Storing and accessing large amounts of time-based data
31 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
General Intro
I am working on a project where the ultimate goal is to analyse data and produce insights on this. However, I'm having a little trouble with the data storage itself, before even starting any analysis. I am hoping someone with more experience who has a feel for this can give some insight.
Data
We are recording ~70 signals from several hydrogen fuel cell vehicles that are driving around. Most of the signals are at 1Hz, about 10 at 5Hz and a handful at slower (0.1Hz, 0.002Hz) frequencies. The cars are driven for anywhere from 5 minutes to 2 hours (roughly) any number of times per day (sometimes 0 for multiple days), but while they are being used the data is being continuously recorded. All signals are doubles.
I currently save everything to one large nested structure (which is good for sharing with coworkers) that is just a nested-cell structure. This structure is inherited from the reading of the RAW signal data. Signals are things like GPS, speed, steering wheel angle, fuel cell voltages etc. The current data structure works well in some ways, but calculating the mean vehicle speed for example requires a 2-level nested for-loop (loops through cars, then trips [trips are vehicle on->vehicle off]) instead of mean(ts) as can be done with timeseries objects. Part of the problem is that the types of analysis that can be performed is very broad so it's known what the specific query functions will be.
The dataset currently covers one year of data from a four year project, to give a sense of the scale. It's also possible more vehicles will be added, increasing the amount of data for analysis.
Desires
- Relatively easy to use, for less MATLAB savvy coworkers
- Filtering out trips on different conditions, such as: length, signal value, and date
- Easy to perform statistical analysis
- Ability to perform analysis on entire dataset easily (across different sample frequencies)
- Units as a property of a signal
- (Very optional) Saved on a server (I have both Windows and Linux servers available) so that users can dynamically query the data
Attempts
I have tried several different approaches to this so far:
- Single timeseries collection per car and per sample rate. Used append to make them into single timeseries whilst maintaining proper timestamps (I like how this works when plotting all data). This seems to work alright, unless the PC has limited amounts of RAM....
- Single timetable per car, with missing data for resampling filled by NaNs. This is my current in use method, but at 3,629,763; 2,853,868; 1,707,074; rows of data (with 70+ signals), they can use quite a lot of data.
Thoughts
I've been looking at support docs and everything and don't want to put a lot of work in to pursuing the wrong solution.
- Use a timeseries collection for each car and resample the data using ZoH (probably causes errors due to massive TSC size in memory)
- Use a timetable, but this doesn't seem as applicable for use if you've only got numbers.
- Use one of the above, but have each trip in a nested array instead of one big table (ease of use of one big table is nice, though).
- Use one of the above, store it in HDF5 data set (not sure if possible) so data is only loaded as needed.
- Use a datastore with the above?
- Double down on "big data" and start learning map-reduce stuff
Edit
I am no where near running into limits on storage or memory; that is a future problem I will have, but not the important thing to solve now. As mentioned in the comments there's much I can do to reduce size that has nothing to do with structure.
My MAIN QUESTION is about whether I should use timeseries, timetables or non-time-based arrays, and how I should organise and store these. I have many sets of data from each car and I have contained each set (which can be recorded minutes or hours apart) in a single table as this simplifies finding things like the mean of all samples very quickly, but massive tables aren't very nice to memory. So how do I store and organise the data such that I can easily perform analysis? Memory restrictions can be kept in mind but they are definitely not a driving force in this at all. My computer handles what I've described just fine.
2 comentarios
dpb
el 3 de Oct. de 2016
Editada: dpb
el 3 de Oct. de 2016
Is the "~70 signals from several ... vehicles" the current total number of channels or is this per vehicle?
Is there sufficient precision in the measurements to justify double storage or (it would seem highly likely) single would halve the actual memory; shorter integers reflective of the actual A/D bit resolution even more...
Respuestas (4)
Kirby Fears
el 3 de Oct. de 2016
Editada: Kirby Fears
el 3 de Oct. de 2016
Hi Ben,
You have discussed your ideas and questions regarding the Matlab data structures that allow for easy access and manipulation of your data in memory, but how are you storing this data in the first place on your disk? Are you using a database technology like MongoDB or HBase that can be integrated into map-reduce calculations?
If you're not using a database while generating about 2 million rows times 70 signals per year for each car, you will soon have difficulty loading that data into memory in the first place. Imagine your data doubles or triples. What will you do to get it into memory?
Your querying should be done outside of memory through a database, not by indexing/searching through a complex structure that's been completely loaded into your memory from the start.
If you don't have access to databases managed by a professional database administrator, you might consider using Amazon Web Services (or a competitor) to set up data storage and big data processing without the need to own and maintain all that software & hardware in-house.
https://aws.amazon.com/emr/details/hadoop/
Once your data is stored and managed, it's much easier to talk about the data structures needed for your analyses since you'll probably be working with a lot less data in memory to begin with. Probably one of the data structures you have already used will work fine.
Hope this helps.
8 comentarios
ESala
el 26 de Jul. de 2018
Hi @Ben, @Kirby,
I had the same problem where I needed to access InfluxDB from MATLAB, so I made a small client library.
Steven Lord
el 26 de Jul. de 2018
In release R2017a we added the capability to work with out-of-memory time-stamped data using a tall timetable. Consider storing your data in files that can be used to create a datastore then create a tall array from that datastore and convert or create a timetable from that tall array. See the "Extended Capabilities" section on the documentation page for timetable for more information.
3 comentarios
Steven Lord
el 26 de Jul. de 2018
There were some timetable methods that didn't work for tall timetable arrays in release R2017a. We added tall timetable support for a few more functions (including retime and synchronize in particular) in release R2017b.
If there are particular functions that you want to use on tall table or timetable arrays but they don't work, please let Technical Support know so they can add a request for those specific functions (with your use cases as motivation) to the enhancement database.
Peter Perkins
el 14 de Sept. de 2018
Consider this:
>> x = rand(100,5);
>> c1 = mat2cell(x,100,[1 1 1 1 1])
c1 =
1×5 cell array
{100×1 double} {100×1 double} {100×1 double} {100×1 double} {100×1 double}
>> c2 = num2cell(x);
>> whos
Name Size Bytes Class Attributes
c1 1x5 4560 cell
c2 100x5 60000 cell
x 100x5 4000 double
Tables are like c1, which splits x into five 100x1 columns, as opposed to c2, which splits it into 500 separate scalars. Mostly people use tables for a medium-to-large number of observations of a small-to-medium number of variables.
0 comentarios
Peter Perkins
el 6 de Sept. de 2018
Ben, I'm coming to this post a bit late, and I don't think I have a silver bullet for you, but I wonder if I can get you to say more about, "Easy to perform statistical analysis" and "Ability to perform analysis on entire dataset easily (across different sample frequencies)". It would be helpful to hear a more detailed description of a concrete example or two of the kind of analysis you need to do.
You say, "as this simplifies finding things like the mean of all samples very quickly". It's often the case that people want to compute descriptive statistics, but for each trip, or each car, or grouped in some other way. Is that what you mean, or do you really mean across "all samples"?
Also, I guess I can understand the appeal of one timetable per car, but as you say, with multi-rate data there's a lot of NaN-filling (or ZOH, or interpolation) to be done, and so I'm wondering why you tried using multiple timeseries collections per vehicle to deal with multi-rate, but not multiple timetables per car?
Finally, I wonder if you could say what's making you conclude that a timetable "doesn't seem as applicable for use if you've only got numbers"?
Thanks for shedding any light, if you can.
3 comentarios
Peter Perkins
el 12 de Sept. de 2018
OK, thanks. It sounds like grouped calculations in one form or another on timetables is working OK for you.
Depending on what kind of data you have, it may not be true that one numeric array uses less memory than a timetable with all numeric column variables. As long as the timetable is not very wide, there's very little extra memory needed to cut the numeric array into columns. Performance differences depends on what you are doing.
But it's not "wrong" to put all numeric variables into a table or a timetable if the variables are conceptually different things (perhaps different units) because in those cases you're unlikely to want to treat them all as one homogeneous matrix.
It all depends on how you want to think of your data.
Ver también
Productos
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!