What is the best way to work with large "Table" data type variables?

57 visualizaciones (últimos 30 días)
Matt
Matt el 29 de Jul. de 2016
Comentada: Matt el 15 de Sept. de 2016
I find the "table" data type very useful and I would like to take advantage of its features when working with very large tables. Essentially, what I want to do is take a number of delimited text files, all with the same number of columns/variables, import each one as a table, and vertically concatenate them into one MATLAB table. The size of the tables presents a problem, as the files get very big, very fast. My initial thought was to use the "matfile" function, but it is not compatible with the table data type; you have to load the entire table variable to add rows to it, which defeats the purpose. As an example, if I have a .mat file called "test.mat" that contains a table variable called "table1," I cannot access it with "matfile."
m = matfile(test.mat,'Writable',true);
m.table1(1,1);
The second line produces an error:
The variable 'table1' is of class 'table'. To use 'table1', load the entire variable.
I used that example for simplicity, but the same error is generated if I attempt to add rows from a table in the workspace to table1.
Is there a way to do what I want that does not require the entire table to be loaded? If the entire table has to be loaded, I imagine I'll run out of memory very quickly. I would also like to minimize the time required to process the data, so loading the entire table does not lend itself to that goal. It may well be that tables are not an option, but I wanted to ask to see if anyone else had any ideas. If I have to move away from tables, what would be the most efficient alternative?
Thanks, Matt

Respuestas (4)

Heather Gorr, PhD
Heather Gorr, PhD el 1 de Ag. de 2016
Hi Matt, Depending on the type of data processing you are doing, you may be able to read data from each of your files into a table and process incrementally using the datastore function. This will also allow you to specify the appropriate data type (categorical for example) before the data is brought into memory, which could potentially save overhead.
For example, if you want to only retain valid information from a select number of columns, you could do something like this: (adapted from datastore doc, see doc examples for more!)
ds = datastore('airlinesmall.csv','TreatAsMissing','NA');
preview(ds)
% Choose data of interest and data types
ds.SelectedVariableNames = {'Year','UniqueCarrier','ArrDelay','DepDelay'};
ds.SelectedFormats{2} = '%C';
% Set the read size.
ds.ReadSize = 5000;
%%Read the first 5000 rows
data = read(ds);
%%Read the rest and only keep valid data
while hasdata(ds)
t = read(ds);
idx = ~any(ismissing(t),2);
t = t(idx,:);
data = [data;t];
end
This shows working with one file, but is the same for a directory of files with the same overall structure. Here is a bit more info on datastore: http://www.mathworks.com/help/matlab/import_export/what-is-a-datastore.html
  1 comentario
Matt
Matt el 5 de Ag. de 2016
Thanks Heather. I found this video helpful to gain a high level understanding of some of the methods for working with "big data," such as datastores. I'm not sure what the answer will ultimately be for me, but I definitely have some options to look into.

Iniciar sesión para comentar.


Edric Ellis
Edric Ellis el 15 de Sept. de 2016
Editada: Edric Ellis el 15 de Sept. de 2016
New in R2016b is the ability to create a "tall" table which lets you perform operations on the table as if it were an in-memory table, but the data is only read in on demand from a datastore (which can reference multiple delimited text files). See more in the doc here: http://www.mathworks.com/help/matlab/tall-arrays.html.
  1 comentario
Matt
Matt el 15 de Sept. de 2016
Wow, very cool. Now I just need to get IT to hurry up and get the newest version so I can use that.

Iniciar sesión para comentar.


Star Strider
Star Strider el 29 de Jul. de 2016
One possibility is to generate the table in your workspace from a cell array (or a double array) each time using cell2table or array2table, and store it in your .mat file as a cell array (or double array), using table2cell, table2array and their inverses for the conversion each time you want to work on the entire table.
You can probably add to the arrays without loading the entire table this way.
I don’t have any actual experience doing this with tables (I’ve never needed to) but have with arrays, so I leave it to you to experiment. It’s the only way I can think of to do what you want.
  1 comentario
Matt
Matt el 29 de Jul. de 2016
If I understand correctly, you would import a text file as a table, convert that table to an array, and save it to a .mat file. Then, import the next text file as a table and convert it to an array. Finally, use "matfile" to access the array in the .mat file and concatenate the array in the workspace to the one in the .mat file.
I imagine this would work, but you would only ever be able to work with the tables when you first imported them. After importing data from all desired text files, you could access a part of the final saved array with "matfile" and convert that part to a table, but you would never be able to work on the entire data set as a table. Also, I imagine you would lose all metadata associated with a table when you convert it to an array.
That would still give me a bit more flexibility than importing directly to an array, although I don't know how it would affect memory and speed. I think I'm going to have to make some compromises anyway, due to the fact that "matfile" is not compatible with the table data type. It's too bad Mathworks has not expanded the "matfile" function to work with tables.

Iniciar sesión para comentar.


Image Analyst
Image Analyst el 29 de Jul. de 2016
If a table runs out of memory then reading it in and converting to an array will double the requirements, at least until you can clear the table from memory. Also it will require separate arrays for each variable type. Like one table for numbers and one for strings. If all you had was numbers you wouldn't even have used a table anyway. Normally you'd only use a table if you have columns where the columns are a mixture of data types.
Just what kind of data are you talking about? What is the data class of each column in the table, and how many gigabytes are we talking about for the table? How much RAM do you have? What does the "memory" function display for you? Can you buy more RAM?
And don't even think about cell arrays - they take up to 15 times or more memory than a table.
Have you considered memmapfile()?
  2 comentarios
Matt
Matt el 29 de Jul. de 2016
The data is numeric (minus a header row). I wanted to use the table data type for the metadata, summary information, and potentially the ability to join tables (using the join function, not just concatenating) and use logical and categorical variables. A number of the variables are essentially enumerated, so they would be useful as categorical variables in this case. Basically, the table data type offers a lot of useful features, but, if size and speed prevent me from using it, I can always fall back to numeric arrays. Even then, size and speed may be an issue.
There are 3000-4000 columns in the text files I'm using. Each represents a variable. I obviously don't want that many variables, so some kind of container is necessary.
In terms of the text files, there can be ~20 GB of data to read in for one "data set." I can do things piecemeal, but this question came up because I'm trying to avoid that.
I am not well versed in memory optimization in MATLAB. The computer I'm using has 8 GB of RAM. More RAM is a possibility, but it would take a while. The Parallel Computing Toolbox and a cluster with the Distributed Computing Server are also possibilities, but are even farther down the road. I read about memmapfile(), as well as things like datastore and mapreduce, but I have never used any of them before.
"memory" returns the following:
Maximum possible array: 27434 MB (2.877e+10 bytes) *
Memory available for all arrays: 27434 MB (2.877e+10 bytes) *
Memory used by MATLAB: 1320 MB (1.384e+09 bytes)
Physical Memory (RAM): 8135 MB (8.530e+09 bytes)
* Limited by System Memory (physical + swap file) available.
At some point, I increased the virtual memory in an attempt to gain additional memory.
Hopefully that provides a bit more context and useful information.
Image Analyst
Image Analyst el 30 de Jul. de 2016
Perhaps this page http://www.mathworks.com/help/matlab/memory.html will help. See the last link on it. Also, check out memmapfile.

Iniciar sesión para comentar.

Productos

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by