What is the best way to work with large "Table" data type variables?
31 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
I find the "table" data type very useful and I would like to take advantage of its features when working with very large tables. Essentially, what I want to do is take a number of delimited text files, all with the same number of columns/variables, import each one as a table, and vertically concatenate them into one MATLAB table. The size of the tables presents a problem, as the files get very big, very fast. My initial thought was to use the "matfile" function, but it is not compatible with the table data type; you have to load the entire table variable to add rows to it, which defeats the purpose. As an example, if I have a .mat file called "test.mat" that contains a table variable called "table1," I cannot access it with "matfile."
m = matfile(test.mat,'Writable',true);
m.table1(1,1);
The second line produces an error:
The variable 'table1' is of class 'table'. To use 'table1', load the entire variable.
I used that example for simplicity, but the same error is generated if I attempt to add rows from a table in the workspace to table1.
Is there a way to do what I want that does not require the entire table to be loaded? If the entire table has to be loaded, I imagine I'll run out of memory very quickly. I would also like to minimize the time required to process the data, so loading the entire table does not lend itself to that goal. It may well be that tables are not an option, but I wanted to ask to see if anyone else had any ideas. If I have to move away from tables, what would be the most efficient alternative?
Thanks, Matt
0 comentarios
Respuestas (4)
Heather Gorr, PhD
el 1 de Ag. de 2016
Hi Matt, Depending on the type of data processing you are doing, you may be able to read data from each of your files into a table and process incrementally using the datastore function. This will also allow you to specify the appropriate data type (categorical for example) before the data is brought into memory, which could potentially save overhead.
For example, if you want to only retain valid information from a select number of columns, you could do something like this: (adapted from datastore doc, see doc examples for more!)
ds = datastore('airlinesmall.csv','TreatAsMissing','NA');
preview(ds)
% Choose data of interest and data types
ds.SelectedVariableNames = {'Year','UniqueCarrier','ArrDelay','DepDelay'};
ds.SelectedFormats{2} = '%C';
% Set the read size.
ds.ReadSize = 5000;
%%Read the first 5000 rows
data = read(ds);
%%Read the rest and only keep valid data
while hasdata(ds)
t = read(ds);
idx = ~any(ismissing(t),2);
t = t(idx,:);
data = [data;t];
end
This shows working with one file, but is the same for a directory of files with the same overall structure. Here is a bit more info on datastore: http://www.mathworks.com/help/matlab/import_export/what-is-a-datastore.html
Edric Ellis
el 15 de Sept. de 2016
Editada: Edric Ellis
el 15 de Sept. de 2016
New in R2016b is the ability to create a "tall" table which lets you perform operations on the table as if it were an in-memory table, but the data is only read in on demand from a datastore (which can reference multiple delimited text files). See more in the doc here: http://www.mathworks.com/help/matlab/tall-arrays.html.
Star Strider
el 29 de Jul. de 2016
One possibility is to generate the table in your workspace from a cell array (or a double array) each time using cell2table or array2table, and store it in your .mat file as a cell array (or double array), using table2cell, table2array and their inverses for the conversion each time you want to work on the entire table.
You can probably add to the arrays without loading the entire table this way.
I don’t have any actual experience doing this with tables (I’ve never needed to) but have with arrays, so I leave it to you to experiment. It’s the only way I can think of to do what you want.
Image Analyst
el 29 de Jul. de 2016
If a table runs out of memory then reading it in and converting to an array will double the requirements, at least until you can clear the table from memory. Also it will require separate arrays for each variable type. Like one table for numbers and one for strings. If all you had was numbers you wouldn't even have used a table anyway. Normally you'd only use a table if you have columns where the columns are a mixture of data types.
Just what kind of data are you talking about? What is the data class of each column in the table, and how many gigabytes are we talking about for the table? How much RAM do you have? What does the "memory" function display for you? Can you buy more RAM?
And don't even think about cell arrays - they take up to 15 times or more memory than a table.
Have you considered memmapfile()?
2 comentarios
Image Analyst
el 30 de Jul. de 2016
Perhaps this page http://www.mathworks.com/help/matlab/memory.html will help. See the last link on it. Also, check out memmapfile.
Ver también
Categorías
Más información sobre Text Files en Help Center y File Exchange.
Productos
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!