What is the correct way to save a large MATLAB structure?
104 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Owen
el 20 de Nov. de 2024 a las 8:38
Comentada: Steven Lord
el 21 de Nov. de 2024 a las 16:40
I have a MATLAB structure which is just over 21GB in memory (from whos) and when I save this to a MAT file with the "-v7.3" and "-nocompression" flags it takes well over an hour (on a high performance workstation with an NVMe SSD) and I get a file which is 77GB on disk. I understand that there is some overhead in saving to a MAT file and that "-nocompression" will result in larger files that with compression (but I gave up after about 3 hours waiting for a compressed version to save), but how can 56GB of "overhead" be considered acceptable?
I only need to save this one structure and I won't be adding any other data or modifying the MAT file, so all of the additional features of the v7.3 format are of no use to me, I just need support for >2Gb variables. I attempted to use the undocumented getByteStreamFromArray function to get a byte array I can just dump to a file but this just returned "Error during serialization".
Am I somehow missing a "correct" way to do this efficiently? Or are my only options to either split my data in to a bunch of <2Gb variables to save in a v7 format or write my own serializer? I appreciate that doing either of these isn't exactly a massive job, I'm just very surprised there isn't better native support for large files!
0 comentarios
Respuestas (2)
Matt J
el 20 de Nov. de 2024 a las 16:22
Editada: Matt J
el 20 de Nov. de 2024 a las 16:44
but how can 56GB of "overhead" be considered acceptable?
It depends on what your struct contains. Field data containing handle objects, for example, will give a deceptively small memory total according to whos() because only the handle, and not what it is pointed to, is counted. However, when you save to a .mat file, the entirety of the data pointed to by the handle will be cloned, resulting in a much larger memory footprint. Example:
s.h=gcf;
whos('s').bytes
ans =
176
save file1 s
dir('file.mat').bytes
ans =
1836
Or are my only options to either split my data in to a bunch of <2Gb variables to save in a v7 format
It is hard for me to imagine why one would ever want 21GB of data in a single file. It would block off a huge chunk of contiguous disk space and it would take forever to load.
2 comentarios
Steven Lord
el 21 de Nov. de 2024 a las 16:40
What's the general layout of the struct? Is it a scalar struct with a few large fields, is it a scalar struct with many small-to-medium fields, is it a non-scalar struct, etc.? This could impact how much overhead there is.
As an analogy, consider an egg carton. You could have an egg carton that wraps each egg in a small cardboard box of its own and then ties all those small boxes together (a non-scalar struct with one element per egg and one field named egg in each element) or you could have one that stores each egg in a cup of its own, but does not completely enclose the egg like the first example (a regular array.) Both store the same eggs, but one uses more material (overhead) and takes longer to access the egg/data.
Your suggestion of "dump this data to a binary file" is interesting, but would you expect to be able to dump the data (which I assume you mean something like the raw contents of memory) in one release of MATLAB and read in that dumped data in a different release of MATLAB? [And even if you don't, do you think other users would expect that to be a requirement for that type of feature and be annoyed/angry if it didn't support that workflow?] If so, we would need to very carefully consider how any internal change to how structs (or more generally, any arrays) are organized in memory would affect this dumping/reading process.
Rahul
el 21 de Nov. de 2024 a las 8:32
Hi Owen,
The issue you're encountering stems from the design of MATLAB's MAT-file formats and the inherent inefficiencies of the -v7.3 format for your specific use case.
- With the ‘-v7.3’ flag you can store variables with size greater than 2GB, with compression.
- Without the ‘-v7.3’ flag (e.g. if the default version is set to -v7 or lower) there is no (or less) compression, but we cannot store large arrays.
MAT-file structure: The ‘-v7.3’ format uses HDF5 as its backend, which is highly versatile but not optimized for cases where you have a single, large variable. HDF5 is designed for general-purpose storage, including metadata and other overheads that can lead to excessive file sizes.
Serialization limitations: Large and complex data structures like ‘struct’ can incur significant overhead because every field and subfield is treated as a separate dataset in HDF5.
‘getByteStreamFromArray’ is limited to serializing objects that MATLAB's internal serializer can handle. Structures or arrays with greater than 4 GB of data often hit limitations in MATLAB's serialization mechanism.
Some of the possible solutions that could resolve this issue are as follows:
Split into multiple variables and use -v7 format
- If feasible, divide your large structure into several smaller variables, each <2GB.
- Save these in the older ‘-v7’ format, which is more space-efficient for such cases.
fields = fieldnames(myStruct);
for i = 1:numel(fields)
save(['part_' fields{i} '.mat'], 'myStruct', '-v7');
end
Writing custom serializer
- If the structure doesn't contain complex objects, you can recursively serialize it into a binary file with custom MATLAB code.
fid = fopen('large_struct.bin', 'w'); fwrite(fid, myStruct, 'uint8');
fclose(fid);
Use low-level HDF5 tools
- If you need to stick to ‘-v7.3’, you can consider using MATLAB's low-level HDF5 functions to write the structure directly without unnecessary overhead.
h5create('large_struct.h5', '/myStruct', size(myStruct));
h5write('large_struct.h5', '/myStruct', myStruct);
If performance is your priority and you don't need the ‘-v7.3’ features, you can split the data into smaller parts and use the -v7 format.
Moreover, if your workstation supports parallel computing, you can consider using MATLAB's Parallel Computing Toolbox to parallelize the saving process. This might help speed up the process, especially if your structure can be split into independent parts.
To know more about the usage of HDF5 functions used in the above code, refer to the documentation link mentioned below:
Best!
Ver también
Categorías
Más información sobre Workspace Variables and MAT-Files en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!