Compress only selected variables when saving to .mat

I have two variables data and meta, which I am saving in a compressed .mat file (version '-v7' ). The data variable is usually 800mb uncompressed, while meta might not even be 1mb. I have lots of these .mat files and sometimes I just need to loop through all the meta variables. However, since the file is compressed, it still takes lots of time to load the meta variable alone , i.e. same time as if I were to load both variables.
Is it possible to selectively compress specific variables in a .mat file ? Alternative data designs?
Note : I already have an overall single meta which is basically the concatenation of the single smaller ones. However, I will need to abandon this approach because it does not scale well size-wise and performance-wise.

6 comentarios

No, but why not use two mat-files?
dpb
dpb el 25 de Jun. de 2014
At the expense of multiple names/files, save the two separately? Also means have to switch version level for the shorter one altho maybe because of being small it wouldn't cost much to decompress it???
Alternatively, would seem to give up compression entirely and go to stream file and memmapfile, maybe?
Oleg Komarov
Oleg Komarov el 25 de Jun. de 2014
Editada: Oleg Komarov el 25 de Jun. de 2014
The simplest alternative would be to go for 2 files (that would make a total of ~5000 files which is okay). However, it means changing my API in many places, and risk of introducing bugs.
Giving up compression is not feasible/scalable: 2500 * 800mb ~ 1.9tb.
"Version 7.3 MAT-files use an HDF5-based format that stores data in compressed chunks. The time required to load part of a variable from a Version 7.3 MAT-file depends on how that data is stored across one or more chunks. Each chunk that contains any portion of the data you want to load must be fully uncompressed to access the data. Rechunking your data can improve the performance of the load operation. To rechunk data, use the HDF5 command line tools, which are part of the HDF5 distribution."
Anybody with experience in setting the chunking layout with low level HDF5 functions?
dpb
dpb el 25 de Jun. de 2014
Not I on the HDF5 stuff, no...
I don't suppose the matobj thingie has any advantage on the read side, does it, since you're looking at the whole variable even though it's the smaller of two I presume it would still have to decompress whole thing?
per isakson
per isakson el 25 de Jun. de 2014
Editada: per isakson el 25 de Jun. de 2014
I've spent too much time experimenting with the low and high level HDF5 API of Matlab and alternatives. I'm not sure my "results" are relevant to your use case.
My use case:
  • many hundred 1MB time series
  • reading much more important than writing performance
  • typically reading of entire time series
My conclusions:
  • the low level HDF5 API is not worth the trouble (in my case)
  • the system cache is important to the performance (buy more RAM)
  • store double only when necessary
  • chunking comes at a high price (performance)
I think it is difficult to recommend anything without knowing a bit more about the internal structure of the 800MB and the 1MB together with descriptions of some typical "queries".
.
"However, it means changing my API in many places, and risk of introducing bugs."
I guess the documentation refers to
Tool Name: h5repack
Purpose:
Copies an HDF5 file to a new file with or without compression
and/or chunking.
whether h5repack can help depends on your queries.
So, matfile() does not improve the loading speed (but it also depends on the setup, I am currently testing with an SSD).
From my previous link, it seemed possible to use low level HDF5 API directly on .mat files, but it is not the case. Also, I do not have the time nor the intention to learn/rewrite everything with pure hdf5 (mat files being based on that).
I have not mentioned that meta is a dataset, and this might impact on the way it is stored, i.e. not contiguously, forcing to unzip/load many chunks. I will test serialization of the dataset, before saving it.

Iniciar sesión para comentar.

Respuestas (1)

Jeremy
Jeremy el 23 de En. de 2015
I know this is old, but I have been doing some similar work recently since I could tell it was taking longer than it should to load a small portion of the data using the matfile method. I also spent two days understanding how to use all the low level HDF5 commands only to find that it did not really help on the read side.
Then I realized that the issue is originating on the write side and the compression that occurs. the savefast utility in the file exchange saves using the high level hdf5 commands and this does NOT compress the data. This didn't quite work for me since I am saving complex numbers but I was able to use the same approach and I am now saving uncompressed v7.3 files and reading small portions of my matrix over 100 times faster!
If your matrix is just real numbers, you should be able to create a v7.3 file with your metadata and then use the simple high-level h5write command to save additional variables to the same file uncompressed.

Preguntada:

el 25 de Jun. de 2014

Comentada:

el 23 de En. de 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by