MATLAB Answers

0

Binary, ASCII and Compression Algorithms

Asked by Matlab2010 on 11 Jul 2014
Latest activity Edited by José-Luis
on 14 Jul 2014
I have a large number (> 1E6) of ASCII files (myFile.txt) which contain time series data, all in the same format: timestamp, field 1, field 2,...,field 20. Each data entry is one row, tab separated. Each of the fields 2-20 is a double. The timestamp is string (HH:MM:SS.FFF). The files are each c. 5GB in size.
I wish to reduce the hard disk storage required. How can I do this?
My thoughts so far are
1. Convert the files to binary format. How can I do this? Is it by applying dec2bin.m? However this function seems to only take scalars. What would this look like?
2. Compress each file. Each file may be used independently of the others, thus I wish to compress individually. I know that differing approaches to compression work differently for different data structures. Given my data structure above, which is the best one to apply?
Given the importance of this, I would be happy calling other language files from inside matlab (eg C++). Any standard libraries/ third party tools that can be recommended?
3. Any other suggestions?
Finally, an important point is that I wish the user to be able to quickly load and access the data in each file - ie the bin2dec() call must be quick as must be the decompression.
thank you!

  3 Comments

This is something I would not use Matlab for. Those are massive files.
You could use a database. Most of them have options to compress data, and allow for unnecessary repetitions, e.g. field names.
If you want to save as binary file, what would you do with the text? If you have a mix of text and data, reading and saving becomes trickier. Also, if you don't document the format right, good luck trying to read it in the future. I would point you to existing formats such as netcdf and similars. mat file could work, but, then again, I wouldn't use them for this.
I don't want to use a database due to I/O costs.
The binary files would contain no text as I would convert the timestamps to java format (eg using datenum.m).
I would use a database. Which one is mostly down to personal preferences and constraints. I like mysql because it's free.
Depending on how your data looks like, you could use the netcdf: format. It has support to be read/written in Matlab. The same is true for hdf5 . These are sort of lightweight databases though.
IMO, io through a database would be faster than wading through the mountain of files you have, unless you plan on hard-coding file paths. I haven't tested it though so that's not a definite.

Sign in to comment.

1 Answer

Answer by Star Strider
on 11 Jul 2014

I would read them in as text files, save them as ‘.mat’ files (in the default binary format), then delete the text files. Since the ‘.mat’ files have a different suffix/extension, the prefix name can be the same as for the text file. See the documentation for save and load for details.

  2 Comments

1. I would like to be able to access the data from Python and R as well as matlab.
2. Does compressing mat files help much? eg zip.m
  1. If you want to access the files from other applications, your best option would be to go with something other than .mat files, since to the best of my knowledge, those are MATLAB-specific. I’m not familiar with the file types Python and R can read and write, so you would need to find a common, space-efficient file format for all three applications.
  2. Compressing them would help. You probably have to go that route anyway, considering the sizes of the files.

Sign in to comment.