Large streaming data direct to file

4 visualizaciones (últimos 30 días)
Joe Davison
Joe Davison el 16 de Nov. de 2018
Editada: Joe Davison el 29 de Nov. de 2018
Hi!,
I would like to setup a system to log months’ worth of financial json websocket data to a file.
  • The json data coming in looks like this {"this": "that", "foo": [1,2,3], "bar": ["a", "b", "c"]}, and there is about 20 message per second.
  • I did tests with FPRINTF writing directly to a .txt file. That works but the files get really big 2gb per day. Because there is not compression.
  • I tested different SAVE formats ( '-v7' being by far the best) to save a new variable inside a .mat file every 10 mins. This was a little too slow to keep up with the stream of data coming in. Taking almost a second to save every 10 mins and it wouldn't be ideal to process it if I have to load a ton of different variables. But the file size looked to be very good. (http://undocumentedmatlab.com/blog/improving-save-performance)
  • I tried the MATFILE declaration to write directly to file. But only could adjoin to the end of a file with '-v7.3' .mat files. Which makes the file a lot bigger then ‘-v7’ and still takes a little too long.
  • I would like to have a file that uses good compression that I can write a new message to fast. Maybe HDF5 file format.?
I believe I need to serialize the data coming in and save it directly to a file in some kind of compressed way. But I'm not exactly sure how to do that.
  • I read through this article and don't get exactly how to implement it. ( https://undocumentedmatlab.com/blog/serializing-deserializing-matlab-data). Since this is older article is there a more up to date way.
  • Do I use something like "h5write"? "getByteStreamFromArray"?
  • After the file is created with months of data. How do I pull each message, one by one, to process it?
  • Is this "Fast serialize/deseriali​ze" in the file exchange the correct path?... I can't figure out how to use it.
Thank you!
Joe

Respuestas (1)

Jan
Jan el 16 de Nov. de 2018
Editada: Jan el 16 de Nov. de 2018
You can create the text as chat vector by sprintf instead of fprintf and compress it in the RAM before writing them to disk: https://www.mathworks.com/matlabcentral/fileexchange/69388-mkzip . This should avoid the overhead of compressed MAT files.
Maybe it is just the disk access, which slows down the processing. Then try to use a SSD instead.
  1 comentario
Joe Davison
Joe Davison el 17 de Nov. de 2018
Editada: Joe Davison el 29 de Nov. de 2018
Thanks Jan,
Great comments!
I made a littel test program:
string = '{"this": "that", "foo": [1,2,3], "bar": ["a", "b", "c"]}';
disp('creating compressed sting');
tic
whatyeaTest = sprintf(string);
dzipVariable = dzip(whatyeaTest);
toc
lineSize = 3000;
disp('compressed');
tic
for i=1:lineSize
fid = fopen('CompressedString8.bin','a+'); % Comes about about 36kb
fwrite(fid,dzipVariable');
fclose(fid);
end
toc
disp('regular Non Compressions');
tic
for i=1:lineSize
fid = fopen('NonCompressedString8.txt','a+'); % Comes about about 18kb
fwrite(fid,string);
fclose(fid);
end
toc
Maybe I did something wrong, but it looks like the end compressed file size is about double the size. I assume this is becuase there is some overhead in compressing each string instead of the entire file.
CompressedString8.bin - 36kb
NonCompressedString8.txt - 18kb
It does look like the dzip (mkzip's real function) does compress the variable to half the size.. but for some reason the end file is still larger.
string 1x56 112 char
dzipVariable 63x1 63 uint8
  • Good thought about the SSD drive. I am using an SSD and a fairly nice i7 processor in these tests.
  • Is there another way to conisoulsly compress the incoming data and write that to the dirive, without the overhead?
  • The second "regular non-compression' version works fine as far as speed of writing, but the file size is too big. Is there a way to ZIP the file, but use a parrellel processor / thread so it's not taking up valuable processing time while the websocket data is still coming in? (it takes about 30 seconds on a days worth of data, which is about 1gb)? I think I only have about .1 seconds between the websocket streams.
Thanks for your help!
-Joe

Iniciar sesión para comentar.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by