Append data to matfile using parallel method

Hi:
I have a lots of data that needs to be saved into a test.mat file, below are my test code:
x=rand(10000,1);
save('test.mat','x');
for i=1:1:100
eval(['va_',num2str(i),'=rand(10000,1);'])
eval(['save(','''','test.mat','''',',','''','va_',num2str(i),'''',',','''','-append','''',')'])
end
the problem is that, this is a test code, in my real situation:
1. the number of variables is very large in my situation (up to va_10000).
2. the size of data of each 'va_i' is very large (up to size of 2e6*1).
in this way, although I have upgrade my drive into 960EVO ssd, the saving time is still significantly large.
is there anyway to improve the code into parallel saving? so that I could save the computational cost?
Thanks!
Yu

6 comentarios

Stephen23
Stephen23 el 12 de Sept. de 2018
Editada: Stephen23 el 12 de Sept. de 2018
2e6 elements per array for 10000 arrays... assuming double type, this requires in the region of 160 Gigabytes of memory. Does your computer have that much memory?
Do all of those arrays need to be in MATLAB memory at the same time? How do you want to process the data in that file? Would multiple files be acceptable instead?
"is there anyway to improve the code into parallel saving?"
Avoiding eval would be a start.
The MATLAB JIT engine does a lot of optimizations if it can... but not if you use eval:
Yu Li
Yu Li el 12 de Sept. de 2018
Editada: Yu Li el 12 de Sept. de 2018
Hi:
my workstation is working and I will come back to you with a new test code after it finished the current calculation.
Thanks for your reply.
Yu
Hi:
1. I could not find a way to deprecate 'eval' function, because the variable names are changed with the 'i' in the for-loop.
2. yes this is the size I'm working at, I'm using a server to operate such large size of data.
3. please see the test code below:
x=rand(10000,1);
save('test.mat','x');
parfor i=1:1:1000
para_save(i)
end
function para_save(i)
eval(['va_',num2str(i),'=rand(10000,1);'])
eval(['save(','''','test.mat','''',',','''','va_',num2str(i),'''',',','''','-append','''',')'])
end
in this way, the saving time is significantly reduced, but the 'test.mat' file that generated in this code could not be opened.
Thanks!
Yu
function para_save(i)
varname = sprintf('va_%d', i);
savestruct.(varname) = rand(10000,1);
save('test.mat', '-struct', 'savestruct', '-append');
However, MATLAB does not promise that you can have multiple simultaneous save() to the same file.
Walter Roberson
Walter Roberson el 13 de Sept. de 2018
Is the size and data type of each variable the same?
Is the data likely to be compressible?
Yu Li
Yu Li el 13 de Sept. de 2018
Data size are not the same, data are not compressible. I have to save them in the same matfile for future reading.

Iniciar sesión para comentar.

Respuestas (2)

Steven Lord
Steven Lord el 13 de Sept. de 2018

0 votos

Consider writing each variable to a different file in such a way that when you want to use them later on you can construct a datastore using that collection of files and make a tall array from the datastore.

1 comentario

Yu Li
Yu Li el 13 de Sept. de 2018
My memory is enough for these data, I need to append these data in a given matfile for future reading.

Iniciar sesión para comentar.

Walter Roberson
Walter Roberson el 13 de Sept. de 2018

0 votos

You cannot write to a mat file in parallel. If writing in parallel to a mat file is a requirement then your problem cannot be solved.
If computation of the items is expensive, then do the computation in parallel, writing to different mat files (though potentially one per parallel core rather than one per variable.) Afterwards, merge the files together in a serial loop.
With the data not being compressible, either write in binary or else use the -7.3 option to not compress the output.

2 comentarios

Yu Li
Yu Li el 13 de Sept. de 2018
If save these files with their seperate name using parallel, and merge them, the merging process requires load these data into memory and re-save them. Which may increase the overall saving time.
Or do you have any other ways to merge these data?
Walter Roberson
Walter Roberson el 13 de Sept. de 2018
Overall saving time might not increase under the assumption that calculation of the array is expensive. If the average rate of graduation is less than the time required to save one variable then parfor for the calculation and merging afterwards can potentially save time.
Another approach in the case where calculations are expensive is to use a pollable data queue to calculate results in parallel and send them back to the client process to do the saving.
If the average rate of graduation is faster than the time to save one variable then you are probably bandwidth limited in writing to the ssd, and increasing the number of simultaneous writers will not increase the bandwidth.

Iniciar sesión para comentar.

Categorías

Etiquetas

Preguntada:

el 12 de Sept. de 2018

Comentada:

el 13 de Sept. de 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by