Improving speed of readtable

Question

Daniel van Huyssteen el 14 de Abr. de 2021

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/801591-improving-speed-of-readtable

Comentada: Daniel van Huyssteen el 14 de Abr. de 2021

Respuesta aceptada: Walter Roberson

Example.zip

Abrir en MATLAB Online

I have a large array stored in a .dat file (see Example.dat attached) and I need to import the array into MATLAB.

At the moment I am using the following approach to load the table and convert it to an array.

Example_Table = readtable("Example.dat");
Example_Array = table2array(Example_Table);

This process is, however taking much longer than I would expect since I have a reasonably powerful PC.

I suspect that the issue is related to the array having a large number of zero entries.

The results of Run & Time are shown below

It is clear that pretty much all of the time is involved in reading the table and not in converting it to an array.

The timing profile of table.readTextFile>textscanReadData is shown below

Where all of the time is spent on the TreatAsEmpty command (because of having many zero entries?).

Below is a snapshot of the CPU and RAM usage during the reading of table.

Here it is clear that there is a lot of computational power not being used so this process should be able to be sped up some way or another.

How can I make this process run faster?

I have to read in lots of data like this and it is a very frustrating process.

Thanks in advance!

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Walter Roberson el 14 de Abr. de 2021

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/801591-improving-speed-of-readtable#answer_674726

Abrir en MATLAB Online

Where all of the time is spent on the TreatAsEmpty command (because of having many zero entries?)

No, that is not what is happening.

What is happening is that Mathworks coded the internal call to textscan() split across two lines

data = textscan(fid, format, a bunch of stuff here, ...
                'TreatAsEmpty', treatasempty, a bunch more stuff here)

The time for the overall call is accounted to the last line of the call, as the textscan() call itself cannot start until all of the parameters have been executed.

For this purpose, "executed" for something like

treatasempty = [];
textscan('TreatAsEmpty', treatasempty)

would consist of parsing the character vector 'TreatAsEmpty' and created a temporary (unnamed) expression block for it and pushing that into the parameters; and then parsing the variable named treatasempty and locating the variable in scope and pushing its (named) expression block into the parameters. Those operations might not take long but they take some time, and that time is time spent preparing to call textscan() but not yet having called textscan(). The time to parse the parameters and get ready for the call is being shown in the 554 line, the data = textscan( part.

'TreatAsEmpty' is not a command in this context: it is just a literal constant to be prepared and passed in to the function.

The timing you are seeing for line 555 is the time spent executing the textscan()

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Walter Roberson el 14 de Abr. de 2021

Abrir en MATLAB Online

If you have a matrix of data, D, then

tic
[r, c, s] = find(D);
as_rows = [r, c, s].';
dlmwrite('Example_sparse.csv', as_rows, 'precision', 16);
toc

About 1.3 seconds to write

Restore with

tic
S = fileread('Example_sparse.csv'); rcs = str2num(S); Drestored = sparse(rcs(1,:), rcs(2,:), rcs(3,:));
time_via_sparse = toc

about 1.5 seconds to read.

Daniel van Huyssteen el 14 de Abr. de 2021

This works great! Thanks! :D

Iniciar sesión para comentar.

Answer 2

Bjorn Gustavsson el 14 de Abr. de 2021

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/801591-improving-speed-of-readtable#answer_674716

It is a rather large data-file to read. You might reduce the read-time if you use load instead of readtable - that should reduce all sorts of overhead associated with the capacity to handle all sorts of data-formats of readtable.

If you have the capacity to modify the data-format of your files that might be a far more successful way forward if you have very sparse data - then you might be better off saving the non-zero components together with their row and column indices and handle that when reading data instead of saving large number of zeros. But maybe you're given the data and have to shovel zeros and zeros around...

HTH

2 comentarios
Mostrar NingunoOcultar Ninguno

Daniel van Huyssteen el 14 de Abr. de 2021

Thanks for the answer.

Unfortunately I'm not able to manipulate the writing/storage of the original data.

Curiously, using load instead of readtable takes even longer!

Bjorn Gustavsson el 14 de Abr. de 2021

That's a double bummer. I'm really surprised that load takes longer time, I would've bet good money that the more general capacity of readtable would cost time. Then perhaps you can save overall processing-time by following Walter's suggestion of converting the data-files to a sparse format. You might be able to bulk-process all data-files over-night when it doesn't test your patience...

Iniciar sesión para comentar.

Improving speed of readtable

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Más respuestas (1)

2 comentarios
Mostrar NingunoOcultar Ninguno

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Improving speed of readtable

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

7 comentarios Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Más respuestas (1)

2 comentarios Mostrar NingunoOcultar Ninguno

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno