read small selection of data from large file

Question

0 votos

I have several large .csv files (up to around 8GB, the arrays have about 10^5 rows and up to 15k columns) from which I would like to read data. Most of these read operations will only be pulling from 1000 to 10000 data points at a time (generally it will be just a single row of data or a subset of a row). However, it seems like dlmread is doing something inefficiently since each read operation is taking several minutes. Is there a lower-level read function which can do this significantly faster (really, it needs to be orders of magnitude faster; even a 2x increase in speed isn't going to cut it)? Should I use another format for the data? I thought about building a mySQL database for it but I have no experience with this. Is Matlab even the right environment to be doing this sort of thing? Thanks in advance.

Josh

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Walter Roberson el 24 de Mayo de 2011

0 votos

Will the same file be read over and over again? If so then it can become worth-while to create an index of where the beginnings of lines are, by reading lines and using ftell() to determine the current position, storing that value, reading more lines, and so on. Once you have the index generated once, you can use it to fseek() to somewhere at or before the target row and then just read that row.

This indexing process does require reading the file through once, but it avoids having to read it from the beginning each time.

None of the operating systems that MATLAB runs on support positioning directly to a particular line in a text file (not unless the lines are all exactly the same length.)

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Walter Roberson el 24 de Mayo de 2011

For the first pass of generating the index, fgets() to read single lines. You might not need to ask the ftell() value for each line: for example if you recorded the ftell() value for each 10 lines then to position to a particular line you would fseek() to the recorded position before that line and then fgets() the lines between.

To read data, once you are positioned where you want, the best method to read the row can depend upon how complex the row is. For example is it all numeric, with commas between the fields, or are there sometimes strings? Is it possible that any of the strings might contain embedded commas? If you have a consistent format, sometimes the fastest approach is to fgetl() the line and then find() the places the line == ',' and index the result to determine the beginning and ending position of the columns of interest, and convert only that range (possibly with sscanf() or textscan())

Josh Warren el 24 de Mayo de 2011

Oh, it's very simple; they're all doubles separated by commas. Essentially it's a giant numeric Matlab array that has been exported to a .csv file. Let me see if I can make it work; thanks again.

Iniciar sesión para comentar.

Answer 2

Ashish Uthama el 25 de Mayo de 2011

0 votos

If you have the option to change the source or if you plan to use the data over and over again, it might be best to change the format to plain binary. i.e use fwrite to write it out as doubles rather than a text format like csv. (Unless, of course, each line has a varying number of entries and that this structure is integral to your data.)

This would probably be the fastest, since it is the most simplest. Also, the file size might be smaller.

You will be able to index into the file to read subset more easily. You can easily compute the offset to the (i,j)th element since you know exactly how much space a single double will take in a binary file.

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Walter Roberson el 25 de Mayo de 2011

That's a good idea.

Josh Warren el 25 de Mayo de 2011

Thanks Ashish. I'm still a total noob with to actual programming and I just figured out how to implement what Walter was talking about so I think I'll take that approach, although I definitely need to familiarize myself with how to navigate binary vs ASCII files for future projects. I'm in the process of writing my own "high level" read function which checks to see if there is a file containing the line indices in the directory of the file to be read, and if there isn't, it generates this file first before using fseek to get to the line on which the data is contained. Thanks again to both of you for your help.

Iniciar sesión para comentar.

read small selection of data from large file

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Más respuestas (1)

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Categorías

Productos

Etiquetas

Community Treasure Hunt

read small selection of data from large file

0 comentarios Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Más respuestas (1)

2 comentarios Mostrar Ninguno Ocultar Ninguno

Categorías

Productos

Etiquetas

Ver también

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

2 comentarios
Mostrar Ninguno Ocultar Ninguno