How can I estimate the time required by textscan and the size of the output?
Mostrar comentarios más antiguos
Hello,
I am running Matlab 2013b on Windows 7. I have 8 GB RAM memory and a I set the swap file to 20 GB.
I am trying to read a relatively large txt file that is tab separated. The size of the file on the hard disk is a little over 2 GB. There are 6 columns and approx. 64 million rows in the file. The entries are mixed (strings and numbers with some missing values).
At this point I am using:
textscan(fid,repmat('%s',1,6),'delimiter','\t');
It is running for about 4 hours now using about 6.5 GB RAM.
1. I would like to know how can I estimate the time it takes to read the file and the size of the output.
2. After it is done I would like to extract the numerical values from the resulting cell matrix and save that to a .mat file. Any idea how long that would take?
3. Is there any better way of doing this? If I could extract from the file a matrix with the numerical values only (setting everything else to NaN) it would be great.
Thanks!
3 comentarios
per isakson
el 31 de Jul. de 2014
Editada: per isakson
el 31 de Jul. de 2014
"using about 6.5 GB RAM."   How did get you get this number? I would have thought close to 8GB.
Alexandru
el 31 de Jul. de 2014
per isakson
el 31 de Jul. de 2014
Editada: per isakson
el 31 de Jul. de 2014
Strange! Could there be something using memory that the task manager doesn't report on? Is your system 64bit?
Respuesta aceptada
Más respuestas (1)
dpb
el 30 de Jul. de 2014
1 voto
Don't know that there is any metric to predict run time other than testing as it's so dependent upon the machine characteristics, not just size.
Two things I can think of to try --
a) Use the specific format for the data file -- strings for string, numeric for numbers. Skip ('%*s' for example to skip a string field) any fields that aren't mandatory. Use "'collectoutput',true" to gather the various types together. This will bypass a subsequent conversion step.
b) Use the feature of textscan to process the file in pieces -- say 1 to a few MB roughly per pass.
4 comentarios
Alexandru
el 31 de Jul. de 2014
per isakson
el 31 de Jul. de 2014
Skipping columns with %*s works regardless of which characters are in the column.
Alexandru
el 31 de Jul. de 2014
per isakson
el 31 de Jul. de 2014
Editada: per isakson
el 31 de Jul. de 2014
Yes, if it is a reasonable number of different string constants and you know them beforehand.
>> cac = textscan( 'char; 23.1; 16', '%f', 'Delimiter', ';' ...
, 'treatAsEmpty', {'char'} );
>> cac{:}
ans =
NaN
23.1000
16.0000
nan regardless of case is converted to "NaN"
Categorías
Más información sobre Structured Data and XML Documents en Centro de ayuda y File Exchange.
Productos
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
