Is textscan ever better than readtable?
Mostrar comentarios más antiguos
I have some legacy code that reads in data from a text file. The author seems to avoid using readtable() in favour of using textscan() to get a cell array of strings and then converting the strings to the correct format afterwards. This seems like an awkward way of doing things, and takes a long time for big files so my questions are:
- Is there any obvious reason to do this? Is textscan somehow more flexible/robust than readtable?
- Is readtable optimised for reading data in a specified format? (i.e. faster than reading a string and converting)
3 comentarios
Adam
el 9 de Mzo. de 2018
Well, the obvious difference is that readTable returns a table structure, which may not always be what is wanted. I have never used readTable or worked with tables much myself so I don't know if they come with all the functionality you would want or whether the ability to do what you want with a raw cell array of inputs can have a greater flexibility in some cases.
Tables were only introduced in R2013b though so depending how 'legacy' the code is that may be the only reason readTable is not used.
Von Duesenberg
el 9 de Mzo. de 2018
textscan is probably more "low level"; but recall that readtable was introduced in version R2013b, so maybe your legacy code was written before that.
Daniel Murphy
el 9 de Mzo. de 2018
Respuesta aceptada
Más respuestas (1)
Walter Roberson
el 9 de Mzo. de 2018
1 voto
textscan() can be a bit more flexible in handling challenges such as using commas for decimal point, or odd quoting, or reading time-like things "raw" because spaces in time formats confuse both textscan and readtable.
Generally, textscan() has more control over skipping data, and more control over number of lines to be processed.
On the other hand with fairly recent updates adding the detectImportOptions facility, readtable can do some fixed-width reading that textscan struggles with.
7 comentarios
Jeremy Hughes
el 3 de Ag. de 2021
"textscan() can be a bit more flexible in handling challenges such as using commas for decimal point,"
I find this to be the opposite.
readtable(filename,'DecimalSeparator',',')
textscan doesn't have this capability at all. It needs to be treated as a string first.
Walter Roberson
el 3 de Ag. de 2021
readtable() permits a 'Format' parameter that is textscan compatible.
Walter Roberson
el 3 de Ag. de 2021
DecimalSeparator was new as of R2018a, which had not been released at the time of my Answer.
textscan() is able to parse a character vector or string scalar, which readtable() cannot do. So in the R2018a timeframe,
textscan(regexprep(fileread(filename),{',', ';'}, {'.', ','}), FORMAT)
was workable code to do comma to decimal point conversion and then parse the text; at the time the only way to do the same with readtable() was to write the converted data to a file and readtable() the file (which some people did do.)
@Walter Roberson: as far as I can tell, specifying the 'Format' option does not allow READTABLE et al to parse text (i.e. character vector or string variables). Can you please show how you can specify the 'Format' option so that READTABLE can parse a character vector? I cannot find the option anywhere:
txt = sprintf('%d %f\n',23,pi,5,sqrt(2));
out = textscan(txt,'%f%f','CollectOutput',true);
out{:}
Jeremy Hughes
el 3 de Ag. de 2021
@Walter Roberson - Wow, I failed to look at the date. Not even sure why this popped up on my radar today.
Ahh, I see. It's not textscan doing it, but I see how textscan enables that in a susinct line of code. I tend to look at "modify the original data" as a last resort, whether it's in a file or a char-array. Having to read the whole file into memory can be an issue.
Walter Roberson
el 3 de Ag. de 2021
In cases where the file fits in memory, my experience is that reading as character and transforming the characters can be very effective -- relatively easy to code, and sometimes big performance gains compared to parsing line-by-line. This is especially true for semi-structured files, such as files that have repeated blocks of headers and data, or files that have fixed text with embedded numbers.
For example, readtable() and textscan() are not very good at reading a file that looks like
The temperature in Winnipeg at 15:17 was 93 degrees.
The temperature in Thunder Bay at 15:18 was 88 degrees.
The temperature in Newcastle On The Tyne at 15:18 was -3.8 degrees.
but reading as text and doing a regexp 'names' can work really well.
Categorías
Más información sobre Text Data Preparation en Centro de ayuda y File Exchange.
Productos
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!