Why are xlsread (and readtable) doing such a poor job at loading mixed data files?
4 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Sjouke Rinsma
el 18 de Jul. de 2022
Editada: Sjouke Rinsma
el 19 de Jul. de 2022
So, this issue has popped up regularly when loading from mixed data files. For example, when importing data using [data, text, raw] = xlsread('file.xlsx'), interpretation of the data appears strangely random. Though the 'data' and 'text' contain solely numeric and text data respectively, oftentimes the 'text' data still contains numeric data stored as text; on some occasions because a column contains both numeric and text data, but regularly also with columns that contain only numeric data (albeit with a header on the first row). In some cases xlsread even returns columns with only numeric data as text but with one or more cells as numeric. The 'raw' data in that regard shows the same distinction between numeric and text data. Similar (though different) results are obtained when using readtable by the way.
As an example, see attached data which contains a section of a full table ('raw' data from xlsread). Apart from the header, columns 3 to 6, 8 and 9 contain only numeric data (verified this in original file), while columns 59 to 62 contain mixed data. Somehow only row 3 of the mentioned first 6 colums is interpretated fully as numeric. More interestingly, column 4, row 7537 is interpreted as numeric whereas the rest of the column is text. Later columns (59 to 62) containing mixed data are again loaded as numeric data.
Now I'm sure there are reasons why this happens, and even alternatives for loading the data... but why return an ambiguous result like this in the first place? Either do it correctly or return everything as text so I can interpret the data myself, because this only adds additional processing time; first from the function interpretating the data and next from having to re-intepret the data. Or is there someting I'm missing here?
3 comentarios
dpb
el 18 de Jul. de 2022
It's only trying to do what is asked -- import the Excel sheet. If it's malformed/mistyped, that's not MATLAB's fault; it can only presume that's what was intended.
Respuesta aceptada
Jan
el 18 de Jul. de 2022
Editada: Jan
el 19 de Jul. de 2022
The automatic detection of string types is known to be a source of severe bugs. The genes "Dec1", "SEPT1" and "MARCH1" have been renamed, because Excel converted them to dates automagically. 24 other genes are affected also, but a complete list of names is not found yet, because opening such data will convert the string according to the local language. The probleme is severe, because a survey showed, that 20% of 3600 publications used Excel tables containing the failing names (See: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7 ).
I do not trust Matlab's automatic detection of data types for scientific projects. Any kind of intelligent guessing can and will fail for some specific input. Using Excel tables as input is a marker for using a bad practice. I've seen too many projects fail for such reasons. Blaming xlsread and readtable does not hit the point. A trustworthy design of a study must exclude such sources of errors, because they can be extremely hart to detect.
4 comentarios
Jan
el 19 de Jul. de 2022
@Sjouke Rinsma: "a clear distinction between fully numeric and not can be made without ambiguity" - This sound like a trustworthy statement. But some details are hard to control:
- dots or commas appear as decimal points
- Unicode characters for the d or e in scientific notation
- Somewhere a hypen appears instead of a minus
- Let Excel convert a NaN to "-1.#IND", "#SNAN" or "1.#QNAN"
- ... This list cannot be complete: according to Goedel there is always an input, which let the interpreter explode.
For a long term scientific project for clinical decision making and international collaborations we decided for using ASCII files with a well defined subset of representations of numbers. This is extremely simple and several stages below JSON and XML, but bullet proof. We deliver Matlab and C code together with a complete definition of the format.
In an earlier stage of the project we used an MS Office Access data base and collided with the 2 GB limit of the data file. The export suffered from several ambiguities and therefore we switched to the stone age ASCII files. Well, it works reliably.
Más respuestas (0)
Ver también
Categorías
Más información sobre Spreadsheets en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!