Why are xlsread (and readtable) doing such a poor job at loading mixed data files?

Question

Sjouke Rinsma el 18 de Jul. de 2022

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/1762625-why-are-xlsread-and-readtable-doing-such-a-poor-job-at-loading-mixed-data-files

Editada: Sjouke Rinsma el 19 de Jul. de 2022

So, this issue has popped up regularly when loading from mixed data files. For example, when importing data using [data, text, raw] = xlsread('file.xlsx'), interpretation of the data appears strangely random. Though the 'data' and 'text' contain solely numeric and text data respectively, oftentimes the 'text' data still contains numeric data stored as text; on some occasions because a column contains both numeric and text data, but regularly also with columns that contain only numeric data (albeit with a header on the first row). In some cases xlsread even returns columns with only numeric data as text but with one or more cells as numeric. The 'raw' data in that regard shows the same distinction between numeric and text data. Similar (though different) results are obtained when using readtable by the way.

As an example, see attached data which contains a section of a full table ('raw' data from xlsread). Apart from the header, columns 3 to 6, 8 and 9 contain only numeric data (verified this in original file), while columns 59 to 62 contain mixed data. Somehow only row 3 of the mentioned first 6 colums is interpretated fully as numeric. More interestingly, column 4, row 7537 is interpreted as numeric whereas the rest of the column is text. Later columns (59 to 62) containing mixed data are again loaded as numeric data.

Now I'm sure there are reasons why this happens, and even alternatives for loading the data... but why return an ambiguous result like this in the first place? Either do it correctly or return everything as text so I can interpret the data myself, because this only adds additional processing time; first from the function interpretating the data and next from having to re-intepret the data. Or is there someting I'm missing here?

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Sjouke Rinsma el 18 de Jul. de 2022

I was going to comment that this would be an alternative, though setting each colunm to the correct type would require prior investigation of every file being loaded if I were to set all columns to the correct type,(since using this approach rows 4, 5, 6, 8 and 9 are still returned as 'TextVariableImportOptions'). Because I use these functions in a program for other users to import their data, this unfortunately is not really an option.

However, setting all columns to 'char' I can just do the interpretation myself :) so thanks for the suggestion, I'll see if I can speed up the process this way.

Also, upon further investigation it appears the interpretation Matlab does is simply related to how Excel interprets each cell. Even though all cells in the example were formatted as 'General', some numeric cells still appear as numbers, while others appear as text with a warning that 'the cell is formatted as text'.

dpb el 18 de Jul. de 2022

It's only trying to do what is asked -- import the Excel sheet. If it's malformed/mistyped, that's not MATLAB's fault; it can only presume that's what was intended.

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Jan el 18 de Jul. de 2022

1
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/1762625-why-are-xlsread-and-readtable-doing-such-a-poor-job-at-loading-mixed-data-files#answer_1010100

Editada: Jan el 19 de Jul. de 2022

The automatic detection of string types is known to be a source of severe bugs. The genes "Dec1", "SEPT1" and "MARCH1" have been renamed, because Excel converted them to dates automagically. 24 other genes are affected also, but a complete list of names is not found yet, because opening such data will convert the string according to the local language. The probleme is severe, because a survey showed, that 20% of 3600 publications used Excel tables containing the failing names (See: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7 ).

I do not trust Matlab's automatic detection of data types for scientific projects. Any kind of intelligent guessing can and will fail for some specific input. Using Excel tables as input is a marker for using a bad practice. I've seen too many projects fail for such reasons. Blaming xlsread and readtable does not hit the point. A trustworthy design of a study must exclude such sources of errors, because they can be extremely hart to detect.

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Jan el 19 de Jul. de 2022

@Sjouke Rinsma: "a clear distinction between fully numeric and not can be made without ambiguity" - This sound like a trustworthy statement. But some details are hard to control:

dots or commas appear as decimal points
Unicode characters for the d or e in scientific notation
Somewhere a hypen appears instead of a minus
Let Excel convert a NaN to "-1.#IND", "#SNAN" or "1.#QNAN"
... This list cannot be complete: according to Goedel there is always an input, which let the interpreter explode.

For a long term scientific project for clinical decision making and international collaborations we decided for using ASCII files with a well defined subset of representations of numbers. This is extremely simple and several stages below JSON and XML, but bullet proof. We deliver Matlab and C code together with a complete definition of the format.

In an earlier stage of the project we used an MS Office Access data base and collided with the 2 GB limit of the data file. The export suffered from several ambiguities and therefore we switched to the stone age ASCII files. Well, it works reliably.

Sjouke Rinsma el 19 de Jul. de 2022

Editada: Sjouke Rinsma el 19 de Jul. de 2022

Good point Jan, those exceptions in numeric inputs you raise will eventually, according Murphy, result in problems too (surely Excel auto-formatting has annoyed me in that regard on several occasions as well). Perhaps I should have phrased my statement as "inputs containing only digits can be distinguished ambiguously". In any case, I guess in the end I'm just not really charmed by the fact that these Matlab functions apparently import data according the format applied by another program, rather than do what I was expecting based on the function description. However, I think my question has been answered, so thanks for that :)

Iniciar sesión para comentar.

Why are xlsread (and readtable) doing such a poor job at loading mixed data files?

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Respuesta aceptada

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Why are xlsread (and readtable) doing such a poor job at loading mixed data files?

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Respuesta aceptada

4 comentarios Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos