textscan failing to read data in text file
6 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
UniqueWorldline
el 15 de Oct. de 2017
Comentada: Cedric
el 21 de Oct. de 2017
I have a text file with a fileID called fidRawData that contains rows that look like this:
A BCD 99.9 9.90 9.999 99.9 0.999 0.99 9.999 9.999 99.99 99.9 9.9
A can be one of two characters ('A' or 'B'), or it can be empty (a space is inserted in its place, leaving white space at the beginning of the row). The status of this first character can vary by row. BCD is a three letter code than can vary depending on the row. The subsequent columns of numbers I want to consider as being as general as possible, but none of them will ever get large. They should all be between -9999 and 9999.
Sometimes an error occurs and
---
is inserted in place of some of the numbers in a given row like this:
A BCD 99.9 9.90 9.999 --- --- 0.99 9.999 9.999 99.99 99.9 9.9
The only thing I can really be sure of is that there will always be one space between the columns. There may be more than one space. The numbers can vary depending on if they are positive or negative, where the decimal point is, and how large or small they are.
I need to use either textscan or fscanf (I would prefer to use textscan for its greater flexibility) to store all the data in each of these columns (including the textual information in the first two columns) in whatever data type will accept such a diverse range of simpler data types and allow me to easily retrieve the data.
Whenever and 'A' is omitted, and a ' ' is put in its place, I am ok with an 'N' or other character taking its place if need be, but if there is an 'A' or a 'B', I want that stored as 'A' or 'B' respectively.
When an '---' shows up, I want to replace that with NAN, an empty location in the data structure, or some other indication that there is no data available.
I tried the following command on a singular row where there was an 'A' at the beginning of the row and no '---' were in the row:
rawData = textscan(fidRawData, '%s %s %f %f %f %f %f %f %f %f %f %f)
This command worked as expected. It returned a 1x14 cell array where all the values in the text file were stored as I wanted in rawData.
But there are plenty of rows without and 'A' or 'B' and '---' is present at least once in the row. In order to try and address these variations, I tried the following on a row where both conditions are true:
rawData = textscan(fidRawData, '%s %s %f %f %f %f %f %f %f %f %f %f %f %f,'Delimiter',' ','EmptyValue',0)
This test results in a 1x14 cell array that is completely empty. The cells are either 1x1 cell type cells and contain a 0x0 char array, or they are 0x1 double cells.
rawData = textscan(fidRawData, '%s %s %f %f %f %f %f %f %f %f %f %f)
worked up until it hit the '---' in the row, then began returning 0x1 double cells for the remaining columns of rawData.
What can I do to get textscan to deal with these possibilities?
0 comentarios
Respuesta aceptada
Cedric
el 15 de Oct. de 2017
Editada: Cedric
el 15 de Oct. de 2017
Here is one way. We pre-process the content before parsing, adding 'N' where the first letter is missing. Then we count the number of columns, split the content on white spaces, and reshape the output according to the number of columns. Finally we extract the header (or those first two char columns) and convert the rest to double.
content = fileread( 'data.txt' ) ;
content = regexprep( content, '^\s', 'N ', 'lineanchors' ) ;
nCols = numel( strsplit( regexp( content, '[^\r\n]+', 'match', 'once' ), ' ')) ;
data = reshape( regexp(content, '\s+', 'split'), nCols, [] ).' ;
header = data(:,1:2) ;
data = str2double( data(:,3:end) ) ;
Applied to the file attached, we get:
>> header
header =
5×2 cell array
{'A'} {'BCD'}
{'B'} {'BCD'}
{'N'} {'BCD'}
{'B'} {'BCD'}
{'N'} {'BCD'}
>> data
data =
99.9000 9.9000 9.9990 NaN NaN 0.9900 9.9990 9.9990 99.9900 99.9000 9.9000
99.9000 9.9000 9.9990 NaN NaN 0.9900 9.9990 9.9990 99.9900 99.9000 9.9000
99.9000 9.9000 9.9990 99.9000 0.9990 0.9900 9.9990 9.9990 99.9900 99.9000 9.9000
99.9000 9.9000 9.9990 NaN NaN 0.9900 9.9990 9.9990 99.9900 99.9000 9.9000
99.9000 9.9000 9.9990 NaN NaN 0.9900 9.9990 9.9990 99.9900 99.9000 9.9000
5 comentarios
Más respuestas (1)
Walter Roberson
el 17 de Oct. de 2017
In the case where you already know the number of numeric columns (perhaps having parsed the file the way Cedric shows), then there is a trick you can use:
S = 'A BCD 99.9 9.90 9.999 99.9 0.999 0.99 9.999 9.999 99.99 99.9 9.9'; %sample input
S1 = ' BCD 99.9 9.90 9.999 99.9 --- --- 9.999 9.999 --- 99.9 9.9'; %another sample input. Leading space is important
NumNumeric = 11;
SP = '%*[ ]';
fmt = ['%c', SP, '%s', repmat([SP '%f'], 1, NumNumeric)];
textscan(S, fmt, 'treatasempty', '---', 'whitespace','')
textscan(S1, fmt, 'treatasempty', '---', 'whitespace','')
These give
ans =
1×13 cell array
{'A'} {'BCD'} {[99.9]} {[9.9]} {[9.999]} {[99.9]} {[NaN]} {[NaN]} {[9.999]} {[9.999]} {[NaN]} {[99.9]} {[9.9]}
ans =
1×13 cell array
{' '} {'BCD'} {[99.9]} {[9.9]} {[9.999]} {[99.9]} {[NaN]} {[NaN]} {[9.999]} {[9.999]} {[NaN]} {[99.9]} {[9.9]}
This approach does not require pre-processing to replace missing leading character.
I show here scanning from a string; you can fopen() the file and pass the file identifier where I show the string.
1 comentario
Ver también
Categorías
Más información sobre Text Data Preparation en Help Center y File Exchange.
Productos
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!