Conditional textscan - How to select certain lines from a file

Hi there, I would like to read information from a file into an array for later use. Only certain rows of that file are supposed to be read in, namely rows for which the second column starts with 'S1' and is followed by two random digits. I'm having trouble with this conditional textscan. Here is the code for reading in the file (note that it starts with 13 lines that are not in column format, hence the "headline" codes at the beginning). I basically want the varibales Postion, Length, Channel etc only to be read in for lines that meet the regex condition.
dataFileName=strcat('EEG_Anne_',int2str(pNumber),'.vmrk');
fid = fopen(dataFileName);
headline1=fgets(fid);
headline2=fgets(fid);
headline3=fgets(fid);
headline4=fgets(fid);
headline5=fgets(fid);
headline6=fgets(fid);
headline7=fgets(fid);
headline8=fgets(fid);
headline9=fgets(fid);
headline10=fgets(fid);
headline11=fgets(fid);
headline12=fgets(fid);
headline13=fgets(fid);
C = textscan(fid, '%s%s%d%d%d','Delimiter',',');
Stimulus=C{2};
if regexp(Stimulus{i},'S1\d*'),
Type=C{1};
Position=C{3};
Length=C{4};
Channel=C{5};
end
fclose(fid);

 Respuesta aceptada

Stephen23
Stephen23 el 13 de Oct. de 2015
Editada: Stephen23 el 13 de Oct. de 2015
Usually the fastest and easiest way to select from a dataset is to read the complete file into MATLAB and then make the selection inside of MATLAB:
N = 117;
fileName = sprintf('EEG_Anne_%d.vmrk',N);
fid = fopen(fileName);
hdrRows = 13;
hdrData = textscan(fid,'%s',hdrRows, 'Delimiter','\n');
matData = textscan(fid,'%s%s%s%d%d%d', 'Delimiter',{',','='}, 'CollectOutput',true);
fclose(fid);
X = ~cellfun('isempty',regexp(matData{1}(:,3),'^S1\d\d$','once'));
To read the header data into a cell array I also replaced the very awkward 17 calls to fgets with one simple call to textcscan. To test this code I used the file that you gave in your other answer (attached here also). The test detects these rows:
>> matData{2}(X,:)
ans =
13127 1 0
17828 1 0
22387 1 0
27429 1 0
31951 1 0
36610 1 0
51258 1 0
56417 1 0
61951 1 0
.... etc
which corresponds exactly to the rows with 'S1xx' in the second column.
Bonus: if you want to practice using regular expressions (i.e. regexp), then you can try my FEX submission:
This tool lets you interactively write and change a regular expression, and updates the outputs as you type, so you can see what effect those changes have on the string parsing. It is a great way to practice using regular expressions, or to adapt a regular expression to your particular requirements.

7 comentarios

Thanks for your help! I just redirected that question to my earlier one where I posted the entire code of my script. Sorry, I should have done that from the start. The code you gave me up here makes sense, but I can't seem to adapt it to my particular situation - I need the separate columns (Position, Channel etc) for later use and they are not defined in the new code , or are they?
Stephen23
Stephen23 el 13 de Oct. de 2015
Editada: Stephen23 el 13 de Oct. de 2015
You can access the numeric data easily using indexing:
Position = matData{2}(X,1);
LengthV = matData{2}(X,2);
Channel = matData{2}(X,3);
And you can get the string data in the same way:
A = matData{1}(X,1)
B = matData{1}(X,2)
C = matData{1}(X,3)
Although actually I would not recommend doing this: rather than cluttering up your workspace and memory with copies of the same data, just keep one copy (the original matrix from textscan) and use indexing to access the parts of this only when you need to. It is considered good programming to practice to keep your data together a much as is reasonable, without copying and spreading it about.
Also using the name Length is not such a good idea, as it differs only by case from the inbuilt length function.
Thanks for the tips! When I access the data (ie. Position) I now get the error: Index exceeds matrix dimensions.
I'm not sure where I assigned the dimensions of my matrix and was actually wondering what the numbers stand for in general (I'm relatively new to Matlab)
Position = matData{2}(X,1);
The number in the curly brackets refers to the column where the information can be found, correct? What does (X,1) refer to then?
regexp(matData{1}(:,2)
And here - it seems like the number in the round bracket refers to the column of matrix X - what does {1} stand for here then?
Stephen23
Stephen23 el 13 de Oct. de 2015
Editada: Stephen23 el 13 de Oct. de 2015
You need to get used to reading the documentation for every function that you use, because it answers some of your questions, and gives links to the answers for the rest. Your questions are about indexing and cell arrays (which are containers of other arrays), so lets have a look at these:
The function textscan returns a cell array. We know that it is a cell array because we read the documentation for textscan. The arrays inside a cell array can be accessed by using curly braces {} and some indexing. So this code:
matData{2}
obtains the numeric array inside the second cell of the cell array matData. I know that this is a numeric array because of the format string that I used in textscan '%s%s%s%d%d%d'. The option CollectOutputs that I used tells me the size of this numeric matrix: it must have three columns, because I used three consecutive identical numeric fields %d. The function textscan will automatically import all of the rows in the file that match this format string: we do not need to allocate the size of this output in advance.
From this numeric matrix we only want certain rows: the ones matching your requirement that the second column matches the string 'S1xx' for some digits x. These rows are identified with:
X = ~cellfun('isempty',regexp(matData{1}(:,3),'^S1\d\d$','once'));
which creates a logical index X. This logical index X is used to select only those rows from the numeric matrix using matrix indexing with parentheses ():
matData{2}(X,1);
where the 1 selects the first column. Review the link I gave for matrix indexing to know how this works.
That is how we select particular rows and columns from the numeric matrix, stored inside the cell array matData.
You might like to work through these tutorials, which are an excellent introduction to MATLAB:
Thanks a lot for taking the time to explain these things and for the links. I've been working through some of the tutorials and the links you gave me, as well as your answer. I still get an error message though and I'm still having issues fully grasping how the regex pattern extracts the information I want it to extract.
My first problem is in reading in MatData: I want a matrix with 5 columns (the first two of which contain strings and the last three of which contain numbers). The second column of this matrix is the important one (containing S1xx) although I will need to extract all of them later (column 1 reads stuff like Stimulus=MK4 etc, column 3 has response time, column 4 and 5 contain single digits). It seems to me that the code you give above only extracts the last three columns, the numerical ones (which is why I maybe get the "matrix size" error). Correct? Why do I need the "CollectOutput" in there? Could I do without?
matData = textscan(fid,'%s%s%d%d%d', 'Delimiter',{','}, 'CollectOutput',true);
The next problem lies within the regex expression used to extract the relevant lines. From I now undertstand the following code (just the regex snippet):
X = ~cellfun('isempty',regexp(matData{1}(:,3),'^S1\d\d$','once'));
accesses the first array inside the first cell of matData. Why do we access the first cell and not the second? The second is the one that contains the "S1xx" information, no? And then the code continues to access (;,3) --> and here I'm a bit lost. Does that mean it accesses all rows of column 3 of MatData?
So basically I think the numbers in the parentheses are not quite right at the moment for what I want to get. I'm just a bit confused as to which numbers I need to change in order to get it right.
Hope you can help. Thanks a lot in advance!
Ok, never mind, I finally understood it and my script works! Took a bit longer than it should have - sorry for being slow. Thanks again for your help!
Stephen23
Stephen23 el 16 de Oct. de 2015
Editada: Stephen23 el 16 de Oct. de 2015
My code certainly does not extract only the last three columns, it actually gives you all of the data in your file, even the headers!
Did you actually look at the variables that my code generates?
You should have a play with your workspace browser: there you can view a summary of every variable, and double-click them to open any variable in the workspace viewer, where you can view every variable and its elements (i.e. values). Double-click on matData and you will find both the character arrays (the first few columns) and numeric arrays (the last columns) inside it.
All of your data is there, I promise you, it just requires some exploration and cell arrays containing other arrays.
You write that you "want a matrix with 5 columns (the first two of which contain strings and the last three of which contain numbers" but it is not possible to store both character and numeric data in one array, although these arrays can be stored together in a cell array, which is what textscan does. A cell array is just a container of other arrays, and it does not matter what kind they are, but it adds an extra level of complexity to your code.
The character data occurs in the first columns, so it is the first array in the output cell array:
strData = matData{1}
while the numeric data are all of the remaining columns, so occurs second in the output cell array:
numData = matData{2}
This is why we access the first cell for the regexp (regular expression) call: because the first cell contains all of the characters data. So "The second is the one that contains the "S1xx" information, no?" is incorrect: the string data is the first cell: you don't need me to tell you this, have a look at the variables in your variable browser.
Your statement "And then the code continues to access (;,3).... Does that mean it accesses all rows of column 3 of MatData?" is incorrect, as should be becoming clear: matData is a cell array, it contains some other arrays. matData has size 1*2. It certainly does not have three columns. What you are interested in are the data arrays inside of matData: these are the character and numeric arrays that were specified in textscan, with as many columns as that format specification. So we can so this:
numData = matData{2} : <- get the numeric array out of the cell array
numData(:,3) % <- get the third column of the numeric array.
Or we can do this in one go, this is equivalent:
matData{2}(:,3)
To answer your question: it is not strictly required to use the option CollectOutput, but either way the function textscan will output a cell array containing some character/numeric data, and by using this option you simply merge these character/numeric arrays together where possible. This often makes further processing much simpler, because accessing and processing data in lots of cells of a cell array is not as convenient as it sounds.

Iniciar sesión para comentar.

Más respuestas (1)

Samy Alkhayat
Samy Alkhayat el 12 de Nov. de 2018
Editada: Samy Alkhayat el 12 de Nov. de 2018
Hello, I have a similar problem, where I want to concatenate 32 columns from different 32 files sequentially named. The code works fine if the the 32 files have similar size arrays(2500 in some of the sets); however, other sets of 32 files have one file of size 2909 or more. Now I need to to consider all the concatenation to be over the first 2500 only. Please help in editing the code below (error message is below as well):
clear all size=2500; for i=1 : 32 filename=horzcat(pwd,'\Run4915_Inj',int2str(i),'.pre'); delimiter = {'\t',','}; startRow = 3; %% Format string for each line of text: formatSpec = '%f%f%f%*s%[^\n\r]';
%% Open the text file. fileID = fopen(filename,'r');
%% Read columns of data according to format string. dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'HeaderLines' ,startRow-1, 'ReturnOnError', false);
%% Close the text file. fclose(fileID); %% Create output variable time(:,i)=dataArray{1:size, 1}; P(:,i)=dataArray{1:size, 2}; needle(:, i)=dataArray{1:size, 3}; Pav=mean(P,2); nav=mean(needle,2); tav=mean(time,2); %% Clear temporary variables clearvars filename delimiter startRow formatSpec fileID dataArray ans; end
I get this error as I run the code to the exceptional set: Unable to perform assignment because the size of the left side is 2500-by-1 and the size of the right side is 2909-by-1.

Categorías

Preguntada:

el 13 de Oct. de 2015

Editada:

el 12 de Nov. de 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by