Pre-determining the number of lines in a text file
Mostrar comentarios más antiguos
Is there any programmatic way of determining in advance the number of lines in a text file, for use with dlmread, textscan, etc...? I mean other than some brute force way like reading line by line in a while loop until EOF is hit.
6 comentarios
Guru
el 4 de Jul. de 2013
I am guessing this is related to the question you were trying to help another trying to read every nth line from a text file? The solution that I posted is that there really is no reason to use textscan/dlmread, and it simply is much faster not to do so for that case.
Guru
el 4 de Jul. de 2013
But more on your question, in general there isn't a way other than having some counter in a while loop while grabbing every line one at a time with fgetl.
Besides this example, the idea of using the higher level file I/O routines is that you don't worry about how many lines they are.
Related to that, a text file is a series of ASCII characters. The number of lines depends on the number of newline characters which I believe is char(10) on the ASCII table. Without reading the characters or lines, you can't tell how many times the newline character would appear, unless the creator of the text file put that information in the headerline information.
Guru
el 4 de Jul. de 2013
Honestly, that type of behavior is exactly what TEXTSCAN should be used for instead of DLMREAD. The documentation of TEXTSCAN shows nice examples of ignoring various columns, and by default it will read all rows of the file.
Matt J
el 5 de Jul. de 2013
Chris Volpe
el 26 de Abr. de 2019
I realize this has been dormant for 5 years, and the API/behavior may have changed since then, but dlmread does the trick for me. I have a .csv (comma separated value) ASCII text file with 320 lines, and 240 comma-separated ASCII floating point numbers (including 'nan') on each line. I just do a plain vanilla "M = dlmread(filename);" and I get a 320x240 matrix in M.
Respuesta aceptada
Más respuestas (8)
Just out of boredom, here's a function:
function n = linecount(fid)
n = 0;
tline = fgetl(fid);
while ischar(tline)
tline = fgetl(fid);
n = n+1;
end
Edited: Thanks for comment Walter
8 comentarios
Walter Roberson
el 4 de Jul. de 2013
The above is off-by-one. An empty file would be counted as 1, a file with one line would count as 2, and so on.
Walter Roberson
el 5 de Jul. de 2013
Matt, feof() never predicts end of file. The I/O routines do not know that an end of file is the next thing waiting: the I/O routines only know if you have already tried to read at the end-of-file. Because of this, your routine will count 1 for an empty file, and will consistently run 1 high. The arrangement Guru used is proper: test to see if you got end-of-file by looking at the result of an attempt to read.
Also keep in mind that in MATLAB, if a routine expects an output and you do not provide a variable (or expression) to write the output to, then called routine usually allocates the memory for the value within itself and then the memory gets thrown away upon return because it is not needed. This is not always the case, as routines can test for varargout() and potentially not calculate the values that are going to be thrown out, but you should assume that the memory will be allocated and thrown away unless it is documented otherwise.
Walter Roberson
el 5 de Jul. de 2013
Editada: Walter Roberson
el 10 de En. de 2017
function n = linecount(fid)
n = 0;
fgetl(fid);
while ~eof(fid)
fgetl(fid);
n = n+1;
end
This code does not explicitly store the lines, but as noted above you should assume that each line is being allocated in fgetl() and then the result thrown away.
[Note: later testing shows this can be wrong]
Walter Roberson
el 5 de Jul. de 2013
In the text file you used to test with, does the last line end with the line terminator, or does it just end with no terminator?
Yes, feof() has overhead.
Matt J
el 5 de Jul. de 2013
Walter Roberson
el 10 de En. de 2017
Earlier I wrote that feof() never predicts end-of-file. That is true, but I was missing some information about the operation of fgetl and fgets that I just noticed today:
"After each read operation, fgetl and fgets check the next character in the file for the end-of-file marker. Therefore, these functions sometimes set the end-of-file indicator before they return a value of -1.
[...]
This behavior does not conform to the ANSI specifications for the related C language functions." (emphasis added)
Sigh.
Informaton
el 29 de Oct. de 2014
Another approach is to use the underlying operating system's functionality. Specifically, UNIX/Linux (i.e. also Mac) include a command line method 'wc -l [filename]' to get the line count of [filename].
To implement in MATLAB you could do something like this
if (~ispc)
[status, cmdout]= system('wc -l filenameOfInterest.txt');
if(status~=1)
scanCell = textscan(cmdout,'%u %s');
lineCount = scanCell{1};
else
fprintf(1,'Failed to find line count of %s\n',filenameOfInterest.txt);
lineCount = -1;
end
else
fprintf(1,'Sorry, I don''t know what the equivalent is for a windows system\n');
lineCount = -1;
end
1 comentario
Ian McInerney
el 30 de Abr. de 2017
Editada: Ian McInerney
el 30 de Abr. de 2017
There is actually an equivalent command for Windows-based systems using the command line. It is discussed in some length here: https://blogs.msdn.microsoft.com/oldnewthing/20110825-00/?p=9803/
The command to run in the command prompt is:
find /c /v "" filename.txt
Which can then be used in the else condition in your if-check.
else
% For Windows-based systems
[status, cmdout] = system(['find /c /v "" ', filename]);
if(status~=1)
scanCell = textscan(cmdout,'%s %s %u');
lineCount = scanCell{3};
disp(['Found ', num2str(lineCount), ' lines in the file']);
else
disp('Unable to determine number of lines in the file');
end
end
Walter Roberson
el 10 de En. de 2017
function n = linecount(filename)
[fid, msg] = fopen(filename);
if fid < 0
error('Failed to open file "%s" because "%s"', filename, msg);
end
n = 0;
while true
t = fgetl(fid);
if ~ischar(t)
break;
else
n = n + 1;
end
end
fclose(fid);
I have tested this with files that end with newline and with files that do not end with newline.
6 comentarios
Jan
el 17 de En. de 2017
I've written a C-Mex function, which counts DOS, Unix and old Mac linebreaks. It opens the files in binary mode, because I do not want to struggle with linebreaks followed by backspace characters or \Z as end-of-file characters. I admit, that considering the text mode is strange magic for me.
For counting the lines of my user-defined Matlab files, the C-Mex is 3 times faster than this M-version and gets the same results. Your M-function has the advantages to handle Unicode file names automatically and there is no need to compile an external file.
@Readers: Currently I assume, that Walter's suggestion satisfies all needs. If you have to count the lines in thousands of files and need to save time, feel free to contact me.
Peter Cook
el 5 de Oct. de 2017
I have the need for just this function. For a large CSV file I just tested it on (2907x10295) using linecount() is only about half as fast (10.4 s) as reading the file entirely using Stanislaw Adaszewski's swallow_csv() (21.2 s).
Jan
el 6 de Oct. de 2017
@Peter: The Mex function is faster in my tests, but I hesitate to publish it, because I do not know how to handle the exceptions accurately:
[]: 0 lines
[a]: 1 line
[a\n]: 1 line
[a\n\n]: 2 lines
[a\nb]: 2 lines
[a\nb\n]: 2 lines
[a\r\n]: 1 line
[a\rb\n]: 2 lines
[a\n\r]: 2 lines?
[a\nb\r]: 2 lines
[\n]: 1 line
[\na]: 2 lines
[\na\n]: 2 lines
If all lines end with an \n, the solution would be easy. If you really have a need for this, I would take the time to fix the code.
Peter Cook
el 6 de Oct. de 2017
Editada: Peter Cook
el 6 de Oct. de 2017
I don't know that I've ever met an ASCII file that doesn't end lines with either \n or \r\n. I can understand if you don't want to publish it on the file exchange without exception handling, maybe you could email me the c file as-is and I can compile it on my machine?
EDIT: I found a much faster solution - read the entire file into memory and count the \n newline characters (uinicode 10)
>> tic; fptr = fopen(fName); s = fread(fptr); numLine = sum(s==10); fclose(fptr); toc
Elapsed time is 2.996891 seconds.
>> tic; linecount(fName); toc
Elapsed time is 16.268462 seconds.
Jan
el 9 de Oct. de 2017
@Peter: fread(fptr) does read the complete file and stores each byte in a double. Prefer: fread(fptr, Inf, '*uint8'), which uses less memory.
Walter Roberson
el 9 de Oct. de 2017
Just counting the \n can give an off-by-one error. You need to know if the final \n has any characters following it or not.
123\n456\n
has two lines.
123\n456\n7
has three lines
123\n456\n7\n
has three lines.
Boris
el 10 de En. de 2017
I came across this code a while ago which is reasonably fast and works well on large files:
fid = fopen(strFileName, 'rt');
chunksize = 1e6; % read chuncks of 1MB at a time
numRows = 1;
while ~feof(fid)
ch = fread(fid, chunksize, '*uchar');
if isempty(ch)
break
end
numRows = numRows + sum(ch == sprintf('\n'));
end
fclose(fid);
strFileName is the file name for the ascii file
numRows has the total number of lines
Now, the only problem remains efficiently testing for blank lines before using csvread \ dlmread to read (sizeable) chunks of the file (ie my code is thrown if the csv file ends in a blank line so it would be nice if I could test and count the number of blank lines at the end of my files...
3 comentarios
Walter Roberson
el 10 de En. de 2017
This code has an off-by-one error for files that end in linefeeds.
If you are using OS-X or Linux, try:
! (echo line 1; echo line 2; echo line 3) > file_ends_nl.txt
! (echo line 1; echo line 2; echo -n line 3) > /file_ends_no_nl.txt
You can use !ls file*.txt to verify that they are different sizes, the one with no_nl being one byte shorter. You can use !od -cx on each file to verify that one ends in newline and the other does not.
The code will report numRows = 3 for the file with no newline at the end, but will report 4 for the file that ends in newline.
Boris
el 17 de Jul. de 2017
Or used the code above and check if the file ends in 0A:
if ch(end)==10
numRows=numRows-1;
end
Richard Crozier
el 14 de Ag. de 2019
This is a great answer, worked great for me on a 5GB text file of point cloud data.
Jan
el 10 de En. de 2017
2 votos
The determination of the number of lines require to read the file and interprete the line breaks. This means some real work and the disk access is the bottleneck in this case. Therefore e.g. importing the file to a cell string is not remarkably faster if you determine the number of lines at first. If the number of lines is determined initially, the main work would still be to "guess" a suiting buffer size for importing the lines. This requires either a copy of each line from the buffer to the Matlab string, or to realloc the imported string and allocate a new input buffer for each line.
I find it disappointing, that Matlab does not have a simple tool to import a text file to a cell string. Even the way to split a string (e.g. imported by fileread) to a cell considering the DOS/Linux/Mac linebreaks needed tools like strread, dataread, textread, textscan and regexp('split') which are not available in all Matlab versions and frequently outdated. Therefore I tried to write an efficient C-Mex again for the FileExchange. But the results have been grim: The best approaches have been only some percent faster than fread, replacing the different linebreaks by char(10) and calling a "Str2Cell" C-Mex. Neither counting the lines nor smart prediction techniques for a dynamic buffer allocation for the single lines accelerated the code sufficiently. The bottleneck of the disk access rules everything, even if the data are available in the cache already. For real file access, when the data are not read seconds before already and cached, all smart tricks are useless.
I think this is the reason why Matlab and many other tools do not contain a function for determine the number of lines in a text file.
If I find the time, I will try to write a LineCount.mex function, but I do not expect it to be much faster than Walter's Matlab approach.
Ken Atwell
el 30 de Oct. de 2014
If we can make two assumptions:
- ASCII #10 is a reliable end-of-line marker
- The entire file will fit into memory (that is, we're not talking about Big Data)
I would do the following (using the help for the plot command in this example):
txt=fileread(fullfile(matlabroot, 'toolbox', 'matlab', 'graph2d', 'plot.m'));
sum(txt==10)+1
This will be fast... certainly faster than "fgetl" approach, but maybe not as fast as the "wc" approach Hyatt put forth above (assuming you can live without Windows platform support).
1 comentario
Walter Roberson
el 10 de En. de 2017
Files are not required to end with a line terminator, but they might. So a file with 3 lines might have either 2 linefeeds (separating line 1 from line 2, separating line 2 from line 3, nothing at end of file), or 3 linefeeds (one at the end of each line.) The above code would count 4 if this hypothetical file ended with linefeed (as is more common than not.)
Dr. Erol Kalkan, P.E.
el 19 de Mayo de 2016
Editada: Matt J
el 19 de Mayo de 2016
Here is a short and fast way: Say file name to be read is apk.txt
fid = fopen('apk.txt','r');
Nrows = numel(textread('apk.txt','%1c%*[^\n]'));
1 comentario
Walter Roberson
el 19 de Mayo de 2016
textread is deprecated.
How is your routine going to treat empty lines? I think the result is going to depend upon whether the file is CR/LF or LF delimited: in the CR/LF case the %1c is going to read the CR, leaving the LF to be matched by the %*[^\n], but in the LF case, the %1c is going to read the LF, moving the next line into position to be matched by the %*[^\n]
John BG
el 10 de En. de 2017
hi Matt
the command importdata returns a cell with all lines that are not empty of a text file.
The amount of elements of this cell is equal to the amount of lines of the text file.
file_name='filename.txt'
numel(importdata(fname))
if you find these lines useful would you please mark my answer as Accepted Answer?
To any other reader, if you find this answer of any help please click on the thumbs-up vote link,
thanks in advance for time and attention
John BG
6 comentarios
Walter Roberson
el 10 de En. de 2017
Editada: Walter Roberson
el 10 de En. de 2017
Interesting approach, but has challenges.
It does not count empty lines (including in the cases described below)
It works for pure numeric files which have a single column.
If there are multiple numeric columns, separated either by commas or whitespace, then you need to take the number of rows rather than the number of elements.
If there is text (other than commas separating the columns) then importdata returns a struct. You have to add the number of rows of the 'data' field and (if present) the 'colheaders' field will have one row. If the number of rows of the 'textdata' field is the same as the number of rows of the 'data' field then it is probably a per-row text header and you should not count it in addition to the 'data' field rows, but if the number of rows of the 'textdata' field is anything different then it represents text headers before the numeric data and you need to count it and do not count colheaders in that case. It could happen that the number of rows of text headers just happens to be the same as the number of rows of data: you can detect that case because colheaders will be present when it otherwise would not be (I think that's what I observed.)
Possibly I missed a few cases.
When it does work, it is 50 times slower than my recent code.
Walter Roberson
el 10 de En. de 2017
>> timeit(@() size(importdata('somedata.csv'),1), 0)
ans =
0.0032202894005
>> timeit(@() size(importdata('somedata.csv'),1), 0)
ans =
0.0029899404005
>> timeit(@() size(importdata('somedata.csv'),1), 0)
ans =
0.0029341294005
>> timeit(@() linecount('somedata.csv'), 0)
ans =
0.0001702595455
>> timeit(@() linecount('somedata.csv'), 0)
ans =
0.0001823676885
>> timeit(@() linecount('somedata.csv'), 0)
ans =
0.0002160064885
Best of the runs of your approach: 0.0029341294005
Best of the runs of my code: 0.0001702595455
ratio: 17.2
Jan
el 10 de En. de 2017
@John BG: Walter has posted a comment and this does not restrict Matt J to decide by his own. This is the idea of the comment section. Walter's ideas about the empty lines and the dependency to the contents of the file are important. The run time matters also and of course the interpretation of the data must waste time, when it is not needed.
I decide for my self that Walter's intuitive speed estimations are based on a many years of experienced programming. If I start to wonder, when he claims something I do not expect, I ask my self, how I can try to see the same details Walter can see.
@Walter: Thanks for the speed test and for sharing your experiences.
Walter Roberson
el 10 de En. de 2017
>> fname = 'somedata.txt';
>> r = zeros(1,100);for K = 1 : 100; r(K) = timeit(@() size(importdata(fname),1), 0); end
>> min(r)
ans =
0.0148051934005
>> mean(r)
ans =
0.0156547813305
>> r2 = zeros(1,100);for K = 1 : 100; r2(K) = timeit(@() linecount(fname), 0); end
>> min(r2)
ans =
0.000143144709071429
>> mean(r2)
ans =
0.00015156221620381
>> min(r)/min(r2)
ans =
103.428156699192
>> mean(r)/mean(r2)
ans =
103.289472288058
Walter Roberson
el 10 de En. de 2017
People other than Matt J read this, so before they implement the importdata() approach they need to know about its limitations. It is a nice compact expression that works well (if perhaps less efficient than it could be) under the circumstance of a file containing a single column of (pure) numeric values; unfortunately it turns out to be fragile if that condition is not met.
Categorías
Más información sobre Large Files and Big Data en Centro de ayuda y File Exchange.
Productos
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!