File is all numeric, but csv read does not work fully
Mostrar comentarios más antiguos
Hi, I'm trying to read in a bunch of data files one at time as a matrix, use the find function to find a certain z location and when I do, I store that row of data in a new matrix. My problem is that no matter what I do, I get this error:
??? Error using ==> dlmread at 145
Mismatch between file and format string.
Trouble reading number from file (row 781666, field 4) ==>
Error in ==> csvread at 52
m=dlmread(filename, ',', r, c);
Error in ==> Velocity_AtPt_vs_Time at 67
datafile=csvread(fullname,1);
The data files are all identical, 1 row of column headers and 14 columns of all numeric data, I made it so that the csvread skips the first row and reads all else. My files are approximately 1 million rows x 14 columns.
What's happening is the code executes for 69 data files doing exactly the steps I wish it to/filling the new matrix properly and then stops and gives me this error after the 69th. I have tried taking away the 70th and 71st files to see what happens, it now stops at 67 files. Very odd. If anyone has suggestions, please let me know! Thanks for reading
This is my loop that receives the error message:
for k = 1:numel(filenames)
% Create full file name and partial filename
fullname = [currentfolder filesep NEWFileNames(k).name];
% Read in data
datafile=csvread(fullname,1);
[rr1,cc1] = find(datafile(:,z)==0.0075000000008515);
firstrow1 = rr1(1,1);
firscol1 = cc1(1,1);
dataset1(k,:)=datafile(firstrow1,:);
end
Note: The rr1 and cc1 are just so that I may take the first instance this value shows up, but is not the error with this code
15 comentarios
dpb
el 6 de Abr. de 2016
What if you read just the initially-failing file? Does it fail or not?
I was unaware that csvread would accept only R and not R,C as offset inputs but a test here shows it seemingly is ok with that.
Image Analyst
el 6 de Abr. de 2016
Editada: Image Analyst
el 6 de Abr. de 2016
Might not be a bad idea though to get rid of the unneeded 1:
datafile = csvread(fullname);
By the way, this is a bad idea:
[rr1,cc1] = find(datafile(:,z)==0.0075000000008515);
Why? See the FAQ: http://matlab.wikia.com/wiki/FAQ#Why_is_0.3_-_0.2_-_0.1_.28or_similar.29_not_equal_to_zero.3F
And you forgot to attach the csv file, so we can't check row 781666, field 4. What is there? Please attach the file, with at least that many rows, or if a chunk of it, then tell us what row 781666 is now at.
Jenna P
el 7 de Abr. de 2016
dpb
el 7 de Abr. de 2016
If the full file was read unmodified by itself, I'm convinced it's a memory-related bug...I know it probably takes a long time to process all these, but if you start over is it consistent or seemingly random where it dies?
Image Analyst
el 7 de Abr. de 2016
Can you put the line
memory
just before the csvread() line. Then after it crashes, paste what it says for memory back here.
A million lines is a lot of lines, especially if each row has lots of numbers on it, and if you're trying to store all of them in memory at the same time. It could run up into the gigabytes of RAM.
Jenna P
el 7 de Abr. de 2016
I was wondering if you ran the script multiple times without changing anything does it fail in the same place every time?
Also, one thing you could try that might save just a tiny fraction of memory would be to rewrite the find operation a little...
[rr1,cc1] = find(datafile(:,z)==0.0075000000008515,1,'first');
dataset1(k,:)=datafile(rr1,:);
This will stop the lookup after the one case is found. Also I note that z isn't defined in the code snippet, is it simply a constant I presume?
IA has mentioned the issue of floating point comparison but that would result in the result being empty which would be a different error...
Also, since you're looking for a particular value within the file, and the file is quite large, it might be advantageous to use a little more sophistication in the process rather than simply trying to read the whole thing in one swell foop--instead use textscan to read sections and if the point were to happen to be fairly early in the file, you can truncate the read before having to process the whole thing.
Is there any way to guestimate how far into the file the location of interest is?
Jenna P
el 7 de Abr. de 2016
textscan is quite different in one significant way--it operates on an open file handle and can be called repetitively to read the file sequentially whereas csvread and the base routine it calls which does the actual work dlmread only operate by reading the entire file into memory at once.
It seems as noted before if the file in question will, in fact, be read correctly on its own that there is not actually an error in the file itself but some memory interaction; that it seems to be exactly reproducible surprises me somewhat, though, I was expecting there to be a somewhat random component involved.
But, as far as that goes, I'd suggest submitting the symptoms as a bug report to TMW; perhaps they would be willing to try to reproduce the issue. It's too much data to be able to provide via the Answers forum for any of us to attempt directly.
BUT, back to getting to a workaround to get your project underway, look at the following doc page and consider how you might follow their guidelines there. If you could provide any algorithm by which you could estimate the location of the value in question by knowing something of the way in which the coordinates are laid out, that would help. If you need some help implementing this, attach a smaller subsection of a file and would be glad to help (as am sure many others will be as well).
The newer releases than I have also have a memory-mapping facility for text files similar to memmapfile for stream files that might be of help if you have recent-enough release. I don't so really have no experience with it or how well it might handle your particular problem.
Any way, there's more than one way to skin the cat here... :)
ADDENDUM
If you know the location you're after is always well down in the file, (you aren't so lucky as that it is the same identical location in every file I suppose which would make things much simpler) try setting the initial row number to a large integer but less than the line being searched for. Then, while I'd expect speed to go way down, maybe you'll resolve the memory-related issue.
I'm wondering, however, it the problem isn't actually memory per se, but OS file handles or the like that aren't being destroyed and it's system resources that are the ultimate cause of the failure and only the symptom error message is a red herring as to what is really the root cause. If that were to be the case, it would be likely the above also may fail but I think it's worth a try as it is simple to simply change the one constant in the existing script/function.
dpb
el 7 de Abr. de 2016
"Taking files away from the 150,000 by 14 set allowed it to run through, no problem. Clearly a memory issue."
That seems to me to point even more strongly to the resources end of things rather than just memory as that number of 150,000 is <<800k lines approximately where the point in question was from the initial set that ran successfully thru some 70 instances before the error. IOW, the amount of data you'd already processed in the case of this thread far, far exceeds the other case it would seem unless I'm grossly misreading the description.
I'd missed the section above where you attached a section of a file; I'll see if I can get a sample working this evening if you've not come up with a solution ere then.
dpb
el 19 de Abr. de 2016
Did you try the textscan solution? The first response in my earlier Answer should work simply substituting it (and the appropriate fopen|fclose pair of course) for csvread.
I'd surely suggest making that attempt before going to xlsread. If it's something peculiar about [csv|dlm]read causing the (what I think is a resource issue) error, textscan is standalone and if it also aborts that's pretty indicative it's more fundamental.
Also, did you file a Service Request with TMW Support on the issue?
Respuestas (1)
OK, to separate from the long-winded chain of comments...this isn't the full answer yet, but a "getting-started" for textscan solution.
>> fid=fopen('file70_part.csv'); % open the file
>> d=cell2mat(textscan(fid,'','headerlines',1,'delimiter',',','collectoutput',1));
>> whos d
Name Size Bytes Class Attributes
d 26x14 2912 double
>> fid=fclose(fid);
>>
The above is all needed to read the full file; I've done a couple of things to make note of--
- Used empty string '' for the format string. This has the effect that Matlab will determine the fields per record automagically and return the proper shape; otherwise you have to know the number per record and write a specific format string to match, and;
- Used cell2mat around the textscan call to return the data as double array rather than the cell array otherwise returned. 'collectoutput' serves to make a single array, not 14.
What's not shown here is a counted number of records to read... That can be as simple as--after the fopen, of course:
>> fgetl(fid); % get, throwaway the header row
>> while ~feof(fid) % until run out of data
d=cell2mat(textscan(fid,'',5,'delimiter',',','collectoutput',1));
d(:,14).',end
ans =
0.0079 0.0078 0.0076 0.0070 0.0071
ans =
0.0072 0.0074 0.0075 0.0088 0.0087
ans =
0.0085 0.0084 0.0083 0.0081 0.0080
ans =
0.0079 0.0078 0.0076 0.0075 0.0074
ans =
0.0071 0.0072 0.0087 0.0085 0.0084
ans =
0.0083
>>
This aborts with a short group since the size isn't evenly divisible; there were 26 lines of data. You'll end up aborting the loop because (hopefully) your search for the particular value will have succeeded and then you do a break, fclose and do whatever with the data you found and go to the next file.
NB Previously had forgotten to remove the 'headerlines',1 parameter so was skipping a record each loop through. Had accounted for the single header record at the beginning of the file with the fgetl call before beginning the loop.
Categorías
Más información sobre Large Files and Big Data en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!