File is all numeric, but csv read does not work fully

Hi, I'm trying to read in a bunch of data files one at time as a matrix, use the find function to find a certain z location and when I do, I store that row of data in a new matrix. My problem is that no matter what I do, I get this error:
??? Error using ==> dlmread at 145
Mismatch between file and format string.
Trouble reading number from file (row 781666, field 4) ==>
Error in ==> csvread at 52
m=dlmread(filename, ',', r, c);
Error in ==> Velocity_AtPt_vs_Time at 67
datafile=csvread(fullname,1);
The data files are all identical, 1 row of column headers and 14 columns of all numeric data, I made it so that the csvread skips the first row and reads all else. My files are approximately 1 million rows x 14 columns.
What's happening is the code executes for 69 data files doing exactly the steps I wish it to/filling the new matrix properly and then stops and gives me this error after the 69th. I have tried taking away the 70th and 71st files to see what happens, it now stops at 67 files. Very odd. If anyone has suggestions, please let me know! Thanks for reading
This is my loop that receives the error message:
for k = 1:numel(filenames)
% Create full file name and partial filename
fullname = [currentfolder filesep NEWFileNames(k).name];
% Read in data
datafile=csvread(fullname,1);
[rr1,cc1] = find(datafile(:,z)==0.0075000000008515);
firstrow1 = rr1(1,1);
firscol1 = cc1(1,1);
dataset1(k,:)=datafile(firstrow1,:);
end
Note: The rr1 and cc1 are just so that I may take the first instance this value shows up, but is not the error with this code

15 comentarios

dpb
dpb el 6 de Abr. de 2016
What if you read just the initially-failing file? Does it fail or not?
I was unaware that csvread would accept only R and not R,C as offset inputs but a test here shows it seemingly is ok with that.
Image Analyst
Image Analyst el 6 de Abr. de 2016
Editada: Image Analyst el 6 de Abr. de 2016
Might not be a bad idea though to get rid of the unneeded 1:
datafile = csvread(fullname);
By the way, this is a bad idea:
[rr1,cc1] = find(datafile(:,z)==0.0075000000008515);
And you forgot to attach the csv file, so we can't check row 781666, field 4. What is there? Please attach the file, with at least that many rows, or if a chunk of it, then tell us what row 781666 is now at.
Jenna P
Jenna P el 7 de Abr. de 2016
Yes dpb, I tried running the 3 files only around where this error occurs: so that's 69,70,71 just to be sure. They ran through with no problem and no error message.
Image Analyst: I have attached file 70 in "snippet" format to this post. Unfortunately, the whole file is too large, but in the attached the file starts from row 781664 (accounting for headers removed as in my code, although I have included the headers in this file so that it may be better understood). Since the data values are at various 3D XYZ locations (as results of a CFD simulation), it's important that I get the point I need exactly and not something within a tolerance. That is why I am using ==, but perhaps there is a better way since this one is quite slow.
Thank you for your quick responses
dpb
dpb el 7 de Abr. de 2016
Editada: dpb el 7 de Abr. de 2016
If her description is correct in having a header row the 1 is mandatory. For some inexplicable reason, TMW made this offset zero-based; almost if not the only thing in Matlab that is. WHY? Who knows, but it is...
dpb
dpb el 7 de Abr. de 2016
If the full file was read unmodified by itself, I'm convinced it's a memory-related bug...I know it probably takes a long time to process all these, but if you start over is it consistent or seemingly random where it dies?
Jenna P
Jenna P el 7 de Abr. de 2016
Editada: Jenna P el 7 de Abr. de 2016
As I mentioned earlier it stops after file 69. My matrix is filed up to that point properly and then no more. I also tried removing files 69,70 and 71, but this time it stopped at file 67 which is quite peculiar. Perhaps since my code does not "close" the files, this is an issue? I don't think csvread has this option though. Maybe I should just try sorting the files into two sets 50/50 and running each set, grabbing my 2 rows from each one, and then combining the results. Memory could very well be an issue, although it is the same error at the same stopping point on my personal laptop and university desktop
edit: This did not work
Can you put the line
memory
just before the csvread() line. Then after it crashes, paste what it says for memory back here.
A million lines is a lot of lines, especially if each row has lots of numbers on it, and if you're trying to store all of them in memory at the same time. It could run up into the gigabytes of RAM.
Maximum possible array: 16920 MB (1.774e+010 bytes) *
Memory available for all arrays: 16920 MB (1.774e+010 bytes) *
Memory used by MATLAB: 780 MB (8.178e+008 bytes)
Physical Memory (RAM): 12222 MB (1.282e+010 bytes)
* Limited by System Memory (physical + swap file) available.
Maximum possible array: 17601 MB (1.846e+010 bytes) *
Memory available for all arrays: 17601 MB (1.846e+010 bytes) *
Memory used by MATLAB: 780 MB (8.178e+008 bytes)
Physical Memory (RAM): 12222 MB (1.282e+010 bytes)
* Limited by System Memory (physical + swap file) available.
Maximum possible array: 17325 MB (1.817e+010 bytes) *
Memory available for all arrays: 17325 MB (1.817e+010 bytes) *
Memory used by MATLAB: 780 MB (8.178e+008 bytes)
Physical Memory (RAM): 12222 MB (1.282e+010 bytes)
* Limited by System Memory (physical + swap file) available.
Maximum possible array: 17211 MB (1.805e+010 bytes) *
Memory available for all arrays: 17211 MB (1.805e+010 bytes) *
Memory used by MATLAB: 778 MB (8.162e+008 bytes)
Physical Memory (RAM): 12222 MB (1.282e+010 bytes)
* Limited by System Memory (physical + swap file) available.
Maximum possible array: 17149 MB (1.798e+010 bytes) *
Memory available for all arrays: 17149 MB (1.798e+010 bytes) *
Memory used by MATLAB: 778 MB (8.162e+008 bytes)
Physical Memory (RAM): 12222 MB (1.282e+010 bytes)
* Limited by System Memory (physical + swap file) available.
Maximum possible array: 17068 MB (1.790e+010 bytes) *
Memory available for all arrays: 17068 MB (1.790e+010 bytes) *
Memory used by MATLAB: 778 MB (8.162e+008 bytes)
Physical Memory (RAM): 12222 MB (1.282e+010 bytes)
* Limited by System Memory (physical + swap file) available.
Maximum possible array: 16993 MB (1.782e+010 bytes) *
Memory available for all arrays: 16993 MB (1.782e+010 bytes) *
Memory used by MATLAB: 776 MB (8.135e+008 bytes)
Physical Memory (RAM): 12222 MB (1.282e+010 bytes)
* Limited by System Memory (physical + swap file) available.
This is the result of putting memory into the loop, although I'm not sure how to deal with this output
dpb
dpb el 7 de Abr. de 2016
Editada: dpb el 7 de Abr. de 2016
I was wondering if you ran the script multiple times without changing anything does it fail in the same place every time?
Also, one thing you could try that might save just a tiny fraction of memory would be to rewrite the find operation a little...
[rr1,cc1] = find(datafile(:,z)==0.0075000000008515,1,'first');
dataset1(k,:)=datafile(rr1,:);
This will stop the lookup after the one case is found. Also I note that z isn't defined in the code snippet, is it simply a constant I presume?
IA has mentioned the issue of floating point comparison but that would result in the result being empty which would be a different error...
Also, since you're looking for a particular value within the file, and the file is quite large, it might be advantageous to use a little more sophistication in the process rather than simply trying to read the whole thing in one swell foop--instead use textscan to read sections and if the point were to happen to be fairly early in the file, you can truncate the read before having to process the whole thing.
Is there any way to guestimate how far into the file the location of interest is?
Jenna P
Jenna P el 7 de Abr. de 2016
dpb: Yes, I have tried running the same script multiple times and the failure is always the same and at the same time. Thank you for your suggestion to making the command a little shorter using 'first'
z is simply a constant denoting the column #14 where all z values are.
I believe the location is about halfway down the rows somewhere. I may just need to do a totally different approach. I don't see why textscan would be much different and I also do not have much experience using it
dpb
dpb el 7 de Abr. de 2016
Editada: dpb el 7 de Abr. de 2016
textscan is quite different in one significant way--it operates on an open file handle and can be called repetitively to read the file sequentially whereas csvread and the base routine it calls which does the actual work dlmread only operate by reading the entire file into memory at once.
It seems as noted before if the file in question will, in fact, be read correctly on its own that there is not actually an error in the file itself but some memory interaction; that it seems to be exactly reproducible surprises me somewhat, though, I was expecting there to be a somewhat random component involved.
But, as far as that goes, I'd suggest submitting the symptoms as a bug report to TMW; perhaps they would be willing to try to reproduce the issue. It's too much data to be able to provide via the Answers forum for any of us to attempt directly.
BUT, back to getting to a workaround to get your project underway, look at the following doc page and consider how you might follow their guidelines there. If you could provide any algorithm by which you could estimate the location of the value in question by knowing something of the way in which the coordinates are laid out, that would help. If you need some help implementing this, attach a smaller subsection of a file and would be glad to help (as am sure many others will be as well).
The newer releases than I have also have a memory-mapping facility for text files similar to memmapfile for stream files that might be of help if you have recent-enough release. I don't so really have no experience with it or how well it might handle your particular problem.
Any way, there's more than one way to skin the cat here... :)
ADDENDUM
If you know the location you're after is always well down in the file, (you aren't so lucky as that it is the same identical location in every file I suppose which would make things much simpler) try setting the initial row number to a large integer but less than the line being searched for. Then, while I'd expect speed to go way down, maybe you'll resolve the memory-related issue.
I'm wondering, however, it the problem isn't actually memory per se, but OS file handles or the like that aren't being destroyed and it's system resources that are the ultimate cause of the failure and only the symptom error message is a red herring as to what is really the root cause. If that were to be the case, it would be likely the above also may fail but I think it's worth a try as it is simple to simply change the one constant in the existing script/function.
Jenna P
Jenna P el 7 de Abr. de 2016
Editada: Jenna P el 7 de Abr. de 2016
Hi dpb, thanks for your thoughtful response. I'll give textscan a shot and see how it goes. In the meantime, I was processing some other data which required a similar approach (ironically I used tolerances even before we discussed it), but it works for a small number of files and then with many more it stops. It seems to be a memory issue at this point since this is an entirely separate and different code, but with many files also. This time the files are about 150,000 rows by 14 col. The error message is very much identical, but it complains about a different row and field.
Unfortunately, the coordinates seem to be mapped randomly to the file. I had also entertained the idea of using the same exact row from each file instead of the find function as a test, but the error still persists. I'll give using a lower row number a chance, although that's not the goal of course. Perhaps textscan will be the fix since csvread is the common factor in both files.
Edit: Taking files away from the 150,000 by 14 set allowed it to run through, no problem. Clearly a memory issue. Since I'm trying to average results, I may have to settle on how many I average
dpb
dpb el 7 de Abr. de 2016
"Taking files away from the 150,000 by 14 set allowed it to run through, no problem. Clearly a memory issue."
That seems to me to point even more strongly to the resources end of things rather than just memory as that number of 150,000 is <<800k lines approximately where the point in question was from the initial set that ran successfully thru some 70 instances before the error. IOW, the amount of data you'd already processed in the case of this thread far, far exceeds the other case it would seem unless I'm grossly misreading the description.
I'd missed the section above where you attached a section of a file; I'll see if I can get a sample working this evening if you've not come up with a solution ere then.
Jenna P
Jenna P el 19 de Abr. de 2016
Editada: Jenna P el 19 de Abr. de 2016
I still have not found a solution to this problem. It does not make sense to me
edit: Actually... using xlsread instead of csvread may have worked..but painfully slow
dpb
dpb el 19 de Abr. de 2016
Did you try the textscan solution? The first response in my earlier Answer should work simply substituting it (and the appropriate fopen|fclose pair of course) for csvread.
I'd surely suggest making that attempt before going to xlsread. If it's something peculiar about [csv|dlm]read causing the (what I think is a resource issue) error, textscan is standalone and if it also aborts that's pretty indicative it's more fundamental.
Also, did you file a Service Request with TMW Support on the issue?

Iniciar sesión para comentar.

Respuestas (1)

dpb
dpb el 7 de Abr. de 2016
Editada: dpb el 20 de Abr. de 2016
OK, to separate from the long-winded chain of comments...this isn't the full answer yet, but a "getting-started" for textscan solution.
>> fid=fopen('file70_part.csv'); % open the file
>> d=cell2mat(textscan(fid,'','headerlines',1,'delimiter',',','collectoutput',1));
>> whos d
Name Size Bytes Class Attributes
d 26x14 2912 double
>> fid=fclose(fid);
>>
The above is all needed to read the full file; I've done a couple of things to make note of--
  1. Used empty string '' for the format string. This has the effect that Matlab will determine the fields per record automagically and return the proper shape; otherwise you have to know the number per record and write a specific format string to match, and;
  2. Used cell2mat around the textscan call to return the data as double array rather than the cell array otherwise returned. 'collectoutput' serves to make a single array, not 14.
What's not shown here is a counted number of records to read... That can be as simple as--after the fopen, of course:
>> fgetl(fid); % get, throwaway the header row
>> while ~feof(fid) % until run out of data
d=cell2mat(textscan(fid,'',5,'delimiter',',','collectoutput',1));
d(:,14).',end
ans =
0.0079 0.0078 0.0076 0.0070 0.0071
ans =
0.0072 0.0074 0.0075 0.0088 0.0087
ans =
0.0085 0.0084 0.0083 0.0081 0.0080
ans =
0.0079 0.0078 0.0076 0.0075 0.0074
ans =
0.0071 0.0072 0.0087 0.0085 0.0084
ans =
0.0083
>>
This aborts with a short group since the size isn't evenly divisible; there were 26 lines of data. You'll end up aborting the loop because (hopefully) your search for the particular value will have succeeded and then you do a break, fclose and do whatever with the data you found and go to the next file.
NB Previously had forgotten to remove the 'headerlines',1 parameter so was skipping a record each loop through. Had accounted for the single header record at the beginning of the file with the fgetl call before beginning the loop.

Categorías

Más información sobre Large Files and Big Data en Centro de ayuda y File Exchange.

Preguntada:

el 6 de Abr. de 2016

Editada:

dpb
el 20 de Abr. de 2016

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by