Analyzing a Large Amount of Data in a CSV file

Hello. So I am tasked for one of my projects to analyze over a hundred thousand lines of data in a .csv file. I decide to use Matlab to help me analyze it. This data is essentially COVID19 data for over a hundred countries and I need to group the data by countries and plot a graph of deaths per million versus time. I've written a code so far but I received an error at the end. Currently, I am not sure what to do to fix the problem and I really need help. Also, I am new to Matlab so my code might look very inefficient and primitive. I would also very much appreciate feedback on how I can make it better.
T=readtable("data.csv")
countries=unique(T.location);
numcount=numel(T.location);
for i=1:height(T);
for j=cellstr(countries);
A=[];
count=0;
if ismember(T.location(i),j);
count=count+1;
A(count,1)=string(T.date(i));
A(count,2)=T.total_deaths_per_million(i);
A(count,3)=T.new_deaths_per_million(i);
A(count,4)=T.hospital_beds_per_thousand(i);
A(count,5)=T.life_expectancy(i);
A(count,6)=T.human_development_index(i);
end
eval('cntry' string(j) '=A');
end
end
After running this code, I receive this error (I'm not sure how to change the text to red, sorry).
File: test1.m Line: 17 Column: 22
Invalid expression. Check for missing multiplication operator, missing or unbalanced delimiters, or
other syntax error. To construct matrices, use brackets instead of parentheses.
Could anyone please help me to solve this error? Also, I am planning to plot deaths per million versus time but I am not sure how to convert dates into a number that can be plotted in a graph. Could anyone give me some tips on how to do so? Thank you very much!

3 comentarios

Stephen23
Stephen23 el 28 de Oct. de 2021
"Could anyone give me some tips on how to do so?"
Your task would be much easier if you did not force meta-data (e.g. country names) into variable names:
Meta-data is data, and data belongs in variables, not in variable names. If you stored the country names are data in their own right then you could easily and efficiently loop over the countries, process, and plot the data as required.
In contrast the approach you have chosen forces you into writing slow, complex, inefficient code.
Nathan Lawira
Nathan Lawira el 28 de Oct. de 2021
So do you mean that I should instead store country names into a cell array and pull the data from that cell?
Stephen23
Stephen23 el 28 de Oct. de 2021
"So do you mean that I should instead store country names into a cell array and pull the data from that cell?"
That would be one of many approaches that would be easier to work with than dynamic variable names.
Other approaches would be to use a table (see READTABLE) or a non-scalar structure.

Iniciar sesión para comentar.

 Respuesta aceptada

Yongjian Feng
Yongjian Feng el 27 de Oct. de 2021
The error is about your line 17. What do you want to do there?
Do you mean
eval(['cntry' num2str(j) '=A']);

8 comentarios

Yongjian Feng
Yongjian Feng el 27 de Oct. de 2021
Editada: Yongjian Feng el 27 de Oct. de 2021
The user just wants to store A for each iteration into a cntry1/cntry2.... If so, put it into an array like this:
T=readtable("data.csv")
countries=unique(T.location);
numcount=numel(T.location);
cntry = [];
for i=1:height(T);
for j=cellstr(countries);
A=[];
count=0;
if ismember(T.location(i),j);
count=count+1;
A(count,1)=string(T.date(i));
A(count,2)=T.total_deaths_per_million(i);
A(count,3)=T.new_deaths_per_million(i);
A(count,4)=T.hospital_beds_per_thousand(i);
A(count,5)=T.life_expectancy(i);
A(count,6)=T.human_development_index(i);
end
cntry = [cntry A];
end
end
count is incremented conditionally, so A will have a different number of rows for different iterations. So you need a cell array instead of horzcat.
Good catch.
T=readtable("data.csv")
countries=unique(T.location);
numcount=numel(T.location);
cntry = {};
for i=1:height(T);
for j=cellstr(countries);
A=[];
count=0;
if ismember(T.location(i),j);
count=count+1;
A(count,1)=string(T.date(i));
A(count,2)=T.total_deaths_per_million(i);
A(count,3)=T.new_deaths_per_million(i);
A(count,4)=T.hospital_beds_per_thousand(i);
A(count,5)=T.life_expectancy(i);
A(count,6)=T.human_development_index(i);
end
cntry{end+1} = A;
end
end
%% access cntry using cntry{1}/cntry{2}....
T=readtable("data.csv")
countries=unique(T.location);
numcount=numel(T.location);
cntry = {};
for i=1:height(T);
for j=cellstr(countries);
A=[];
count=0;
if ismember(T.location(i),j);
count=count+1;
A(count,1)=string(T.date(i));
A(count,2)=T.total_deaths_per_million(i);
A(count,3)=T.new_deaths_per_million(i);
A(count,4)=T.hospital_beds_per_thousand(i);
A(count,5)=T.life_expectancy(i);
A(count,6)=T.human_development_index(i);
end
cntry{end+1} = A;
end
end
%% access cntry using cntry{1}/cntry{2}....
I tried this and it does indeed work. Thank you very much for your help! However, it took about 4 minutes to return cntry with all the data inside. Is it because my code is inefficient or is the time taken dependent on the amount of data I have (my T is a 119656x64 table)?
Also, is there a way for me to name the elements of my cell array? For example, if I want the data of Australia, then is it possible for me to make a code that will return Australia's data with cntry{Australia} instead of for example cntry{16}?
Yongjian Feng
Yongjian Feng el 29 de Oct. de 2021
Editada: Yongjian Feng el 29 de Oct. de 2021
You can use a struct to store the data. Inside the loop, if you can figure out the name of a country, you can do
T=readtable("data.csv")
countries=unique(T.location);
numcount=numel(T.location);
for i=1:height(T);
for j=cellstr(countries);
A=[];
count=0;
countryName = '';
if ismember(T.location(i),j);
count=count+1;
A(count,1)=string(T.date(i));
A(count,2)=T.total_deaths_per_million(i);
A(count,3)=T.new_deaths_per_million(i);
A(count,4)=T.hospital_beds_per_thousand(i);
A(count,5)=T.life_expectancy(i);
A(count,6)=T.human_development_index(i);
% somehow you can figure out your country name
% maybe from T?
countryName = 'Australia';
end
if ~isempty(countryName)
% good you figure out your countryName
cntry.(countryName) = A;
else
% you need to decide what to do if you can't figure out the
% countryName.
end
end
end
%% access cntry using cntry{1}/cntry{2}....
Nathan Lawira
Nathan Lawira el 29 de Oct. de 2021
Thank you! I've found another way though by using struct() and creating another cell array. Thank you so much all for your help.

Iniciar sesión para comentar.

Más respuestas (1)

Walter Roberson
Walter Roberson el 27 de Oct. de 2021
The line
eval('cntry' string(j) '=A');
is the problem. You have a character vector, then a space, then a string scalar, then a space, then a character vector. You do not have operators between the parts, so that is an invalid expression.
If you were to change it to
eval(['cntry' string(j) '=A']);
then you would have the problem that with the string scalar there, the result of the [] would be a 1 x 3 string object, but eval() is not designed to be able to accept a vector of string objects.
You should Don't Do That

6 comentarios

for j=cellstr(countries);
Why are you using cellstr() there?
From outside, we cannot tell whether countries is numeric or cell array of character vectors. We can tell that countries is not string array, because readtable() defaults to recording text as cell of character vectors instead of string array; likewise, readtable() only knows about categorical if you use import options and set the variable to categorical.
The other thing we know is that T.location will be vertical, not horizontal: when readtable is able to find delimiters between parts, it creates separate column variables, not row variables.
If countries was numeric, then cellstr() would fail. So by elimination, we conclude that countries must be a cell array of character vectors and that the cell array is a column vector.
What does cellstr() do when it is passed a column vector cell array of character vectors? Answer: it returns them unchanged, just as if cellstr() had not been called. So we can deduce that the right hand side of for j=cellstr(countries); is getting a cell array column vector.
What does for do when it receives a cell array column vector? Answer: it does a single iteration and assigns the loop variable to be the entire cell array column vector at the same time. This is because for iterations over columns not over elements.
eval('cntry' string(j) '=A');
Remember at this point j is a cell array of character vectors, a complete copy of the countries variable. string() of it is going to result in a column vector string array. What is it that you expect the statement to do in such a case?
I see... in that case, I would have to convert j into a row vector by one means or another, right?
eval('cntry' string(j) '=A');
I am holding a dummy variable A which takes all the data from T for a specific country, and after the conditional if ends, it puts all the data it is storing into dynamic variables (i.e. cntryAfghanistan, cntryAustralia). So, at the end, I would have all these variables containing the data for each country.
So, at the end, I would have all these variables containing the data for each country.
We very much recommend that you do not do that. Instead,
jc = char(j); %in case it is scalar cell array or string scalar
cntry.(jc) = A;
That would give you a struct named cntry with one field named after each country.
Nathan Lawira
Nathan Lawira el 28 de Oct. de 2021
I see.. I'll try it out, thank you very much!
Stephen23
Stephen23 el 28 de Oct. de 2021
Editada: Stephen23 el 28 de Oct. de 2021
"I would also very much appreciate feedback on how I can make it better."
Rather than nesting the data in fields named for each country, your data would be better (simpler, more efficient, much easier to access) if you created a flat non-scalar structure, which would look something like this:
S(1).country = 'Afghanistan';
S(1).population = 39e6;
S(1).whatever = 123;
s(2).country = 'Bolivia';
S(2).population = 12e6;
S(2).whatever - 456;
... etc
Then you can simply loop over all countries using indexing or easily generate comma-separated lists:
As an alternative, a table might be suitable for your data (and also makes processing data much simpler):
Nathan Lawira
Nathan Lawira el 28 de Oct. de 2021
Okay, I'll try it out. Thank you!

Iniciar sesión para comentar.

Productos

Versión

R2021a

Preguntada:

el 27 de Oct. de 2021

Comentada:

el 29 de Oct. de 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by