Access data in a nested structure array and generate an excel file

Hello everyone,
I have a bunch of .json files in a folder called "JSON files". Each file is an article. I would like to put the paper_id, title, abstract, and authors' affiliation in an excel files.
files = dir('JSON files/*.json');
for i=1:numel(files)
filename = fullfile(files(i).folder, files(i).name);
data = jsondecode(fileread(filename));
end
the "data" in the above code for each i looks like
data =
struct with fields:
paper_id: '0015023cc06b5362d332b3baf348d11567ca2fbb'
metadata: [1×1 struct]
abstract: [2×1 struct]
body_text: [20×1 struct]
bib_entries: [1×1 struct]
ref_entries: [1×1 struct]
back_matter: [1×1 struct]
and
data.metadata =
struct with fields:
title: 'The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3'
authors: [17×1 struct]
and
data. metadata.authors
ans =
17×1 struct array with fields:
first
middle
last
suffix
affiliation
email
for this specific article the affiliation is empty. However most of the .json files have an authors' affiliation. Moreover, each author may have different affiliation from her co-authors. I would like to keep all afiliations.
Any idea how I can solve this problem? Any input would be greatly appreciated!

 Respuesta aceptada

Sindar
Sindar el 6 de Mayo de 2020
Editada: Sindar el 6 de Mayo de 2020
[Edited with fixed answer, x2]
This throws no errors and appears to produce a correct excel file from your sample data:
files = dir('JSON files/*.json');
% start off a table to put info in, with the correct size and column names
mytable=table('Size',[numel(files) 4],'VariableTypes',{'string';'string';'string';'string'},'VariableNames',{'pid';'title';'abstract';'affiliations'});
for ind=1:numel(files)
% load data
filename = fullfile(files(ind).folder, files(ind).name);
data = jsondecode(fileread(filename));
% for the ind-th row, fill in the pid column
mytable{ind,'pid'} = {data.paper_id};
% title column
mytable{ind,'title'} = {data.metadata.title};
% abstract column
% if the abstract is empty, it will get left as <missing> and print an empty cell in excel
if ~isempty(data.abstract)
this_abstract = {data.abstract.text};
% concatenate abstract lines with spaces
mytable{ind,'abstract'} = {strjoin(this_abstract,' ')};
end
% affiliations column
% check which authors have affiliation info
tmp=false([0 0]);
for ind_A = 1:length(data.metadata.authors)
tmp(ind_A) = ~isempty(fieldnames(data.metadata.authors(ind_A).affiliation));
end
% put all affiliations in a single struct array
if ~isempty(tmp) && nnz(tmp)>0
this_affiliation = [data.metadata.authors(tmp).affiliation];
% put all institutions in a single cell array
this_affiliation = {this_affiliation.institution};
% remove duplicates, sort, and make sure the first element is an empty string
this_affiliation = unique([{''} this_affiliation]);
% combine all the affiliations, separating by '; ' (ignore empty first string)
mytable{ind,'affiliations'} = {strjoin(this_affiliation(2:end),'; ')};
end
% clear temporary variable (tmp especially)
clear tmp this_abstract this_affiliation
end
% write table to myData.xls
writetable(mytable,'myData.xls')
If you have a lot of files and don't want to store all the data in a table before writing, use the append option:
mytable=table('Size',[1 4],'VariableTypes',{'string';'string';'string';'string'},'VariableNames',{'pid';'title';'abstract';'affiliations'});
writetable(mytable,'myData.xls');
for ind
...
mytable{1,'pid'} = {data.paper_id};
...
writetable(mytable,'myData.xls','WriteMode','Append','WriteVariableNames',false)
end

16 comentarios

If you have a lot of files and don't want to store all the data in a table before writing, use the append option:
mytable=table([],[],[],[],'VariableNames',{'pid';'title';'abstract';'affiliations'})
writetable(mytable,'myData.xls')
for ind
...
mytable{1,'pid'} = data.paper_id;
...
writetable(mytable,'myData.xls','WriteMode','Append','WriteVariableNames',false)
end
Susan
Susan el 6 de Mayo de 2020
Editada: Susan el 6 de Mayo de 2020
Thank you very much for your reply. I appreciate your time and help.
when I execute the code you provided me I got error "Wrong number of arguments" on the line contains "mytable{ind,'pid'} = data.paper_id;". I've changed it to "convertCharsToStrings(data.paper_id)"/"cellstr(data.paper_id)". However, I am not sure whether I am right or not. I've also got another error on "mytable{ind,'abstract'} = data.abstract" which says "To assign to or create a variable in a table, the number of rows must match the height of the table." I have changed it to "mytable{ind,'abstract'} = [data.abstract.text]", however, I encounter a new error that I don't know how to handel that "The value on the right-hand side of the assignment has the wrong width. The assignment requires a value whose width is 1". Moreover, I have got some other errors that don't know how to address such as:
>> this_affiliation = unique([{''} this_affiliation])
Error using cell/unique (line 85)
Cell array input must be a cell array of character vectors.
>> mytable{ind,'affiliation'} = strjoin(ans(2:end),'; ')
Error using strjoin (line 53)
First input must be a string array or cell array of character vectors.
Moreover, the "data.metadata.authors.affiliation" is a struct as well.
I have attached 10 files (out of 1000 files that I have). Could you please kindly take a look and tell me what modification I need to make? I really appreciate your help. Thanks
To fix the "Wrong number of arguments" error, simply wrap the RHS in {}:
mytable{ind,'pid'} = {data.paper_id};
assuming you just want the text field from the abstract, that can be gathered as affiliations above. For the affiliations, first you need to make a new struct array just from the affiliation structures:
files = dir('JSON files/*.json');
% start off a table to put info in
mytable=table();
for ind=1:numel(files)
filename = fullfile(files(ind).folder, files(ind).name);
data = jsondecode(fileread(filename));
% for the ind-th row, fill in the pid column
mytable{ind,'pid'} = {data.paper_id};
% title column
mytable{ind,'title'} = {data.metadata.title};
% abstract column
this_abstract = {data.abstract.text};
mytable{ind,'abstract'} = {strjoin(this_abstract,' ')};
% put all affiliations in a single struct array
this_affiliation = [data.metadata.authors.affiliation];
% put all institutions in a single cell array
this_affiliation = {this_affiliation.institution}
% remove duplicates, sort, and make sure the first element is an empty string
this_affiliation = unique([{''} this_affiliation]);
% combine all the affiliations, separating by '; ' (ignore empty first string)
mytable{ind,'affiliation'} = {strjoin(this_affiliation(2:end),'; ')};
end
% write table to myData.xls
writetable(mytable,'myData.xls')
There is some error-checking left, for instance:
if ~isempty(data.abstract)
this_abstract = {data.abstract.text};
end
will catch empty abstracts.
Ok, I've figured out all the empty-value catching. Here's the original answer for reference, but I've edited the Answer
I don't have your data, so this may need a little tweaking, but give it a try:
files = dir('JSON files/*.json');
% start off a table to put info in
mytable=table();
for ind=1:numel(files)
filename = fullfile(files(ind).folder, files(ind).name);
data = jsondecode(fileread(filename));
% for the ind-th row, fill in the pid column
mytable{ind,'pid'} = data.paper_id;
% title column
mytable{ind,'title'} = data.metadata.title;
% abstract column
mytable{ind,'abstract'} = data.abstract; % more work here, since it's a struct
% put all affiliations in a single cell array
this_affiliation = {data.metadata.authors};
% remove duplicates, sort, and make sure the first element is an empty string
this_affiliation = unique([{''} this_affiliation]);
% combine all the affiliations, separating by '; ' (ignore empty first string)
mytable{ind,'affiliation'} = strjoin(ans(2:end),'; ')
end
% write table to myData.xls
writetable(mytable,'myData.xls')
Susan
Susan el 6 de Mayo de 2020
Editada: Susan el 6 de Mayo de 2020
Thank you VERY MUCH! I really appreciate your time. I learned a lot from your reply. Thanks again!
If you don't mind, I've got a quick questions for you.
Some files don't have authors, i.e.,
data.metadata
ans =
struct with fields:
title: ''
authors: []
So, we end up with "undefined variable 'tmp'." How to deal with that? Thanks again.
...
% check which authors have affiliation info
tmp=[];
for ind_A = 1:length(data.metadata.authors)
...
should do it
Unfortunately, no. It goes well through the first file but give me the following error on second file
"Array indices must be positive integers or logical values"
on which line?
Susan
Susan el 6 de Mayo de 2020
Editada: Susan el 6 de Mayo de 2020
I have attached a new file that gave me error that I mentione above, i.e., undefined variable 'tmp'.
data.paper_id
ans =
'036a53f0173c0e210dd5a551dc22c371e2859786'
Susan
Susan el 6 de Mayo de 2020
Editada: Susan el 6 de Mayo de 2020
On this line, even for ind = 1, when I add tmp = [];
this_affiliation = [data.metadata.authors(tmp).affiliation];
Fixed the answer
Ah, the issue was that tmp = []; makes tmp a numeric array, so it gets filled with 0s and 1s instead of falses and trues. You can't index into an array with 0
Additional issue for authors without affiliations:
this_affiliation = [data.metadata.authors([]).affiliation];
errors (you can't dot-index []).
Thank you very much! That works perfectly well. Thank you Thank you Thank you!
BTW, this was an interesting problem. I have a "catstructfield" function that does this sort of thing for nice structure arrays. I actually routinely use it on data loaded from JSON files. Luckily for me, the files I work with don't rarely have any missing data, so I didn't have to worry about that. I may eventually update the code to the point that it could be generally-usable and worth sharing. This question/dataset will certainly help with that!
I tried it on your data and it threw no errors! Because I have a try-catch block that just doesn't return any part of the structure that didn't work. So, I ended up with a mostly empty structure...
Glad to hear this data set may help with that! Thanks again for your help.

Iniciar sesión para comentar.

Más respuestas (0)

Preguntada:

el 5 de Mayo de 2020

Editada:

el 6 de Mayo de 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by