Access data in a nested structure array and generate an excel file

Question

0 votos

Hello everyone,

I have a bunch of .json files in a folder called "JSON files". Each file is an article. I would like to put the paper_id, title, abstract, and authors' affiliation in an excel files.

files = dir('JSON files/*.json');
for i=1:numel(files)
    filename = fullfile(files(i).folder, files(i).name);
    data = jsondecode(fileread(filename));
end

the "data" in the above code for each i looks like

data = 
  struct with fields:
       paper_id: '0015023cc06b5362d332b3baf348d11567ca2fbb'
       metadata: [1×1 struct]
       abstract: [2×1 struct]
      body_text: [20×1 struct]
    bib_entries: [1×1 struct]
    ref_entries: [1×1 struct]
    back_matter: [1×1 struct]

and

    data.metadata =
     struct with fields:
      title: 'The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3'
    authors: [17×1 struct]

and

    data. metadata.authors
ans = 
  17×1 struct array with fields:
    first
    middle
    last
    suffix
    affiliation
    email

for this specific article the affiliation is empty. However most of the .json files have an authors' affiliation. Moreover, each author may have different affiliation from her co-authors. I would like to keep all afiliations.

Any idea how I can solve this problem? Any input would be greatly appreciated!

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Sindar el 6 de Mayo de 2020

Editada: Sindar el 6 de Mayo de 2020

Abrir en MATLAB Online

0 votos

[Edited with fixed answer, x2]

This throws no errors and appears to produce a correct excel file from your sample data:

files = dir('JSON files/*.json');
% start off a table to put info in, with the correct size and column names
mytable=table('Size',[numel(files) 4],'VariableTypes',{'string';'string';'string';'string'},'VariableNames',{'pid';'title';'abstract';'affiliations'});
for ind=1:numel(files)
    % load data
    filename = fullfile(files(ind).folder, files(ind).name);
    data = jsondecode(fileread(filename));
    % for the ind-th row, fill in the pid column
    mytable{ind,'pid'} = {data.paper_id};
    % title column
    mytable{ind,'title'} = {data.metadata.title};
    % abstract column
    % if the abstract is empty, it will get left as <missing> and print an empty cell in excel
    if ~isempty(data.abstract)
        this_abstract = {data.abstract.text};
        % concatenate abstract lines with spaces
        mytable{ind,'abstract'} = {strjoin(this_abstract,' ')};
    end
    % affiliations column
    % check which authors have affiliation info
    tmp=false([0 0]);
    for ind_A = 1:length(data.metadata.authors)
        tmp(ind_A) = ~isempty(fieldnames(data.metadata.authors(ind_A).affiliation));
    end
    % put all affiliations in a single struct array
    if ~isempty(tmp) && nnz(tmp)>0
        this_affiliation = [data.metadata.authors(tmp).affiliation];
        % put all institutions in a single cell array
        this_affiliation = {this_affiliation.institution};
        % remove duplicates, sort, and make sure the first element is an empty string
        this_affiliation = unique([{''} this_affiliation]);
        % combine all the affiliations, separating by '; ' (ignore empty first string)
        mytable{ind,'affiliations'} = {strjoin(this_affiliation(2:end),'; ')};
    end
    % clear temporary variable (tmp especially)
    clear tmp this_abstract this_affiliation
end
% write table to myData.xls
writetable(mytable,'myData.xls')

If you have a lot of files and don't want to store all the data in a table before writing, use the append option:

mytable=table('Size',[1 4],'VariableTypes',{'string';'string';'string';'string'},'VariableNames',{'pid';'title';'abstract';'affiliations'});
writetable(mytable,'myData.xls');
for ind
    ...
    mytable{1,'pid'} = {data.paper_id};
    ...
    writetable(mytable,'myData.xls','WriteMode','Append','WriteVariableNames',false) 
end

16 comentarios
Mostrar 14 comentarios más antiguos Ocultar 14 comentarios más antiguos

Susan el 6 de Mayo de 2020

Editada: Susan el 6 de Mayo de 2020

JSON files.zip

Thank you very much for your reply. I appreciate your time and help.

when I execute the code you provided me I got error "Wrong number of arguments" on the line contains "mytable{ind,'pid'} = data.paper_id;". I've changed it to "convertCharsToStrings(data.paper_id)"/"cellstr(data.paper_id)". However, I am not sure whether I am right or not. I've also got another error on "mytable{ind,'abstract'} = data.abstract" which says "To assign to or create a variable in a table, the number of rows must match the height of the table." I have changed it to "mytable{ind,'abstract'} = [data.abstract.text]", however, I encounter a new error that I don't know how to handel that "The value on the right-hand side of the assignment has the wrong width. The assignment requires a value whose width is 1". Moreover, I have got some other errors that don't know how to address such as:

>> this_affiliation = unique([{''} this_affiliation])

Error using cell/unique (line 85)

Cell array input must be a cell array of character vectors.

>> mytable{ind,'affiliation'} = strjoin(ans(2:end),'; ')

Error using strjoin (line 53)

First input must be a string array or cell array of character vectors.

Moreover, the "data.metadata.authors.affiliation" is a struct as well.

I have attached 10 files (out of 1000 files that I have). Could you please kindly take a look and tell me what modification I need to make? I really appreciate your help. Thanks

Sindar el 6 de Mayo de 2020

Abrir en MATLAB Online

assuming you just want the text field from the abstract, that can be gathered as affiliations above. For the affiliations, first you need to make a new struct array just from the affiliation structures:

files = dir('JSON files/*.json');
% start off a table to put info in
mytable=table();
for ind=1:numel(files)
    filename = fullfile(files(ind).folder, files(ind).name);
    data = jsondecode(fileread(filename));
    % for the ind-th row, fill in the pid column
    mytable{ind,'pid'} = {data.paper_id};
    % title column
    mytable{ind,'title'} = {data.metadata.title};
    % abstract column
    this_abstract = {data.abstract.text};
    mytable{ind,'abstract'} = {strjoin(this_abstract,' ')};
    % put all affiliations in a single struct array
    this_affiliation = [data.metadata.authors.affiliation];
    % put all institutions in a single cell array
    this_affiliation = {this_affiliation.institution}
    % remove duplicates, sort, and make sure the first element is an empty string
    this_affiliation = unique([{''} this_affiliation]);
    % combine all the affiliations, separating by '; ' (ignore empty first string)
    mytable{ind,'affiliation'} = {strjoin(this_affiliation(2:end),'; ')};
end
% write table to myData.xls
writetable(mytable,'myData.xls')

There is some error-checking left, for instance:

if ~isempty(data.abstract)
    this_abstract = {data.abstract.text};
end

will catch empty abstracts.

Sindar el 6 de Mayo de 2020

Abrir en MATLAB Online

Ok, I've figured out all the empty-value catching. Here's the original answer for reference, but I've edited the Answer

I don't have your data, so this may need a little tweaking, but give it a try:

files = dir('JSON files/*.json');
% start off a table to put info in
mytable=table();
for ind=1:numel(files)
    filename = fullfile(files(ind).folder, files(ind).name);
    data = jsondecode(fileread(filename));
    % for the ind-th row, fill in the pid column
    mytable{ind,'pid'} = data.paper_id;
    % title column
    mytable{ind,'title'} = data.metadata.title;
    % abstract column
    mytable{ind,'abstract'} = data.abstract; % more work here, since it's a struct
    % put all affiliations in a single cell array
    this_affiliation = {data.metadata.authors};
    % remove duplicates, sort, and make sure the first element is an empty string
    this_affiliation = unique([{''} this_affiliation]);
    % combine all the affiliations, separating by '; ' (ignore empty first string)
    mytable{ind,'affiliation'} = strjoin(ans(2:end),'; ')
end
% write table to myData.xls
writetable(mytable,'myData.xls')

Sindar el 6 de Mayo de 2020

BTW, this was an interesting problem. I have a "catstructfield" function that does this sort of thing for nice structure arrays. I actually routinely use it on data loaded from JSON files. Luckily for me, the files I work with don't rarely have any missing data, so I didn't have to worry about that. I may eventually update the code to the point that it could be generally-usable and worth sharing. This question/dataset will certainly help with that!

I tried it on your data and it threw no errors! Because I have a try-catch block that just doesn't return any part of the structure that didn't work. So, I ended up with a mostly empty structure...

Susan el 6 de Mayo de 2020

Glad to hear this data set may help with that! Thanks again for your help.

Iniciar sesión para comentar.

Access data in a nested structure array and generate an excel file

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Respuesta aceptada

16 comentarios
Mostrar 14 comentarios más antiguos Ocultar 14 comentarios más antiguos

Más respuestas (0)

Categorías

Etiquetas

Community Treasure Hunt

Access data in a nested structure array and generate an excel file

0 comentarios Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Respuesta aceptada

16 comentarios Mostrar 14 comentarios más antiguos Ocultar 14 comentarios más antiguos

Más respuestas (0)

Categorías

Etiquetas

Ver también

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

16 comentarios
Mostrar 14 comentarios más antiguos Ocultar 14 comentarios más antiguos