Extracting Data field of a Series in HTML file

In an HTML file, there is a section like this :
series: [{
name: 'Numbers',
color: '#33CCFF',
lineWidth: 5,
data: [45,78,84,91,111,125,178,231,274,283,303,333] }],
How to extract the 'data' field into an array in a matlab code ?
There are many such series' in that same HTML file with different 'name' fields. For example, name: 'Total Value', 'Log Scale', 'Base Value' etc.

4 comentarios

Mohammad Sami
Mohammad Sami el 7 de Abr. de 2020
Use regexp to find the keyword in the text.
start pattern = series: \[\{
end pattern = \}\]\,
b
b el 7 de Abr. de 2020
Thanks, but only want that particular series which matches name='Numbers'. Not the other series'.
Also, findElement does not seem to understand regexp.
Mohammad Sami
Mohammad Sami el 7 de Abr. de 2020
are you parsing the html in Matlab as char array ? regexp is for string, cellstr or char data.
you can easily change the pattern to name: \'Numbers\'
I am trying to do the following:
url="c:\finCase\case1.html";
code=webread(url);
tree=htmlTree(code);
selector="series";
subtrees=findElement(tree,selector);
The subtrees field is empty whereas it should have all the series' corresponding to various names ('Numbers', 'Total Value', 'Log Scale', 'Base Value' etc).

Iniciar sesión para comentar.

 Respuesta aceptada

per isakson
per isakson el 7 de Abr. de 2020
Editada: per isakson el 9 de Mayo de 2020
I misunderstood your question. This is a bit of overkill.
Assumptions
  • the string, series:, always indicates the start of a block of interest
I created a sample file, cssm.txt, which I uploaded. (Matlab Answers doesn't allow the extension .html ).
This script reads all blocks
%%
chr = fileread('cssm.txt');
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
%%
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
for jj = 1 : len
txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).name = matlab.lang.makeValidName( txt );
txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).color = txt;
txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
series(jj).lineWidth = str2double( txt );
txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
series(jj).data = str2num( txt ); %#ok<ST2NM>
end
and extract "series which matches name='Numbers'. Not the other series'."
>> series(strcmp({series.name},'Numbers')).data
ans =
45 78 84 91 111 125 178 231 274 283 303 333
In response to comment below
Assumptions
  • the string, series:, always indicates the start of a block of interest
  • the string, }], indicates the end of a block of interest
  • all html-files of interest are named index.html
  • all files named index.html are of interest
  • all html-files of interest are in subfolders under a root-folder, ...\finCase
  • every html-file, index.html, contains exactly one block that has a specific value of the field name:, e.g. Numbers
The overkill is still there. However, reading and parsing four html-files (copies of cssm.txt ) takes less than 10ms.
Try
>> client_data = read_client_data( 'd:\m\cssm\finCase', 'index.html', 'Numbers' )
client_data =
4×2 cell array
{'anderson' } {1×9 double}
{'kim-j-clijsters'} {1×10 double}
{'paul-judd' } {1×11 double}
{'simmi' } {1×12 double}
>>
where (in one m-file)
function client_data = read_client_data( root, file, name )
sad = dir( fullfile( root, '**', file ) );
len = length( sad );
client_data = cell( len, 2 );
for jj = 1 : len
cac = strsplit( sad(jj).folder, filesep );
client = cac{end};
series = read_one_file_( fullfile( sad(jj).folder, sad(jj).name ) );
client_data(jj,:) = { client, series(strcmp({series.name},name)).data };
end
end
function series = read_one_file_( file )
chr = fileread( fullfile( file ) );
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
for jj = 1 : len
txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).name = strtrim( txt );
txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).color = txt;
txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
series(jj).lineWidth = str2double( txt );
txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
series(jj).data = str2num( txt ); %#ok<ST2NM>
end
end
TODO: add error handling and comments

10 comentarios

b
b el 7 de Abr. de 2020
Editada: per isakson el 7 de Abr. de 2020
I really cannot thank you enough.
Although I have one follow-up query.
The files are in hundreds of client directories like this:
c:\finCase\client1\index.html
c:\finCase\client2\index.html
c:\finCase\client3\index.html
...
How can all these index.html files be batch processed so that instead of a vector, the output of the command
series(strcmp({series.name},'Numbers')).data
is a matrix containing the data field of each client in every row ?
Just a small caveat : Every row will have different number of elements. Client1 may have done more transactions than Client2, hence the 'Data' field of client1 may have more number of entries than the 'Data' field of client2.
b
b el 7 de Abr. de 2020
Sorry, my mistake. The directory names are not 'client1', 'client2', 'client3' etc. The directory names are 'anderson', 'simmi', 'paul-judd', kim-j-clijsters' ...
Some names contain no hyphens, some contain a single hyphen and some have two hyphens.
per isakson
per isakson el 9 de Abr. de 2020
Editada: per isakson el 9 de Abr. de 2020
I wrote an addition to the answer yesterday, but I failed to submit. Better luck this time.
That is a lot of effort. Really thanks. The assumptions themselves are great. Unfortunately, the last assumption does not work for the given index.html. It turns out that just beneath the fieldName 'Numbers', there is another block with the same field name 'Numbers'. The difference between the two is that the first one is for linear plotting, and the second one is for logarithmic. There is no way that I can manually delete the second (un-needed logarithmic one) from every single index.html file.
Due to this violation, the code gives the following error:
Unable to perform assignment because the size of the left side is 1-by-2 and the size of the right side is 1-by-3.
Error in read_client_data (line 11)
client_data(jj,:) = { client, series(strcmp({series.name},name)).data }
Truly thanks for atleast bringing it to this stage.
b
b el 9 de Abr. de 2020
p.s. The run-time is just fine. For ~200 client list (terminated from bigger list for checking purposes), the run-time is slightly less than one minute on a 64GB RAM, 4GHz i7, M.2-drive machine.
per isakson
per isakson el 10 de Abr. de 2020
Editada: per isakson el 10 de Abr. de 2020
"The difference between the two is that the first one is for linear plotting, and the second one is for logarithmic.", but how can I known for sure which is which? May I add to the assumptions that the first one is always "linear"? If so, replace the line
client_data(jj,:) = { client, series(strcmp({series.name},name)).data };
by
ix_name = find( strcmp({series.name},name), 1, 'first' ); % "linear" is first
client_data(jj,:) = { client, series(ix_name).data };
"The assumptions themselves are great." Thank you for saying that. Half a century ago I was told that "the longer you keep away from the keyboard the better a program", i.e it pays off to do a bit of planning ahead.
b
b el 10 de Abr. de 2020
Yes ! Thank you ! No errors and the data field is now easily operable.
Although I am more concerned about the further analysis, this has been a great learning experience. I am so happy that you responded to this question.
This code has been a life-saver. But there is just one more issue. The 'Data' field sometimes has null values :
series: [{
name: 'Numbers',
color: '#33CCFF',
lineWidth: 5,
data: [null,null,null,null,45,78,84,91,111,125,178,231,274,283,303,333] }],
in which case,
>> series(strcmp({series.name},'Numbers')).data
ans =
[]
instead of
>> series(strcmp({series.name},'Numbers')).data
ans =
0 0 0 0 45 78 84 91 111 125 178 231 274 283 303 333
Why are the null values giving empty data field ?
A nice thing with standards is that there are so many to chose between. Null (or NULL) is a special marker used in Structured Query Language to indicate that a data value does not exist in the database [Wikipedia]. However, Matlab doesn't honor Null.
Replace the statement
series(jj).data = str2num( txt ); %#ok<ST2NM>
by
out = textscan( txt , '%f' ...
, 'CollectOutput' , true ...
, 'Delimiter' , ',' ...
, 'EmptyValue' , 0 ...
, 'TreatAsEmpty' , 'null' ...
, 'Whitespace' , ' \t[]' );
series(jj).data = reshape( out{:}, 1,[] );
and read about textscan in the documentation.
b
b el 10 de Mayo de 2020
LOL on the tragedy of being Null.
The code section works nicely with output as needed.
Indebted once again.

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Large Files and Big Data en Centro de ayuda y File Exchange.

Etiquetas

Preguntada:

b
b
el 7 de Abr. de 2020

Comentada:

b
b
el 10 de Mayo de 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by