Extracting Data field of a Series in HTML file

Question

0 votos

In an HTML file, there is a section like this :

        series: [{
            name: 'Numbers',
            color: '#33CCFF',
            lineWidth: 5,
            data: [45,78,84,91,111,125,178,231,274,283,303,333]        }],

How to extract the 'data' field into an array in a matlab code ?

There are many such series' in that same HTML file with different 'name' fields. For example, name: 'Total Value', 'Log Scale', 'Base Value' etc.

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Mohammad Sami el 7 de Abr. de 2020

are you parsing the html in Matlab as char array ? regexp is for string, cellstr or char data.

you can easily change the pattern to name: \'Numbers\'

b el 7 de Abr. de 2020

Abrir en MATLAB Online

I am trying to do the following:

url="c:\finCase\case1.html";
code=webread(url);
tree=htmlTree(code);
selector="series";
subtrees=findElement(tree,selector);

The subtrees field is empty whereas it should have all the series' corresponding to various names ('Numbers', 'Total Value', 'Log Scale', 'Base Value' etc).

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

per isakson el 7 de Abr. de 2020

Editada: per isakson el 9 de Mayo de 2020

Abrir en MATLAB Online

0 votos

cssm.txt

I misunderstood your question. This is a bit of overkill.

Assumptions

the string, series:, always indicates the start of a block of interest

I created a sample file, cssm.txt, which I uploaded. (Matlab Answers doesn't allow the extension .html ).

This script reads all blocks

%%
chr = fileread('cssm.txt');
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
%%
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] ); 
for jj = 1 : len
    
    txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
    txt(txt== '''') = [];
    series(jj).name = matlab.lang.makeValidName( txt );
    txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
    txt(txt== '''') = [];
    series(jj).color = txt;
    
    txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
    series(jj).lineWidth = str2double( txt );
    
    txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
    series(jj).data = str2num( txt );  %#ok<ST2NM>
end

and extract "series which matches name='Numbers'. Not the other series'."

>> series(strcmp({series.name},'Numbers')).data
ans =
    45    78    84    91   111   125   178   231   274   283   303   333
    
    

In response to comment below

Assumptions

the string, series:, always indicates the start of a block of interest
the string, }], indicates the end of a block of interest
all html-files of interest are named index.html
all files named index.html are of interest
all html-files of interest are in subfolders under a root-folder, ...\finCase
every html-file, index.html, contains exactly one block that has a specific value of the field name:, e.g. Numbers

The overkill is still there. However, reading and parsing four html-files (copies of cssm.txt ) takes less than 10ms.

Try

>> client_data = read_client_data( 'd:\m\cssm\finCase', 'index.html', 'Numbers' )
client_data =
  4×2 cell array
    {'anderson'       }    {1×9  double}
    {'kim-j-clijsters'}    {1×10 double}
    {'paul-judd'      }    {1×11 double}
    {'simmi'          }    {1×12 double}
>> 

where (in one m-file)

function    client_data = read_client_data( root, file, name )
    
    sad = dir( fullfile( root, '**', file ) ); 
    len = length( sad );
    client_data = cell( len, 2 );
    for jj = 1 : len 
        cac = strsplit( sad(jj).folder, filesep );
        client = cac{end};
        series = read_one_file_( fullfile( sad(jj).folder, sad(jj).name ) );
        client_data(jj,:) = { client, series(strcmp({series.name},name)).data };
    end
end
function    series = read_one_file_( file )
    
    chr = fileread( fullfile( file ) );
    cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
    
    len = length( cac );
    series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
    
    for jj = 1 : len
        
        txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
        txt(txt== '''') = [];
        series(jj).name = strtrim( txt );
        
        txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
        txt(txt== '''') = [];
        series(jj).color = txt;
        
        txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
        series(jj).lineWidth = str2double( txt );
        
        txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
        series(jj).data = str2num( txt );  %#ok<ST2NM>
        
    end
end

TODO: add error handling and comments

10 comentarios
Mostrar 8 comentarios más antiguos Ocultar 8 comentarios más antiguos

per isakson el 9 de Mayo de 2020

Abrir en MATLAB Online

A nice thing with standards is that there are so many to chose between. Null (or NULL) is a special marker used in Structured Query Language to indicate that a data value does not exist in the database [Wikipedia]. However, Matlab doesn't honor Null.

Replace the statement

series(jj).data = str2num( txt ); %#ok<ST2NM>

by

out = textscan( txt             , '%f'      ...
            ,   'CollectOutput' , true      ...  
            ,   'Delimiter'     , ','       ...
            ,   'EmptyValue'    , 0         ...      
            ,   'TreatAsEmpty'  , 'null'    ...
            ,   'Whitespace'    , ' \t[]'   );
series(jj).data = reshape( out{:}, 1,[] );

and read about textscan in the documentation.

b el 10 de Mayo de 2020

LOL on the tragedy of being Null.

The code section works nicely with output as needed.

Indebted once again.

Iniciar sesión para comentar.

Extracting Data field of a Series in HTML file

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Respuesta aceptada

10 comentarios
Mostrar 8 comentarios más antiguos Ocultar 8 comentarios más antiguos

Más respuestas (0)

Categorías

Productos

Etiquetas

Community Treasure Hunt

Extracting Data field of a Series in HTML file

4 comentarios Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Respuesta aceptada

10 comentarios Mostrar 8 comentarios más antiguos Ocultar 8 comentarios más antiguos

Más respuestas (0)

Categorías

Productos

Etiquetas

Ver también

Community Treasure Hunt

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

10 comentarios
Mostrar 8 comentarios más antiguos Ocultar 8 comentarios más antiguos