HTML Page source info

Question

b el 26 de Nov. de 2020

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/663243-html-page-source-info

Comentada: Rik el 3 de Dic. de 2020

Hello, many-a-times we come across a series of numbered webpages

basePage.html?page=2
basePage.html?page=3

and so forth, wherein there are several fields identified by their labels:

<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>

and so on.

How can the "textOfInterest" of one particular parameter, say, Parameter2, of all the Name*, of all the pages,

basePage.html?page=1toInf

be taken (outputted/exported) into one text file, say, Parameter2.txt?

The "textOfInterest" is often alphanumeric with special characters !@#$% also.

Thanks.

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

b el 1 de Dic. de 2020

Initially, I was hesitant to download this file because I thought it is religious or some such thing. But I am happy to have downloaded it. It is immensely useful and 'on the money' for this thread.

My interest occurs in the function button_Callback in BibleDownloader.m. The webpage is getting saved in the parameter called 'data'. And since finding <div class="pagination"> is right in the ballpark of my initially query, I was greatly excited to see the output and experiment with the case 'NB2014' inside this function. Unfortunately, the code doesn't seem to go here, since I was unable to retrieve either 'data', or the indices idx*. All of these indices idx*, viz idx, idx2 and idx3 will be useful for me. How can I access, and get to this part?

Also, perhaps you can suggest one regexp line to pull out 'textOfInterest' from

<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>

and better still, if you already have something like the BibleDownloader m-file, with regexp used on extracting text between <div class> and </div> type of structure, that will be great.

Rik el 1 de Dic. de 2020

Editada: Rik el 1 de Dic. de 2020

The goal of Bible downloader is religious (although you can use the text of a Bible translation for non-religous purposes as well of course), but the code isn't.

Did you try adapting any of the code? I'll post some code as an answer.

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Rik el 1 de Dic. de 2020

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/663243-html-page-source-info#answer_561193

Abrir en MATLAB Online

One possibility with strfind:

close_div=strfinf(d,'</div>');
param=1;
pat=sprintf('<label>Parameter%d : </label> <div class="category-related">',param)
position=strfind(d,pat);
position=position+numel(pat);%this will be the start of your text of interest
texts=cell(size(position));
for n=1:numel(position)
    end_of_text=close_div(close_div>position(n));
    end_of_text=end_of_text(1)-1;
    texts{n}=d(position(n):end_of_text);
end

Or with a regexp:

d=['<h2 class="category-heading">Name1</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name2</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name3</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'];
RE=['<label>Parameter\d',... % \d matches a single digit
    ' : </label> <div class="category-related">',...
    '(',... % use parentheses to capture a token
    '[^<]*',... % this matches any number of characters other than <
    ')',...
    '</div>'];
t=regexp(d,RE,'tokens');
clc
celldisp(t)

You can also adapt the expression to look forward to match </div> so you can use .* instead of [^<]*

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

b el 1 de Dic. de 2020

Abrir en MATLAB Online

Thank you.

But I have run into problem with the following part:

Trying to take the output of the two parameters simultaneously: Parameter1 and Parameter2. It so happens, that many times, Parameter1 is present, but the Parameter2 is missing. That is, the structure is like this:

<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>

Same problem if try to take all the three parameters.

When all three parameters are to be extracted, the objective is to get ' ' (no value) at the place where it is missing, rather than skipping it completely, because skipping it completely would result in a mismatch (so that when it is exported to the output text file, the corresponding entry is simply blank).

In the first (strfind) code, I tried to replicate the 'for loop' three times for the three parameters, but quickly ran into problems.

b el 2 de Dic. de 2020

Abrir en MATLAB Online

Thanks for the link.

Downloaded the readfile from github. The 'elements' seems promising, except for - what are those ->->-> arrows in front of all the fields of interest?! Anyways, glad that it has brought to this point.

But the same situation with all the three approaches : when the mail-field is missing, then how to write 'NULL' in the output-file and continue with the loop?

Name1    mail1
Name2    missing
Name3    mail3
Name4    mail4

The strfind and regexp approaches give

Name{1}='Name1'
Name{2}='Name2'
Name{3}='Name3'
Name{4}='Name4'

and

Parameter{1}='mail1'
Parameter{2}='mail3'
Parameter{3}='mail4'

How to bypass the 'for loop' and at the same time, print 'NULL' in the corresponding excel row-column entry? In this example, (row=2,col=2) will be 'NULL', and (row=3,col=2) will be Parameter{2}.

It is not the question of 'skipping if not found', because numel(position) has already been evaluated, =4 here for the Name field, and =3 for the Parameter. So it seems to be hardcoded.

Rik el 2 de Dic. de 2020

Those arrows are probably newline characters. What release are you using?

I would suggest parsing each element separately. That way you can write an empty char or whatever you prefer in the email field for that person.

Iniciar sesión para comentar.

Answer 2

b el 3 de Dic. de 2020

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/663243-html-page-source-info#answer_563538

That is exactly how I am doing it. By parsing it separately, there is no way to correlate which Name-field has the corresponding Mail-field missing. It parses all the Name-fields, then it parses all the mail-fields, as a sequential process.

What modification should be made in the codes, so that they print 'Not Found' when the mail field is missing in the corresponding iteration? Is there a way to get the index values of the missing Mail-fields?

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

b el 3 de Dic. de 2020

Abrir en MATLAB Online

I am overwhelmed by the way you have patiently worked with me on this thread. I think I will close this elaborate thread here only, but not before posting this limerick:

There was once a man named Rik, 
Who wrote matlab codes so quick, 
To the topic, they were relevant
The codes themselves so elegant, 
His m-files, sir, were completely sick!

Enjoy your freedom from this thread.

Rik el 3 de Dic. de 2020

You're welcome (and thanks for the limerick XD).

If you have follow-up question, feel free to post a link to it here.

Iniciar sesión para comentar.

HTML Page source info

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Respuesta aceptada

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Más respuestas (1)

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Ver también

Categorías

Etiquetas

Community Treasure Hunt

HTML Page source info

6 comentarios Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Respuesta aceptada

8 comentarios Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Más respuestas (1)

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Ver también

Categorías

Etiquetas

Community Treasure Hunt

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo