How to access itemprop = "name" from within a data structure in HTML code using Matlab?

HTML code
<div class="itemName largestFont" itemprop="name"> Information which I want to extract </div>
<div class="itemCategory largeFont"><a href="/somerandomwebsitelink"> Information which I dont need </a></div>
I want to extract the information from itemprop = "name" only
using the selector feature with text analytics,
I can do "selector = "DIV.itemHeader"
Item Header is the class in which both those div elements lie and as a result both of the information within those divs is extracted.
I only want the information from itemprop = "name"
How do I go about doing that?

3 comentarios

Yup, thats correct
Unfortunately I do not have that toolbox to test with.
My own implementation would probably be to use regexp with named tokens and the 'names' option.

Iniciar sesión para comentar.

 Respuesta aceptada

TADA
TADA el 26 de Mzo. de 2019
Editada: TADA el 27 de Mzo. de 2019
I don't have the toolbox you mentioned, but it most likely uses xpath to parse the html...
I think the best options are xpath or regular expressions.
as far as I know to use xpath in matlab you have to use Java classes, but regular expressions are built in to matlab and they are very covenient.
The regex pattern could be something like that:
str = ['<div class="itemName largestFont" itemprop="name"> Information which I want to extract </div>'...
'<div class="itemCategory largeFont"><a href="/somerandomwebsitelink"> Information which I dont need </a></div>'];
match = regexp(str, '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>', 'names')
match =
struct with fields:
data: ' Information which I want to extract '

11 comentarios

anyway, I edited my answer with a regular expression solution to your problem
function [name] = getTitle(tree)
selector = "DIV.itemHeader";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
end
Sorry but based on this function, how do I integrate your code into this?
I realize you don't have it but this is all using the text analytics toolbox.
Would I not still need some sort of selector to point to the correct tag and class before going further to extract the information?
what do you get back from this line?
name = extractHTMLText(nameSection);
do you get the HTML you mentioned?
if so you can simply run the regexp line on this string:
function name = getTitle(tree)
selector = "DIV.itemHeader";
nameSection = findElement(tree, selector);
% i'm not sure because I don't have your original HTML nor that toolbox
% but I suspect that this line returns the HTML you mentioned
html = extractHTMLText(nameSection);
regexPattern = '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>';
name = regexp(html, regexPattern, 'names');
end
come to think of it, "DIV.itemHeader" is a css selector,
If you post the original HTML document you are mining data from it will help
this is a wild guess, but if I'm right you can try this instead:
selector = "DIV.itemHeader.itemName";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
or this (although I'm not sure because itemprop is not a valid html attribute):
selector = "DIV.itemHeader[itemprop=""name""]";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
N/A
N/A el 28 de Mzo. de 2019
Editada: N/A el 28 de Mzo. de 2019
NOTE: When the functions were run, the outputs did not have semi colons. Please ignore the outputs having semicolons
When I run this
function [name] = getTitle(tree)
selector = "DIV.itemHeader";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
end
I get this in the command window
name =
Information I want
Information I don't want
When I run this
selector = "DIV.itemHeader.itemName";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
I get this in the command window
name =
0×1 empty double column vector
When I run this
selector = "DIV.itemHeader[itemprop=""name""]";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
I get this in the command window
Error using htmlTree/findElement (line 99)
Attribute selector 'itemprop="name"' is not supported.
When I run this
function name = getTitle(tree)
selector = "DIV.itemHeader";
nameSection = findElement(tree, selector);
html = extractHTMLText(nameSection);
regexPattern = '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>';
name = regexp(html, regexPattern, 'names');
end
I get this in the command window
name =
0×0 empty struct array with fields:
data
I want the output of
name = regexp(title, regexPattern, 'names');
to give me this in the command window
name =
Information I want
Here is the website I am trying to get HTML information from
You will notice that
Manticore of Darkness - IOC-067 - Ultra Rare Unlimited
and
are both in the
<div class="itemHeader"><div class="itemName largestFont" itemprop="name">Manticore of Darkness - IOC-067 - Ultra Rare Unlimited</div><div class="itemCategory largeFont"><a href="/invasion-of-chaos-ioc-unlimited-singles/11257">Invasion of Chaos [IOC] Unlimited Singles</a></div></div>
I just want
Manticore of Darkness - IOC-067 - Ultra Rare Unlimited
NOT
Thanks !!
this was a real long shot:
selector = "DIV.itemHeader[itemprop=""name""]";
the regex doesn't work because that extractHTMLText returns an array of strings of the text and not the HTML...
can you post you HTML document so I can at least try the css selectors?
also I made a mistake with the selector earlier,
try that instead:
% this css selector is now valid if I got the structure of your html right
% and if matlab handle's css selectors correctly
selector = "DIV.itemHeader .itemName";
or that: (probably won't work either)
selector = "DIV.itemHeader [itemprop=""name""]"
or maybe (not sure as the htmlTree is only available starting 2018b so I don't have it):
function name = getTitle(tree)
selector = "DIV.itemHeader";
nameSection = findElement(tree, selector);
html = nameSection.Content; % hopefully this will return the inner HTML
regexPattern = '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>';
match = regexp(html, regexPattern, 'names');
name = match.data;
end
OK,
so that element you want to find is the only one with the "itemName" css class
the simplest css selector for that one would be ".itemName"
this should work:
function name = getTitle(tree)
selector = ".itemName";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);
end
HALLELUJAH! :D

Iniciar sesión para comentar.

Más respuestas (1)

Sean de Wolski
Sean de Wolski el 28 de Mzo. de 2019
Editada: Sean de Wolski el 28 de Mzo. de 2019
Using htmlTree, this is trivial:
tree = htmlTree(fileread('yourfile.html'))
div = tree.findElement('div')
item = div.getAttribute("itemprop")
names = item == "name"
div(names).extractHTMLText

4 comentarios

This also worked, however, while the ouput of
div(names).extractHTMLText
was what I wanted, when the function returned the value and this value was assigned to a variable
name = getName(tree);
The output of that was
name =
377×1 logical array
and then it spat out a column of 377 zeros
N/A
N/A el 28 de Mzo. de 2019
Editada: N/A el 28 de Mzo. de 2019
You gotta give TADA and Walter a raise, they've been helping me over literally the past few days. At this point, I might as well throw them on my script as co-authors :D
Neither me nor Walter Robertson (as far as I know) work for mathworks... I'd gladly take that raise though :)
@TADA, we're always hiring into MathWorks and have a distributor in Israel who may or may not be looking for MATLAB users.
@Shivam, this returns exactly what you want from your comment above:
s = string(webread("https://beta.trollandtoad.com/yugioh/invasion-of-chaos-ioc-unlimited-singles/manticore-of-darkness-ioc-067-ultra-rare-unlimited/1155511", weboptions('Timeout', 15)));
%%
tree = htmlTree(s)
%%
div = tree.findElement('div')
%%
item = div.getAttribute("itemprop")
%%
names = item == "name"
%%
div(names).extractHTMLText
ans =
"Manticore of Darkness - IOC-067 - Ultra Rare Unlimited"

Iniciar sesión para comentar.

Categorías

Productos

Versión

R2019a

Etiquetas

Preguntada:

N/A
el 26 de Mzo. de 2019

Comentada:

el 29 de Mzo. de 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by