Is there a way to pull a specific link after using webread() to get the content from a page?

1 visualización (últimos 30 días)
Essentially I'm using webread() to obtain the contents of a google search. If there's a Wikipedia link in the contents, I want to extract it. I've been using regexp(content,exp,'match') but I'm confused on how to create an expression that'll match the Wikipedia link. I know that doing something such as:
regexp(content,'https?://en\.?\w*\.?\w')
Will get me the 'https://en.wikipedia.org' portion of the link, but this expression seems unnecessary just for that part already. I can continue doing that for the whole link but the amount of words in the Wikipedia link will vary so I'm unsure how to contain just the link and not accidentally take text following the link.
(e.g https://en.wikipedia.org/wiki/List_of_landmark_court_decisions_in_the_United_States or https://en.wikipedia.org/wiki/Banana)
In the text that is read, it appears that the link is followed by the &amp. Perhaps I can take all the characters from http to &amp but it would be nice to get some tips on how to create an expression for that!
Thanks for the help!
  1 comentario
Matthew Cao
Matthew Cao el 1 de Mayo de 2018
Editada: Matthew Cao el 1 de Mayo de 2018
Ok, I could simply replace the ('\.?\w*\.?\w'') part of the expression with \S+ which will look for any non-white-space character that appears consecutively. This pulls the Wiki link and a lot afterwards too:
https://en.wikipedia.org/wiki/List_of_landmark_court_decisions_in_the_United_States&(there's the word 'amp' here but it is not shown on the forum);sa=U&.............
I need to stop it right at the &,amp!

Iniciar sesión para comentar.

Respuesta aceptada

Matthew Cao
Matthew Cao el 1 de Mayo de 2018
I think I've solved it by putting '\S+' in the expression and '?=&sa'. That way the expression will match all the characters following 'https?://en' but stop at the right point.
regexp(content,'https?://en.\S+(?=&(amp);sa)','match')
This will find everything up until the '&(amp);sa'! If there's a more efficient way of doing this let me know!

Más respuestas (0)

Categorías

Más información sobre Environment and Settings en Help Center y File Exchange.

Etiquetas

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by