Extract documents from a website's hyperlinks

23 visualizaciones (últimos 30 días)
dsmalenb
dsmalenb el 22 de Mayo de 2019
Comentada: Adrian el 25 de Jul. de 2023
Hello folks!
I have a few websites from which I am trying to pull the files from their embedded hyperlinks. Does Matlab have a way to do this? For example, if we look at the website:
https://en.wikipedia.org/wiki/Quantum_mechanics we notice several hyperlinks at the bottom as references. In this case disregard the earlier hyperlinks that lead to other articles or to these references.
Is there a way to extract these documents automatically via Matlab?
  1 comentario
Adrian
Adrian el 25 de Jul. de 2023
Yes, Matlab has the capability to extract files from websites, including those embedded in hyperlinks. To achieve this, you can use the "webread" function in Matlab. This function allows you to read the content of a webpage and then you can parse the HTML to extract the links you're interested in.
Check here general outline of the steps you can follow:
  1. Use the "webread" function to retrieve the content of the webpage (e.g., https://en.wikipedia.org/wiki/Quantum_mechanics).
  2. Parse the HTML content to identify and extract the hyperlinks you want. For this, you can use regular expressions or a HTML parsing library like "regexp" or "HTMLParser" available in Matlab.
  3. Filter out the relevant links you need, based on specific criteria (e.g., filtering out links that don't lead to references).
  4. Use the "webread" function again to download the files pointed to by the extracted hyperlinks.
It's worth noting that the process may vary depending on the structure of the webpage and how the hyperlinks are embedded in the HTML. Also, make sure to be respectful of website terms of service and check if there are any restrictions on web scraping or downloading files from the site.
Below is a basic example of how you can get started with the process using Matlab's "webread" function to retrieve the webpage's content:
matlab
% Step 1: Read the content of the webpage
url = 'https://en.wikipedia.org/wiki/Quantum_mechanics';
html_content = webread(url);
% Step 2: Parse the HTML to extract hyperlinks
% You'll need to implement this part based on the specific HTML structure
% and the criteria you want to use to identify the relevant links.
% Step 3: Filter out the relevant links
% Step 4: Download the files pointed to by the hyperlinks
% You can use "webread" again to download the files. Make sure to handle
% file names and saving appropriately.
% Additional steps:
% - Implement error handling for webread and file downloads.
% - Be mindful of website policies and restrictions to avoid any legal issues.
Please keep in mind that the actual implementation may be more involved, and you might need to tweak it based on the structure of the webpages you're dealing with. Additionally, web scraping is a complex topic, so it's essential to be mindful of the website's terms of service and to be respectful of their resources and bandwidth.

Iniciar sesión para comentar.

Respuesta aceptada

Koundinya
Koundinya el 29 de Mayo de 2019
That could be done using webread to retrieve data from the webpage and regexp to extract all the hyperlinks in the page by parsing through the retrieved text.
html_text = webread(https://en.wikipedia.org/wiki/Quantum_mechanics);
hyperlinks = regexp(html_text,'<a.*?/a>','match');

Más respuestas (0)

Categorías

Más información sobre Call Web Services from MATLAB Using HTTP en Help Center y File Exchange.

Etiquetas

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by