How to download multiple files from a website

30 visualizaciones (últimos 30 días)
Chad Greene
Chad Greene el 21 de Nov. de 2023
Comentada: Dyuman Joshi el 22 de Nov. de 2023
This question has been asked many times in various ways on this forum, but I've never found a simple answer to this very simple question:
It seems like there should be a two-line solution along the lines of :
url_list = get_urls('https://www.ngdc.noaa.gov/thredds/catalog/global/ETOPO2022/15s/15s_surface_elev_netcdf/catalog.html','extension','.nc');
websave(url_list)
if get_urls were a function and websave were as easy to use as entering a list of file urls to download and having it save them in the current directory.
  3 comentarios
Chad Greene
Chad Greene el 21 de Nov. de 2023
Wow, thank you @Dyuman Joshi!
Dyuman Joshi
Dyuman Joshi el 22 de Nov. de 2023
You are welcome!

Iniciar sesión para comentar.

Respuesta aceptada

Voss
Voss el 21 de Nov. de 2023
url = 'https://www.ngdc.noaa.gov/thredds/catalog/global/ETOPO2022/15s/15s_surface_elev_netcdf/catalog.html';
% webread() the main page and parse out the links to .nc files:
data = webread(url);
C = regexp(data,'<a href=".*?(\?[^"]*.nc)">','tokens');
temp_urls = strcat(url,vertcat(C{:}));
% webread() each linked url:
data = cell(size(temp_urls));
for ii = 1:numel(temp_urls)
data{ii} = webread(temp_urls{ii});
end
% get the download link in each of those pages:
C = regexp(data,'<a href="([^"]*)">\s*<b>HTTPServer','tokens','once');
% append them to the (sub-)domain of the main URL to get the actual URLs
% for downloading the .nc files:
idx = find(url == '/',3);
nc_urls = strcat(url(1:idx(end)-1),vertcat(C{:}));
% construct file names to save to locally:
[~,filenames,ext] = fileparts(nc_urls);
filenames = strcat(filenames,ext);
% download all the files:
for ii = 1:numel(nc_urls)
websave(filenames{ii},nc_urls{ii});
end
  3 comentarios
Voss
Voss el 21 de Nov. de 2023
You're welcome!
Each link on the main page goes to a distinct intermediate page which contains the link to download the actual .nc file.
The first webread/regexp gets the set of urls to those intermediate pages. Then webread each of those intermediate pages in a loop, and regexp all the contents to get the download urls (which is the url immediately preceding 'HTTPServer' on each intermediate page - there are several other urls on those pages, and that was the only way I could think of to be sure to get the right one).
Chad Greene
Chad Greene el 22 de Nov. de 2023
Ooh, okay, that makes a lot of sense. Thanks @Voss!

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Downloads en Help Center y File Exchange.

Etiquetas

Productos


Versión

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by