How to download multiple files from a website

This question has been asked many times in various ways on this forum, but I've never found a simple answer to this very simple question:
It seems like there should be a two-line solution along the lines of :
url_list = get_urls('https://www.ngdc.noaa.gov/thredds/catalog/global/ETOPO2022/15s/15s_surface_elev_netcdf/catalog.html','extension','.nc');
websave(url_list)
if get_urls were a function and websave were as easy to use as entering a list of file urls to download and having it save them in the current directory.

3 comentarios

Dyuman Joshi
Dyuman Joshi el 21 de Nov. de 2023
Editada: Dyuman Joshi el 22 de Nov. de 2023
This method works but it seems extremely slow. (Probably due to the large file sizes and my poor internet connection atm)
webpageurl = 'https://www.ngdc.noaa.gov/thredds/catalog/global/ETOPO2022/15s/15s_surface_elev_netcdf/catalog.html';
%Read the webpage
str = webread(webpageurl);
%Get the hyperlinks from the webpage data
hl = regexp(str,'<a.*?/a>','match')'
hl = 294×1 cell array
{'<a class="static" href="https://www.ngdc.noaa.gov/thredds/catalog/catalog.html">NCEI THREDDS Data Server</a>'} {'<code>ETOPO_2022_v1_15s_N00E000_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E015_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E030_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E045_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E060_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E075_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E090_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E105_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E120_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E135_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E150_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E165_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W015_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W030_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W045_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W060_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W075_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W090_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W105_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W120_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W135_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W150_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W165_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W180_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N15E000_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N15E015_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N15E030_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N15E045_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N15E060_surface.nc</code>' }
The link that the hyperlinks on the given webpage lead to are not the same as the ones from which to download the data.
%Suffix for download url
fileurl = 'https://www.ngdc.noaa.gov/thredds/fileServer/global';
Please note that I have taken the link corresponding to the HTTP Server download option.
%Ignoring the header hyperlink
for k=2:5
%Some manipulation
z = extractBetween(hl{k}, 'Scan', '"');
%Combine the hyperlink with the url and try to use webread()
new = strcat(fileurl, z{:});
yo = websave(sprintf('File%d.nc', k-1), new);
end
ls
File1.nc File2.nc File3.nc File4.nc
ncdisp('File3.nc')
Source: /users/mss.system.b2I8c7/File3.nc Format: netcdf4_classic Global Attributes: GDAL_AREA_OR_POINT = 'Area' node_offset = 1 GDAL_TIFFTAG_COPYRIGHT = 'DOC/NOAA/NESDIS/NCEI > National Centers for Environmental Information, NESDIS, NOAA, U.S. Department of Commerce' GDAL_TIFFTAG_DATETIME = 20220929130858 GDAL_TIFFTAG_IMAGEDESCRIPTION = 'Topography-Bathymetry; EGM2008 height' Conventions = 'CF-1.5' GDAL = 'GDAL 3.3.2, released 2021/09/01' NCO = 'netCDF Operators version 4.9.1 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)' Dimensions: lon = 3600 lat = 3600 Variables: crs Size: 1x1 Dimensions: Datatype: char Attributes: grid_mapping_name = 'latitude_longitude' long_name = 'CRS definition' longitude_of_prime_meridian = 0 semi_major_axis = 6378137 inverse_flattening = 298.2572 spatial_ref = 'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]' GeoTransform = '30 0.004166666666666667 0 0 0 -0.004166666666666667 ' lat Size: 3600x1 Dimensions: lat Datatype: double Attributes: standard_name = 'latitude' long_name = 'latitude' units = 'degrees_north' lon Size: 3600x1 Dimensions: lon Datatype: double Attributes: standard_name = 'longitude' long_name = 'longitude' units = 'degrees_east' z Size: 3600x3600 Dimensions: lon,lat Datatype: single Attributes: long_name = 'z' _FillValue = -99999 grid_mapping = 'crs' units = 'meters' positive = 'up' standard_name = 'height' vert_crs_name = 'EGM2008' vert_crs_epsg = 'EPSG:3855'
Chad Greene
Chad Greene el 21 de Nov. de 2023
Wow, thank you @Dyuman Joshi!
Dyuman Joshi
Dyuman Joshi el 22 de Nov. de 2023
You are welcome!

Iniciar sesión para comentar.

 Respuesta aceptada

Voss
Voss el 21 de Nov. de 2023
url = 'https://www.ngdc.noaa.gov/thredds/catalog/global/ETOPO2022/15s/15s_surface_elev_netcdf/catalog.html';
% webread() the main page and parse out the links to .nc files:
data = webread(url);
C = regexp(data,'<a href=".*?(\?[^"]*.nc)">','tokens');
temp_urls = strcat(url,vertcat(C{:}));
% webread() each linked url:
data = cell(size(temp_urls));
for ii = 1:numel(temp_urls)
data{ii} = webread(temp_urls{ii});
end
% get the download link in each of those pages:
C = regexp(data,'<a href="([^"]*)">\s*<b>HTTPServer','tokens','once');
% append them to the (sub-)domain of the main URL to get the actual URLs
% for downloading the .nc files:
idx = find(url == '/',3);
nc_urls = strcat(url(1:idx(end)-1),vertcat(C{:}));
% construct file names to save to locally:
[~,filenames,ext] = fileparts(nc_urls);
filenames = strcat(filenames,ext);
% download all the files:
for ii = 1:numel(nc_urls)
websave(filenames{ii},nc_urls{ii});
end

3 comentarios

Chad Greene
Chad Greene el 21 de Nov. de 2023
Awesome, thank you! Your solution works, and I want to make sure I understand it--What exactly is the first loop doing? I'm having trouble understanding why we need to call webread and regexp twice. Isn't all the information in temp_urls after the first call to webread?
Voss
Voss el 21 de Nov. de 2023
You're welcome!
Each link on the main page goes to a distinct intermediate page which contains the link to download the actual .nc file.
The first webread/regexp gets the set of urls to those intermediate pages. Then webread each of those intermediate pages in a loop, and regexp all the contents to get the download urls (which is the url immediately preceding 'HTTPServer' on each intermediate page - there are several other urls on those pages, and that was the only way I could think of to be sure to get the right one).
Chad Greene
Chad Greene el 22 de Nov. de 2023
Ooh, okay, that makes a lot of sense. Thanks @Voss!

Iniciar sesión para comentar.

Más respuestas (0)

Productos

Versión

R2023b

Etiquetas

Preguntada:

el 21 de Nov. de 2023

Comentada:

el 22 de Nov. de 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by