How do I extract the contents of an HTML table on a web page into a MATLAB table?

94 visualizaciones (últimos 30 días)
I'd like to plot and analyze the TSA traveler data from this website: https://www.tsa.gov/coronavirus/passenger-throughput
The data is embedded on the page as an HTML table element.
How do I extract the table content into a MATLAB table?

Respuesta aceptada

Pat Canny
Pat Canny el 23 de Jun. de 2020
You can extract the <table> content, which is all stored in a set of <td> tags, as a string array and go from there.
You first need to use findElement and extractHTMLText on an htmlTree object.
You then can use reshape to arrange the data, then use array2table to convert to a table.
Here is one approach:
travel_data = webread('https://www.tsa.gov/coronavirus/passenger-throughput');
travel_data_tree = htmlTree(travel_data);
selector = "td";
subtrees = findElement(travel_data_tree,selector);
str = extractHTMLText(subtrees);
table_data = str(4:end); % first three elements are just the column names
reshape_ncols = 3;
reshape_nrows = length(table_data)/reshape_ncols;
table_data_reshaped = reshape(table_data,reshape_ncols,reshape_nrows)';
% Convert to table
traveler_data_table = array2table(table_data_reshaped,'VariableNames',["Date" "Travelers_Today" "Travelers_Last_Year"]); % I got lazy with VariableNames, I know.
% Convert data types from strings to appropriate types
traveler_data_table.Date = datetime(traveler_data_table.Date);
traveler_data_table.Travelers_Today = str2double(traveler_data_table.Travelers_Today);
traveler_data_table.Travelers_Last_Year = str2double(traveler_data_table.Travelers_Last_Year);
traveler_data_table.Traveler_Ratio = traveler_data_table.Travelers_Today ./ traveler_data_table.Travelers_Last_Year;
% Plot the results
figure
plot(traveler_data_table.Date,traveler_data_table.Traveler_Ratio)
title("TSA Traveler Ratio by Date (2020 vs. 2019)")
grid on
% Some more fun analysis
% When did it bottom out?
[min_ratio,idx] = min(traveler_data_table.Traveler_Ratio);
min_ratio_pct = 100*min_ratio;
min_date = traveler_data_table.Date(idx);
disp("The minimum traveler ratio of " + min_ratio_pct + "% occurred on " + string(min_date))
latest_pct = 100*traveler_data_table.Traveler_Ratio(1);
disp("The current ratio is " + latest_pct + "%")

Más respuestas (1)

Christopher Creutzig
Christopher Creutzig el 7 de Jun. de 2022
Starting in R2021b, you can directly use readtable for HTML tables:
readtable("https://www.tsa.gov/coronavirus/passenger-throughput",...
FileType="html",ReadVariableNames=true,ThousandsSeparator=",")
ans = 364×5 table
Date 2022 2021 2020 2019 __________ __________ __________ __________ __________ 06/05/2022 2.3872e+06 1.9847e+06 4.4126e+05 2.6699e+06 06/04/2022 1.9814e+06 1.6812e+06 3.5302e+05 2.226e+06 06/03/2022 2.3326e+06 1.8799e+06 4.1968e+05 2.6498e+06 06/02/2022 2.2132e+06 1.8159e+06 3.9188e+05 2.6239e+06 06/01/2022 1.9991e+06 1.5879e+06 3.0444e+05 2.3702e+06 05/31/2022 2.1081e+06 1.6828e+06 2.6774e+05 2.2474e+06 05/30/2022 2.3122e+06 1.9002e+06 3.5326e+05 2.499e+06 05/29/2022 2.0965e+06 1.6505e+06 3.5295e+05 2.5556e+06 05/28/2022 1.9942e+06 1.6058e+06 2.6887e+05 2.1172e+06 05/27/2022 2.3847e+06 1.9596e+06 3.2713e+05 2.5706e+06 05/26/2022 2.3799e+06 1.8545e+06 3.2178e+05 2.4858e+06 05/25/2022 2.1477e+06 1.6182e+06 2.6117e+05 2.269e+06 05/24/2022 2.0207e+06 1.4708e+06 2.6484e+05 2.4536e+06 05/23/2022 2.329e+06 1.7474e+06 3.4077e+05 2.5122e+06 05/22/2022 2.3509e+06 1.8637e+06 2.6745e+05 2.0707e+06 05/21/2022 1.9888e+06 1.55e+06 2.5319e+05 2.1248e+06

Categorías

Más información sobre Tables en Help Center y File Exchange.

Etiquetas

Productos


Versión

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by