Process Big Data in the Cloud

Open Live Script

This example shows how to access a large data set in the cloud and process it in a cloud cluster using MATLAB® capabilities for big data.

Learn how to:

Access a publicly available large data set on Amazon Cloud.
Find and select an interesting subset of this data set.
Use datastores, tall arrays, and Parallel Computing Toolbox to process this subset in less than 20 minutes.

The public data set in this example is part of the Wind Integration National Dataset Toolkit, or WIND Toolkit [1], [2], [3], [4]. For more information, see Wind Integration National Dataset Toolkit.

Requirements

To run this example, you must set up access to a cluster in Amazon AWS. In MATLAB, you can create clusters in Amazon AWS directly from the MATLAB desktop. On the Home tab, in the Parallel menu, select Create and Manage Clusters. In the Cluster Profile Manager, click Create Cloud Cluster. Alternatively, you can use MathWorks Cloud Center to create and access compute clusters in Amazon AWS. For more information, see Getting Started with Cloud Center.

Set Up Access to Remote Data

The data set used in this example is the Techno-Economic WIND Toolkit. It contains 2 TB (terabyte) of data for wind power estimates and forecasts along with atmospheric variables from 2007 to 2013 within the continental U.S.

The Techno-Economic WIND Toolkit is available via Amazon Web Services, in the location s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data. It contains two data sets:

s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data/met_data - Metrology Data
s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data/fcst_data - Forecast Data

To work with remote data in Amazon S3, you must define environment variables for your AWS credentials. For more information on setting up access to remote data, see Work with Remote Data. In the following code, replace YOUR_AWS_ACCESS_KEY_ID and YOUR_AWS_SECRET_ACCESS_KEY with your own Amazon AWS credentials. If you are using temporary AWS security credentials, also set the environment variable AWS_SESSION_TOKEN.

setenv("AWS_ACCESS_KEY_ID","YOUR_AWS_ACCESS_KEY_ID");
setenv("AWS_SECRET_ACCESS_KEY","YOUR_AWS_SECRET_ACCESS_KEY");

This data set requires you to specify its geographic region, and so you must set the corresponding environment variable.

setenv("AWS_DEFAULT_REGION","us-west-2");

To give the workers in your cluster access to the remote data, add these environment variable names to the EnvironmentVariables property of your cluster profile. To edit the properties of your cluster profile, use the Cluster Profile Manager, in Parallel > Create and Manage Clusters. For more information, see Set Environment Variables on Workers.

Find Subset of Big Data

The 2 TB data set is quite large. This example shows you how to find a subset of the data set that you want to analyze. The example focuses on data for the state of Massachusetts.

First obtain the IDs that identify the metrological stations in Massachusetts, and determine the files that contain their metrological information. Metadata information for each station is in a file named three_tier_site_metadata.csv. Because this data is small and fits in memory, you can access it from the MATLAB client with readtable. You can use the readtable function to access open data in S3 buckets directly without needing to write special code.

tMetadata = readtable("s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data/three_tier_site_metadata.csv",...
    "ReadVariableNames",true,"TextType","string");

To find out which states are listed in this data set, use unique.

states = unique(tMetadata.state)

states = 50×1 string array
    ""
    "Alabama"
    "Arizona"
    "Arkansas"
    "California"
    "Colorado"
    "Connecticut"
    "Delaware"
    "District of Columbia"
    "Florida"
    "Georgia"
    "Idaho"
    "Illinois"
    "Indiana"
    "Iowa"
    "Kansas"
    "Kentucky"
    "Louisiana"
    "Maine"
    "Maryland"
    "Massachusetts"
    "Michigan"
    "Minnesota"
    "Mississippi"
    "Missouri"
    "Montana"
    "Nebraska"
    "Nevada"
    "New Hampshire"
    "New Jersey"
    "New Mexico"
    "New York"
    "North Carolina"
    "North Dakota"
    "Ohio"
    "Oklahoma"
    "Oregon"
    "Pennsylvania"
    "Rhode Island"
    "South Carolina"
    "South Dakota"
    "Tennessee"
    "Texas"
    "Utah"
    "Vermont"
    "Virginia"
    "Washington"
    "West Virginia"
    "Wisconsin"
    "Wyoming"

Identify which stations are located in the state of Massachusetts.

index = tMetadata.state == "Massachusetts";
siteId = tMetadata{index,"site_id"};

The data for a given station is contained in a file that follows this naming convention: s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data/met_data/folder/site_id.nc, where folder is the nearest integer less than or equal to site_id/500. Using this convention, compose a file location for each station.

folder = floor(siteId/500);
fileLocations = compose("s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data/met_data/%d/%d.nc",folder,siteId);

Process Big Data

You can use datastores and tall arrays to access and process data that does not fit in memory. When performing big data computations, MATLAB accesses smaller portions of the remote data as needed, so you do not need to download the entire data set at once. With tall arrays, MATLAB automatically breaks the data into smaller blocks that fit in memory for processing.

If you have Parallel Computing Toolbox, MATLAB can process the many blocks in parallel. The parallelization enables you to run an analysis on a single desktop with local workers, or scale up to a cluster for more resources. When you use a cluster in the same cloud service as the data, the data stays in the cloud and you benefit from improved data transfer times. Keeping the data in the cloud is also more cost-effective. This example ran in less than 20 minutes using 18 workers on a c4.8xlarge machine in Amazon AWS.

If you use a parallel pool in a cluster, MATLAB processes this data using workers in the cluster. Create a parallel pool in the cluster. In the following code, use the name of your cluster profile instead. Attach the script to the pool, because the parallel workers need to access a helper function in it.

p = parpool("myAWSCluster");

Starting parallel pool (parpool) using the 'myAWSCluster' profile ...
connected to 18 workers.

addAttachedFiles(p,mfilename("fullpath"));

Create a datastore with the metrology data for the stations in Massachusetts. The data is in the form of Network Common Data Form (NetCDF) files, and you must use a custom read function to interpret them. In this example, this function is named ncReader and reads the NetCDF data into timetables. You can explore its contents at the end of this script.

dsMetrology = fileDatastore(fileLocations,"ReadFcn",@ncReader,"UniformRead",true);

Create a tall timetable with the metrology data from the datastore.

ttMetrology = tall(dsMetrology)

ttMetrology =

  M×6 tall timetable

            Time            wind_speed    wind_direction    power     density    temperature    pressure
    ____________________    __________    ______________    ______    _______    ___________    ________

    01-Jan-2007 00:00:00       5.905          189.35        3.3254    1.2374       269.74        97963  
    01-Jan-2007 00:05:00      5.8898          188.77        3.2988    1.2376       269.73        97959  
    01-Jan-2007 00:10:00      5.9447          187.85         3.396    1.2376       269.71        97960  
    01-Jan-2007 00:15:00      6.0362          187.05        3.5574    1.2376       269.68        97961  
    01-Jan-2007 00:20:00      6.1156          186.49        3.6973    1.2375       269.83        97958  
    01-Jan-2007 00:25:00      6.2133          185.71        3.8698    1.2376       270.03        97952  
    01-Jan-2007 00:30:00      6.3232          184.29        4.0812    1.2379       270.19        97955  
    01-Jan-2007 00:35:00      6.4331          182.51        4.3382    1.2382        270.3        97957  
             :                  :               :             :          :            :            :
             :                  :               :             :          :            :            :

Get the mean temperature per month using groupsummary, and sort the resulting tall table. For performance, MATLAB defers most tall operations until the data is needed. In this case, plotting the data triggers evaluation of deferred calculations.

meanTemperature = groupsummary(ttMetrology,"Time","month","mean","temperature");
meanTemperature = sortrows(meanTemperature);

Plot the results.

figure;
plot(meanTemperature.mean_temperature,"*-");
ylim([260 300]);
xlim([1 12*7+1]);
xticks(1:12:12*7+1);
xticklabels(["2007","2008","2009","2010","2011","2012","2013","2014"]);
title("Average Temperature in Massachusetts 2007-2013");
xlabel("Year");
ylabel("Temperature (K)")

Many MATLAB functions support tall arrays, so you can perform a variety of calculations on big data sets using familiar syntax. For more information on supported functions, see Supporting Functions.

Define Custom Read Function

The data in the Techno-Economic WIND Toolkit is saved in NetCDF files. Define a custom read function to read its data into a timetable. For more information on reading NetCDF files, see NetCDF Files.

function t = ncReader(filename)
% NCREADER Read NetCDF File (.nc), extract data set and save as a timetable

% Get information about NetCDF data source
fileInfo = ncinfo(filename);

% Extract variable names and datatypes
varNames = string({fileInfo.Variables.Name});
varTypes = string({fileInfo.Variables.Datatype});

% Transform variable names into valid names for table variables
if any(startsWith(varNames,["4","6"]))
    strVarNames = replace(varNames,["4","6"],["four","six"]);
else
    strVarNames = varNames;
end

% Extract the length of each variable
fileLength = fileInfo.Dimensions.Length;

% Extract initial timestamp, sample period and create the time axis
tAttributes = struct2table(fileInfo.Attributes);
startTime = datetime(cell2mat(tAttributes.Value(contains(tAttributes.Name,"start_time"))),"ConvertFrom","epochtime");
samplePeriod = seconds(cell2mat(tAttributes.Value(contains(tAttributes.Name,"sample_period"))));

% Create the output timetable 
numVars = numel(strVarNames);
tableSize = [fileLength numVars];
t = timetable('Size',tableSize,'VariableTypes',varTypes,'VariableNames',strVarNames,'TimeStep',samplePeriod,'StartTime',startTime);

% Fill in the timetable with variable data
for k = 1:numVars
    t(:,k) = table(ncread(filename,varNames{k}));
end
end

References

[1] Draxl, C., B. M. Hodge, A. Clifton, and J. McCaa. Overview and Meteorological Validation of the Wind Integration National Dataset Toolkit (Technical Report, NREL/TP-5000-61740). Golden, CO: National Renewable Energy Laboratory, 2015.

[2] Draxl, C., B. M. Hodge, A. Clifton, and J. McCaa. "The Wind Integration National Dataset (WIND) Toolkit." Applied Energy. Vol. 151, 2015, pp. 355-366.

[3] King, J., A. Clifton, and B. M. Hodge. Validation of Power Output for the WIND Toolkit (Technical Report, NREL/TP-5D00-61714). Golden, CO: National Renewable Energy Laboratory, 2014.

[4] Lieberman-Cribbin, W., C. Draxl, and A. Clifton. Guide to Using the WIND Toolkit Validation Code (Technical Report, NREL/TP-5000-62595). Golden, CO: National Renewable Energy Laboratory, 2014.

Related Examples

More About

Work with Deep Learning Data in AWS (Deep Learning Toolbox)
Deep Learning with Big Data (Deep Learning Toolbox)