How to train LSTM net on very large dataset.

6 visualizaciones (últimos 30 días)
Davey Gregg
Davey Gregg el 2 de Mzo. de 2021
Respondida: Udit06 el 22 de Feb. de 2024
I am trying to train a standard LTSM net and I have about 225Gb worth of data that I want to feed it. The data comes from a binary neural recording file containing 32 channels. I am pulling 60 seconds of data before each timestamp that I am trying to predict and slicing that up into 1 second chuncks giving 32x1000 arrays as my sampling frequency is 1KHz. My plan is to train the network with 60 classes, one for each second, and have the net output its confidence on when my event of interest may occur. I got halfway decent results with this method using a small subset of this data that can fit into ram, but I really want to give it the whole thing so it can see more possible autocorrilations hidden in the full data.
What I am during currently is reading the data from the binary file, transforming the data into cell arrays and storing them into a matObj using matfile(). The matObj variables are structured in the format needed as inputs for the training function as a 1xn cell array of sequences and a 1xn catagorial array of labels. Calling net = trainNetwork(matObj.XTrain,matObj.YTrain,layers,options) works well for smaller datasets but MatLab still loads the data into ram and if I try to use my 225Gb matObj file it throws "Out of memory". So I don't know a good way to pass this data to the training function. Filedatastore only seems to work with a large collection of smaller files and I really don't want to have each sequence saved as its own file to use with file datastore. It takes long enough to save all the sequences into one file and that is much better than having a folder with 10K+ files.

Respuestas (1)

Udit06
Udit06 el 22 de Feb. de 2024
Hi Davey,
If you don't want create multiple smaller files, you can create a MATLAB's custom datastore that reads the data directly from the large file in small chunks that can be loaded into the memory. By iterating over the smaller chunks of data, networks can learn from large data sets without needing to load all data into memory at once. You can also leverage Parallel Computing Toolbox to scale up the network training.
Refer to the following MathWorks documentations to understand how to create a custom datastore and scaling up deep learning in parallel.
You can also refer to the following documentation that guides how to train a deep learning model using big data.
I hope it helps.

Categorías

Más información sobre Image Data Workflows en Help Center y File Exchange.

Productos


Versión

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by