Video length is 29:30

Deep Learning for Signals and Sound

Overview

Deep learning networks are proving to be versatile tools. Originally intended for image classification, they are increasingly being applied to a wide variety of other data types. In this webinar, we will explore deep learning fundamentals which provide the basis to understand and use deep neural networks for signal data. Through two examples, you will see deep learning in action, providing the ability to perform complex analyses of large data sets without being a domain expert. 

Explore how MATLAB addresses the common challenges encountered using CNNs and LSTMs to create systems for signals and sound, and see new capabilities for deep learning for signal data.

Highlights

We will demonstrate deep learning to denoise speech signals and generate musical tunes. You will see how you can use MATLAB to:

  • Train neural networks from scratch using LSTM and CNN network architectures
  • Use spectrograms and wavelets to create 3d representations of signals
  • Access, explore, and manipulate large amounts of data
  • Use GPUs to train neural networks faster

About the Presenters

Emelie Andersson is an application engineer at MathWorks focusing on MATLAB applications such as data analytics, machine learning and deep learning. In her role she supports customers to adapt MATLAB products in the entire data analytics workflow. She has been with MathWorks for 2 years and holds a M.Sc. degree from Lund University in image analysis and signal processing.

Johanna Pingel joined the MathWorks team in 2013, specializing in Image Processing and Computer Vision applications with MATLAB. She has a M.S. degree from Rensselaer Polytechnic Institute and a B.A. degree from Carnegie Mellon University. She has been working in the Computer Vision application space for over 5 years, with a focus on object detection and tracking.

Recorded: 6 Dec 2018

Welcome to Deep Learning for Signals and Sound. My name is Johanna Pingel from Product Marketing, and I'm joined by Emelie Andersson from Application Engineering. Today we will cover when and why you could use deep learning for signal and audio data, and, of course, show you how it's done through two new examples. 

There are different kinds of data that engineers work with, when it comes to deep learning and machine learning. Most of the data types are one of the three-- tabular data, time series data, or image data. Deep learning has historically been used for image data, where the technology first took off and got very promising results. Deep learning has now become a tool for other types of data, as well. This seminar will focus on the middle section-- deep learning for signal and audio type data. 

So you have signal and/or audio data. Should you use deep learning, or classical machine learning? This is a question that comes up a lot. Everyone's talking about deep learning, but is it really good for signal and audio data, and when should I use machine learning instead of deep learning? Let me take 30 seconds and define the difference before answering this question. We define deep learning as a subset of machine learning, and also within machine learning is classical machine learning. In both cases, we have input data, and we want an algorithm to learn to identify unique qualities between the signals. You have input data, and you want the algorithm to produce an output from the data. 

Let's start with classical machine learning, where you manually extract the relevant features from the data. These features get added to a machine learning model for classification, which then produces an output. With deep learning, you feed the raw data directly into a deep neural network that learns the features automatically. Deep learning is usually implemented using a neural network architecture, which we'll explain in more detail using the examples. That's a high level difference between machine learning and deep learning. But the question is, which one should I use? 

The answer is, it depends. Perhaps this is your situation. I have a lot of data, but not as much knowledge about what that data entails. If this is the case, you should try deep learning. Since the model learns its own features, you, as the engineer, do not have to know as much about the data, and more data can mean a better model. Perhaps you have a smaller data set. Can you still use deep learning? The answer is, sure, go ahead and try. But it's likely that a machine learning model will probably give you better results. In classical machine learning, you can utilize your knowledge about the data, and extract the best features and the right amount of them. 

You may ask, can't I just try both methods-- deep learning and machine learning-- and compare the results? The answer is, yes, you can definitely try both. The methods can easily be compared by looking at the accuracy. You can stay in the MATLAB environment to create both machine learning and deep learning models. In this seminar, we'll just be talking about deep learning methods. But at the end of the seminar, you'll find resources on how to do similar things using machine learning. The best part is, in MATLAB you don't have to be a signal processing engineer to work with signal data, and you don't have to be a data scientist to work with machine learning and deep learning, either. 

So, let's get started. Regardless of your specific deep learning problem, you'll most likely benefit from following the typical deep learning workflow. First, you need access to your audio and signal data files from wherever they are-- either just MP3 files, historical data in a database, or live streaming data. You then need lots of labeled data. You need input data to give the neural network lots of examples, so it can learn and understand the features to make decisions. All of that data makes the training of the network quite computationally intense. To be able to do these computations faster, you can and should utilize GPUs, which can make the training significantly faster. 

Once you have a trained model, you more than likely want to incorporate this into a larger system, either allowing users to call the model through an app, or exporting the model to external hardware. You may be considering targeting an embedded GPU, and MATLAB offers products to automatically generate CUDA code from your model. And as you may know, deep learning is an iterative process, where you'll most likely need to go back and forth between these steps many times for a variety of reasons-- maybe to bring in more data to increase the model accuracy, or a change in the parameters of your model before deploying. The good news is, MATLAB provides all the tools to go back and forth between these steps. 

So, following the standard workflow, we'll look at two demos today, both of which will showcase different deep learning techniques you can try on your data. The first demo is music generation, which uses an LSTM network architecture. Emelie will walk through designing a network that can generate melodies, so in the end, by playing a short intro jingle, the network will be able to continue the song based on what has been played. The second demo is about speech denoising with severe background noise, which will use a CNN with images. We will train a network so that it will be able to remove or reduce the noise from the input, leaving a clean speech signal. 

And now, I'll turn it over to Emelie, who will walk you through both examples. 

Thank you, Hanna. So, let's get started with the first demo, music generation. The data that we have to train the network is a data set of old folk song melodies. The network type we will use is called LSTMs. It stands for Long Short Term Memory. As the name implies, the network has an innate memory, in contrast to almost all other neural network types, which have no memory in time. The train network will be called folkNet. We will then use folkNet by playing it a short song, and let it tag along. Create a follow-up melody. 

So let's take a closer look at how these kinds of networks work. "I was born in Sweden. I speak--" I know that the missing word is Swedish, based on the context. The LSTMs are a kind of recurrent neural networks. And recurrent neural networks take previous data in time into account when making predictions. These kinds of networks are especially good for signal, audio, text, and time series data. Information can be kept by sending it back into the network, and these loops can be unfolded for easier visualization. 

So let's go back to the example, "I was born in Sweden," and then we can have, let's say, five pages of text. To still have Sweden in the context requires a much bigger memory than order. So, Long Short Term Memory networks are really good at this, since they carry a memory cell throughout the whole process. And in my demo, I'm going to show you how easy it is to create these kinds of networks in MATLAB. 

So, we'll start with looking at what we could create in the end with this trained neural network. Here, we have a piano, looking like this. this app has been created in MATLAB, and in here, I can record a tune. This tune, or melody, will then be used to generate a longer follow-up melody. I can change the temperature up here. Temperature, in this case, is a technical term that describes how similar the future tunes will be to the notes I play. So a lower temperature means that the tune will be similar to the notes I played, and a higher temperature means that the tune will vary more from the original notes played. Let's leave it at this. I'll press Rec and start to play "Old MacDonald Had a Farm." 

Now I'll press "Make Me A Tune!" and we can see that it's starting to generate a neural melody. On the y-axis, we have the pitch, or the note. On the x-axis, we have the time in beats. So in the red color, we have the tunes that we just played, and in blue, we have the new generated tunes. Let's play the tune. 

I'll just pause it right there. So one reason why the beginning didn't really sound like we're used to is because the notes are elongated when I play several tunes. So I played three Gs in the beginning, and they become a long tone instead of three beats. But pretty cool. Here we actually have a computer-generated song. 

Let's take a look at the script that trains the network that is used to generate the music. What we will do in this script is to train a step-ahead prediction network, which is a network that can predict the next note when given the current note. All in all, the network will learn 35 different notes. I'll start with a reading in the folk song data from the melody folder. If I go to the folder, we can see that the data is in a MIDI format. And this is a format to communicate between order and music. So basically, these files are just data. So not a recording of sound. But the computer can use the data like notes to play out the song. 

Let's play one of the melodies outside of MATLAB, and let Windows Media Player convert the MIDI data to sound. 

These are the kinds of tones the network will be trained on. And you might recognize how it sounds from the kind of tones that was generated in the app we looked at. 

In the next section, we will make a melody vector of each data matrix that we get from the media data. We will also get rid of the files that contain harmony pieces, and remove some longer songs. All in all, we go from having 1,034 melodies to 985. So, now we have the melody vectors, but no predictors or response for the model. 

In this section, we will create the response variable by converting the 35 different notes in each of the 985 melodies into a categorical data type. And we will create the predictors by taking a dummy variable for each note, which is also a way of working with categorical data but keeping it as a numerical data type. We will also, in this section, add something called an end token to the end of the response data melodies. In this case, we'll take the highest note plus 1, which is 89. This gives us a very interesting feature. Now, the train network can decide on its own when it's finished generating a tune. By adding the end token to every piece, we're basically teaching the network when the music has come to an end. 

This next section partitions the data in the predictors and the response into training and validation data. With the settings we have here, we set aside 10% the data for testing. So the training data will be used to train the network, and the test data will be used to check the accuracy of the network on new data. Here, we can detect if the data is overfitting to the training set. 

Now, we have finally come to the part where we design the architecture of this step-ahead LSTM predictor. The first layer in the network is a sequence input layer. Here, we have the number 35, which is how many notes we have in our melodies. After the input layer, we will use two LSTM layers with 250 hidden units each. This is a rather arbitrary choice. However, we're hoping that by including two LSTM layers with a modest number of hidden units, that the network will be able to learn sufficiently complex behavior to learn musical relationships. The dropout layers that come after each LSTM layer provide a guard against overfitting. 

After these layers, we have a fully connected layer, which, as its name implies, ensures that the layer has as many notes as number of classes. In this case, we have the number 35 here, since we have 35 classes for the notes. These softmax layers-- that comes next-- outputs 35 values that it ensures has a value between 0 and 1. We can think of this as the probability distribution of the different classes. And lastly, we have a classification layer, which just outputs the class with the highest probability. 

In this next section, we will choose the training options that we will use to train the network. The Adam optimization algorithm and the gradient threshold of 1 are chosen to stabilize the training process. And if you're not familiar with the layer options and the training options, you're able to learn all about the layers and the training options in the MATLAB documentation. 

Now we have reached the last section, where we do the training of the network. All we need to add is the x-train and the y-train data, together with the architecture we set in layers and the training options we chose. As output, we will have the network folkNet. Here, we can see the network training progress, and in blue, we have the accuracy, which is going upwards, luckily. In red, below, we have the cross-entropy loss for multiclassification, which is going down. That's also good news. We want the loss to move towards 0. In black, we have the validation data results. Using this, we can see if the model is starting to overfit to the training data. 

So this training you see now has actually been sped up by multiple times. It takes much longer to run in reality. You can see that the performance on the training set is improving over time. If we allowed the training to continue, the performance could get better. But this doesn't guarantee we'll get a network which generates nice music. Instead, what's more likely is that we'll have trained a network which just closely mimics the training set. 

At around 20 epochs, we have a turning point in the validation accuracy. I interpret this to be the point where the network is at its most general, so the network can reasonably well anticipate the musicality of unseen pieces. Up here in the corner, we have a step button. If I press this, I will get the network weights as they are at the very moment that I stop the training. 

So here are the results. It doesn't look great in terms of accuracy, but in this case, we don't really mind. The goal is not to train an LSTM which perfectly reproduces the training data. Instead, the goal is to train an LSTM which generates reasonable sounding pieces. Now, we have finished training, and this is the network that has been put into our music-generating piano app. So here, we have looked at an example of using deep learning to generate new data through one-step-ahead prediction. 

The next thing we'll do is to look at how deep learning can be used to denoise an already existing signal. In this demo, instead of using LSTMs, we'll be using Convolutional Neural Networks, or CNNs, which is exactly how you would train a network on images. So in this example, we have audio data of people talking. The duration of each sequence is about the length of a sentence. To train a denoiser model, we will add washing machine noise to the clean speech data. The speech data disturbed with noise will be the predictors, and the clean speech data, which we now have, will be the response. And through that knowledge, the network will learn how to denoise a disturbed signal. 

So, as I said, we will use a network type constructed for image input data. However, right now, we have audio data. Let's use a common technique for creating images from signal data called spectrograms. We'll take both the clean data and the noisy data and convert it into spectrogram images. Spectrograms are a Fourier transform on short snippets of data, also known as short-time Fourier transform. And this outputs a frequency on one axis, and time on the other. We will take each spectrogram of the noisy audio and shift it backwards in time, so that we end up with eight consecutive spectrograms. We will train the network using images based on these spectrograms. Basically, these 8-by-129-sized images are just eight frequency vectors at eight consecutive times. The prediction, or estimate, will be based on the latest frequency vector in time. 

Before we jump into MATLAB again, let's just take a short look at how CNNs actually work. On a very high level, convolutional neural networks work like we can see in the image here. We have multiple layers, where in many of the steps, we do convolution with different-sized filters. The very first layers of filters learn to identify low-level features, such as colors and very simple shapes. The later layers will learn to identify more advanced features. 

So let's hop back into MATLAB. I'm going to start with showing you an example of what we could get in the end with this denoising network. This is an app where I can choose which signal to listen to up here. Let's start with playing the noise a signal. 

So this is what the noisy signal sounds like. It has a signal-to-noise ratio of 0 decibel, which is a very heavy distortion of the original signal. Let's listen to the denoised signal-- and remember that what we're trying to remove is the washing machine noise. 

If the red of the second bow falls upon the green of the first, the result is to give a bow of abnormally wide-- 

So what has happened here is that the noisy signal has been passed through the network, and here we have the output of the network. We can hear that the washing machine noise is almost completely gone. Let's play the clean audio data. 

Many complicated ideas about the rainbow have been formed. 

This is the original sound that we will use as ground truth. I'll move to the script where we train the network that perform this denoising. To do deep learning, we need a quite large data set. In the folder sound data, we have 121,000 sound files in an MP3 format. To easily work with these files, we'll create a data store, which is basically a pointer towards the data. The data store objects are very small-- as can be seen here, only 8 bytes. Specifically, this is an audio data store, which has audio-specific functionality. 

Next, I'm going to choose a subset of the data to work with, which is just 1,000 files. It is, of course, better to work with a larger data set, but for the sake of time, we will use 1,000 for now. Let's read in one of the files of the 1,000 in the data store, and name it cleanAudio. 

This data is in 48 kilohertz, but 8 kilohertz will actually be enough. We create a sample rate converter object, and then I'll downsample the clean audio file to reduce some computational load. To this signal we will add some noise. In this case, it will be washing machine noise. We choose a random location in the noise file, and then we calculate the noise power and the speech power so that the noisy audio will have a signal-to-noise ratio of 0 decibel. As I said earlier, this is a quite heavy distortion of the original signal, so that the clean audio is almost indistinguishable. So let's hear what that sounds like. 

I'll go ahead and plot the two signals we have now. And we can see a big difference in the signals also here. The objective of the neural network is to input the noisy audio and output something as close to the clean audio as possible. I mentioned in the slides that we will use a technique called short-time Fourier transform, also known as spectrograms, to create these images from the audio signals. Let's define the parameters needed to do our spectrograms. We will use a Hamming window of length 256 and an overlap of 75% of the window length. After this, we create the short-time Fourier transform by taking the spectrogram of both the clean data and the noisy data using all the parameters we just defined. 

We don't have the eight-segment frequency matrix yet. Let's create that. As I showed in the slides, we just create eight copies of the spectrogram that we shift by one spectrogram time-step for each. This is how we create the target and the predictors. 

Even though we created the data store a while back, we're still not working with all the data. Now let's start working with all the 1,000 files that we chose to keep. To do that easily, we convert the data store into a tall array. So what's a tall array? Tall arrays are arrays that have more rows than what actually fits into memory-- perfect for big data sets. When I perform a set of commands on the tall array, they will remain unevaluated until I called the gather function. This deferred evaluation enables us to work quickly with large data sets. When we eventually request the output using gather, MATLAB combines the queued calculations where possible and takes the minimum number of passes through the data. This can also be performed in parallel if you have multiple cores in your computer. 

In this next section, we will do the entire procedure which we did for that one file-- from reading it in, to getting the target and the predictors-- all in one line in this helper function. And it's not until we run the gather command that the function is actually running. We can see that MATLAB has identified that we only need to do one pass of the data to be able to do the calculations in the helper function. Now we just normalize the predictor and the target data, which is quite standard procedure in deep learning. And before we start training the network, let's set aside data for validation to be able to spot overfitting. 

So, now we have arrived at the network architecture. As I mentioned earlier, we will be creating image input data so that we can utilize the network types that are available for images. So let's start designing a convolutional neural network. I can write it in code like we have here, but I can also quickly design it in an app called Deep Network Designer. In this app, I can drag and drop the layers I'm interested in. I will start with an image input layer that has the following size-- number of features, which is the frequency resolution, of 129, and number of time segments, which is 8. After that, I grab a convolutional layer and add some filter settings, like filter size and number of filters. The con layer is followed by a normalization layer, and then activation layer. 

We have a repetition of the con layers, and I can go ahead and copy-paste the ones that should be the same. If I count in the end, we will have 16 con layers in total. 

We'll end with a regression layer. And that computes the half-mean squared error loss. I can press outer range, so that the layers align neatly, and then I'll press analyze, to see if there are any errors in the architecture. I get three errors, and they say "missing input" and "missing output." And I can see that they are right at the beginning of the network. So if I go back to the graph, I can see that I forgot to connect the very first layers. I connect them and I analyze again, and this time, everything is fine. 

In total we have 48 layers in this network. And to utilize this architecture I designed here, I will just go ahead and press export. As in the previous example, we will set some training options on how we want to train this model. We set the epoch so that to take three passes through the data, and so that each batch consists of 128 images. After this, we just need to do the training. I'll insert the training data, the layers, and the options. Since it's a regression problem this time, we have the root-mean-square error going down in blue instead of having the accuracy going up. We can see that the error is steadily going down, and then flattening out at an error just below 4. The validation data in black shows that the model is not overfitting. 

Now we have a train network that is ready to be exposed to new data. So let's read in a new file, add some noise to it, and run it through the network. Let's plot the clean speech, the noisy, and the denoised speech. We can see that the appearance of the denoised speech is closer to the clear speech than to the noisy speech. If we play it out, we can hear that it sounds just like in the app. 

So that's one example on how you can create a denoising network using convolutional neural networks. That was all for me. 

We hope you found this webinar useful, and there's many more examples for you to explore on our website. For more information on deep learning for signal data, and to try out examples, you can go to our documentation page. And you can learn more about machine learning and deep learning solutions in MATLAB following these links. Thanks for listening.