Feature selection in sequence to one regression

6 visualizaciones (últimos 30 días)
Michal Slezak
Michal Slezak el 26 de Jun. de 2023
Comentada: Michal Slezak el 30 de Jun. de 2023
I have dataset that has about 3000 observations. Each observation consists of 28 time-series variables (pressure in particular areas of cardiovascular system) and a single-value (resistance of the heart valve).
My goal is to train a model (neural network) that would take some of those time-series as an input and do the regression of that single-value parameter.
Now, the question is how to do a feature selection, so that I could choose like 3-6 out of those 28 time-series as inputs. I don't need an already finished code but rather an idea or a clue.
If I had a sequence-to-sequence regression problem instead, I could simply use a Pearson correlation coefficient. If I had categorical data, I think I could use chi-square technique. But I cannot find out what to do in case of sequence-to-one regression problem.

Respuesta aceptada

Kautuk Raj
Kautuk Raj el 27 de Jun. de 2023
In the case of a sequence-to-one regression problem, where you have multiple time-series features and a single-valued target variable, there are several feature selection techniques you can try. Here are a few ideas:
  1. Correlation analysis: You can calculate the Pearson correlation coefficient between each time-series feature and the target variable and select the top features with the highest correlation values. This approach can help identify the features with the strongest linear relationship with the target variable. However, it may not capture more complex nonlinear relationships.
  2. Feature importance from a trained model: You can train a neural network model using all the available time-series features and then use feature importance techniques to determine which features are most important for the model's predictions. For example, you can use the feature importance scores from a random forest or gradient boosting model. This approach can capture both linear and nonlinear relationships between the features and target variable.
  3. Principal component analysis (PCA): PCA is a dimensionality reduction technique that can help identify the most important features that explain the most variance in the data. You can apply PCA to the time-series features and select the top principal components as inputs for your model. This approach can be useful when there are high correlations among the time-series features.
  4. Forward feature selection: You can use a forward feature selection algorithm to iteratively add the most informative time-series features to the model until the desired number of features is reached. This approach starts with an empty set of features and adds the most informative feature at each iteration based on a predefined criterion, such as the increase in model performance. This approach can be computationally expensive but can lead to a more optimal feature set.
  5. Lasso regression: Lasso regression is a sparse regression technique that can select the most important features while also performing feature regularization. Lasso regression can help identify the most relevant features for the model while also reducing the risk of overfitting. This approach is particularly useful when there are many features and the number of observations is limited.
  1 comentario
Michal Slezak
Michal Slezak el 30 de Jun. de 2023
Thanks for help, but I'm not sure if calculating Pearson correlation between time-series and a single value is actually possible. I think it's only applicable in case of two (or more) series of data.

Iniciar sesión para comentar.

Más respuestas (0)

Productos


Versión

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by