Optimizing Interpretability in Gaussian Process Regression Models: A Strategic Approach to Preprocessing and Testing Data

Hi
I am utilizing the Regression Learner App to develop a model that can adjust my RAW data so that it can accurately predict data accordingly. My question pertains more to the general usage of the tool.
1. When setting my input data, there is an option to reserve a portion of the data for testing. Does this process allocate the learning and testing data randomly, or does it do so sequentially, e.g., using the first few weeks of data for training and the remaining for testing?
2. I have discovered that Gaussian Process Regression (GPR) models yield the best results for my dataset. However, this type of model lacks interpretability. My inputs include Signal Data, Temperature, and Humidity.
If I wish to assess the individual impact of each input on the overall signal, in terms of applying a linear or polynomial correction before the GPR model processing, is this possible? By doing so, I can minimize the amount of data fed into the GPR model, which in turn might provide some interpretability for my overall modeling process.

Respuestas (1)

  1. Regression Learner partitions the test data randomly. In Classification Learner, the partition is random and stratified. (https://www.mathworks.com/help/stats/cvpartition.html). Stratification is based on the class labels. That is, an attempt is made to keep the class frequency similar in the training and test sets. If you want to control your test partition, you could (1) first partition your data into train and test outside of the Learner app, (2) load the training data into the Learner app at the session start dialogue, and (3) later load the separate test data into the Learner app.
  2. You can use model-agnostic interpretability techniques such as Partial Dependence Plot (PDP), Shapley, and LIME on your GPR models. In R2023b, you can use these techniques inside the Learner app using the "Explain" tab within the Learner app.
If this answer helps you, please remember to accept the answer.
Example screenshot from the Regression Learner app, within the Explain tab, for a GPR model on fisheriris data:

4 comentarios

Great, thanks
I have just upgraded to 2023B and notive the additional TABs.
So that i understand Partial Dependence plots,how can i extract the data from it so the signal could be conidioned prior to going to in the model?
For example this is my plot i get with the x-axis being tempeture.
The partial dependence plot that you provided indicates that the predicted response generally rises as the temperature rises. Is this consistent with your expectaions? Is this a sufficient model explanation for you, or are you looking for some other info from the model interpretability?
Can you provide more info about your overall goals for this analysis? It is not clear to me why you want to "adjust your RAW data" before passing it to the model. Perhaps this step is not needed. In general, if the model can produce a good regression result with the inputs as they are, perhaps that is simpler and more desirable. You mention two possible reasons: "By doing so, I can minimize the amount of data fed into the GPR model, which in turn might provide some interpretability for my overall modeling process."
(1) The first reason that you mention is to "minimize the amount of data fed into the GPR model". That sounds like you want to do dimensionality reduction, or feature selection. For that, you could use the PCA and/or feature selection capabilities within Classification Learner. If some of the current input predictors are not needed, those can be removed using feature selection, then you can rebuild, re-test, and re-interpret the model. If you choose to apply PCA, that is a way to reduce the dimensionality of the model (if you don't use all PCA components) in order to save computation, but it would generally make your model harder to interpret, because you would be feeding principal components into the model, rather than easily interpretable predictors like "Temperature". If you go down that road, perhaps you could use the model without PCA for model explainability (for those wanting to understand how the original predictors generally affect the model output), and use the model with PCA for efficiency (if you are reducing to fewer PCA components).
(2) The second reason that you mention is that a "linear or polynomial correction" might "provide some interpretability for my overall modeling process." I don't think this goal is needed, because you can already interpret the GPR model as it is.
For more background on using PDP within Classification Learner, here is a MathWorks video (this shows an earlier version of Classification Learner, before model explainability was on a separate tab): "Use Classification Learner App to Interpret Machine Learning Models with Partial Dependence Plots"
Okay, let me explain a little more about what I am trying to model, even though this post is more about the actual tool rather than the techniques I need to use. I am trying to model an NO2 sensor, which is affected by both temperature and humidity. Ideally, I would like to correct all temperature issues outside the GPR model, as the issue with temperature is more linear or polynomial, while humidity has more to do with patterns and rise/fall rates, which I would like to be exclusively handled by the GPR model until I fully understand the chemistry behind the sensor and its interaction with humidity.
Yes, as the temperature increases, my sensor, which I am trying to model, reduces its signal level; therefore, we need to increase the output of the model to compensate for this. What I was hoping to do was to correct this temperature effect using a linear correction, based on what is learned in this GPR model, as the output is very accurate. Meanwhile, if I were to use a linear model, my R-squared is very low, and therefore the temperature part of the correction might not be as accurate as the GPR is outputting.
So to answer your question about why I am adjusting the RAW data, it is because I want the GPR model to exclusively handle humidity-related issues only. Thus, I am trying to break the model into smaller blocks.
So, what is PCA? Am I correct in understanding it is an optimization option that needs to be enabled? My inputs into my model, in addition to Signal, Temperature, and Humidity, include additional hourly rate changes from 1 hour to 120 hours, so it over 100 inputs, representing humidity rate changes. Including rate changes provides better results in the model. When I run PCA, it indicates only 5 inputs are used but does not specify which 5 inputs they are. Have i understood this option correctly?
I am using the Regression Learner app; would I also need to use the Classification Learner app? What would be the advantage?
In addition to my previous post, I have another question related to the Regression Learner app. Would I be better off using the app?
I have sensor data from June to November, which includes sensor signals, humidity, and humidity change rates.
I have divided my data into two tables: one for training and the other for testing. The training table covers data from June to September, while the testing table spans from September to November. These are the results I'm getting; as you can see, the training performance is very good, but the testing performance is fairly low.
Am I correct in understanding that for the Regression Learner app to produce accurate results, the training needs to be conducted with data that has a similar combination of variables as the test data or future incoming data? Is it possible to configure my data in such a way that the Regression Learner app comprehends more of the underlying principles during its training, rather than relying solely on absolute values?

Iniciar sesión para comentar.

Productos

Versión

R2023b

Etiquetas

Preguntada:

el 3 de Nov. de 2023

Comentada:

el 6 de Nov. de 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by