nonlinear fit of experimental data
Mostrar comentarios más antiguos
Dear MatLab Experts,
I would like to generate a nonlinear regression model to fit my experimental data 'Mk_Superf_FSF' as function of the independent variables 'MaxFDiam' and 'MinFDiam' which are respectively the max and min diameter of an arbirarily shaped closed and connected 2D surface. I also added the variable 'Area' which is obviously correlated to max and min diameters so I think it is not wise to use that as well.
I was suggested a linear fit for the experimental data (see attached picture). The 7th order polynomial p(x) fits the data very well but the suggested formula is non physical. In fact, the variable used is a sum of quantities with different units:
x = MaxFDiam * MinFDiam + Area / MaxFDiam + Area / MinFDiam + MaxFDiam / MinFDiam
I cannot assign a units to ithe resulting sum because the product of the two diameters has units [mm^2.] whereas the Area/MaxDIameter has units [mm], the ratio of the two diamters is unitless.
I tried to fit a sum of two negative exponentials where in the exponent I have the Area ad the product of the diameters respectively. MatLab complained printing out that the Jacobian has a column of all zeros. I tried some other combinations of exponential functions. Again MatLab complained stating that the model returns "NaN" of "Infinity".
Some other times MatLab printed out that that maximum number of iterations had been exceeded.
I tried a power-law fit as follows:
coeffs0 = [0.8672 1 1]
opts = statset('fitnlm');
opts.RobustWgtFcn = 'bisquare';
X = [MaxFDiam' MinFDiam'];
mdlfun = @(coeff, X) coeff(1)* X(:,1).*X(:,2).^coeff(2) + coeff(3);
mdl = fitnlm(X,Mk_Superf_FSF',mdlfun, coeffs0,'Options', opts, 'CoefficientNames', {'a' , 'b', 'c'});
This time MatLab did not complain but the resulting model is anything but good. The R^2 value is awful. The P_values are very high except for one.
mdl =
Nonlinear regression model:
y ~ a*x1*x2^b + c
Estimated Coefficients:
Estimate SE tStat pValue
________ ________ ________ __________
a 0.007407 0.02121 0.34922 0.73352
b -0.26657 0.87603 -0.30429 0.76658
c 0.93603 0.048648 19.241 8.0919e-10
Number of observations: 14, Error degrees of freedom: 11
Root Mean Squared Error: 0.0363
R-Squared: 0.215, Adjusted R-Squared 0.0721
F-statistic vs. constant model: 1.51, p-value = 0.264
Maybe the model is not right. Maybe the initial parameter values are not good.....
I would greatly appreciate some help at getting a decent fit. Above all, I would like to learn techniques to:
(1) devise the model formula
(2) choose the initial parameter values
Thank you so much for any suggestion and help.
Best regards,
Maura E. M.
14 comentarios
dpb
el 19 de Ag. de 2019
Whassup w/the first observation? Looks like complete outlier.
The response versus min diameter also looks peculiar w/ a couple of points in the middle that are very far out of line with adjacent before/after.
The response versus the area variable is far more well behaved than either of the diameter variables with again, the exception of the first point just isn't even close to the rest.
I'm not at all surprised the model doesn't fit well...you talk of the linear combination factors model not being physical; what does the system represent and is there any physical correlation that it should follow to guide the model? That's always the best thing to have if there is anything one can use.
As far as units on the other expression, while it's not the question you asked nor necessarily the best way to fit, one simply assigns units to the coefficients as needed to match the independent variables such that the fitted response does have the right units. Granted, the resulting coefficients may have no real physical interpretation, but you can make the units work arbitrarily as well as the choice of terms in the fit.
Maura E. Monville
el 20 de Ag. de 2019
I hadn't yet fitted anything; I was just exploring the dataset by plotting various ways...fitting blindly w/o visualizing first is fools' errand. I plotted against each of the independent variables (after sorting by the variable) and those were enough to make me ask some questions before trying to go further...here are a couple of the plots--


As noted, these are plotted by sorting on the independent variable and using the sorted index for the response variable. What seems peculiar with the min diameter is the more jagged and the drop in the response for the two cases around 11-12. That is a very difficult detail to fit with precision and just raises questions as to whether is or is not an artifact of the measurement or real.
This is an even more detail look than I had done last night--what this shows is that the max diam and the area are nearly surrogates for each other altho one must remember in these "one at a time" plots the order isn't quite identical; this was done simply to visualize whether there was an apparent correlation with the independent variable of the desired predicted response.
I don't know the definition of the FSF nor how much precision can be presumed to be associated with the observation but with area there would appear to be a peak then a somewhat exponential decrease as area increases. That is pretty-much the gross shape with the two "diameters" and, of course, the area is going to be a function of those albeit given the arbitrary shape there's no direct simple relation there.
I suspect the problem here is that arbitrariness in the shape -- and even if you developed an almost perfect correlation from these data there would be no reason to believe it would hold for another set of observations for which the shapes weren't the same or very similar. Possibly it is that there is a unique feature there that distinguishes the two "funny" cases with the minimum diameters that isn't present in the rest of the samples.
I would wonder if other measures of geometry that try to represent the shapes in more categorical terms might produce better predictors -- like measures of curvature or lobes or such--maybe measures of perimeter might be an indicatior of that difference from being just a circular opening, who knows. I'd probably study the outlines of those shapes against the response and see if I could pick out any pattern that seemed to correlate...
dpb
el 20 de Ag. de 2019
Oh...intended to note. If, indeed, there is a reason that the first observation measurement is non-representative of the rest, then I'd throw it out of this dataset entirely--it's just confounding things to even look at it, what more with only 14 points an outlier has high influence on estimation.
That latter is another risk in this data and trying to read too much into it -- there really isn't that much data here and being more or less happenstance (in that the shapes while not truly arbitrary are what was needed by patient and not part of a design study of specific shapes, they may not be a very good choice for trying to estimate the effects).
One other re: max diameter and area correlation -- Rsq = 0.948 which is quite high; as evidenced by plot

There does appear there could possibly be a slightly curvature upward from a purely linear relationship, but there are only those two points past the middle range of six that have the most variability so it's possible it's more just an appearance the eye wants to make than real.
Maura E. Monville
el 21 de Ag. de 2019
dpb
el 21 de Ag. de 2019
I didn't say you could make the coefficients physically meaningful, only that you can assign arbitrary units to them such that the coefficient times the independent variable(s) ends up with units of the response. That's almost self evident as the response is in a given set of units so the prediction is reproducing those whatever the terms in the correlation are.
I also didn't say it was going to be easy (or even necessarily, possible) to develop a correlation given the data you have.
I would wonder which of the observations goes with which of the representative collimator shapes? I'd like to be able to compare those to which observation they generated.
Of those, there are two basic shapes, one basically an ellipsoid while the others are what I'd call a kidney-like shape as was the picture you sent earlier. A collection of those images with their associated response would be interesting.
I don't yet know whether can find a model that would predict these results or not -- probably could if made it specific enough with respect to each individual case but I still wonder if such would ever be of any value as far as drawing conclusions from regarding basic relationships.
Is it possible to take measurements with theoretical shapes without actually using patients so one could start with defined geometric shapes and then bring in the eccentricity factors? If so, I think I'd try to start with such a designed experiment where made very defined changes in shapes that are computable and classifiable and see if making changes there would produce predictable results. Then one could perturb those idealized shapes into approximations or the real ones and see how the results were affected. Just a thought--"you can't control what isn't controlled" and happenstance variables are the bane of statistics and modelling.
Maura E. Monville
el 21 de Ag. de 2019
dpb
el 21 de Ag. de 2019
1) The 6 mm is the outlier so it doesn't help. The 15 mm is right in the middle of the responses outside those that are the five or so that are the "peak" values. The curious thing would be to try to isolate why those are outstanding.
2) So the 11,12,13 do correlate with the same sequence in the original dataset? I'll have to study that some. Still think it would be worthwhile to line up the images with the response to observe side by side what shape yields what response that don't have enough data to do yet here.
3) A seventh-order poly with 14 (and really should just be 13) points is well over-fitted--the goodnes of a fit will be as much coming from the fact the solution is constrained so much as that the chosen model actually represents the functional form. As noted, it's still possible the other shapes haven't seen are different-enough that there could be a categorical variable to incorporate as grouping variable rather than quantitative. May not be, too, but I'd want to investigate further that direction.
I've not had the time to dig into the additional info in the pdf files as yet...have to go do some personal errands at the moment; maybe tonight could get back and look some more. It is an interesting problem and am just beginning to get enough to have a clue about it....
Maura E. Monville
el 21 de Ag. de 2019
Maura E. Monville
el 22 de Ag. de 2019
You need to pull that .zip file and edit it to remove the actual patient data in the header before reloading it...
The data in the image would be a lot more useful if you could save that as text file to be read...would save a lot of typing and building of lookup tables...
Maura E. Monville
el 23 de Ag. de 2019
dpb
el 23 de Ag. de 2019
Ah! OK...I only opened and looked at the first one and presumed all the rest were the same...if weren't same number of header lines then I guess will need to to it again. Mayhaps that's why some seemed to have gaps in the perimeters...
Respuestas (1)
In your example you find a fit to a function of one variable, and are somehow looking for a combination of terms to form that one variable. Do you need to get it into this form or is it ok to have the predicted value, y be a function of two variables?
Assuming the latter, in case it is helpful I just tried a somewhat simplistic approach of considering the response to be a quadratic function of the two inputs MinFDiam and MaxFDiam.
Regarding motivation for choosing this form, I guess you could consider this to be a low order taylor series representation. (one up from linear which I tried and didn't fit very well). I'm not sure of the precise mathematical statemement of this, but the general notion is that for small enough regions all continuous functions are well approximated by just the low order terms of the Taylor series, and in particular functions that show some curvature are well approximated by quadratics in a small enough region.
I am not familiar with using fitnlm so I just used fitlm as follows
x1 = MinFDiam(:)
x2 = MaxFDiam(:)
y = Mk_Superf_FSF(:)
mdl = fitlm([x1 x2 x1.*x2 x1.^2 +x2.^2],y)
This gave the following statistics
mdl =
Linear regression model:
y ~ 1 + x1 + x2 + x3 + x4 + x5
Estimated Coefficients:
Estimate SE tStat pValue
___________ __________ ________ __________
(Intercept) 0.60854 0.075803 8.028 4.2585e-05
x1 0.057919 0.025395 2.2808 0.052008
x2 0.0015331 0.014779 0.10374 0.91993
x3 0.00067369 0.0014413 0.46741 0.65268
x4 -0.0024835 0.0012742 -1.9491 0.087118
x5 -0.00031494 0.00071727 -0.43908 0.67222
Number of observations: 14, Error degrees of freedom: 8
Root Mean Squared Error: 0.0204
R-squared: 0.82, Adjusted R-Squared: 0.708
F-statistic vs. constant model: 7.3, p-value = 0.00746
Which does not seem too bad.
16 comentarios
Maura E. Monville
el 22 de Ag. de 2019
Looking at these p values we can see that the terms that include MaxFDiam have high p values which suggests that changes in MaxFDiam do not produce a large response in Mk_Superf_FSF. In other words it appears that your output is primarily driven by just the one variable MinFDiam.
Following up on that I tried fitting Mk_Superf_FSF = c2*MinFDiam^2 + c1*MinFDiam + c0
This had a similar R-squared as the previous fit above. Which is not surprising as the terms that involved MaxFDiam had not contributed much to the fit.
Continuing then to think of this as just fitting to the one variable MinFDiam, I looked at adding a cubic term
Mk_Superf_FSF = c3*MinFDiam^2 + c2*MinFDiam^2 + c1*MinFDiam + c0
as follows, which gave quite a high R-squared and low p values. So perhaps this is a useful model. Sorry when I copy and paste the MATLAB output it seems to wrap, but please run it yourself to see it better.
x1 = MinFDiam(:)
x2 = MaxFDiam(:)
y = Mk_Superf_FSF(:)
mdl = fitlm([x1,x1.^2,x1.^3],y)
mdl =
Linear regression model:
y ~ 1 + x1 + x2 + x3
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ __________
(Intercept) 0.11948 0.06442 1.8547 0.09333
x1 0.2005 0.017762 11.288 5.1814e-07
x2 -0.014639 0.0015451 -9.4745 2.6012e-06
x3 0.00034586 4.2364e-05 8.1642 9.8515e-06
Number of observations: 14, Error degrees of freedom: 10
Root Mean Squared Error: 0.00677
R-squared: 0.975, Adjusted R-Squared: 0.968
F-statistic vs. constant model: 131, p-value = 2.53e-08
Looking deeper into this, I plotted the resulting fit and original data to obtain.
Subjectively, this looks "overfit" to me.
I would suggest (as I think did @dpb) that you need to check if the response for MinFDiam=6 is an outlier. It clearly drives the fit below. If we eliminated that one point, it looks like the output doesn't even depend on MinFDiam.
In general, if possible, it is best to have some form of theoretical model that gives you an equation with some unknown coefficients. Then just use the regression to fit the unknown coefficients.

Maura E. Monville
el 22 de Ag. de 2019
dpb
el 23 de Ag. de 2019
Can't disagree w/ Jon on overfitting and illustrates the problem of polynomials distinctly with the curvature at the RH end. The whole shape from the maximum area between 8-12 is an artifact of the 6 mm point being included (thought we had established earlier it didn't belong) and then building in enough curvature to hit the high values but also run thru near the mean of the other two groups...has to undershoot in the middle to curve back up to pick up the last.
Maura E. Monville
el 23 de Ag. de 2019
Editada: dpb
el 23 de Ag. de 2019
Jon
el 23 de Ag. de 2019
At this point, is there a specific question or problem that you have with using MATLAB that someone can help you with?
For general questions about the theory and practice of fitting experimental data, there may be better resources (text books, other websites).
If you find a specific approach that you are trying to apply, but are having problems with how to implement that in MATLAB, I'm sure you will get lots of help here.
That being said, maybe someone is interested in working with your data set and seeing what kind of fits they can come up with. That would be great, but in terms of setting expectations, I think it is somewhat out of scope for MATLAB answers.
Maura E. Monville
el 23 de Ag. de 2019
dpb
el 23 de Ag. de 2019
I've had family emergency the past week so have had very little time to actually do more than glance at the updates since the first response above, sorry. I do find it an interesting topic (while not medical am NE by training so the radiation end of things is in baliwick) and will try to find some time over the weekend to see if I can look into the questions that came to my mind simply from a shielding problem standpoint regarding areas/shapes. Plus, I'll have to have a little time to digest the nomenclature that's all new to me...
Can you identify anything at all from the measurements, patients, collimator, external conditions, sequence or conditions at time of measurements, etc., etc., etc., of the third through sixth measurements (excluding the 6 mm one so there are only 13 total, not 14)?
Outside of those, there is essentially a linear relationship with the min diameter--those four are significantly higher response than any of the others.
I've looked at where those are with respect to min/max and ratio of max/min and there's nothing there that really segregates those measurements but something is different with them than all the rest. Finding that probably uncontrolled corollary variable might be the answer to the whole enigma.
I've not had time yet to study the shape data...
Maura E. Monville
el 23 de Ag. de 2019
dpb
el 23 de Ag. de 2019
" please tell me the identifiers "AXXXXXX" of the 4 collimators whose FSF is the higest value according to your observations?"
Just do
find(Mk_Superf_FSF(2:end)>1.01)
If you've not resorted from the order in the .mat file(s), then you'll get the same (3:6) that I did. If you don't toss out the 6mm case that is the first, then they're 4:7
I was in the process of building the lookup table to answer the specifics and compare the shapes when I posted the above to (hopefully) not have to do the lookup by hand. :) besides the privacy issue.
What is the the source configuration relative to the position and size of the collimator? Is it a point, line, distributed, ... source?
And, configuration of target/measurement?
What does the Monte Carlo simulation look like? It seems to have some general features altho it also appears to underpredict the larger magnitude responses. Is it sure thing that those measurements aren't somehow biased?
Maura E. Monville
el 24 de Ag. de 2019
"I think the[r]e is a physics explanation for the higher measurements."
And well may be but I think it highly unlikely that explanation is in the variables controlled/measured here.(*)
The MC simulation misses those specific points by far more than the others in a consistent direction so whatever it is isn't included in that model, either.
(*) And note that even if you were successful at building a model by some magic transformation of variables or nonlinear curve-fitting strategem that did manage to fit the observations from these measurements that to infer that would be the physical reason behind the values would be a gross misrepresentation of such a fit even if you could make it happen with a set of coefficients with consistent units.
Maura E. Monville
el 9 de Sept. de 2019
Categorías
Más información sobre Mathematics and Optimization en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!