How do I do Regression for multiple subjects
Mostrar comentarios más antiguos
Hi
This a stats question for regression analysis
I have data for 19 subjects who did 8 blocks of a task.
I want to see if data increase "significantly" across blocks.
Which option is correct (statistically speaking)
They give different results!!
Option 1: Fit every subject and then t-test on the coefficient of rate of change (slope)
coeff = [];
for s=1:19
fit = fitlm(1:8, data(s,:)) ;
coeff(s,1) = fit.Coefficients.Estimate(2) ;
end
[h,p,ci,stats] = ttest(coeff)
p = 0.1030
Option 2: Average across all subject and then run fitlm (on the group average) and use the stats in the fit LinearModel
M = mean(data) ;
fit = fitlm(1:8, M) ;
fit.Coefficients.pValue(2) = 0.155
thanks
pb
Respuestas (2)
Strictly speaking, you're supposed to know the test for hypothesis before designing and running the experiment.
The Q? here is what is the actual null hypothesis to test? If the hypotheis is one that there is a difference in difficulty between tasks, then the better test would be Friedman...if the question is whether there's a trend in difficulty versus test, that's different. Then it's a Q? of whether it's for the population as a whole or for each individual.
Without additional background on the test and objectives, I'd suggest Friedman.
data=readmatrix('data.txt');
stackedplot([data mean(data,2)])
friedman(data);
It is pretty low for significance, but you have no replications. As the bottom plot shows, there does appear to be a slight curvature in the average versus block although it would be difficult to show significance I suspect (I didn't try).
If the Friedman test were to turn out significant there are then post-hoc tests that can be utilized to identify which specifically differ.
I considered using friedman however there are no actual reeititions, consideering that there are 19 subjects and 8 tasks. However, the t-test is inappropriate, since the data appear not to be normally distributed. If none of the task scores can be zero or negative, then the lognormal distibution may be more appropriate.
A1 = readmatrix('data.txt')
figure
plot((1:size(A1,2)), A1)
grid
xlabel('Task')
ylabel('Sobject Score')
legend(compose('Subject #%2d', 1:size(A1,1)), Location='northoutside', NumColumns=4)
xlim('padded')
figure
boxchart(A1, Notch='on')
xlabel('Task')
ylabel('Sobject Score')
figure
boxplot(A1, Notch='on')
xlabel('Task')
ylabel('Sobject Score')
figure
tiledlayout(4,5)
for k = 1:size(A1,1)
nexttile
histfit(A1(k,:), 5, 'lognormal')
title(sprintf('Subject #%2d',k))
end
sgtitle('Lognormal Fit')
figure
tiledlayout(4,5)
for k = 1:size(A1,1)
nexttile
histfit(A1(k,:), 5, 'normal')
title(sprintf('Subject #%2d',k))
end
sgtitle('Normal Fit')
tmean = mean(A1) % Task Means
tmed = median(A1) % Task Medians
Looking at the median values (also displayed in the boxchart and boxplot figures), there does not appear to be any specific trend across tasks.
I am not certain what to suggest, however there does not appear to be anything significant across tasks in these results.
.
15 comentarios
You have 19 subjects each of whom did eight different tasks once. That's not replication unless some of the eight tasks are the same task presented again?
I don't think regression for slope is meaningful here; the task numbers aren't a suitable predictor variable.
Star Strider
hace alrededor de 18 horas
Movida: dpb
hace alrededor de 3 horas
I'm not certain what your oriignal design or intent is.
There doesn't appear to be any specific trend in the median values for each task block.
A1 = readmatrix('data.txt');
tmed = median(A1)
mdl = fitlm(1:numel(tmed), tmed)
figure
plot(mdl)
grid
.
dpb
hace alrededor de 3 horas
The null hypothesis is yet to be clearly stated. If the "block" is considered an ordinal sequence number and the hypothesis is that there is an overall increase in duration with time/repitition, then the question is whether the conclusion is whether the change is for the population as a whole or for any specific individual. Presuming the population, the answer is obvious, use a measure of the durations of the population. Given @Star Strider's observation the observations are not normally distributed, the median may be a better choice or applying a Box transformation first.
If one were to do each individually, one has the isse of deriving a suitable correction for the problem of significance inflation for multiple comparisons.
Pat
hace alrededor de 3 horas
As I noted, it depends upon whether you're wanting to draw a conclusion regarding the population of subjects or by subject...the latter option leading to the issue of multiple comparisons across however many subjects there may be to extend from each individual to the population.
Since this case has an equal number of observations (lacking any missing) for each subject, the average of the slopes of all individual regressions will be the same as the slope of all subjects fitted together.
However, there's another possible problem with using regression for this...it may turn out to not be an issue, but by observation of the data sample, at least some of the traces appear to have a shape and a quadratic may have zero slope but still have a significant curvature. Would such a trend that won't be discovered by using a trend line alone be an issue? As noted, probably won't happen, but it's possible.
Pat
hace alrededor de 2 horas
data=readmatrix('data.txt');
M = mean(data) ;
fit = fitlm(1:8, M)
X=repmat([1:8],height(data),1);
fit = fitlm(X(:),data(:))
N=height(data);
coeff=nan(N,2);
for s=1:N
fit = fitlm(1:8, data(s,:));
coeff(s,:)=fit.Coefficients.Estimate;
end
mean(coeff)
[~,p] = ttest(coeff(:,2))
[nnz(coeff(:,2))==height(coeff) all(coeff(:,2))>0]
As shown, the estimates are identical; the p-values are different owing to the DOF reduction in collapsing to the 8 observations with the means versus all data.
The t-test on the array of coefficients is not meaningful as there's no error variance of the quality of fit; I'm not at all certain how you must have gotten the number you posted initially.
Overall, I think the better approach is the mean.
Pat
hace 10 minutos
DOH! <Head slap!> I whiffed on somehow having not gotten the fit inside the loop, indeed.
It is a biased estimator for the aforementioned reasons, however; you've completely removed the likelihood of any being non-significant by keeping only the coefficient itself. Without the variability in the observations and thereby the uncertainty in the coefficients, the conclusion will always be there's a trend unless it turns out to be identically zero or totally random such that the mean turns out to be zero. It's not a meaningful test statistic for the stated hypothesis.
From augmenting the above comment we find that
[nnz(coeff(:,2))==height(coeff) all(coeff(:,2))>0]
returns True for both tests. The latter is a confirmatory indication of the hypothesis but Ihe uncertainty in the estimates of the coefficients themselves isn't reflected in that test statistic.
I still think for the composite group the mean estimator would be the better choice although I'm still not totally convinced it's the best approach--but otomh a better alternative isn't coming to me.
NOTA BENE:
Remember that the t-test for coefficients in fitlm is a 2-sided test that the estimate is non-zero. In your case, looking for the positive trend, that can be converted to a one-sided upper test. For the case using the mean fit above, that would return
>> p95U = 1-tcdf(1.6261,6)
p95U =
0.077527
>>
from local -- which is, of course, half the returned p value. That gets you in the ballpark with the desired result, anyways.
Pat
hace alrededor de 3 horas
"...the RT on a task "
I don't know what "RT" is?
"I don't think I removed "the likelihood of any being non-significant by keeping only the coefficient itself." ...I ran 1000 regression with random data"
Ah, but these aren't random such that the expected mean slope would be zero; these data are the result of a series of physical tests that were deliberately designed to be strenuous so that the effort to accomplish subsequent steps is virtually certain to be more difficult and thereby take longer (the measured variable). As the sample data shows, individuals can and do show occasional drops in one versus the prior since they can make a concentrated effort to do so, but that isn't possible to be kept up indefinitely. And, that makes the estimate biased to being >0, but the individual variability is such that the estimation (at least for this admittedly limited data set doesn't quite make the 95% level.
I still do not think that is an appropriate test statistic for the actual experiment.
dpb
hace 38 minutos
That's the problem with remotely trying to answer such; one almost never does get the necessary background to have sufficient information. The time to have consulted a statistician was during the design of the experiment such that the test to be applied would be known a priori.
In this case, my suggestion is you need to get that advice now rather than relying on a newsgroup or your own devices.
Categorías
Más información sobre Gaussian Process Regression en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!







