How can I extract the time length (in miliseconds) between two audio signals?

I have a psychology experiment paradigm which asks participants to give a verbal response immediately after they hear a beep sound. Participants may or may not respond to the beep, and their response could be quick or slow. I need to extract the time length between the end of the beep sound and the start of their verbal response. Such time length should be measured in miliseconds as the total time allowed for each response was 3 seconds (3000 ms). There are hundreds of trials so I would like to find a way to do the extraction automatically. How should I achieve this? Carload thanks to any suggestions!

2 comentarios

Which toolboxes do you have available to use?
I recommend using the third-party Psychtoolbox for this kind of work.

Iniciar sesión para comentar.

Respuestas (2)

Considering the nature of this problem, probably the best option is to estimate the signal envelops with the Signal Processing Toolbox envelope function (use the 'peak' option with an appropriate window), decide on a threshold, and measure the time the envelope crosses the threshold.
It may be necessary to use a filter to eliminate noise. If you are using the lowpass function (or any of its friends) for this, use the ImpulseResponse='iir' name-value pair for best results.
This approach as worked for me in the past.
It will probably be necessary to experiment to get the result you want.

13 comentarios

Wade
Wade el 26 de Oct. de 2025
Editada: Wade el 26 de Oct. de 2025
Hi there,
Thank you so much for your answer. If I made it right of your words, I should find the peak of the beep sound and the response voice, and then extract the length between the two peaks. I have no question on the former as the beep sounds are identical and have the same frequency, envelope etc. But I wonder if it can apply to the response voice signal, which may vary hugely and have different peaks. For example, if a participant offered an answer first and then immediately corrected that answer with louder voice and higher frequency, the peak should fall on the latter part of the voice segment. But in that instance what I want to obtain is the distance between the beep peak and the first part of the voice segment, because it was his first utterance that could manifest his reaction time. How should I solve this? Much appreciation!
My pleasure!
I would not use the peak values, however you need to use the 'peak' option in your envelope call. If you want to know when the voice response begins, set a threshold and then determine the time the voice response envelope (I use the upper envelope here) first crosses that threshold.
Try something like this --
Fs = 44100; % Sampling Frequency (z)
L = 5;
t = linspace(0, Fs*L, Fs*L+1).'/Fs;
ts = seconds(t); % Time Vector ('duration' Here)
s = randn(size(t)) .* exp(-(t-2.2).^2*10); % Voice Response Signal
[et,eb] = envelope(s, 1000, 'peak'); % Use 'peak' Option
thrshld = 0.25; % Detection Threshold Value
tidx = find(diff(sign(et - thrshld))); % Approximate Indices of Threshold Crossing
idxrng = tidx(1)+[-1 0 1]; % Index Range For Interpolation
t_exact = interp1(et(idxrng), ts(idxrng), thrshld); % 'Exact" Value Of Upper Envelope Crossing Threshold Value
fprintf('\nResponse envelope crosses detection threshold level at %.3f seconds\n', seconds(t_exact))
Response envelope crosses detection threshold level at 1.690 seconds
figure
plot(ts, s, DisplayName='Response Signal')
hold on
plot(ts, [et eb], LineWidth=2, DisplayName="Envelope")
hold off
grid
xlabel("Time (s)")
ylabel("Amplitude")
yline(thrshld, '--k', "Detection Threshold", DisplayName='Detection Threshold')
xline(t_exact, '-.r', "Response Onset Time", DisplayName="Response Onset Time")
text(t_exact, 1.5, sprintf('%.3f s \\rightarrow',seconds(t_exact)), Horiz='right')
legend(Location='best')
.
Thanks a lot for the reply! I can roughly make sense of your codes. But I have 3 questions:
  1. How should I determine the threshold?
  2. 1.690s is the result in a window that starts from 0s as the beep onset, but if another beep falls on a non-zero position on the X-axis, will the codes return a time index or the actual duration length?
  3. Do I need to do some pre-preprocessing to remove noise? If yes how should I do that?
As a complete layman of audio signal processing, please forgive me if any of these questions looks stupid to you :)
My pleasure!
  1. I defined the threshold empirically here. There is usually some noise, even in a filtered signal, so the threshold needs to be greater than that value. Beyond that, the lowest value that gives the best results (the fastest time) would be best. I doubt that there is a mathematical way to determine the best threshold.
  2. I do not fully understand your experiment. My code measures the time to voice response onset from the beginning of a specific record. It has no idea where the beeps are, so it simply returns the time to the voice response. (This is a simple example, and it could be made as comprehensive as necessary to give you the result you want.) If the beeps are recorded in the same record as the voice response, and all the beeps have the same frequency characteristics (ideally a single frequency), it would be relatively straightforward to separate them from the voice response and compute the times of the beeps and the time of the voice response separately. I would need representative data to explore this.
  3. I do not have a sample of your signal, so I cannot determine the noise characteristics. I usually use a Fourier transform of a signal to design the filter cutoffs, and determine the sort of filter I want (usually lowpass or bandpass).
I do not consider any questions to be 'stupid'! I will do my best to answer any that you have.
.
Here's the sample audiofile. I found I made a mistake in my initial description of the signal so here's the correction: the stimulus was a segment of English number, the immediate response was a segment of Chinese number, and the final part was the beep sound reminding the participant to stop. These three segments of sound signals made a complete cycle (or trial in psychological term). Then another cycle follows immediately. I want to extract the time length between the first and the second voice segment. If the second segment does not exist (i.e., the participant failed to provide an answer), then extract the length between the first segment and the beep.
I don't know if it is feasible but I noticed that the stimulus and the response have drastically different frequencies and amplitudes. Maybe they could be separated according to freq/amplitudes?
Thank you for the file.
I am having a bit of a problem understanding the signal contents. There are three 250 Hz frequency ranges, beginning at the low end at about 500, 2000, and 2500 Hz, according to the ''pspectrum' spectrogram' plot. Since they minimally overlap, they can be filtered from each other relatively efficiently, and then timed appropriately. (I filtered and plotted them individually. The bandpass filter frequency cutoffs can easily be changed as necessary.) What are they, and what should I do with them?
UZ = unzip('sample-1.zip')
UZ = 1×1 cell array
{'sample-1.wav'}
[s,Fs] = audioread(UZ{1});
L = size(s,1)
L = 867153
t = linspace(0, L-1, L).'/Fs;
figure
plot(t, s(:,1), DisplayName='Left Channel')
hold on
% plot(t, s(:,2), DisplayName='Right Channel')
% plot(t, s(:,1)-s(:,2), DisplayName='Channel Difference')
hold off
grid
legend(Location='best')
[FTs1,Fv] = FFT1(s(:,1),t);
figure
plot(Fv, abs(FTs1)*2)
grid
xlabel('Frequency (Hz)')
ylabel('Magnitude')
xlim([0 6]*1E+3)
[p,f,tps] = pspectrum(s(:,1), Fs, 'spectrogram');
figure
surfc(tps,f,p, 'EdgeColor','none')
colormap(turbo)
colorbar
xlabel('Time (s)')
ylabel('Frequency (Hz)')
zlabel('Magnitude')
title('''pspectrum'' spectrogram')
ylim([0 3E+3])
view(0,90)
s500 = bandpass(s(:,1), [250 1000], Fs, ImpulseResponse='iir');
s2000 = bandpass(s(:,1), [1800 2200], Fs, ImpulseResponse='iir');
s2500 = bandpass(s(:,1), [2500 2750], Fs, ImpulseResponse='iir');
figure
tiledlayout(3,1)
nexttile
plot(t, s500)
grid
xlabel('Time (s)')
ylabel('250 - 1000 Hz')
nexttile
plot(t, s2000)
grid
xlabel('Time (s)')
ylabel('1000 - 2200 Hz')
nexttile
plot(t, s2500)
grid
xlabel('Time (s)')
ylabel('2500 - 2750 Hz')
sgtitle('Bandpass-Filtered s(:,1)')
% Fs = 44100; % Sampling Frequency (z)
% L = 5;
% t = linspace(0, Fs*L, Fs*L+1).'/Fs;
% ts = seconds(t); % Time Vector ('duration' Here)
% s = randn(size(t)) .* exp(-(t-2.2).^2*10); % Voice Response Signal
abs1 = abs(s(:,1));
% [et,eb] = envelope(abs(:,1), 1000, 'peak'); % Use 'peak' Option[et,eb] = envelope(abs(:,1), 1000, 'peak'); % Use 'peak' Option
[et,eb] = envelope(abs1, 1000, 'peak'); % Use 'peak' Option
thrshld = 0.15; % Detection Threshold Value
tidx = find(diff(sign(et - thrshld))); % Approximate Indices of Threshold Crossing
for k = 1:numel(tidx)-1
idxrng = max(tidx(k)-1,1) : min(tidx(k)+1,L); % Index Range For Interpolation
t_exact(k) = interp1(et(idxrng), t(idxrng), thrshld); % 'Exact" Value Of Upper Envelope Crossing Threshold Value
% fprintf('\nResponse envelope crosses detection threshold level at %.3f seconds\n', seconds(t_exact))
end
figure
plot(t, s, DisplayName='Response Signal')
hold on
plot(t, [et eb], LineWidth=2, DisplayName="Envelope")
hold off
grid
xlim([0 5])
xlabel("Time (s)")
ylabel("Amplitude")
yline(thrshld, '--k', "Detection Threshold", DisplayName='Detection Threshold')
% xline(t_exact, '-.r', "Response Onset Time", DisplayName="Response Onset Time")
% text(t_exact, 1.5, sprintf('%.3f s \\rightarrow',seconds(t_exact)), Horiz='right')
% legend(Location='best')
function [FTs1,Fv] = FFT1(s,t)
% One-Sided Numerical Fourier Transform
% Arguments:
% s: Signal Vector Or Matrix
% t: Associated Time Vector
t = t(:);
L = numel(t);
if size(s,2) == L
s = s.';
end
Fs = 1/mean(diff(t));
Fn = Fs/2;
NFFT = 2^nextpow2(L);
FTs = fft((s - mean(s)) .* hann(L).*ones(1,size(s,2)), NFFT)/sum(hann(L));
Fv = Fs*(0:(NFFT/2))/NFFT;
% Fv = linspace(0, 1, NFFT/2+1)*Fn;
Iv = 1:numel(Fv);
Fv = Fv(:);
FTs1 = FTs(Iv,:);
end
.
I guess the three frequency ranges correspond to the three sound segments: the stimulus (English number which is made using a text-to-speech software), the response (real human voice from one and the same person), and a beep which is cut and pasted from a third-party audio. As they came from different sources and were not processed in terms of frequency, I think that may explain the problem.
I do not believe there is a problem. The signals can easily be separated by filtering them, and that is a significant advantage.
This is the best I can do with your data. The code is unfortunately fragile because of the nature of the signals, and while it should work with other records, it may not, without some tweaking.
I am not ceretain what the data actually are, and what you want to do with them.
The start and stop times of the segments are in the tables, however only the start times are plotted.
Try this --
UZ = unzip('sample-1.zip')
UZ = 1×1 cell array
{'sample-1.wav'}
[s,Fs] = audioread(UZ{1});
L = size(s,1)
L = 867153
t = linspace(0, L-1, L).'/Fs;
figure
plot(t, s(:,1), DisplayName='Left Channel')
hold on
% plot(t, s(:,2), DisplayName='Right Channel')
% plot(t, s(:,1)-s(:,2), DisplayName='Channel Difference')
hold off
grid
legend(Location='best')
[FTs1,Fv] = FFT1(s(:,1),t);
figure
plot(Fv, abs(FTs1)*2)
grid
xlabel('Frequency (Hz)')
ylabel('Magnitude')
xlim([0 6]*1E+3)
[p,f,tps] = pspectrum(s(:,1), Fs, 'spectrogram');
figure
surfc(tps,f,p, 'EdgeColor','none')
colormap(turbo)
colorbar
xlabel('Time (s)')
ylabel('Frequency (Hz)')
zlabel('Magnitude')
title('''pspectrum'' spectrogram')
ylim([0 3E+3])
view(0,90)
s500 = bandpass(s(:,1), [250 750], Fs, ImpulseResponse='iir');
s2000 = bandpass(s(:,1), [1900 2100], Fs, ImpulseResponse='iir');
s2500 = bandpass(s(:,1), [2600 2700], Fs, ImpulseResponse='iir');
smtx = [s500 s2000 s2500];
figure
tiledlayout(3,1)
nexttile
plot(t, s500)
grid
xlabel('Time (s)')
ylabel('250 - 750 Hz')
nexttile
plot(t, s2000)
grid
xlabel('Time (s)')
ylabel('1900 - 2100 Hz')
nexttile
plot(t, s2500)
grid
xlabel('Time (s)')
ylabel('2600 - 2700 Hz')
sgtitle('Bandpass-Filtered s(:,1)')
ttlmtx = ["250 - 750 Hz", "1900 - 2100 Hz", "2600 - 2700 Hz"];
figure
tiledlayout(3,1)
for k1 = 1:size(smtx,2)
[et,eb] = envelope(smtx(:,k1), 4500, 'peak'); % Use 'peak' Option
thrshld = max(abs(smtx(:,k1)))*0.6; % Detection Threshold Value
tidx = find(diff(sign(et - thrshld))); % Approximate Indices of Threshold Crossing
% for k2 = 1:numel(tidx)-1
% idxrng = max(tidx(k2)-1,1) : min(tidx(k2)+1,L); % Index Range For Interpolation
% t_exact(k2,:) = interp1(et(idxrng), t(idxrng), thrshld) % 'Exact" Value Of Upper Envelope Crossing Threshold Value
% tseg = t(idxrng)
% % fprintf('\nResponse envelope crosses detection threshold level at %.3f seconds\n', seconds(t_exact))
% end
% disp(t_exact)
% t_exact2 = t_exact(1:floor(numel(t_exact)/2)*2)
% t_exactr = reshape(t_exact2.', 2, []).'
% Tss{k1} = array2table(t_exactr, VariableNames=["Segment Start","Segment End"])
tidx2 = tidx(1:2:end);
tidx2 = reshape(tidx, 2, []).';
dmt2 = 1./diff([0; tidx2(:,1)]);
Lv = isoutlier(dmt2,'movmedian',4); % Find & Eliminate 'Double Start' Entries
tidx2 = tidx2(~Lv,:);
sstimesr = t(tidx2);
Tss{k1} = array2table(sstimesr, VariableNames=["Segment Start","Segment End"]);
nexttile
plot(t, smtx(:,k1), DisplayName='Response Signal')
hold on
plot(t, [et eb], LineWidth=1.5, DisplayName="Envelope")
hold off
grid
% xlim([0 5])
xlabel("Time (s)")
ylabel("Amplitude")
title(ttlmtx(k1))
yline(thrshld, '--k', "Detection Threshold", DisplayName='Detection Threshold')
xline(sstimesr(:,1), '-m')
ylim(ylim+[-1 1])
end
Tss{:}
ans = 5×2 table
Segment Start Segment End _____________ ___________ 0.83998 0.98923 4.5582 4.7937 8.3503 8.5734 12.085 12.156 16.094 16.291
ans = 5×2 table
Segment Start Segment End _____________ ___________ 3.8804 4.2529 7.6504 8.0385 11.475 11.865 15.249 15.672 19.167 19.573
ans = 5×2 table
Segment Start Segment End _____________ ___________ 3.4014 3.5273 7.1484 7.2903 11.036 11.131 14.759 14.927 18.698 18.825
function [FTs1,Fv] = FFT1(s,t)
% One-Sided Numerical Fourier Transform
% Arguments:
% s: Signal Vector Or Matrix
% t: Associated Time Vector
t = t(:);
L = numel(t);
if size(s,2) == L
s = s.';
end
Fs = 1/mean(diff(t));
Fn = Fs/2;
NFFT = 2^nextpow2(L);
FTs = fft((s - mean(s)) .* hann(L).*ones(1,size(s,2)), NFFT)/sum(hann(L));
Fv = Fs*(0:(NFFT/2))/NFFT;
% Fv = linspace(0, 1, NFFT/2+1)*Fn;
Iv = 1:numel(Fv);
Fv = Fv(:);
FTs1 = FTs(Iv,:);
end
..
If you want to extract the time difference, simply subtract one set of start times from another.
I am still not certain what differences you want to compute.
I looked up the numbers in the table and compared them with the actual audio signal, and found that they do not correspond to each other. For example, in the table the first segment starts from 0.83998 second and ends at 0.98923 second. But the actual data starts from 0.619 second (see below).
For other segment boundaries, the values also did not correspond to the actual data. What might be the problem here?
(My computer crashed and it took a few minutes to get it back up. This is unusual for Ubuntu, so I have to see what caused it.)
There is no actual 'problem' with my code. It has to use a non-zero threshold ('Detection Threshold') to detect the onset of a signal segment, because of noise in the signal that it is not possible to eliminate completely. The detection threshold has to be low enough to detect the onset of a signal, and high enough to not detect noise as a false-positive. The 'Detection Threshold' is calculated from the signal characteristics for each signal, and has to be the same value for the entire signal in order to trust the results.
This is the same with the bandpass filters. It might be possible to narrow the passbands considerably, however that risks eliminating possibly necessary information from the filtered output.
This is the problem with real-world data -- it never behaves the way I want it to, so I can never produce the ideal result. I have done extensive biomedical signal processing, and noise and unwanted signal characteristics are always a problem. The best I can ever hope for is consistency, so that the derived data actually make some sense.
There is never an ideal solution to real-world problems. There are always compromises.
Since you've separated the audio into three different signal fragments according to their frequency, I wonder if there is a way to play each of them so that I can doublecheck which is which.
You would have to run my code to separate the signals, do the filtering, and then listen to each one separately.
This only works with Google Chrome with MATLAB Online (I will not use Google Chrome), so I ran it on my desktop instead.
This works --
wavfile = websave('sample-1.zip','https://www.mathworks.com/matlabcentral/answers/uploaded_files/1842543/sample-1.zip')
UZ = unzip(wavfile)
[s,Fs] = audioread(UZ{1});
L = size(s,1)
t = linspace(0, L-1, L).'/Fs;
s500 = bandpass(s(:,1), [250 750], Fs, ImpulseResponse='iir');
s2000 = bandpass(s(:,1), [1900 2100], Fs, ImpulseResponse='iir');
s2500 = bandpass(s(:,1), [2600 2700], Fs, ImpulseResponse='iir');
% sound(s500, Fs) % Voice
% sound(s2000, Fs) % Squeak
% sound(s2500, Fs) % Squeak
That should work as written. (I just tested it.) I commented -out the sound calls. When you run that, un-ciomment them one at a time to listen to that particular vector.
The two that I labelled 'Squeak' sound similar to me, although they are obviously different in the pspectrum 'spectrogram' plot (they are not much different in frequency). I do not recognize much in the 'Voice' vector.
I also experimented with several different ways of finding the envelope (using a lowpass filter) and of finding the beginning of the signal (finding the peak and then finding the last lowest value of the preceeding 10E+3 index range). None of those worked satisfactorally because of the noise in the signal.
These data are extremely difficult to work with, largely because I rarely work with speech signals, only with signals from various sorts of biomedical instrumentation.
.

Iniciar sesión para comentar.

Categorías

Más información sobre Just for fun en Centro de ayuda y File Exchange.

Productos

Versión

R2021a

Preguntada:

el 25 de Oct. de 2025

Comentada:

el 30 de Oct. de 2025

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by