Finding longest subvector of equal elements in 5000000 length int-vector.

4 visualizaciones (últimos 30 días)
Hi there!
Heads up: Technically this is a homework-problem, but I'm done and just want your thoughts on if the code could be faster/better.
The task is to find how many same-digit-sequences of different lengths that appear in the data. I.E. in data=[1,1,1,2,2,3,3,4,4] sequcens of length 3 appears once, sequences of length 2 appears thrice. So i want appearances=[0,3,1,0,...] (indices indicate length of sequence)
The data is typically a vector of length 5000000 and you don't know how long the longest sequences are (otherwise I'd probably use strfind and decimate the data). Further this is done in a context where the starting-index of the sequences might later become important, even though I don't extract them atm.
The code submitted below should only have to run over the data vector once to withdraw all neccesary information. But still it's at least two times slower than what I've got reason to belive that it could be.
Do you have any ideas on how to solve this task in a quicker/cleaner fashion? Would love to see your take on it. /FB
%find longest similar sequence in data
%data size typically 1x5000000 (first five million digits in pi or e for example)
data=[1,1,1,2,1,1,3,3,2,4,4,4,5,6,7,8,8,9,9,9,8,8,7,7,5,5,4,6,6,2,5,9,9,9,9,7,7,7,7,8,8,4,4,5,5,6,6,4,4,5,5,1,1,2,1,3,3,1];
appearances=zeros(10,1); %count how many times sequences of specific lengths 1:10 appear.
%assumes that we wont find similar sequences longer than 10 such as 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 in data
i=2; %counter s.t while approximate a for-loop. Not using for since can't manually change i within for-loop
%starting at 2 since looking at 2 numbers at a time to compare if equal or not
while i<length(data)
isEqual=(data(i-1)==data(i)); %true if in a sequence of equal digits. 1,1 gives true, 1,2 gives false etc.
switch isEqual
case true
%save where sequence started and keep incrementing while checking if equal
start_i=i-1;
while isEqual&&(i<length(data))
i=i+1;
isEqual=(data(i-1)==data(i));
end
%calculate length of sequence and record as new appearance of that length
end_i=i-1;
sequenceLength=(end_i-start_i+1);
appearances(sequenceLength)=appearances(sequenceLength)+1;
case false
%keep incrementing and look for sequence
while~isEqual&&(i<length(data))
i=i+1;
isEqual=(data(i-1)==data(i));
end
end
end
%to find out sequences of length one subtract the sum of all other
%sequences from length of data
appearances(1)=length(data)-sum(appearances);

Respuesta aceptada

Matt J
Matt J el 23 de En. de 2021
Editada: Matt J el 23 de En. de 2021
data=[1,1,1,2,2,3,3,4,4];
sequenceLengths = diff( find( diff([inf,data,inf]) ) ) ;
result = histcounts(sequenceLengths, 1:max(sequenceLengths)+1)
result = 1×3
0 3 1
  2 comentarios
Filip Berntsson
Filip Berntsson el 23 de En. de 2021
Super neat! Deals excellently with the large dataset. Now I just have to sit down and try and understand just why it works...
Thanks!

Iniciar sesión para comentar.

Más respuestas (1)

Image Analyst
Image Analyst el 23 de En. de 2021
Here's how I thought of to do it, using regionprops to count the length of each consequtive run of numbers:
data = [1,1,1,2,1,1,3,3,2,4,4,4,5,6,7,8,8,9,9,9,8,8,7,7,5,5,4,6,6,2,5,9,9,9,9,7,7,7,7,8,8,4,4,5,5,6,6,4,4,5,5,1,1,2,1,3,3,1]
uniqueData = unique(data)
% Preallocate an array for what we think the longest possible run length would be, say a 1000 or so
appearances = zeros(1000, 2); %count how many times sequences of specific lengths 1:10 appear.
% Put lengths into column 2 for convenience for when we look at our output matrix.
appearances(:, 1) = [1 : size(appearances, 1)]';
for k = 1 : length(uniqueData)
% Get this one unique number.
thisValue = uniqueData(k);
% Find regions with k
props = regionprops(data == thisValue, 'Area');
% Increment the count of each length of data.
for k2 = 1 : length(props)
appearances(props(k2).Area, 2) = appearances(props(k2).Area, 2) + 1;
end
end
appearances % Show in command window.
% Crop off bins that has no runs of that length.
lastBin = find(appearances(:, 2) > 0, 1, 'last');
appearances = appearances(1:lastBin, :) % Show in command window.
You'll see the final result in the command window:
appearances =
1 11
2 15
3 3
4 2
So there were 11 sequences of length 1, 15 sequences of length 2, 3 sequences of length 3, and 2 sequences of length 4. Same result as Matt's (if you use the same vector) - it's just a different approach.
  2 comentarios
Filip Berntsson
Filip Berntsson el 23 de En. de 2021
Unfortunately this code seems to take about 20 times longer than my initial submission. Am I missing something or does the toolbox need some sort of setup after installation for this to function properly?
Yours took about 19 seconds when dealing with the 5000000 length data set (whereas mine was about 0.9s, and Matt Js was about 0.13s). It's primarily regionprops that drains time (approx 80% of time).
Image Analyst
Image Analyst el 23 de En. de 2021
Like you, I also thought Matt's answer was somewhat cryptic so I gave a more intuitive, easy to understand way. But if mine takes longer and speed is what you're after, then use Matt's approach.

Iniciar sesión para comentar.

Categorías

Más información sobre Logical en Help Center y File Exchange.

Productos


Versión

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by