Finding longest subvector of equal elements in 5000000 length int-vector.

Question

Filip Berntsson el 23 de En. de 2021

1
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/724663-finding-longest-subvector-of-equal-elements-in-5000000-length-int-vector

Comentada: Image Analyst el 23 de En. de 2021

Hi there!

Heads up: Technically this is a homework-problem, but I'm done and just want your thoughts on if the code could be faster/better.

The task is to find how many same-digit-sequences of different lengths that appear in the data. I.E. in data=[1,1,1,2,2,3,3,4,4] sequcens of length 3 appears once, sequences of length 2 appears thrice. So i want appearances=[0,3,1,0,...] (indices indicate length of sequence)

The data is typically a vector of length 5000000 and you don't know how long the longest sequences are (otherwise I'd probably use strfind and decimate the data). Further this is done in a context where the starting-index of the sequences might later become important, even though I don't extract them atm.

The code submitted below should only have to run over the data vector once to withdraw all neccesary information. But still it's at least two times slower than what I've got reason to belive that it could be.

Do you have any ideas on how to solve this task in a quicker/cleaner fashion? Would love to see your take on it. /FB

%find longest similar sequence in data
%data size typically 1x5000000 (first five million digits in pi or e for example)
data=[1,1,1,2,1,1,3,3,2,4,4,4,5,6,7,8,8,9,9,9,8,8,7,7,5,5,4,6,6,2,5,9,9,9,9,7,7,7,7,8,8,4,4,5,5,6,6,4,4,5,5,1,1,2,1,3,3,1];
appearances=zeros(10,1);  %count how many times sequences of specific lengths 1:10 appear.
%assumes that we wont find similar sequences longer than 10 such as 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 in data
i=2;       %counter s.t while approximate a for-loop. Not using for since can't manually change i within for-loop
%starting at 2 since looking at 2 numbers at a time to compare if equal or not
while i<length(data)
    isEqual=(data(i-1)==data(i)); %true if in a sequence of equal digits. 1,1 gives true, 1,2 gives false etc.
    
    switch isEqual
        case true
            %save where sequence started and keep incrementing while checking if equal
            start_i=i-1;
            while isEqual&&(i<length(data))
                i=i+1;
                isEqual=(data(i-1)==data(i));
            end            
            %calculate length of sequence and record as new appearance of that length
           
            end_i=i-1;
            sequenceLength=(end_i-start_i+1);
            appearances(sequenceLength)=appearances(sequenceLength)+1;
            
            
        case false
            %keep incrementing and look for sequence
            while~isEqual&&(i<length(data))
                i=i+1;
                isEqual=(data(i-1)==data(i));
            end
            
    end  
end
%to find out sequences of length one subtract the sum of all other
%sequences from length of data
appearances(1)=length(data)-sum(appearances);

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Matt J el 23 de En. de 2021

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/724663-finding-longest-subvector-of-equal-elements-in-5000000-length-int-vector#answer_604283

Editada: Matt J el 23 de En. de 2021

Abrir en MATLAB Online

data=[1,1,1,2,2,3,3,4,4];
sequenceLengths = diff( find( diff([inf,data,inf]) ) ) ;
result = histcounts(sequenceLengths, 1:max(sequenceLengths)+1)
result = 1×3
     0     3     1

2 comentarios
Mostrar NingunoOcultar Ninguno

Filip Berntsson el 23 de En. de 2021

Super neat! Deals excellently with the large dataset. Now I just have to sit down and try and understand just why it works...

Thanks!

Filip Berntsson el 23 de En. de 2021

Now I get it. Clever!

Iniciar sesión para comentar.

Answer 2

Image Analyst el 23 de En. de 2021

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/724663-finding-longest-subvector-of-equal-elements-in-5000000-length-int-vector#answer_604308

Abrir en MATLAB Online

Here's how I thought of to do it, using regionprops to count the length of each consequtive run of numbers:

data = [1,1,1,2,1,1,3,3,2,4,4,4,5,6,7,8,8,9,9,9,8,8,7,7,5,5,4,6,6,2,5,9,9,9,9,7,7,7,7,8,8,4,4,5,5,6,6,4,4,5,5,1,1,2,1,3,3,1]
uniqueData = unique(data)
% Preallocate an array for what we think the longest possible run length would be, say a 1000 or so
appearances = zeros(1000, 2);  %count how many times sequences of specific lengths 1:10 appear.
% Put lengths into column 2 for convenience for when we look at our output matrix.
appearances(:, 1) = [1 : size(appearances, 1)]';
for k = 1 : length(uniqueData)
	% Get this one unique number.
	thisValue = uniqueData(k);
	% Find regions with k
	props = regionprops(data == thisValue, 'Area');
	% Increment the count of each length of data.
	for k2 = 1 : length(props)
		appearances(props(k2).Area, 2) = appearances(props(k2).Area, 2) + 1;
	end
end
appearances % Show in command window.
% Crop off bins that has no runs of that length.
lastBin = find(appearances(:, 2) > 0, 1, 'last');
appearances = appearances(1:lastBin, :) % Show in command window.

You'll see the final result in the command window:

appearances =

1 11

2 15

3 3

4 2

So there were 11 sequences of length 1, 15 sequences of length 2, 3 sequences of length 3, and 2 sequences of length 4. Same result as Matt's (if you use the same vector) - it's just a different approach.

2 comentarios
Mostrar NingunoOcultar Ninguno

Filip Berntsson el 23 de En. de 2021

Unfortunately this code seems to take about 20 times longer than my initial submission. Am I missing something or does the toolbox need some sort of setup after installation for this to function properly?

Yours took about 19 seconds when dealing with the 5000000 length data set (whereas mine was about 0.9s, and Matt Js was about 0.13s). It's primarily regionprops that drains time (approx 80% of time).

Image Analyst el 23 de En. de 2021

Like you, I also thought Matt's answer was somewhat cryptic so I gave a more intuitive, easy to understand way. But if mine takes longer and speed is what you're after, then use Matt's approach.

Iniciar sesión para comentar.

Finding longest subvector of equal elements in 5000000 length int-vector.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

2 comentarios
Mostrar NingunoOcultar Ninguno

Más respuestas (1)

2 comentarios
Mostrar NingunoOcultar Ninguno

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Finding longest subvector of equal elements in 5000000 length int-vector.

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

2 comentarios Mostrar NingunoOcultar Ninguno

Más respuestas (1)

2 comentarios Mostrar NingunoOcultar Ninguno

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno

2 comentarios
Mostrar NingunoOcultar Ninguno