Efficient overlapping block processing with nested sub blocks (similar to im2col with stride)

Hello, I recently have been doing a side project to teach myself some more efficient and performance optimized MATLAB and I ran into the following problem. My solution feels "ugly" to me and I highly suspect that there is a better version. I am intentionally straying from builtin functions like im2col and blockproc but I definitely am not asking you to restrict your solutions especially if they are high performance.
Problem:
An example of the input to the problem is generated by the following code. It consists of square matrix with nxn blocks. I use ascending integers to make it easier to keep track of the format of the data.
% example input matrix A with nxn blocks
n = 2;
N = 2*n^2;
A = kron((1:n^2).',ones(1,2*N)) + (1:n^2:N^2) - 1;
[row,col] = ind2sub([n n],1:n^2);
A((row-1).*n+col,:) = A((col-1).*n+row,:);
A = col2im(A, [n n], [N N], 'distinct').'
A = 8×8
1 3 5 7 9 11 13 15 2 4 6 8 10 12 14 16 17 19 21 23 25 27 29 31 18 20 22 24 26 28 30 32 33 35 37 39 41 43 45 47 34 36 38 40 42 44 46 48 49 51 53 55 57 59 61 63 50 52 54 56 58 60 62 64
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
The desired output from the input is to take a sliding block of 2n x 2n with a stride of n in both the x and y directions. This generates a block with 4 subblocks of size n x n. I would like to generate a similar result of im2col, but with the caveat that the column result keeps the individual sub blocks together (as if I had taken im2col of each of the sub blocks and then stacked them into a single column). Normally im2col scans the matrix top to bottom and left to right, but the final desired requirement is that the columns come from a sweep from left to right and then top to bottom.
My solution is:
% intermediate matrix B with 2n patches with n stride
M = length(A(:,1)) + n;
A(end+1:M,end+1:M) = 0;
[x,y] = meshgrid(1:2*n);
[xs,ys] = meshgrid(0:n:(M-2*n));
idx = (reshape(y,4*n^2,[]) + ys(:)') + (M*reshape(x-1,4*n^2,[]) + M*xs(:)');
colB = A(idx);
% these 2 lines not part of my solution but may help visualize the
% intermediate step of expanding the matrix
szB = sqrt(numel(colB));
B = col2im(colB, [2*n 2*n], [szB szB], 'distinct');
% reshape and permute chain for desired format and ordering
C = reshape(colB,n,2,[]);
C = permute(C,[1 3 2]);
C = reshape(C, 4*n^2, n, []);
C = permute(C, [1 3 2]);
C = reshape(C, 4*n^2,[]) % final desired output
C = 16×16
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 5 9 13 0 21 25 29 0 37 41 45 0 53 57 61 0 6 10 14 0 22 26 30 0 38 42 46 0 54 58 62 0 7 11 15 0 23 27 31 0 39 43 47 0 55 59 63 0 8 12 16 0 24 28 32 0 40 44 48 0 56 60 64 0 17 21 25 29 33 37 41 45 49 53 57 61 0 0 0 0 18 22 26 30 34 38 42 46 50 54 58 62 0 0 0 0 19 23 27 31 35 39 43 47 51 55 59 63 0 0 0 0 20 24 28 32 36 40 44 48 52 56 60 64 0 0 0 0 21 25 29 0 37 41 45 0 53 57 61 0 0 0 0 0 22 26 30 0 38 42 46 0 54 58 62 0 0 0 0 0 23 27 31 0 39 43 47 0 55 59 63 0 0 0 0 0
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
The final result is good for general n and the solution runs decently fast. However it just looks like there is still performance to squeeze out of this. I feel like there is a more direct method than constructing the intermediate matrix and then decomposing it. I also feel like there is a better reordering process than my reshape->permute->reshape->permute->reshape. After too many hours trying I am forced to concede but hopefully someone here can teach me something new. Thanks in advance.

1 comentario

Accidentally put "non" in the title of the post, to be clear the 2n x 2n blocks are overlapping.

Iniciar sesión para comentar.

 Respuesta aceptada

Stephen23
Stephen23 el 23 de Mzo. de 2026 a las 6:06
Editada: Stephen23 el 23 de Mzo. de 2026 a las 7:51
Your reshape/permute chain is working backwards: you're extracting in the "wrong" order and then correcting it. The cleaner approach is to build an index array that maps directly to the desired output layout in one shot, eliminating the intermediate B matrix and the entire reshape chain. The linear index of each output element is the sum of a local offset (position within a window, in subblock order) and a global offset (window position).
First lets create your example matrix in a more direct way:
n = 2;
N = 2*n^2;
nw = N/n;
A = reshape(permute(reshape(1:N^2, n,n, nw,nw), [1,4,2,3]), N,N)
A = 8×8
1 3 5 7 9 11 13 15 2 4 6 8 10 12 14 16 17 19 21 23 25 27 29 31 18 20 22 24 26 28 30 32 33 35 37 39 41 43 45 47 34 36 38 40 42 44 46 48 49 51 53 55 57 59 61 63 50 52 54 56 58 60 62 64
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
Now determine the required indices:
% Pad A just once:
Nn = N + n; % padded dimension
Ap = zeros(Nn,Nn,class(A));
Ap(1:N,1:N) = A;
% Local offsets within a 2n×2n window, laid out in subblock order:
% Column-major indices within one n×n subblock
rb = kron(ones(n,1), (0:n-1).'); % n^2 × 1 row within subblock
cb = kron((0:n-1).', ones(n,1)); % n^2 × 1 col within subblock
% Four subblocks: (TL, TR, BL, BR) -> (sr,sc) = (0,0),(0,1),(1,0),(1,1)
sr4 = [0;0;1;1];
sc4 = [0;1;0;1];
row_off = kron(n*sr4, ones(n^2,1)) + repmat(rb,4,1); % 4n^2 × 1
col_off = kron(n*sc4, ones(n^2,1)) + repmat(cb,4,1); % 4n^2 × 1
% Window positions: wx (column) varies fast -> left-to-right, top-to-bottom
[wx,wy] = ndgrid(0:nw-1, 0:nw-1); % wx fast, wy slow
row_win = n * wy(:).'; % 1 × nw^2
col_win = n * wx(:).'; % 1 × nw^2
% Single index expression via broadcast outer-sum
idx = (row_off+row_win) + Nn*(col_off+col_win) + 1; % 4n^2 × nw^2
C = Ap(idx)
C = 16×16
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 5 9 13 0 21 25 29 0 37 41 45 0 53 57 61 0 6 10 14 0 22 26 30 0 38 42 46 0 54 58 62 0 7 11 15 0 23 27 31 0 39 43 47 0 55 59 63 0 8 12 16 0 24 28 32 0 40 44 48 0 56 60 64 0 17 21 25 29 33 37 41 45 49 53 57 61 0 0 0 0 18 22 26 30 34 38 42 46 50 54 58 62 0 0 0 0 19 23 27 31 35 39 43 47 51 55 59 63 0 0 0 0 20 24 28 32 36 40 44 48 52 56 60 64 0 0 0 0 21 25 29 0 37 41 45 0 53 57 61 0 0 0 0 0 22 26 30 0 38 42 46 0 54 58 62 0 0 0 0 0 23 27 31 0 39 43 47 0 55 59 63 0 0 0 0 0
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>

3 comentarios

Thank you for your quick and detailed answer, I will study and return to accept it. I'm going to attempt to maybe extend it just a bit further to the case where A is square but not necessarily 4*n^2. For instance, if I did A = repmat(A,3) as input instead then I think row_off and col_off would need to be extended to reach the full width of the matrix if I am reading it correctly.
I also note that in your rewriting of my example input that you permute with 4 dimensions. With 3 dimensions it is easy for me to visualize and predict the output of my permutation but with more than 3 my intuition is muted. Can you share how you came up with this permutation?
"With 3 dimensions it is easy for me to visualize and predict the output of my permutation but with more than 3 my intuition is muted"
Attributed to John von Neumann: "In mathematics you don't understand things. You just get used to them". Intuition only takes you so far, I don't think of arrays as having a physical representation (although they can). For me using PERMUTE is mostly about 1) keeping track of the parts/features of the array that you are trying to keep together, 2) keeping track of the dimension requirements (e.g. input order, output order, etc). Keeping a firm grasp on the order of data in memory is critical, because a lot of times (as used here) PERMUTE is just setting things up for RESHAPE.
So to answer your question "how you came up with this permutation?": 1) RESHAPE into useful blocks, 2) Identify/confirm the feaures/groups/whatever, 3) PERMUTE the data in those blocks to get them into the required memory order for the next step 4) RESHAPE into the ouput array.
n = 2;
nw = 2;
N = n*nw; % N = 4, N^2 = 16
V = 1:N^2
V = 1×16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
R = reshape(V, n,n, nw,nw) % RESHAPE into useful blocks
R =
R(:,:,1,1) = 1 3 2 4 R(:,:,2,1) = 5 7 6 8 R(:,:,1,2) = 9 11 10 12 R(:,:,2,2) = 13 15 14 16
We can use these blocks to map the required features to the dimensions: for the 1st column 1,2 are already along dim1, the next required elements 9,10 mean we require dim4. For the 2nd column we need 3,4 which requires dim2. That only leaves dim3 which goes at the end.
B = permute(R, [1,4,2,3]) % PERMUTE the blocks so that they suit memory order
B =
B(:,:,1,1) = 1 9 2 10 B(:,:,2,1) = 3 11 4 12 B(:,:,1,2) = 5 13 6 14 B(:,:,2,2) = 7 15 8 16
Note after PERMUTE we can already see the data is in the required order in memory: the elements are shown in order 1, 2, 9, 10, 3, 4, 11, 12, ... etc.
C = reshape(B, N,N) % RESHAPE to the required output size
C = 4×4
1 3 5 7 2 4 6 8 9 11 13 15 10 12 14 16
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
Experiment! Most likely it won't be correct on the first attempt.
Try smaller examples: my tip is to use dimensions that do not have common divisors.
Thanks for your insightful responses, I've learned a quite a bit here

Iniciar sesión para comentar.

Más respuestas (1)

Matt J
Matt J el 24 de Mzo. de 2026 a las 14:28
Editada: Matt J el 24 de Mzo. de 2026 a las 14:48
If the reordering is something you want to repeat on multiple matrices of the same size and block structure, then you can use your current method once, and recycle some intermediate results to make subsequent processing faster. This generalizes very easily to other dimensions and other kinds of reorderings:
clc,
n = 2;
N = 2*n^2;
A = kron((1:n^2).',ones(1,2*N)) + (1:n^2:N^2) - 1;
[row,col] = ind2sub([n n],1:n^2);
A((row-1).*n+col,:) = A((col-1).*n+row,:);
A = col2im(A, [n n], [N N], 'distinct').'*10
A = 8×8
10 30 50 70 90 110 130 150 20 40 60 80 100 120 140 160 170 190 210 230 250 270 290 310 180 200 220 240 260 280 300 320 330 350 370 390 410 430 450 470 340 360 380 400 420 440 460 480 490 510 530 550 570 590 610 630 500 520 540 560 580 600 620 640
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
tic; %pre-computation
[tmp,mask,idx]=Reordering(size(A),n);
toc;
Elapsed time is 0.003690 seconds.
tic; %subsequent passes
C=tmp;
C(mask)=A(idx);
toc;
Elapsed time is 0.001299 seconds.
C
C = 16×16
10 50 90 130 170 210 250 290 330 370 410 450 490 530 570 610 20 60 100 140 180 220 260 300 340 380 420 460 500 540 580 620 30 70 110 150 190 230 270 310 350 390 430 470 510 550 590 630 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600 640 50 90 130 0 210 250 290 0 370 410 450 0 530 570 610 0 60 100 140 0 220 260 300 0 380 420 460 0 540 580 620 0 70 110 150 0 230 270 310 0 390 430 470 0 550 590 630 0 80 120 160 0 240 280 320 0 400 440 480 0 560 600 640 0 170 210 250 290 330 370 410 450 490 530 570 610 0 0 0 0 180 220 260 300 340 380 420 460 500 540 580 620 0 0 0 0 190 230 270 310 350 390 430 470 510 550 590 630 0 0 0 0 200 240 280 320 360 400 440 480 520 560 600 640 0 0 0 0 210 250 290 0 370 410 450 0 530 570 610 0 0 0 0 0 220 260 300 0 380 420 460 0 540 580 620 0 0 0 0 0 230 270 310 0 390 430 470 0 550 590 630 0 0 0 0 0
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
function [C,mask,idx]=Reordering(dims,n)
A=reshape(1:prod(dims),dims);
M = length(A(:,1)) + n;
A(end+1:M,end+1:M) = 0;
[x,y] = meshgrid(1:2*n);
[xs,ys] = meshgrid(0:n:(M-2*n));
idx = (reshape(y,4*n^2,[]) + ys(:)') + (M*reshape(x-1,4*n^2,[]) + M*xs(:)');
colB = A(idx);
% reshape and permute chain for desired format and ordering
C = reshape(colB,n,2,[]);
C = permute(C,[1 3 2]);
C = reshape(C, 4*n^2, n, []);
C = permute(C, [1 3 2]);
C = reshape(C, 4*n^2,[]); % final desired output
mask=logical(C);
idx=C(mask);
end

1 comentario

Interesting, the original impetus of this block rearrangement was the realization that a couple for loops could be replaced by a precomputation of all these vectors. Constant memory calls into the original matrix was slowing down the process. So in my use case, this subblock processing is a single computation step that is not repeated but I definitely see the value in repeatability of the process.
The matrix A in my code is on the order of about 1500x1500 so with n=5 the colB array can have quite a large number of columns. One saving grace is that I only need the uppertriangular portion of the result so I can remove a lot of column indices.
Also I should thank you further as I remember digging through some of your block manipulation codes on the FEX prior to posting this question orginally.

Iniciar sesión para comentar.

Categorías

Productos

Versión

R2025b

Preguntada:

el 23 de Mzo. de 2026 a las 4:29

Comentada:

hace alrededor de 19 horas

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by