How do I find the original indices of a text array after adding new elements?

105 visualizaciones (últimos 30 días)
Art
Art el 30 de Sept. de 2025 a las 0:00
Comentada: dpb hace alrededor de 21 horas
I'm working a project which reads a text file and searches for user-defined phrases. At first glance this sounds rather easy, but due to complexities such as having phrases wrap around lines (with newlines between search words) or having multiple spaces between words, I found I have to search in several different ways.
I do combinations of things like adding spaces and removing newlines/carriage returns from the original text to create an updated text array to search. I keep a record of the indices of added spaces and removed newlines, all relative to the updated text array. I then map each index of the updated array to the index of the original text array using two loops (one for the added spaces, one for the removed newlines). Note that I first remove any newlines, then add spaces to create the updated search text. I believe the order is important.
I use regexp to search the updated text and return findings and starting indices. These starting indices of findings can then be easily mapped to the original text location, which is what I need to output.
My code works, but because of the loops, doing the initial index mapping takes a very long time for large text files (>20min for a 1 Mb file).
I'm hoping someone can help me figure out how to do the array mapping without the loops, maybe with arrayfun or something else.
Here's the relevant mapping loops. Note that SearchText, OriginalText, AddedSpaces and DeletedNewLines are inputs from the calling function.
SearchTextMap = 1:length(SearchText);
for spaceInd = 1:length(AddedSpaces)
AddedSpaceInd = AddedSpaces(spaceInd);
SearchTextMap(AddedSpaceInd:end) = SearchTextMap(AddedSpaceInd:end) - 1;
end
for newlineInd = 1:length(DeletedNewLines)
DeletedNewLineInd = DeletedNewLines(newlineInd);
SearchTextIndex = find(SearchTextMap == DeletedNewLineInd);
SearchTextMap(SearchTextIndex:end) = SearchTextMap(SearchTextIndex:end)+1;
end
Any help would be greatly appreciated.
  20 comentarios
Art
Art hace alrededor de 3 horas
Ok, good points above. I looked into \< \> but I'm not sure they'd work for all user input cases, I'll check them out further.
After trying to understand how to compute the location offsets as stated above, I came up with this (after plumbing through some additional variables like OriginalText):
% Define an initial array the size of the original text array with each original index listed in order:
OrigTextMap = 1:length(OriginalText);
% Remove any deleted newline indices:
OrigTextMap(DeletedNewLines) = [];
% Initialize the map array the length of the updated text array (after any newlines are removed and spaces added):
TextMap = zeros(size(SearchText));
% Set all TextMap indices that = added space indices to NaNs:
TextMap(AddedSpaces) = NaN;
% Set remaining indices in TextMap to the indices in OrigTextMap.
NotNamInds = ~isnan(TextMap);
TextMap(NotNanInds) = TextMap;
I believe this does what I need without the loops, I just have to account for the NaNs. Thanks for helping me walk through the logic!
dpb
dpb hace alrededor de 2 horas
So how much of a time reduction have you accomplished so far...inquiring minds, and all that! <g>

Iniciar sesión para comentar.

Respuestas (0)

Categorías

Más información sobre Characters and Strings en Help Center y File Exchange.

Etiquetas

Productos


Versión

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by