Vectorized Levenshtein distances between arrays of text labels?
12 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
I have to compare "N" ID labels (several thousand) to each other in order to determine which are mistypings of each other. The labels have up to 20 characters. Preliminarily, I am considering the calculation of the N(N-1)/2 Levenshtein distances between them and using clustering to determine which labels correspond to the same ID. It is being done in Python, but none of the Levenshtein distance implementations are vectorized. That is, the NxN array of distances have to be iterated through on an element-by-element basis in order to calculate their values.
I thought that there might be a vectorized Matlab version of Levenshtein distance, which I could package for deployment and invocation from Python. I found the a few shown in the Annex below, as well as an "editDistance" function available in R2023b. None of these vectorize the calculation of N(N-2)/2 distances. I'm surprised that a vectorized implementation doesn't exist. Am I missing something obvious?
Annex: Matlab implementations of Levenshtein distance
2 comentarios
Stephen23
el 11 de Jun. de 2024
"Am I missing something obvious?"
The lack of an easily vectorizable algorithm.
Respuesta aceptada
Nipun
el 13 de Jun. de 2024
Hi FM,
I understand that you want to compute the Levenshtein distances between several thousand ID labels in a vectorized manner using MATLAB and then interface it with Python.
Here is a MATLAB code that utilizes parallel processing to compute the Levenshtein distances more efficiently. This approach is not fully vectorized due to the nature of the algorithm, but it leverages MATLAB's parallel computing capabilities to speed up the computation.
function distanceMatrix = computeLevenshteinDistances(labels)
% Number of labels
N = numel(labels);
% Initialize the distance matrix
distanceMatrix = zeros(N, N);
% Parallel loop to compute Levenshtein distances
parfor i = 1:N
for j = i+1:N
distanceMatrix(i, j) = levenshtein(labels{i}, labels{j});
distanceMatrix(j, i) = distanceMatrix(i, j); % Symmetric matrix
end
end
end
function d = levenshtein(s1, s2)
% Calculate the Levenshtein distance between two strings
m = length(s1);
n = length(s2);
% Initialize the distance matrix
D = zeros(m+1, n+1);
for i = 1:m+1
D(i, 1) = i-1;
end
for j = 1:n+1
D(1, j) = j-1;
end
% Compute the distances
for i = 2:m+1
for j = 2:n+1
cost = (s1(i-1) ~= s2(j-1));
D(i, j) = min([D(i-1, j) + 1, ... % Deletion
D(i, j-1) + 1, ... % Insertion
D(i-1, j-1) + cost]); % Substitution
end
end
d = D(m+1, n+1);
end
To integrate this MATLAB function with Python, you can use the MATLAB Engine API for Python. Here's an example of how to call the MATLAB function from Python:
import matlab.engine
import numpy as np
# Start MATLAB engine
eng = matlab.engine.start_matlab()
# Example labels
labels = ['label1', 'label2', 'label3', 'labelN']
# Convert Python list to MATLAB cell array
matlab_labels = matlab.cell.array(labels)
# Call the MATLAB function
distance_matrix = eng.computeLevenshteinDistances(matlab_labels)
# Convert the result to a numpy array
distance_matrix = np.array(distance_matrix)
# Stop MATLAB engine
eng.quit()
# Print the distance matrix
print(distance_matrix)
To summarize:
- The MATLAB function computeLevenshteinDistances computes the Levenshtein distances between all pairs of labels using parallel processing.
- The levenshtein function calculates the distance between two strings.
- The Python script uses the MATLAB Engine API to call the MATLAB function and retrieve the distance matrix.
By using MATLAB's parallel computing capabilities, you can significantly speed up the computation of the Levenshtein distances for a large number of labels. I recommend using "parfor" instead of "for" loops to leverage parallel computation in MATLAB.
Refer to the following MathWorks documentation for more information on "parfor" in MATLAB: https://www.mathworks.com/help/parallel-computing/parfor.html
Hope this helps.
Regards,
Nipun
3 comentarios
Paul
el 13 de Jun. de 2024
editDistance uses both recursion and arrayfun. It basically calls itself in a loop and on each call is validating the input strings, which really only needs to be done once, not to mention the overhead of arrayfun (which I think has been discussed elseswhere on this forum), and all of the other checks it runs each time it's called. Maybe all of that overhead contributes to its slowdown by a factor of 70/2.2.
Más respuestas (2)
Christopher Creutzig
el 14 de Jun. de 2024
For this application, I recommend using editDistanceSearcher: You probably don't want to look at clusters larger than some size anyway, and knnsearch on editDistanceSearcher will give you the neighbors of interest. It performs precomputation, trading memory for being faster in amortized time, as long as you call it often enough on the same dataset, which it sounds is exactly what you want to do.
7 comentarios
Christopher Creutzig
el 19 de Jun. de 2024
There are many ways to define clusters. What exactly is the one you are looking for?
If your clusters are defined by “groups of words separated by edit distance,” i.e., you regard all those as a cluster where you have stepping stones to get form A to B without making any steps with a Levenshtein distance of, say, 4 or more, then knowing all the neighbors of your words is all you need.
That is not data you would put into kmeans or kmedoids, of course. Such neighboring information transforms the clustering into a graph theoretical problem. You'd set up an undirected graph with the neighborhood information you have and ask for connected components (or block-cut components to avoid spurious connections from outliers).
Ver también
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!