Vectorized Levenshtein distances between arrays of text labels?
Mostrar comentarios más antiguos
I have to compare "N" ID labels (several thousand) to each other in order to determine which are mistypings of each other. The labels have up to 20 characters. Preliminarily, I am considering the calculation of the N(N-1)/2 Levenshtein distances between them and using clustering to determine which labels correspond to the same ID. It is being done in Python, but none of the Levenshtein distance implementations are vectorized. That is, the NxN array of distances have to be iterated through on an element-by-element basis in order to calculate their values.
I thought that there might be a vectorized Matlab version of Levenshtein distance, which I could package for deployment and invocation from Python. I found the a few shown in the Annex below, as well as an "editDistance" function available in R2023b. None of these vectorize the calculation of N(N-2)/2 distances. I'm surprised that a vectorized implementation doesn't exist. Am I missing something obvious?
Annex: Matlab implementations of Levenshtein distance
Respuesta aceptada
Más respuestas (2)
Christopher Creutzig
el 14 de Jun. de 2024
0 votos
For this application, I recommend using editDistanceSearcher: You probably don't want to look at clusters larger than some size anyway, and knnsearch on editDistanceSearcher will give you the neighbors of interest. It performs precomputation, trading memory for being faster in amortized time, as long as you call it often enough on the same dataset, which it sounds is exactly what you want to do.
7 comentarios
FM
el 17 de Jun. de 2024
Christopher Creutzig
el 17 de Jun. de 2024
I was thinking of taking the 30,011 labels as the base set. If you then take the 5 closest elements within an edit distance of 3 for each of them, is that a start?
FM
el 17 de Jun. de 2024
> To get the 5 closest elements to each of the 30,011 labels, I need to calculate 30,010 distances for each label
No, you don't. That's exactly what editDistanceSearcher helps you avoid. If it helps thinking about it that way, it precomputes “bubble-shaped detectors” around the input values, avoiding most computations of distances that are larger than the original threshold. (Those bubbles aren't perfectly “distance 3” shaped, so some distances still need to be computed, but they will filter out the overwhelming majority of points further out.)
data = string(rand([1,10000]));
searcher = editDistanceSearcher(data,3);
timeit(@() editDistance(data(1),data))
timeit(@() knnsearch(searcher,data(1),K=5))
Christopher Creutzig
el 19 de Jun. de 2024
There are many ways to define clusters. What exactly is the one you are looking for?
If your clusters are defined by “groups of words separated by edit distance,” i.e., you regard all those as a cluster where you have stepping stones to get form A to B without making any steps with a Levenshtein distance of, say, 4 or more, then knowing all the neighbors of your words is all you need.
That is not data you would put into kmeans or kmedoids, of course. Such neighboring information transforms the clustering into a graph theoretical problem. You'd set up an undirected graph with the neighborhood information you have and ask for connected components (or block-cut components to avoid spurious connections from outliers).
FM
el 20 de Jun. de 2024
Categorías
Más información sobre Matrix Indexing en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!