Approximate String Matching

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965

Q-grams are typically used in approximate string matching by “sliding” a window of length q over the characters of a string to create a number of 'q' length grams for matching a match is then rated as number of q-gram matches within the second string over possible q-grams.


Cosine similarity is a common vector based similarity measure similar to dice coefficient.Whereby the input string is transformed into vector space so that the Euclidean cosine rule can be used to determine similarity. The cosine similarity is often paired with other approaches to limit the dimensionality of the problem. For instance with simple strings at list of stopwords are used to exclude from the dimensionality of the comparison.

Dice coefficient is a term based similarity measure (0-1) whereby the similarity measure is defined as twice the number of terms common to compared entitys divided by the total number of terms in both tested entities. The Coefficient result of 1 indicates identical vectors as where a 0 equals orthogonal vectors.

Dices coefficient = (2*Common Terms) / (Number of terms in String1 + Number of terms in String2)