This article describes a simple implementation of the **string search**. It can be used for approximate string matching (for more information, see http://en.wikipedia.org/wiki/Fuzzy_string_searching).

Other algorithms for approximate string searching exist (e.g., **Soundex**), but those aren’t as easy to implement. The **algorithm** in this article is easy to implement, and can be used for tasks where approximate string searching is used in an easy way.

The algorithm used the Levenshtein-distance for determining how exact a string from a word list matches the word to be found. Information about the Levenshtein-distance can be found at http://en.wikipedia.org/wiki/Levenshtein_distance.

## C# implementation

using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace PSC.Search.Demo { class Program { static void Main(string[] args) { string word = "Pure Source Code"; List<string> wordList = new List<string> { "Code Project", "Pure SourceCOde", "sourcecode", "puresourcecode", "Source Kode", "Kode Project", "Other Source" }; List<wordrank> foundWords = FuzzySearch.Search(word, wordList, 0.30); foundWords.ForEach(i => Console.WriteLine(i.Word + " (" + i.Rank.ToString() + ")")); Console.ReadKey(); } } }

Output:

A basic approach is shown. Instead of the Levenshtein-distance, a more optimized algorithm could be used – but here, a quite simple implementation is given for clarity reasons.

public static int LevenshteinDistance(string src, string dest) { int[,] d = new int[src.Length + 1, dest.Length + 1]; int i, j, cost; char[] str1 = src.ToCharArray(); char[] str2 = dest.ToCharArray(); for (i = 0; i <= str1.Length; i++) { d[i, 0] = i; } for (j = 0; j <= str2.Length; j++) { d[0, j] = j; } for (i = 1; i <= str1.Length; i++) { for (j = 1; j <= str2.Length; j++) { if (str1[i - 1] == str2[j - 1]) cost = 0; else cost = 1; d[i, j] = Math.Min( d[i - 1, j] + 1, // Deletion Math.Min( d[i, j - 1] + 1, // Insertion d[i - 1, j - 1] + cost)); // Substitution if ((i > 1) && (j > 1) && (str1[i - 1] == str2[j - 2]) && (str1[i - 2] == str2[j - 1])) { d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost); } } } return d[str1.Length, str2.Length]; }

In the search process, for each word in the wordlist, the Levenshtein-distance is computed, and with this distance, a score. This score represents how good the strings match. The input argument **fuzzyness** determines how much the strings can differ.

public static List<wordrank> Search(string word, ListwordList, double fuzzyness) { List<wordrank> foundWords = new List<wordrank>(); foreach (string s in wordList) { // Calculate the Levenshtein-distance: int levenshteinDistance = LevenshteinDistance(word, s); // Length of the longer string: int length = Math.Max(word.Length, s.Length); // Calculate the score: double score = 1.0 - (double)levenshteinDistance / length; // Match? if (score > fuzzyness) foundWords.Add(new WordRank() { Rank = score, Word = s }); } return foundWords; }

## SQL implementation

Also I’ve implemented a version for SQL.

CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999)) RETURNS int AS BEGIN DECLARE @s1_len int, @s2_len int DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000) SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0 WHILE @j <= @s2_len SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1 WHILE @i <= @s1_len BEGIN SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1 WHILE @j <= @s2_len BEGIN SET @c = @c + 1 SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) + CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END IF @c > @c_temp SET @c = @c_temp SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1 IF @c > @c_temp SET @c = @c_temp SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1 END SELECT @cv1 = @cv0, @i = @i + 1 END RETURN @c END

Usage:

select dbo.edit_distance('Fuzzy String Match','fuzzy string match'), dbo.edit_distance('fuzzy','fuzy'), dbo.edit_distance('Fuzzy String Match','fuzy string match'), dbo.edit_distance('levenshtein distance sql','levenshtein sql server'), dbo.edit_distance('distance','server')

You must log in to post a comment.