Next you create an index of all the tags as trigrams and include the position of tag they came from so that you can reference back to it later:įor example if you want to match any tags that contain java any where in the tag, i.e. _ja, jav, ava, vas, asc, scr, cri, rip, ipt, pt_
#Neoload 6.9 full
That way you aren’t affected so much by the amount of wilcards, because you are only searching via an index rather than a full search that runs over the whole list of tags.įor instance when using Trigrams, the tags are initially split into 3 letter chunks, for instance the expansion for the tag javascript is shown below (‘_’ is added to denote the start/end of a word): But basically what you do is create an inverted index of the tags and search the index instead. I’m not going to explain all the details, the linked page has a very readable explanation.
#Neoload 6.9 code
So can we do any better? Well it turns out that that there is a really nice technique for doing Regular Expression Matching with a Trigram Index that is used in Google Code Search. After chatting to a few of the Stack Overflow developers on Twitter, they consider a Tag Engine query that takes longer than 500 milliseconds to be slow, so a second just to apply the wildcards is unacceptable. Even on a relatively small data-set containing 32,000 tags, it’s slow when comparing it to 210 wildcardsToExpand, taking over a second. This works fine with a few wildcards, but it’s not very efficient. Var expandedTags = new HashSet () foreach ( var wildcard in wildcardsToExpand )
( IsActualMatch(.) is a simple method that does a basic string StartsWith, EndsWith or Contains as appropriate) loop through the wildcards and compare each one with every single tag to see if it could be expanded to match that tag. Now a simple way of doing these matches is the following, i.e. If you want to see the wildcard expansion in action you can visit the url’s below:
This is the long-delayed part 2 of a mini-series looking at what it might take to build the Stack Overflow Tag Engine, if you haven’t read part 1, I recommend reading it first. There’s also a video available of my NDC London 2014 talk “Performance is a Feature!”. I’ve added a Resources and Speaking page to my site, check them out if you want to learn more. The Stack Overflow Tag Engine – Part 2 - 1334 words