Typeahead Searching using Solr

in solr •  7 years ago 

TLDR;
    To achieve typeahead search, it boils down to how you set up your tokenizer. Here's a way to achieve true type ahead search in your schema.xml for the core in question:

<fieldType name="text" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement=""/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement=""/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{1,50}).*" replacement="$1"/>
  </analyzer>
</fieldType> 



Reason
    So why didn't simply doing "keyword*" work with the normal schema.xml? Because our default analyzer created multiple tokens that referred to the document we are searching for. As long as any one of our tokens matched, our document appeared.


For example:

Original: "The quick brown fox jumped over the lazy dog"
Tokenizer: "The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"
Search terms: "quick*", matched "quick", which pulls our document.

In the configuration I did, here is what happened:

During indexing:

Original: "The quick brown fox jumped over the lazy dog"
Tokenizer: "The quick brown fox jumped over the lazy dog"
Filter 1: "The quick brown fox jumped over the lazy dog"
Filter 2: "the quick brown fox jumped over the lazy dog"
Filter 3: "thequickbrownfoxjumpedoverthelazydog"
Filter 4: "the", "theq", "thequ", "thequi", "thequic", "thequick", ... You get the gist.

During search, the terms become:

Search: "The q"
Tokenizer: "The q"
Filter 1: "The q"
Filter 2: "the q"
Filter 3: "theq"

"theq" matches "theq" from the EdgeNGramFilter, so we pick that up.

But if we use "quick", here's what we get:

Search: "quick"
Tokenizer: "quick"
Filter 1: "quick"
Filter 2: "quick"
Filter 3: "quick"

This doesn't match any of our final terms, so we get nothing back. This is how we achieved true typeahead searching using Solr.

Now, why did we use some of those tokenizers and filters:

<filter class="solr.ASCIIFoldingFilterFactory"/>
This one provides converting characters like á to a. We don't expect users to have to enter terms such as this when performing typeahead. Much better to stick with letters mostly used on US keyboards at least.

<filter class="solr.LowerCaseFilterFactory"/>
This one of course just converts all characters to lower case. Basically we're providing case-insensitive searches.

<filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement=""/>
This one gets rid of all the spaces between the words. That way a user can choose to search with or without spaces. Why force the user to have to enter spaces, just provide the word you intend to search and we know what you mean.

<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50"/>
We chose to apply EdgeNGram to break up our token starting from the first letter up to the max gram size. Of course, if we choose a smaller gram size, we do run the risk of not matching anything. To avoid this, you can add the extra PatternReplaceFilterFactory

<filter class="solr.PatternReplaceFilterFactory" pattern="^(.{1,50}).*" replacement="$1"/>
Here we add this additional filter to prevent our search term from going the maxGramSize of the EdgeNGramFilterFactory. Just make sure to update the 50 to whatever value you set to maxGramSize, it will maintain that max length, avoiding your search from no longer working when we have exceeded that limit.


Welcome to my blog, I'll be posting information related to technology and possibly music since these are the things I am passionate about. Hope you find my posts interesting and helpful. If you notice any errors, whether grammatical or just plain incorrect, do drop me a note. This is my first time blogging, so there will be many mishaps along the way, but hopefully you'll stick around and bear with me. Cheers!

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Congratulations @systeem! You have received a personal award!

1 Year on Steemit
Click on the badge to view your Board of Honor.

Do not miss the last post from @steemitboard:
SteemitBoard World Cup Contest - Quarter Finals - Day 2


Participate in the SteemitBoard World Cup Contest!
Collect World Cup badges and win free SBD
Support the Gold Sponsors of the contest: @good-karma and @lukestokes


Do you like SteemitBoard's project? Then Vote for its witness and get one more award!

Congratulations @systeem! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 2 years!

You can view your badges on your Steem Board and compare to others on the Steem Ranking

Vote for @Steemitboard as a witness to get one more award and increased upvotes!