Skip to main content

Lucene Analyzer Explained

Introduction

Analysis is an integral part of both indexing and searching operations in Lucene. It involves several processes including splitting the text into tokens, removing stop words, handling synonyms and phonetics, etc. Both the documents and queries are analyzed using the same analyzer, with exceptions in some cases. 

Anatomy of an Analyzer  

An analyzer is responsible for taking in a Reader and returning a TokenStream.


An Analyzer mainly consists of 3 components
  1. CharFilter
  2. Tokenizer
  3. Token Filter

CharFilter

CharFilter can be used to perform operations on the data before tokenizing it. CharFilter operates on the Reader object rather than the TokenStream Object. Some of the operations that can be performed using CharFilter are :
  1. Strip HTML elements from the input.
  2. Replace specific characters.
  3. Pattern match and replace.

A CharFilter itself is a subclass of Reader while Tokenizer and TokenFilter are subclasses of TokenStream.  



Tokenizer

A tokenizer is responsible for splitting the text into tokens. Various tokenizers are available that use different strategies to tokenize the text. For instance, a WhitespaceTokenizer splits the text on whitespaces. StandardTokenizer should suffice the needs of most users.

TokenFilter

TokenFilter operates on tokens. Tokens that are spit out by Tokenizer are passed through a series of token filters. Built-in token filters are available that can perform a wide variety of jobs.
  • Synonym handling
  • Trimming tokens
  • Handling phonetics
  • Removing stopwords
  • Producing ngrams

In a typical application, a Reader is passed through a series of char filters, and then a single tokenizer and a series of token filters.







Comments

Popular posts from this blog

Programmer's Guide to Dealing With Different Time Zones

All programmers have to deal with times and time zones at some point.  When your product gets wider, it gets more and more difficult to handle and synchronize time between multiple time zones. Being well-informed beforehand might save some pain in the "head" later. I've compiled a few most important tips to take care of when dealing with time zones. The actual list is definitely much bigger. But consider this a start. Store Time in UTC Storing local time in the database is okay when your product is small and deals with just one time zone. But when you start dealing multiple time zones and DST(Daylight Saving Time) , it gets weirder. Now you have to convert your time to different time zones for different users and you have to maintain all upcoming DST changes.  Storing time in UTC can save many headaches in the future. The important reason being that the UTC timezone is never affected by DST. So always remember to store time in UTC, and apply the user's

How to deal with cache stampede in MySQL

The Cache stampede problem (also called dog-piling, cache miss storm, or cache choking) is not a big problem for small scale systems but is a common headache for large scale applications with millions of requests. It has the potential to bring down the entire system in a matter of seconds. What is Cache Stampede Consider an item that is expensive to generate, and is cached. When the cache expires, the application usually regenerates the item and writes to the cache, and continue as normal. Under very high load, when the cached item expires, multiple requests will get a cache miss and all of them will try to regenerate the cache simultaneously, which will cause a high load in the database. Cache Stampede can occur with any cache including MySQL query cache, or any external cache like Redis or Memcached. It is ok to ignore this problem when your application is small scale but very important to address the issues as you scale up.  Cache stampede can be seen as random spikes in the CPU us

Design Patterns

Do not reinvent the wheel. Just realign it. When it comes to software world, most design level problems are already solved. Design Patterns are blueprints of these solutions which you can customize and use in your application. Advantages: Tried and tested solutions be experienced developers. It helps improve developer communication. Every developer knows what a 'Singleton' is, right? Common Design Patterns Creational design patterns Abstract Factory Builder Factory Method Object Pool Prototype Singleton Structural design patterns Adapter Bridge Composite Decorator Facade Flyweight Private Class Data Proxy Behavioural design patterns Chain of responsibility Command Interpreter Iterator Mediator Memento Null Object Observer State Strategy Template Method Visitor