Introduction Analysis is an integral part of both indexing and searching operations in Lucene. It involves several processes including splitting the text into tokens, removing stop words, handling synonyms and phonetics, etc. Both the documents and queries are analyzed using the same analyzer, with exceptions in some cases. Anatomy of an Analyzer An analyzer is responsible for taking in a Reader and returning a TokenStream. An Analyzer mainly consists of 3 components CharFilter Tokenizer Token Filter CharFilter CharFilter can be used to perform operations on the data before tokenizing it. CharFilter operates on the Reader object rather than the TokenStream Object. Some of the operations that can be performed using CharFilter are : Strip HTML elements from the input. Replace specific characters. Pattern match and replace. A CharFilter itself is a subclass of Reader while Tokenizer and TokenFilter are subclasses of TokenStream. Tokenizer A tokenizer is responsible for splitting the
Developer at Microsoft