Apache Lucene: Search Engine Library for Modern Apps

apache lucene

Technology: Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It can be used in any application to add search capability to it. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. It does so by adding content to a full-text index. It then allows you to perform queries on this index, returning results ranked by either the relevance to the query or sorted by an arbitrary field such as a document’s last modified date. The content you add to Lucene can be from various sources, like a SQL/NoSQL database, a filesystem, or even from websites.

Apache Lucene Analysis Pipeline:

Parsing:
Lucene only supports for plain text format, but we can implement Parsers which will convert to the different file formats to plain text, Application can use these to parsers to convert the various formats like XML, word, Pdf to text plain before sending the data to apache lucene.

Tokenization:
Plain text passed to Lucene for indexing goes through a process generally called tokenization. Tokenization is the process of breaking input text into small indexing elements – tokens. The way input text is broken into tokens heavily influences how people will then be able to search for that text. Sometimes simply breaking the words is not sufficient deeper analysis is need for breaking words like we can add synonyms a word, and we may need to stop some words like “a”, “an”, ”the”.

Some work also is need before the Tokenization like removing HTML markup etc…

We can add any number of Tokenization filters after parsing, some of the most used tokenizers:

  • Stemming – Replacing words with their stems. For instance with English stemming “books” is replaced with “book”; now query “book” can find both documents containing “book” and those containing “books”.
  • Stop Words Filtering – Common words like “the”, “and” and “a” rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some “noise” and actually improve search quality.
  • Text Normalization – Stripping accents and other character markings can make for better searching.
  • Synonym Expansion – Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set.
  • Analysis: Before the indexing process starts, the document is to be analyzed as to which part of the text is a candidate to be indexed. This process is where the document is analyzed.

Analyzer Class is the basic Class defined in Lucene Core particularly specialized for direct use for parsing queries and maintaining the queries. Different methods are available in the Analyzer Class so that we can easily go with the analyzing tasks using a wide range of analyzer options provided by the Lucene. An Analyzer is responsible for supplying a TokenStream which can be consumed by the indexing and searching processes. See below for more information on implementing your own Analyzer. Most of the time, you can use an anonymous subclass of Analyzer.

The relationship between Analyzer and CharFilters, Tokenizers, and TokenFilters: The analyzer can accept multiple analyzer to parse our text and covert into TokenStream.

The CharFilter is a subclass of Reader that supports offset tracking.

TheTokenizer is only responsible for breaking the input text into tokens.

TheTokenFilter modifies a stream of tokens and their contents.

Tokenizer is a TokenStream.

The standard Lucene Analysis

Standard Analysis pipeline

Some of the most used common word in Lucene:

Documents: In Lucene, a Document is the unit of search and index. An index consists of one or more Documents .Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.

Fields: A Document consists of one or more Fields. A Field is simply a name-value pair. For example, a Field commonly found in applications is title. In the case of a title Field, the field name is title and the value is the title of that content item.Indexing in Lucene involves creating Documents comprising of one or more Fields, and adding these Documents to an IndexWriter.

Document Search and Search Ranking: The Lucene search API takes a search query and returns a set of documents ranked by relevancy with documents most similar to the query having the highest score. Lucene provides a highly configurable hybrid form of search that combines exact boolean searches with softer, more relevance-ranking-oriented vector-space search methods. All searches are field specific because Lucene indexes terms and a term is composed of a field name and a token.

Sample Code indexing the text:

Writing indexes to Index Store:

StandardAnalyzer standardAnalyzer = new StandardAnalyzer();

       Directory directory = new RAMDirectory();

       IndexWriterConfig config = new IndexWriterConfig(standardAnalyzer);

        IndexWriter writer = new IndexWriter(directory, config);

        Document document = new Document ();

        document.add(new TextField("content", "Hello World", Field.Store.YES));

        writer.addDocument(document);

        document.add(new TextField("content", "Hello Lucene", Field.Store.YES));

        writer.addDocument(document);

        writer.close();

Here we are using RAMDirectory as storage for our indexes for testing purposes, in general we use file system directory to store our indexes.

Directory index = FSDirectory.open(new File("<index-dir>"));

We are adding some text fields to lucene document, there can be multiple fields we can store in document, IntPoint to store integers, LongPoint, DoublePoint etc… we can also store the ranges using IntRange for storing the min and max values for specified field.

Lucene will analyses all document fields and it will not store exact the value in document, by specifying Fields.Store.YES means informing Lucene to store the exact content.

Querying Apache Lucene:

QueryParser is the class which will create lucene search query, and attach analyzer which will parse our search terms convert into tokenStream which will be understood by lucene.

QueryParser parser = new QueryParser ("content", standardAnalyzer);

String querystr = "lucene";

Query query = parser.parse(querystr);

Here Content is the Field name on which fields we want to perform the search. Querystr is the search tem. Query is the lucene query which will be understand by lucene.

Searching:

Create the indexReader and IndexSearch by passing Directory where we indexes documents.

IndexReader reader = DirectoryReader.open(directory);

IndexSearcher searcher = new IndexSearcher (reader);

And pass the query to searcher object to get search results.

QueryParser parser = new QueryParser ("content", standardAnalyzer);

searcher.search(query, collector);

Displaying Search Results:

We collect the search results using TopScoreDocCollector into the array of ScoreDoc. The same array can be used to display the results to user with a proper user interface as needed.

TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage);

searcher.search(query, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;

Final code for Search using Lucene:

IndexReader reader = DirectoryReader.open(directory);

        IndexSearcher searcher = new IndexSearcher (reader);

        QueryParser parser = new QueryParser ("content", standardAnalyzer);

        String querystr = "lucene";

        Query query = parser.parse(querystr);

        int hitsPerPage = 10;

        TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage);

        searcher.search(query, collector);

        ScoreDoc[] hits = collector.topDocs().scoreDocs;

        System.out.println("Query string: " + querystr );

        System.out.println("Found " + hits.length + " hits.");

        for (int i = 0; i < hits.length; ++i) {

               int docId = hits[i].doc;

               Document d = searcher.doc(docId);

               System.out.println((i + 1) + ". " + d.get("title") + " " + d.get("content"));

        }

Lucene also support boosting some search terms:

Lucene support two types of boosting:

Index Time Boosting: Index time boosting is basically programmatically setting the score of a field(s) (and thus impacting that of the overall document) at the time of indexing. “NORM” which is relevant in the context of index time boosting, Norm is basically that one number against the field which affects the document’s score and thus position in the search result pecking order. Norm basically is short for normalized value. The Norm values are added to the index and this can potentially (again, potentially) help increase the query time.

Query Time Boosting: Query time boosting the boost value is directly specified at the time of querying. You could this directly using the setBoost method of the various query objects or directly in the query.

Example: title:lucene ^2 OR content:apache

If the title fields contains lucene it will be shown on top of the search results because of the boost factor 2.

Conclusion: Lucene is a powerful, built-for-purpose full text search library that takes a raw stream of characters, bundles them into tokens, and persists them as terms in an index. It can quickly query that index and provide ranked results, and provides ample opportunity for extension while maintaining efficiency.By using Lucene directly in our applications, or as part of a server, we can perform full text searches in real-time over gigabytes of content. Moreover, by way of custom analysis and scoring, we can take advantage of domain-specific features in our documents to improve the relevance of results or custom queries.

All the content shared in this post belongs to the author of Java application development company. If you wish to share your thoughts regarding the Apache Lucene, comment below.

653 total views, 3 views today

You May Also Like

About the Author: Rajnikant Patel

Working as Digital marketing Executive at Aegis Softwares. I Like to blog about Latest Tech Updates, gadgets and Java Tutorials, Software Reviews and Healthcare Software Solution Guideline.

Leave a Reply

Your email address will not be published. Required fields are marked *