Pages

Apache Lucene - Indexing Part2

I was going through some interesting sections of Apache Lucene these days.I found it really interesting because, the project is a very popular one and it made most of the web applications to integrate complex search modules in them.Some might be knowing JIRA tracker by Atlassian uses lucene which traverses huge buglists,comments,codes,documents etc.
For a vanilla search tool, comparing the search key with strings in the file is very slow.So the indexing like inverted index comes in handy.When indexing is done by lucene, it will create document ids for each document.It will collect all the words and associate them with each docId in which the word belongs. Therefore, each docId will be having alist of positions of words in the document.The index datastructure, the store of documents, with its associated fields is constructed to provide a random access data retrieval.The Lucene inverted index can be either opened to add more documents or delete existing documents at a time.
To update a document you must delete it first, close the index and add it again.
The Analyzer , specfied in the Indexwriter, will extracts the tokens to be indexed.There is a default analyzer for english texts(for multilingual one custom analyzers are needed).Before analyzing is done, the documents like pdf,doc etc are to be parsed.A Term is the basic unit for searching. Similar to the Field object, it consists of a pair of string elements: the name of the field and the value of that field.A term is defined as a pair of <fieldname,text>A term vector is a collection of terms.The inverted index map terms to documents.For each term T , it should store the set of all documents containing that term.So the duty of analyzer is to look for the terms in documents and create a token stream so that they can be mapped.Terms are stored in segments and they are sorted.The term frequency will tell how well that term describes the document contents.But term which appear in many documents are not very useful for filtering.The Kth most frequent term has frequency approx 1/K ie for 100 tokens, the index will contain 50% text.For the indexing strategies : - they can be chosen from
  • Batch based - like a simple file parsing and sorting-
  • BTree - indexing - similar to indexing by file systems and databases - as it is a tree the update can be done in place
  • Segment based which is common, created by lots of small indexes
The algorithm used for lucene indexing can be
  • indexing a single document and merging a set of indexes
  • incremental algorithm in which there will be a stack of segments and new indexes are pushed to stack (segment based)

Apache Lucene - Indexing - Part 1

"Information retrieval (IR) is the science of searching for documents, for information within documents and for metadata about documents, as well as that of searching relational databases and the World Wide Web."

Most of the application uses search features.If you are looking to add a powerful text search engine feature to your application then use Lucene, which can add advanced Search Engine capabilities to an application.This is a really powerful Java API which gave birth to powerful tools such as Nutch,Hadoop,Hibernate search and so on.Lucene was started in 1997 and adopted by Apache in 2001.The main functionality Lucene does is the powerful full text indexing of data.
Indexing with Lucene breaks down into three main operations: converting data to text, analyzing it, and saving it to the index.Lucene looks for strings only , so the documents has to be parsed and indexed.
To search large amounts of text quickly, you must first index that text and convert it into a format that will let you search it rapidly, eliminating the slow sequential scanning process. This conversion process is called indexing, and its output is called an index. So the searching is done on this index to find the data related with a cost of space 'storing indexes'.
These index files can be stored in a directory .A lucene index is divided into segments madeup of several index files(Lucene Documents).An index can be related to mutiple documents.So if new documents are indexed , it is added to segments than modifying the existing index file.Lucene uses a feature called incremental indexing ie there will be a global indexing and index those incremental documents so that it is searchable.Regarding the structure of a lucene index, it is an inverted index .While searching, lucene loads the index to memory .It uses a high performance indexing which has an index size roughly 20-30% of the size of text indexed which uses less memory. The documents in an index is a collection of fields which is a named collection of terms like <field,term>.These fields are independent search space defined at run-time.The segments or sub-indexes are independently searchable and the results of these segments are merged.Suppose a wiki article is indexed , we can set the field properties, so that the field objects contain actual indexed article data or stored one.



More about lucene index file formats - here