Lucene by default can index only first 10000 words...  

Posted by ChEthAn in


In my last project I was using Lucene for indexing and searching the documents. For parsing of documents I was using external libraries like PDFBox, POI and J-Tidy, which first parse the document and extract the data out of it then lucene was adding that extracted text to index.

But while searching, I wasn't able to search certain words which were at the bottom of the document. Almost for a day I was not able to find any reason for that.

Finally I came to know that By default Lucene can index only first 10000 words. So thats why I wasn't able to search few words. As there wont be any error's, it is difficult for User/developers to know about real problem.

So in order to index whole document you have to set the max_field_length of IndexWriter explicitly otherwise by default lucene will index only 10,000 words.

Sample Code Snippet :-

private static final int MAX_FIELD_LENGTH = 100000;//user configurable(By default 10000)

IndexWriter indexWriter = new IndexWriter(LUCENE_DIR,new StandardAnalyzer(),false);
indexWriter.setMaxFieldLength(MAX_FIELD_LENGTH);

This entry was posted on Monday, December 29, 2008 at Monday, December 29, 2008 and is filed under . You can follow any responses to this entry through the comments feed .

1 comments

Anonymous  

hi
chethan first post itself you scored
century :)

January 8, 2009 at 4:38 AM

Post a Comment