English 中文(简体)
Sort by date in Solr/Lucene performance problems
原标题:
  • 时间:2009-11-30 11:07:34
  •  标签:
  • lucene
  • solr

We have set up an Solr index containing 36 million documents (~1K-2K each) and we try to query a maximum of 100 documents matching a single simple keyword. This works pretty fast as we had hoped for. However, if we now add "&sort=createDate+desc" to the query (thus asking for the top 100 new documents matching the query) it runs for a long, very long time and finally results in an OutOfMemoryException. From what I ve understood from the manual this is caused by the fact that Lucene needs to load all the distinct values for this field (createDate) into memory (the FieldCache afaik) before it can execute the query. As the createDate field contains date and time the number of distinct values is pretty large. Also important to mention is that we frequently update the index.

Perhaps someone can provide some insights and directions on how we can tune Lucene / Solr or change our approach in such a way that query times become acceptable? Your input will be much appreciated! Thanks.

最佳回答

The problem is Lucene stores numbers as strings. There are some utilities, which split the date into YYYY, MM, DD and put them in different fields. That gives much better results.

Newer version of Lucene (2.9 onwards) support numeric fields and the performance improvements are significant (couple of orders of magnitude, IIRC.) Check this article about the numeric queries.

问题回答

You can sort the results by index order instead. The sort specification for descending by document number is:

new SortField(null, SortField.DOC, true)

You should also partition the index directories by the date field. All matching documents are examined by Lucene when collecting the top N results. The partitioning will split the examined set. You don t need to examine the older partitions, if you have N results in the newest partition.

Try converting you Date type data into String type (such as milliseconds).





相关问题
solr problem to get the field names

Ive got a problem. In each document I ve got fields: threads.id and posts.id. I want to get the field name value for them so i can get data from the database. Between the lines beneath i have marked ...

Which is the better client for Solr + PHP?

I have two options http://www.php.net/manual/en/book.solr.php http://code.google.com/p/solr-php-client/ I read it somewhere that that 2) use JSON as output types whereas 1) use XML doc. Isn t ...

Geronimo vs Glassfish

For a production environment, is Apache Geronimo better for applications that uses ActiveMQ, Derby, Solr?

Sort by date in Solr/Lucene performance problems

We have set up an Solr index containing 36 million documents (~1K-2K each) and we try to query a maximum of 100 documents matching a single simple keyword. This works pretty fast as we had hoped for. ...

SOLR - delta import not with last_modified

I saw only ways using delta import with last_modified. Is there some other ways to do delta_imports withut using timestamps? For example, if i have unique key(integer), can i tell SOLR to index only ...

SOLR How to return only limited matched content

ok guys, say in my Schema I have 4 fields: <field name="SiteIdentifier" type="string" indexed="true" stored="true" required="true"/> <field name="Title" type="text" indexed="true" stored="...

Solr - character substitution

I have Solr with indexed database. In my database all data is in Latvian. The problem is, I need to be able to search word Riga as if it is word Rīga. Of course, i can define synonym - Rīga = Riga, ...

热门标签