English 中文(简体)
ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage? [closed]
原标题:
最佳回答

As the creator of ElasticSearch, maybe I can give you some reasoning on why I went ahead and created it in the first place :).

Using pure Lucene is challenging. There are many things that you need to take care for if you want it to really perform well, and also, its a library, so no distributed support, it s just an embedded Java library that you need to maintain.

In terms of Lucene usability, way back when (almost 6 years now), I created Compass. Its aim was to simplify using Lucene and make everyday Lucene simpler. What I came across time and time again is the requirement to be able to have Compass distributed. I started to work on it from within Compass, by integrating with data grid solutions like GigaSpaces, Coherence, and Terracotta, but it s not enough.

At its core, a distributed Lucene solution needs to be sharded. Also, with the advancement of HTTP and JSON as ubiquitous APIs, it means that a solution that many different systems with different languages can easily be used.

This is why I went ahead and created ElasticSearch. It has a very advanced distributed model, speaks JSON natively, and exposes many advanced search features, all seamlessly expressed through JSON DSL.

Solr is also a solution for exposing an indexing/search server over HTTP, but I would argue that ElasticSearch provides a much superior distributed model and ease of use (though currently lacking on some of the search features, but not for long, and in any case, the plan is to get all Compass features into ElasticSearch). Of course, I am biased, since I created ElasticSearch, so you might need to check for yourself.

As for Sphinx, I have not used it, so I can t comment. What I can refer you is to this thread at Sphinx forum which I think proves the superior distributed model of ElasticSearch.

Of course, ElasticSearch has many more features than just being distributed. It is actually built with a cloud in mind. You can check the feature list on the site.

问题回答

I have used Sphinx, Solr and Elasticsearch. Solr/Elasticsearch are built on top of Lucene. It adds many common functionality: web server api, faceting, caching, etc.

If you want to just have a simple full text search setup, Sphinx is a better choice.

If you want to customize your search at all, Elasticsearch and Solr are the better choices. They are very extensible: you can write your own plugins to adjust result scoring.

Some example usages:

  • Sphinx: craigslist.org
  • Solr: Cnet, Netflix, digg.com
  • Elasticsearch: Foursquare, Github

We use Lucene regularly to index and search tens of millions of documents. Searches are quick enough, and we use incremental updates that do not take a long time. It did take us some time to get here. The strong points of Lucene are its scalability, a large range of features and an active community of developers. Using bare Lucene requires programming in Java.

If you are starting afresh, the tool for you in the Lucene family is Solr, which is much easier to set up than bare Lucene, and has almost all of Lucene s power. It can import database documents easily. Solr are written in Java, so any modification of Solr requires Java knowledge, but you can do a lot just by tweaking configuration files.

I have also heard good things about Sphinx, especially in conjunction with a MySQL database. Have not used it, though.

IMO, you should choose according to:

  • The required functionality - e.g. do you need a French stemmer? Lucene and Solr have one, I do not know about the others.
  • Proficiency in the implementation language - Do not touch Java Lucene if you do not know Java. You may need C++ to do stuff with Sphinx. Lucene has also been ported into other languages. This is mostly important if you want to extend the search engine.
  • Ease of experimentation - I believe Solr is best in this aspect.
  • Interfacing with other software - Sphinx has a good interface with MySQL. Solr supports ruby, XML and JSON interfaces as a RESTful server. Lucene only gives you programmatic access through Java. Compass and Hibernate Search are wrappers of Lucene that integrate it into larger frameworks.

We use Sphinx in a Vertical Search project with 10.000.000 + of MySql records and 10+ different database . It has got very excellent support for MySQL and high performance on indexing , research is fast but maybe a little less than Lucene. However it s the right choice if you need quickly indexing every day and use a MySQL db.

My sphinx.conf

source post_source 
{
    type = mysql

    sql_host = localhost
    sql_user = ***
    sql_pass = ***
    sql_db =   ***
    sql_port = 3306

    sql_query_pre = SET NAMES utf8
    # query before fetching rows to index

    sql_query = SELECT *, id AS pid, CRC32(safetag) as safetag_crc32 FROM hb_posts


    sql_attr_uint = pid  
    # pid (as  sql_attr_uint ) is necessary for sphinx
    # this field must be unique

    # that is why I like sphinx
    # you can store custom string fields into indexes (memory) as well
    sql_field_string = title
    sql_field_string = slug
    sql_field_string = content
    sql_field_string = tags

    sql_attr_uint = category
    # integer fields must be defined as sql_attr_uint

    sql_attr_timestamp = date
    # timestamp fields must be defined as sql_attr_timestamp

    sql_query_info_pre = SET NAMES utf8
    # if you need unicode support for sql_field_string, you need to patch the source
    # this param. is not supported natively

    sql_query_info = SELECT * FROM my_posts WHERE id = $id
}

index posts 
{
    source = post_source
    # source above

    path = /var/data/posts
    # index location

    charset_type = utf-8
}

Test script:

<?php

    require "sphinxapi.php";

    $safetag = $_GET["my_post_slug"];
//  $safetag = preg_replace("/[^a-z0-9-_]/i", "", $safetag);

    $conf = getMyConf();

    $cl = New SphinxClient();

    $cl->SetServer($conf["server"], $conf["port"]);
    $cl->SetConnectTimeout($conf["timeout"]);
    $cl->setMaxQueryTime($conf["max"]);

    # set search params
    $cl->SetMatchMode(SPH_MATCH_FULLSCAN);
    $cl->SetArrayResult(TRUE);

    $cl->setLimits(0, 1, 1); 
    # looking for the post (not searching a keyword)

    $cl->SetFilter("safetag_crc32", array(crc32($safetag)));

    # fetch results
    $post = $cl->Query(null, "post_1");

    echo "<pre>";
    var_dump($post);
    echo "</pre>";
    exit("done");
?>

Sample result:

[array] => 
  "id" => 123,
  "title" => "My post title.",
  "content" => "My <p>post</p> content.",
   ...
   [ and other fields ]

Sphinx query time:

0.001 sec.

Sphinx query time (1k concurrent):

=> 0.346 sec. (average)
=> 0.340 sec. (average of last 10 query)

MySQL query time:

"SELECT * FROM hb_posts WHERE id = 123;"
=> 0.001 sec.

MySQL query time (1k concurrent):

"SELECT * FROM my_posts WHERE id = 123;" 
=> 1.612 sec. (average)
=> 1.920 sec. (average of last 10 query)

Lucene is nice and all, but their stop word set is awful. I had to manually add a ton of stop words to StopAnalyzer.ENGLISH_STOP_WORDS_SET just to get it anywhere near usable.

I haven t used Sphinx but I know people swear by its speed and near-magical "ease of setup to awesomeness" ratio.

Try indextank.

As the case of elastic search, it was conceived to be much easier to use than lucene/solr. It also includes very flexible scoring system that can be tweaked without reindexing.





相关问题
solr problem to get the field names

Ive got a problem. In each document I ve got fields: threads.id and posts.id. I want to get the field name value for them so i can get data from the database. Between the lines beneath i have marked ...

Which is the better client for Solr + PHP?

I have two options http://www.php.net/manual/en/book.solr.php http://code.google.com/p/solr-php-client/ I read it somewhere that that 2) use JSON as output types whereas 1) use XML doc. Isn t ...

Geronimo vs Glassfish

For a production environment, is Apache Geronimo better for applications that uses ActiveMQ, Derby, Solr?

Sort by date in Solr/Lucene performance problems

We have set up an Solr index containing 36 million documents (~1K-2K each) and we try to query a maximum of 100 documents matching a single simple keyword. This works pretty fast as we had hoped for. ...

SOLR - delta import not with last_modified

I saw only ways using delta import with last_modified. Is there some other ways to do delta_imports withut using timestamps? For example, if i have unique key(integer), can i tell SOLR to index only ...

SOLR How to return only limited matched content

ok guys, say in my Schema I have 4 fields: <field name="SiteIdentifier" type="string" indexed="true" stored="true" required="true"/> <field name="Title" type="text" indexed="true" stored="...

Solr - character substitution

I have Solr with indexed database. In my database all data is in Latvian. The problem is, I need to be able to search word Riga as if it is word Rīga. Of course, i can define synonym - Rīga = Riga, ...

热门标签