Development

sfLucenePlugin/1.0

You must first sign up to be able to contribute.

Introduction

sfLucenePlugin integrates symfony and Zend Search Lucene to instantly add a search engine to your application. The plugin will auto-detect your ORM layer, but currently only supports Propel.

For symfony 1.1, please see sfSearchPlugin.

Requirements

  • symfony 1.0.x
  • Propel

Main Features

  • Configured all by YAML files
  • Complete integration with symfony
  • i18n ready
  • Keyword highlighting
  • Stop words, short words
  • Index optimization
  • Custom indexers
  • Multiple indexes

Development Status

The 1.0 sfLucene branch is in maintenance mode and is generally pretty stable.

Installation

  • Install the plugin:
    symfony plugin-install http://plugins.symfony-project.com/sfLucenePlugin
    
  • Initialize configuration files (ignore this if you are upgrading):
    symfony lucene-init myapp
    
  • Clear the cache
    symfony cc
    
  • Configure sfLucene per the instructions below.

Configuring Lucene

The entire plugin is configured by search.yml files placed throughout your application. You must be careful that you are aware of what search.yml file you are working in because each one has a different purpose. As you will later learn, the project level search.yml file controls the entire engine while a module's search.yml defines indexing parameters.

Open your project's search.yml file, located in myproject/config/search.yml . If you followed the installation instructions above, you will see:

MyIndex:
  models:

  index:
    encoding: UTF-8

Similar to your schema.yml file, you can have multiple indexes. The only requirement is that you must name them! So, enter a name for the first index, where "MyIndex?" goes. This is used internally only by the plugin. If you require a different encoding to be used, enter it. Note, however, that UTF-8 is generally the best charset to store your indexes in.

If you require i18n support, you must define the cultures that you support under index. Use the following syntax:

index:
  cultures: [en_US, fr_FR]

(If you receive an exception saying "Culture XXX is not enabled" then define the culture even if you do not use i18n.)

By default, the plugin will not index or search on common words, such as "the" and "a". Further, it ignores single characters. If you require different behavior, you can define them like so:

index:
  stop_words: [the, an, it]
  short_words: 2

The plugin is also capable of different indexing filters, which determine what content is indexed. For instance, if you need your index to be case-sensitive, you need to use a different indexing filter. The same goes for utf-8 support and numbers. To change these, open the project's search.yml file and add:

index:
  analyzer: utf8num
  case_sensitive: off
  mb_string: on

"analyzer" can either by text, textnum, utf8, or utf8num. If you choose text, all numbers will be ignored. If you require the index to be case sensitive, set "case_sensitive" to "on". mb_string determines whether to use the mb_* string functions instead of the standard PHP string filters. This adds a huge performance bottleneck to indexing when turned on, so use with care. You only need to turn mb_string on if you are working with a case-insensitive utf8 index.

By default, in the command line sfLucene operates in the "search" environment. This has the advantage to that if you require a different configuration setting for the search environment, you can easily set it up. But, if your database is only selectively configured per environment (ie, not for "all"), then you will quickly run into trouble. To get around this, define a database for either "all" or the "search" environment.

Indexing

sfLucene currently supports two ways to add information to the index:

  1. Through the ORM layer
  2. Through symfony actions

Through the ORM layer is the recommended method to add information to the index. The plugin can keep the index synchronized if you use the ORM layer. Through symfony actions is intended only for static content, such as the privacy policy.

ORM layer method

Open your project's search.yml file and you will find a model declaration towards the top. This is where you put the models you wish to index. For each model, you define the fields you want to index and other parameters. The syntax is:

MyIndex:
  models:
    BlogPost:
      fields:
        id: unindexed
        title:
          boost: 1.5
          type: text
        content: unstored
        description: text
    BlogComment:
      fields:
        id: unindexed
        summary: text
        message: text
      description: message
      title: summary

In the above example, two models are set to index: BlogPost? and BlogComment?. In BlogPost?, the fields title, content, and description are stored, but the title fields holds the most weight with a boost factor of 1.5.

When search results are displayed, the system intelligently guesses which field should be displayed as the result "title" and which field is the result "description." However, to be explicit, you can specify a description and title field, as in BlogComment?.

Note that the fields do not have to exist as fields in your database. As long as it has a getter on the model, you can use it in your index.

See the Zend_Search_Lucene documentation for more about the field types.

Further, you can specify a transformation function to put the value through before it is indexed. This is useful if you have HTML code being returned and you need strip it out. Define this like so:

MyIndex:
  models:
    BlogPost:
      fields:
        title: text
        content:
          type: text
          transform: strip_tags
          boost: 1.5

When this model is indexed, the content field will be automatically routed through strip_tags() before being stored in the index.

Next, you must tell your application where to route the model when it is returned. You do this by opening your application's config/search.yml file and defining a route:

MyIndex:
  models:
    BlogPost:
      route: blog/showPost?id=%id%
    BlogComment:
      route: blog/showComment?id=%id%

In routes, %xxx% is a token and will be replaced by the appropriate field value. So, %id% will be the value returned by the ->getId() method. Warning: You must also define the field in the project's search.yml to be indexed or unexpected results will occur!

Finally, you must register the model with the system. If you are using Propel, you must use Propel's behaviors.

Propel

You can do this by opening up the model's file and putting

sfLucenePropelBehavior::getInitializer()->setupModel('MyModel');

after the class declaration. So, for a blog, you would open project/lib/model/BlogPost.php and append the above, replacing "MyModel" with "BlogPost".

Note: Be sure to enable Propel behaviors in your project by changing the file config/propel.ini from

propel.builder.addBehaviors = false

to

propel.builder.addBehaviors = true

and then rebuilding your model with

$ symfony propel-build-model

Advanced Model Settings

You can configure the model even more. If the peer does not follow symfony's naming conventions, you can specify a new one with in the project level search.yml:

MyIndex:
  models:
    BlogPost:
      peer: OtherPeer

Further, sfLucene optimizes memory usage when rebuilding the index from the database by using both an internal pager and hydrating objects on demand. By default, rows are selected in batches of 250, but if you require this to be different, you can customize it like so:

MyIndex:
  models:
    BlogPost:
      rebuild_limit: 250

If only some of your objects should be stored in the index, you can define an validating method on the model that can return a boolean indicating whether the model should be indexed. If this method returns true, the indexer proceeds with indexing. If the method returns false, the indexer ignores that particular instance. By default, the indexer looks for an "isIndexable" method and calls it if it is available. However, you can specify your own method like so:

MyIndex:
  models:
    BlogPost:
      validator: shouldIndex

symfony actions method

To setup an action to be indexed, you must create a file in the module's config directory named search.yml. Inside this file, you define the actions you want indexed:

MyIndex:
  privacy:
  tos:
    security:
      authenticated: true
      credentials: [admin]
  disclaimer:
    params:
      advanced: true
    layout: true

Remember to prefix each one with the name of the index.

As you can see, it is possible to define request parameters, manipulate authentication, and toggle decorating the response. By default, the response is not decorated, the user is not authenticated without any credentials, and there aren't any request parameters.

Building the Index

After you have defined the indexing parameters, you must build the initial index. You do this on the command line:

$ symfony lucene-rebuild myapp

replacing myapp with the name of your application you want to rebuild. This will build the index for all cultures.

Searching

sfLucene ships with a basic search interface that you can use in your application. Like the rest of the plugin, it is i18n ready and all you must do is define the translation phrases.

To enable the interface, open your application's settings.yml file and add "sfLucene" to the enabled_modules section:

all:
  .settings:
    enabled_modules: [default, sfLucene]

You are free to define your own routes in the routing.yml file.

If you have specified multiple indexes in your search.yml files, you need to configure which index that you want to search. You do this by opening the app.yml file and adding the configuration setting:

all:
  lucene:
    index: MyIndex

If you need to configure which index to use on the fly, you can use sfConfig:

sfConfig::set('app_lucene_index', 'MyIndex');

Customizing the Interface

As every application is different, it is easy to customize the search interface to fit the look and feel of your site. Doing this is easy as all you must do is overload the templates and actions.

To get started, simply run the following on the command line:

$ symfony lucene-init-module myapp

If you look in myapp's module folder. you will see a new sfLucene module. Use this to customize your interface.

Often, when writing a search engine, you need to display a different result template for each model. For instance, a blog post should show differently than a forum post. You can easily customize your results by changing the "partial" value in your application's search.yml file. For example:

models:
  BlogPost:
    route: blog/showPost?slug=%slug%
    partial: blog/searchResult
  ForumPost:
    route: forum/showThread?id=%id%
    partial: forum/searchResult

For ForumPost?, the partial apps/myapp/modules/forum/templates/_searchResult.php is loaded. This partial is given a $result object that you can use to build that result. The API for this object is pretty simple:

  • $result->getInternalTitle() returns the title of the search result.
  • $result->getInternalUri() returns the route to the search result.
  • $result->getScore() returns the score / ranking of the search result.
  • $result->getXXX() returns the XXX field.

In addition to the $result object, it is also given a $query string, which was what the user searched for. This is useful for highlighting the results.

If you wish to disable the advanced search interface, open the application's app.yml file and add the following:

all:
  lucene:
    advanced: off

This will prevent sfLucene from giving the user the option to use the advanced mode.

Routing

sfLucene will automatically register friendly routes with symfony. For example, surfing to http://example.org/search will route to sfLucene. If you would like to customize these routes, you can disable them in the app.yml file with:

all:
  lucene:
    routes: off

It will then be up to you configure the routing.

Pagination

You can customize pages by using the same logic as above (defaults to 10):

all:
  lucene:
    per_page: 10

To customize the pager widget that is displayed, change the pageradius key (defaults 5):

all:
  lucene:
    pager_radius: 5

Results

You can configure the presentations of the search results. If you require more fine-tuned customizations, you are encouraged to create your own templates.

To change the number of characters displayed in search results, edit the "resultsize" key:

all:
  lucene:
    result_size: 200

To change the highlighter used to highlight results, edit the "resulthighlighter" key:

all:
  lucene:
    result_highlighter: |
      <strong class="highlight">%s</strong>

Highlighting Pages

The plugin has an optional highlighter than will attempt to highlight keywords from searches. The highlighter will hook into this search engine and also attempts to hook into external search engines, such as Google and Yahoo!.

To enable this feature, open the application's config/filters.yml file and add the highlight filter before the cache filter:

rendering: ~
web_debug: ~
security:  ~

# generally, you will want to insert your own filters here

highlight:
  class: sfLuceneHighlightFilter

cache:     ~
common:    ~
flash:     ~
execution: ~

By default, the highlighter will also attempt to display a notice to the user that automatic highlighting occured. The filter will search the result document for <!--[HIGHLIGHTER_NOTICE]--> and replace it with an i18n-ready notice (note: this is case sensitive).

To highlight a keyword, it must meet the following criteria:

  • must be X/HTML response content type
  • response must not be headers only
  • must not be an ajax request
  • be inside the <body> tag
  • be outside of <textarea> tags
  • be between html tags and not in them
  • not have any other alphabet character on either side of it

To efficiently achieve this, the highlighter parser assumes that the content is well formed X/HTML. If it is not, unexpected highlighting will occur.

The highlighter is also highly configurable. The following filter listing shows the default configuration settings and briefly explains them:

highlight:
  class: sfLuceneHighlightFilter
  param:
    check_referer: on # if true, results from Google, Yahoo, etc will be highlighted.
    highlight_qs: sf_highlight # the querystring to check for highlighted results, NOTE: Deprecated
    highlight_strings: [<strong class="highlight hcolor1">%s</strong>] # how to highlight terms.  %s is replaced with the term
    notice_tag: "<!--[HIGHLIGHTER_NOTICE]-->" # this is replaced with the notice (removed if highlighting does not occur)
    notice_string: > # the notice string for regular highlighting.  %keywords% is replaced with the keywords.  i18n ready.
      <div>The following keywords were automatically highlighted: %keywords%</div>
    notice_referer_string: > # the notice string for referer highlighting.  %keywords% is replaced with the keywords, %from% with where they are from,.  i18n ready
      <div>Welcome from <strong>%from%</strong>!  The following keywords were automatically highlighted: %keywords%</div>

As of version 0.1.4 the preferred method for adding the 'highlight_qs' field is through your project's app.yml file. Defining it in the filters.yml file may cause your application to build routes inconsistently. Here's an example app.yml:

all:
	lucene:
	  highlight_qs:  highlight 

Values defined for 'highlight_qs' in the filter will still be honored, but only if there is nothing defined in the app.yml.

If you need to configure it more, it is possible to extend the highlighting class. Refer to the API documentation for this.

Categories

Starting with 0.0.5 Alpha, each document in the index can be tied to one or more categories. It is then possible to limit your search results to that category in the provided interface. To enable this, you must define a "categories" key to your models or actions. For instance, an example model:

models:
  Blog:
    fields:
      title: text
      post: text
      category: text
    categories: [%category%, Blog]

The "Blog" model above will be placed both into the blog category and the string returned by ->getCategory() on the model. After you rebuild your model, a category drop down will automatically appear on the search interface.

To disable category support all-together, open the application level app.yml file and add:

all:
  lucene:
    categories: off

This will prevent sfLucene from giving the user an option to search by category.

Using the search Criteria API

sfLucene ships with a basic criteria API for easily constructing queries. The API is ideal for most uses, but if you require more advanced functionality, you should use Zend's API. This section will just document the most common ways to interface with the API:

  • You can either use $c = new sfLuceneCriteria; or $c = sfLuceneCriteria::newInstance() . The latter is ideal for method chaining.
  • To add a search criteria, use the ->add() method. The first argument takes either a Zend API object, a string, or another instance of sfLuceneCriteria. The second argument determines how Lucene should handle this criteria. If you give true (default) to the second argument, then document *must* match that criteria. If you give null, then the document *may* match. If you give false, then the document *may not* match. For example, the following: $c = sfLuceneCriteria::newInstance()->add('symfony plugins')->add('cakephp', false); will return documents that contain "symfony plugins" but not "cakephp".
  • If you need to match a field, then you can use ->addField(). ->addField() takes 4 arguments, but only the first one is required. The first one is either a string or an array of values to search under. The second argument is the field name to search under, but if the field is null, then it searches under all fields. The third argument is boolean indicating whether it must match all of the values given. The final argument is how Lucene should handle it (same as above).
  • Use ->addAscendingSortBy() and ->addDescendingSortBy() to sort. Beware that these will drastically slow down your application.

Integrating sfLucene with another plugin

It is possible to integrate sfLucene with other plugins. To add support to your Propel models, you must append the following:

if (class_exists('sfLucene', true))
{
  sfLucenePropelBehavior::getInitializer()->setupModel('MyModel');
}

The conditional lets your plugin function should the user not have this plugin installed.

Then, you must configure sfLucene with your plugin. In project/plugins/sfMyPlugin/config/search.yml, you can define the settings for your models. You can also create a search.yml file in your modules file. But, be warned that these files can be overloaded by the user.

Updating a model's index when a related model changes

If a model's index should be updated based on the modification of a related model, you can override the save method of the related objects to directly call the sfLucene saveIndex and/or deleteIndex methods as in the example below:

class Bicycle extends BaseBicycle
{
  public function save()
  {
    parent::save();
    
    foreach ($this->getWheels() as $wheel)
    {
      $wheel->saveIndex();
    }
  }
}

Custom Indexers

For Individual Models

sfLucene supports custom indexers. Custom indexers are great for complicated data models where the standard indexer would not work. To make a custom Propel indexer, create a class that extends sfLucenePropelIndexer. In this class, you optionally define insert(), shouldIndex(), delete(), and validate() methods. A sample indexer for sfSimpleCMS is below:

class sfSimpleCMSIndexer extends sfLucenePropelIndexer
{
  /**
   * Inserts the model into the index.
   */
  public function insert()
  {
    if (!$this->shouldIndex())
    {
      return $this;
    }

    $doc = $this->getNewDocument();

    $slots = $this->getModel()->getSlots( $this->getCulture() );

    $slotText = '';

    foreach ($slots as $slot)
    {
      $slotText .= strip_tags($slot->getValue()) . "\n\n";
    }

    $doc->addField( $this->getLuceneField('text', 'description', $slotText) );
    $doc->addField( $this->getLuceneField('text', 'title', $this->getModel()->getSlotValue('title', $this->getCulture()) ));
    $doc->addField( $this->getLuceneField('unindexed', 'slug', $this->getModel()->getSlug() ));

    $categories = $this->getModelCategories();

    if (count($categories))
    {
      foreach ($categories as $category)
      {
        $this->addCategory($category);
      }

      $doc->addField( $this->getLuceneField('text', 'sfl_category', implode(', ', $categories)) );
    }

    if ($this->shouldLog())
    {
      $this->echoLog(sprintf('Inserted model "%s" with PK = %s', $this->getModelName(), $this->getModel()->getPrimaryKey()));
    }

    $doc->addField($this->getLuceneField('unindexed', 'sfl_model', $this->getModelName()));
    $doc->addField($this->getLuceneField('unindexed', 'sfl_type', 'model'));

    $this->addDocument($doc, $this->getModelGuid());
  }

  /**
   * Determines if we should index this.
   */
  protected function shouldIndex()
  {
    return $this->getModel()->getIsPublished() ? true : false;
  }

  /**
   * Validates the model to make sure we really can process it.
   */
  protected function validate()
  {
    $response = parent::validate();

    if ($response)
    {
      return $response;
    }

    if (!($this->getModel() instanceof sfSimpleCMSPage))
    {
      return __CLASS__ . ' can only process sfSimpleCMSPage instances';
    }

    return null;
  }
}

To register this indexer with the plugin, open your project's search.yml and define it within the models:

models:
  sfSimpleCMSPage:
    fields:
      id: unindexed
      title: text
    indexer: sfSimpleCMSIndexer

The system will automatically use that indexer for that point forward.

Indexing Other Mediums

sfLucene is extensible and supports indexing other types of mediums, such as PDFs or images. You can hook your custom indexers into sfLucene by defining them in the factories declaration in the search.yml file.

To do this, open your project level search.yml. Add a "factories" key to one of your indexes like so:

MyIndex:
  models:
    ...
  index:
    ...
  factories:
    indexers:
      pdf: [MyPdfIndexerHandler, MyPdfIndexer]

In the above example, when you rebuild the index, in addition to indexing the models and actions, the PDF indexers will also run. When registering new indexers with the system, you must register both a handler and an indexer. The handler is responsible for managing its respective indexer during the rebuilding process. The indexer does the actual indexing. See sfLucene source for more on this.

You can also override the default indexers or disable them all together. In the below example, models are managed by a custom system and actions are not indexed:

MyIndex:
  models:
    ...
  index:
    ...
  factories:
    indexers:
      model: [MyHandler, MyIndexer]
      action: ~

The best way to write your own handlers and indexers is to examine the sfLucene source.

Command Line Reference

The plugin ships with a handful of command line utilities for managing your index. They are listed below:

  • $ symfony lucene-about [application] provides information about the plugin and the index. [application] is optitonal.
  • $ symfony lucene-init application initializes the search configuration files.
  • $ symfony lucene-init-module application initializes a base module for you to customize.
  • $ symfony lucene-optimize application [environment] optimizes the index for all cultures.
  • $ symfony lucene-rebuild application [environment] rebuilds the index for all cultures.

Note: "environment" option added in 0.1.4.

Contribute

All contributions and suggestions are welcome. The project lives in symfony's SVN repository in plugins/sfLucenePlugin. Feel free to commit your suggestions anytime.

Active Tickets

Please report all bugs or suggestions to the trac ticket system.

#2729
sfLuceneHighlighter should hook into Zend_Search_Lucene queries for complete keyword matching
#2757
sfLucene : highlight errors with utf8 accentuated chars when escaping is on
#4013
change zend external
#4138
Unnecessary </label> tag in advancedSearchControls.php
#4231
highlight error: sfLuceneHighlighter "Undefined offset" - for repeted search keys
#4496
sfLucenePlugin11 documentation page is missed!
#5119
The exception message throw an exception
#6409
[PATCH] Field name is not camelized
#6533
Why incorporate Solar in 1.2-Doctrine Branch?
#6989
sfLucenePropelIndexer doesn't restore $old_culture when null
#7165
[PATCH] Unnecessary loaded ProjectConfiguration in sfLuceneBaseTask
#7781
Catchable fatal error when using find() in sfLucene with an input string

Changelog

Version 0.1.7

  • Fixed category notice

Version 0.1.6

  • Fixed sfLucene pagination does not pass categories to pages
  • Backported Propel behavior locking mechanism

Version 0.1.5

  • Fixed lowercase filter for utf8 values (closes #2922)

Version 0.1.4

  • Added option to specify environment in pake tasks
  • Added ability to define 'highlight_qs' in app.yml which fixes routing problems when a value other than the default was used (closes #2873)
  • Upgraded Zend_Search_Lucene to version 1.5.0PR

Version 0.1.3

  • BC: sfLuceneCriteria constructor now requires sfLucene instance
  • Fixed sfLucene throws a notice and returns incorrect search results with UTF8 search query
  • Possible fix for accented characters

Version 0.1.2 Beta

  • Fixed wrong html options in publicControls component (closes #2589).
  • Fixed type-o in documentation (closes #2591) + more

Version 0.1.1 Beta

  • Fixed i18n category "All" value
  • Fixed id tag in categories (thanks Eric)
  • Fixed highlighting filter for UTF8 characters (thanks Eric)
  • Improved Propel batch index memory usage.
  • Upgraded ZSL to trunk (includes support for wild cards, range queries, etc)
  • Added sfMixer hooks
  • Added advanced query types to sfLuceneCriteria
  • Added default routes to sfLucene module
  • Added many more customization options

Version 0.1 Beta

  • BC: Refactored sfLucene to support multiple indexes. The class is now accessed by the singleton constructor:
          sfLucene::getInstance(string $name [, string $culture [, bool $rebuild]]);
    
    Configuration files have also changed and requires manual updating. You now must start each search.yml with the name of the index. See README for more.
  • Added transform field configuration from Thomas Rabaix
  • Consolidated highlighting, all highlighters use common library now
  • Added support for custom types of indexer factories

Version 0.0.7 Alpha

  • Fixed bug with anchors and highlighting filter.
  • Improved performance with highlighting filter and large response contents.
  • Fixed loading unnecessary Zend libraries from francois
  • Added custom analyzer and case sensitivity support
  • Added/fixed encoding support in indexers (thanks Daniel Staver)

Version 0.0.6 Alpha

  • Refactored indexing system & added custom indexer support
  • Propel memory indexing suggestion from Thomas Rabaix
  • BC: Changed behavior based off forum thread #8888: fields are no longer camel-cased. If you had "my_method" in your config, you must make it "myMethod" now.
  • Upgraded to Zend Framework 1.0.2 (build 6503)
  • Fixed category support

Version 0.0.5 Alpha

  • Made titles in interface i18n-ready.
  • Fixed E_STRICT errors in indexers
  • Search query is now auto-populated in interface.
  • Fixed type-o in README
  • Fixed bug in duplicate indexes in i18n models
  • Added category support
  • Added a handful of unit tests
  • Added sfLuceneCriteria API for easily constructing queries

Version 0.0.4 Alpha

  • Added exception suite and updated throws.
  • Removed required dependency on Propel
  • Made room for Doctrine port
  • Changed name to sfLucene
  • Added access keys to the search interface
  • Made all paths use DIRECTORY_SEPARATOR constant.
  • Started writing unit tests.
  • Added "Basic" button to advanced search page
  • BC: Changed routing tokens to %XXX% -- you must manually update (fixes forum topic 8403).

Version 0.0.3 Alpha

  • Added [remove highlighting] button to sfPZSLHighlightFilter
  • Added integration support for plugins.
  • BC: Removed many poltergeist anti-patterns
  • BC: Changed sfPZSL::getInstance() method signature & removed sfPZSL::getCultureInstance()
  • BC: sfPZSLModelIndexer separated to allow for other ORM layers
  • Added better automatic i18n detection
  • Made browser a little bit more robust

Version 0.0.2 Alpha

  • Full i18n support.
  • Fixed index permissions bug between Apache and user.
  • Fixed explicit title and description bug.
  • Revamped & improved CLI interface with more feedback to user.
  • Added boost support and expanded field syntax.
  • Added highlighting to titles in search interface.
  • Revamped config handlers.
  • Improved indexing performance 3-fold.
  • Added generic advanced search page
  • Added highlighting filter with hooks into Google, Yahoo, MSN, and Ask.
  • sfPZSL goes into a reduced state when there are syntax errors in the query.
  • Added stop word filter support
  • Added short word filter support

Version 0.0.1 "Proof of Concept"

  • First public release

Attachments