Friday, October 10, 2008

Parameters in Apache Solr Schema.xml

Indexing using Apache Solr:

In the Schema.xml we can create various datatypes. Every datatype can be associated with one and only one Analyzer. The no of Data types and the associated Analyzers define how we want to index the content. We can have one Analyzer for each of the columns in our Database!

1. The main DATA CLASSES:

1.1 Text

Class: Perdefined javaclasses to define the content datatype

Sortmissinglast = “true” (a sort on this field will cause documents
without the field to come after documents with the field )
SortMissingFirst

omitNorms= “true”
omitNorms is set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms.

1.2 Numeric:

class="solr.IntField" OR
class="solr.SortableIntField"

if you place a text inside a int field in the sortable type it will converted to an integer and can be sorted.

1.3 Date:

Format for date field :
1995-12-31T23:59:59Z trailing "Z" designates UTC time and is mandatory

Reference: ://www.w3.org/TR/xmlschema-2/#dateTime

You can perform operations on the date field and store in the database.
NOW/HOUR
... Round to the start of the current hour
NOW-1DAY
... Exactly 1 day prior to now
NOW/DAY+6MONTHS+3DAYS
... 6 months and 3 days in the future from the start of the current day

Date Field details refer javadocs, probable use case is for time based faceted search.

3. Analyzers: Tokenizers and Tokens

If you want different columns in your database to use different Tokenizers, they must be associated with different data types in Solr. Over and above the tokenizers, the text can be further indexed using the Token filters.We an also have predefined analyzer classes in java and then just include them, they are:

BrazilianAnalyzer, ChineseAnalyzer, CJKAnalyzer, CzechAnalyzer, DutchAnalyzer, FrenchAnalyzer, GermanAnalyzer, GreekAnalyzer, KeywordAnalyzer, PatternAnalyzer, PerFieldAnalyzerWrapper, QueryAutoStopWordAnalyzer, RussianAnalyzer, ShingleAnalyzerWrapper, SimpleAnalyzer, SnowballAnalyzer, StandardAnalyzer, StopAnalyzer, ThaiAnalyzer, WhitespaceAnalyzer

Defining Custom analyzers is a combination of Tokenizers and Tokens.

2. PositionIncrementGap:

A position increment gap controls the virtual space between the last token of one field instance and the first token of the next instance. With a gap of 100, this prevents phrase queries (even with a modest slop factor) from matching across instances.

Which Tokenizer do I use? Which Token filters should I apply? How should I create my Analyzers using Tokenizers and Token filters? These questions depend on the business rule of the Search engine.

No comments:

Post a Comment

Search This Blog

Chennai Drupal Community

drupal.org - Community plumbing

Shyamala's Drupal SEO