Indexing using Apache Solr:
In the Schema.xml we can create various datatypes. Every datatype can be associated with one and only one Analyzer. The no of Data types and the associated Analyzers define how we want to index the content. We can have one Analyzer for each of the columns in our Database!
1. The main DATA CLASSES:
Class: Perdefined javaclasses to define the content datatype
Sortmissinglast = “true” (a sort on this field will cause documents
without the field to come after documents with the field )
omitNorms is set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms.
if you place a text inside a int field in the sortable type it will converted to an integer and can be sorted.
Format for date field :
1995-12-31T23:59:59Z trailing "Z" designates UTC time and is mandatory
You can perform operations on the date field and store in the database.
... Round to the start of the current hour
... Exactly 1 day prior to now
... 6 months and 3 days in the future from the start of the current day
Date Field details refer javadocs, probable use case is for time based faceted search.
3. Analyzers: Tokenizers and Tokens
If you want different columns in your database to use different Tokenizers, they must be associated with different data types in Solr. Over and above the tokenizers, the text can be further indexed using the Token filters.We an also have predefined analyzer classes in java and then just include them, they are:
BrazilianAnalyzer, ChineseAnalyzer, CJKAnalyzer, CzechAnalyzer, DutchAnalyzer, FrenchAnalyzer, GermanAnalyzer, GreekAnalyzer, KeywordAnalyzer, PatternAnalyzer, PerFieldAnalyzerWrapper, QueryAutoStopWordAnalyzer, RussianAnalyzer, ShingleAnalyzerWrapper, SimpleAnalyzer, SnowballAnalyzer, StandardAnalyzer, StopAnalyzer, ThaiAnalyzer, WhitespaceAnalyzer
Defining Custom analyzers is a combination of Tokenizers and Tokens.
A position increment gap controls the virtual space between the last token of one field instance and the first token of the next instance. With a gap of 100, this prevents phrase queries (even with a modest slop factor) from matching across instances.
Which Tokenizer do I use? Which Token filters should I apply? How should I create my Analyzers using Tokenizers and Token filters? These questions depend on the business rule of the Search engine.