70
Suchen und Finden mit Lucene und Solr Florian Hopf 04.07.2012

Lucene Solr talk at Java User Group Karlsruhe

Embed Size (px)

DESCRIPTION

German slides introducing Lucene and Solr.

Citation preview

Page 1: Lucene Solr talk at Java User Group Karlsruhe

Suchen und Finden mit Lucene und Solr

Florian Hopf04.07.2012

Page 2: Lucene Solr talk at Java User Group Karlsruhe

http://techcrunch.com/2010/08/04/schmidt-data/

Page 3: Lucene Solr talk at Java User Group Karlsruhe
Page 4: Lucene Solr talk at Java User Group Karlsruhe

Suche Go

Page 5: Lucene Solr talk at Java User Group Karlsruhe

Suche Go

Ergebnis 1In Ergebnis 1 taucht der Suchbegriff auf...

Ergebnis 3… hier steht der Suchbegriff aus Ergebnis 3...

Ergebnis 2In Ergebnis 2 taucht der Suchbegriff auch auf ...

Page 6: Lucene Solr talk at Java User Group Karlsruhe

Suche Go

Ergebnis 1In Ergebnis 1 taucht der Suchbegriff auf...

Ergebnis 3… hier steht der Suchbegriff aus Ergebnis 3...

Ergebnis 2In Ergebnis 2 taucht der Suchbegriff auch auf ...

SQL

Page 7: Lucene Solr talk at Java User Group Karlsruhe
Page 8: Lucene Solr talk at Java User Group Karlsruhe

Talktitlecontenttalkdate

Speaker

name

Category

name

*

*

*

*

Page 9: Lucene Solr talk at Java User Group Karlsruhe

mysql> select * from talk where title like "%apache%";+----+-------------------------------------------+---------+------------+| id | title | content | talkdate |+----+-------------------------------------------+---------+------------+| 1 | Apache Karaf | ... | 2012-04-25 || 2 | Integration ganz einfach mit Apache Camel | ... | 2012-04-04 |+----+-------------------------------------------+---------+------------+2 rows in set (0.00 sec)

Page 10: Lucene Solr talk at Java User Group Karlsruhe

mysql> select * from talk t join talk_category tc join category c where t.id = tc.talk and tc.category = c.id and (c.name like '%OSGi%' or t.title like '%OSGi%' or t.content like '%OSGi%');+----+-------------------------------------------+---------+------------+------+----------+----+------+| id | title | content | talkdate | talk | category | id | name |+----+-------------------------------------------+---------+------------+------+----------+----+------+| 1 | Apache Karaf | ... | 2012-04-25 | 1 | 1 | 1 | OSGi || 2 | Integration ... | ... | 2012-04-04 | 2 | 1 | 1 | OSGi |+----+-------------------------------------------+---------+------------+------+----------+----+------+2 rows in set (0.00 sec)

Page 11: Lucene Solr talk at Java User Group Karlsruhe

● Performance?● Ranking?● False Positives● Ähnlichkeitssuche (Meyer <=> Meier)● Flexibilität?● Wartbarkeit?

Page 12: Lucene Solr talk at Java User Group Karlsruhe

Suche Go

Ergebnis 1In Ergebnis 1 taucht der Suchbegriff auf...

Ergebnis 3… hier steht der Suchbegriff aus Ergebnis 3...

Ergebnis 2In Ergebnis 2 taucht der Suchbegriff auch auf ...

Page 13: Lucene Solr talk at Java User Group Karlsruhe

File directory = new File(dir); File[] textFiles = directory.listFiles(new TextFiles());

for (File probableMatch : textFiles) { String text = readText(probableMatch); if (text.matches(".*" + Pattern.quote(term) + ".*")) { filesWithMatches.add(probableMatch.getAbsolutePath()); } }

Page 14: Lucene Solr talk at Java User Group Karlsruhe

● Skalierbarkeit?● Unterschiedliche Formate?● Ranking?● Kombination mit Datenbank?

Page 15: Lucene Solr talk at Java User Group Karlsruhe

Suche Go

Ergebnis 1In Ergebnis 1 taucht der Suchbegriff auf...

Ergebnis 3… hier steht der Suchbegriff aus Ergebnis 3...

Ergebnis 2In Ergebnis 2 taucht der Suchbegriff auch auf ...

SQL

Page 16: Lucene Solr talk at Java User Group Karlsruhe

Dokument 1

Die Stadtliegt in den

Bergen.

Dokument 2

Vom Berg kann man die Stadt

sehen.

Page 17: Lucene Solr talk at Java User Group Karlsruhe

Die 1

Stadt 1,2

liegt 1

in 1

den 1

Bergen 1

Vom 2

Berg 2

kann 2

man 2

die 2

sehen 2

Dokument 1

Die Stadtliegt in den

Bergen.

Dokument 2

Vom Berg kann man die Stadt

sehen.

1. Tokenization

Page 18: Lucene Solr talk at Java User Group Karlsruhe

die 1,2

stadt 1,2

liegt 1

in 1

den 1

bergen 1

vom 2

berg 2

kann 2

man 2

sehen 2

Dokument 1

Die Stadtliegt in den

Bergen.

Dokument 2

Vom Berg kann man die Stadt

sehen.

1. Tokenization

2. Lowercasing

Page 19: Lucene Solr talk at Java User Group Karlsruhe

die 1,2

stadt 1,2

liegt 1

in 1

den 1

berg 1,2

vom 2

kann 2

man 2

seh 2

Dokument 1

Die Stadtliegt in den

Bergen.

Dokument 2

Vom Berg kann man die Stadt

sehen.

1. Tokenization

2. Lowercasing

3. Stemming

Page 20: Lucene Solr talk at Java User Group Karlsruhe
Page 21: Lucene Solr talk at Java User Group Karlsruhe

● Java-Bibliothek● Invertierter Index● Analyzer● Query-Syntax● Relevanz-Algorithmus● KEIN Crawler● KEIN Document-Extractor

Page 22: Lucene Solr talk at Java User Group Karlsruhe

Quelle: http://www.ibm.com/developerworks/java/library/os-apache-lucenesearch/

Page 23: Lucene Solr talk at Java User Group Karlsruhe

● Indexieren:● Erstellen eines Documents● Festlegen des Analyzers● Indexieren über IndexWriter

● Suchen:● Verwendung des selben Analyzers● Parsen der Query mit QueryParser ● Auslesen über IndexSearcher/IndexReader ● Ausgabe über Document

Page 24: Lucene Solr talk at Java User Group Karlsruhe

Document

Field Name 1 Value 1title Value

Document

Field Name 1 Value 1title Integration ganz einfach mit Apache CamelField Name 1 Value 1title ValueField Name 1 Value 1speaker Christian Schneider

Field Name 1 Value 1title ValueField Name 1 Value 1date 20120404

Field Name 1 Value 1title ValueField Name 1 Value 1title Integration ganz einfach mit Apache CamelField Name 1 Value 1title ValueField Name 1 Value 1title Integration ganz einfach mit Apache Camel

Document

Field Name 1 Value 1title Value

Document

Field Name 1 Value 1title Integration ganz einfach mit Apache CamelField Name 1 Value 1title ValueField Name 1 Value 1speaker Christian Schneider

Field Name 1 Value 1title ValueField Name 1 Value 1date 20120425

Field Name 1 Value 1title ValueField Name 1 Value 1title Integration ganz einfach mit Apache CamelField Name 1 Value 1title ValueField Name 1 Value 1title Apache Karaf

Field Name 1 Value 1title ValueField Name 1 Value 1title Integration ganz einfach mit Apache CamelField Name 1 Value 1title ValueField Name 1 Value 1speaker Achim Nierbeck

Field Name 1 Value 1title ValueField Name 1 Value 1title Integration ganz einfach mit Apache CamelField Name 1 Value 1title ValueField Name 1 Value 1speaker Christian Schneider

Page 25: Lucene Solr talk at Java User Group Karlsruhe

● Index● ANALYZED● NOT_ANALYZED● NO

● Store● YES/NO

● Feldtyp● String, Numeric, Boolean

Page 26: Lucene Solr talk at Java User Group Karlsruhe
Page 27: Lucene Solr talk at Java User Group Karlsruhe

Tokenizer

TokenFilter

TokenFilter

TokenFilter

TokenFilter

Analyzer

Page 28: Lucene Solr talk at Java User Group Karlsruhe

StandardTokenizer

GermanNormalizationFilter

LowercaseFilter

StandardFilter

GermanLightStemFilter

Analyzer

Tokenizer

TokenFilter

TokenFilter

TokenFilter

TokenFilter

Analyzer

Page 29: Lucene Solr talk at Java User Group Karlsruhe

Directory

IndexWriter

Analyzer

Document

Page 30: Lucene Solr talk at Java User Group Karlsruhe
Page 31: Lucene Solr talk at Java User Group Karlsruhe

DEMO

Page 32: Lucene Solr talk at Java User Group Karlsruhe

Directory

IndexWriter

Analyzer

Document

IndexReaderIndexSearcher

QueryParser

Query

Page 33: Lucene Solr talk at Java User Group Karlsruhe
Page 34: Lucene Solr talk at Java User Group Karlsruhe

● TermQuery ● Apache ● title:Apache

● Boolean Query ● Apache AND Karaf

● PhraseQuery ● "Apache Karaf"

Page 35: Lucene Solr talk at Java User Group Karlsruhe

● WildcardQuery ● Integ* ● Te?t

● RangeQuery ● date:[20120705 TO 20121231]

● FuzzyQuery ● Schneyder~

Page 36: Lucene Solr talk at Java User Group Karlsruhe

title:Apache AND speaker:schneyder~ AND date:[20120401 TO 20120430]

Page 37: Lucene Solr talk at Java User Group Karlsruhe

title:Apache AND speaker:schneyder~ AND date:[20120401 TO 20120430]

BooleanQueryAND

RangeQuerydate:[...]

FuzzyQueryspeaker:schneyder

TermQuerytitle:apach

Page 38: Lucene Solr talk at Java User Group Karlsruhe

title:Apache AND speaker:schneyder~ AND date:[20120401 TO 20120430]

BooleanQueryAND

RangeQuerydate:[...]

FuzzyQueryspeaker:schneyder

TermQuerytitle:apach

Page 39: Lucene Solr talk at Java User Group Karlsruhe

● FilterQueries● Ausschlusskriterium, kann gecacht werden

● Sortierung● Boosting

● Indexing-Time● Query-Time

Page 40: Lucene Solr talk at Java User Group Karlsruhe

score(q ,d )=coord (q ,d )∗queryNorm (q )∗∑t∈q

(tf (t , d )∗idf (t)2∗t.boost∗norm (t , d ))

Page 41: Lucene Solr talk at Java User Group Karlsruhe

score(q ,d )=coord (q ,d )∗queryNorm (q )∗∑t∈q

(tf (t , d )∗idf (t)2∗t.boost∗norm (t , d ))

Anzahl Term im Dokument

Feldlänge,Index-Boost

Invers zu Anzahl Dokumente, die

den Term enthalten

Anzahl der Matches imDokument

Query-Boost

Page 42: Lucene Solr talk at Java User Group Karlsruhe

DEMO

Page 43: Lucene Solr talk at Java User Group Karlsruhe
Page 44: Lucene Solr talk at Java User Group Karlsruhe

● Parser API● Zahlreiche Formate● Integriert OpenSource-Libs● Betrieb embedded oder über Server

Page 45: Lucene Solr talk at Java User Group Karlsruhe
Page 46: Lucene Solr talk at Java User Group Karlsruhe

DEMO

Page 47: Lucene Solr talk at Java User Group Karlsruhe
Page 48: Lucene Solr talk at Java User Group Karlsruhe

● Enterprise Search Server● Basiert auf Lucene● HTTP API● Index-Schema● Integriert häufig verwendete Lucene-Module● Facettierung● Dismax Query Parser● Admin-Interface

Page 49: Lucene Solr talk at Java User Group Karlsruhe

WebappWebapp

Client Solr Lucenehttp

XML, JSON, JavaBin, Ruby, ...

Page 50: Lucene Solr talk at Java User Group Karlsruhe

schema.xml solr-config.xml

dataconf

Solr Home

Lucene

Page 51: Lucene Solr talk at Java User Group Karlsruhe

schema.xml

Field Types

Fields

Page 52: Lucene Solr talk at Java User Group Karlsruhe
Page 53: Lucene Solr talk at Java User Group Karlsruhe
Page 54: Lucene Solr talk at Java User Group Karlsruhe

SearchHandler

DIH

SearchComp.

Monitoring

Lucene

SolrCell

UpdateRequestHandler

Replication

Search Index Index StartIndexing

DBURLFiles

Caches

Page 55: Lucene Solr talk at Java User Group Karlsruhe

solrconfig.xml

Lucene ConfigCaches

Request Handler

Search Components

Page 56: Lucene Solr talk at Java User Group Karlsruhe
Page 57: Lucene Solr talk at Java User Group Karlsruhe

WebappWebapp

Client Solr Lucenehttp

XML, JSON, JavaBin, Ruby, ...

Page 58: Lucene Solr talk at Java User Group Karlsruhe
Page 59: Lucene Solr talk at Java User Group Karlsruhe
Page 60: Lucene Solr talk at Java User Group Karlsruhe

DEMO

Page 61: Lucene Solr talk at Java User Group Karlsruhe
Page 62: Lucene Solr talk at Java User Group Karlsruhe

Master

SlaveSlave

Loadbalancer

Search

Replicate

Index

Page 63: Lucene Solr talk at Java User Group Karlsruhe

● Geospatial Search● More like this● Spellchecker● Suggester● Result Grouping● Function Queries● Sharding

Page 64: Lucene Solr talk at Java User Group Karlsruhe
Page 65: Lucene Solr talk at Java User Group Karlsruhe

● Suchserver basierend auf Apache Lucene● RESTful API● Dokumentenorientiert (JSON)● Schemafrei● Distributed Search● Near Realtime Search● No Commits (Transaction Log)

Page 66: Lucene Solr talk at Java User Group Karlsruhe

curl -XPOST 'http://localhost:9200/jugka/talk/' -d '{ "speaker" : "Florian Hopf", "date" : "2012-07-04T19:30:00", "title" : "Suchen und Finden mit Lucene und Solr"}'

{"ok":true,"_index":"jugka","_type":"talk","_id":"CeltdivQRGSvLY_dBZv1jw","_version":1}

Page 67: Lucene Solr talk at Java User Group Karlsruhe

curl -XGET 'http://localhost:9200/jugka/talk/_search?q=solr'{"took":29,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.054244425,"hits":[{"_index":"jugka","_type":"talk","_id":"CeltdivQRGSvLY_dBZv1jw","_score":0.054244425, "_source" : { "speaker" : "Florian Hopf", "date" : "2012-07-04T19:30:00", "title" : "Suchen und Finden mit Lucene und Solr"}

Page 68: Lucene Solr talk at Java User Group Karlsruhe

● http://lucene.apache.org● http://tika.apache.org● http://lucene.apache.org/solr/● http://elasticsearch.org● http://github.com/fhopf/lucene-solr-talk

Page 69: Lucene Solr talk at Java User Group Karlsruhe

http://nlp.stanford.edu/IR-book/

Page 70: Lucene Solr talk at Java User Group Karlsruhe

Vielen Dank!

http://www.florian-hopf.de@fhopf