Textklassifikation

Textklassifikation

Der Scirus-Classifier

Überblick

• Komplexes Programm:– Porno-Filter– Extraktion von Namen– Klassifikation aufgrund von Text– Klassifikation nach URL/Title– Feste Klassifikation aufgrund einer URL-Liste– Extraktion von Titel/Autor/Abstract etc bei Artikeln– Ausgabe von Refinement-Termen

• Hier nur von Interesse: Klassifikation aufgrund des textuellen Inhalts

Textklassifikation

• Lexikonbasiert:– Phrasen oder Wörter– Erhalten Gewicht für jede Kategorie– Starke Indikatoren

• Klassifikation durch Berechnung eines Scores:– Für jedes Vorkommen wird für jede Kategorie ein

Zähler hochgesetzt– Normalisierung nach Dokumentlänge– Schwellenwert

KonfigurationSpecifier (with example value) Meaning LEXICONS: n.b. all file names are relative paths from the location of

the configuration file AUT=phrases.Aut Phrases dictionary file name SPLX=urltitle.lex or SPAU=urltitle.Aut

Dictionary for word and phrases in url and title of document

OCLX=porn.lex or OCAU=porn.Aut

Dictionary for offensive content detection. if OCLX: text file (will be loaded into trie) if OCAU automaton file only one lexicon to be specified

NAMES_LEX=lexicon.min specify the lexicon used for Type Classification (it used to contain only names at some time, hence the name of the specifier)

NWDS=400 maximum number of words per document to use for Content Classification. This can also be set through the options class (specification at runtime overrides specification in configuration file)

SMD=100 Maximum number of words allowed for a "small" document. Topic identification for small documents is done by the URL/Title classifier, if the normal Classifier isn't able to identify any reliable subject (due to the lack of enough words in text)

SUBJ=genetics main 0 0 0.9 define a topic: name has two words, next is topic code, then subtopic code; for main topics, subtopic code is zero Last float is a correction factor for the subject score depending on the coverage of the domain dictionary

SUBJ=genetics molecular 0 3 another example for subtopic definition. Note how the topic code is the same because molecular genetics is subtopic of genetics. The correction factor can be omitted (default 1.0)

Konfigurations-Datein

//Number of words to process for subject identificationNWDS=2000000MINWORDS=100THRESHOLD=1SUBJ=gen all 0 0SUBJ=chem all 1 0SUBJ=comp all 2 0SUBJ=eng all 3 0SUBJ=env all 4 0SUBJ=geo all 5 0SUBJ=astro all 6 0SUBJ=life all 7 0SUBJ=math all 8 0SUBJ=mat all 9 0SUBJ=med all 10 0….

AufrufCIS Subject Identifier and Content Extractor Version 5.0USAGE: classifier [-h[elp]] [-os|l[A]] [-it|f|h] [-s[ilent]] [-c CONFIG_FILE] [-nout] [-uat] [-URL<filename>] [-smd<number>] [-ps] [-t FILES_TO_IDENTIFY]

-h: print help -c CONFIG_FILE: Name of the configuration file. Default is ././config.txt -os|l[A]: Output format -os: Short: only print well identified subjects(default) -ol: Long: print all subjects -ot: Topics only are output; one line Format: filename:WORDCOUNT#GENERALSCIENCESCORE#TOPICSWITHSCORE ´ -oA: Store and print all phrases for a topic ´ -oT: Print all phrases found in the dictionary ´ (Used for dictionary testing only) -T[t][i][o]: Tasks to carry out and to output (default: all are set) t: Topic identification i: Information from content extractor o: Offensive content filter -it|h|f: Input format -it: Plain text -ih: HTML-file -if: HTML-file preceded by header -nINTEGER :Minumum number of words in a document -MINTEGER :Maximum number of words to be processed in a document tokenizer stops after INTEGER words Documents with less words will get tag 'not_enough_data' -mINTEGER :Minimum score for accepted documents -rINTEGER : maximum relative count for phrase form/thousand In thousand phrases one phrase form will only be counted INTEGER times. -NINTEGER :Maximum number of phrases to output in results for topics -t FILES_TO_IDENTIFY List of files for which subject should be identified. Default: stdin.

-D[r] D1|D2[:F1|F2[:FB1|FB2]]: process all files in directory and recurse Dr: descend recursively into subdirectories D1: name of directory to list or recurse F1... : filename patterns (my contain *) FB1: Patterns for forbidden directories (not recursed) -s: print only some important messages, not all. -nout: Turn off URL/Title classifier. -uat: Use all titles for classification (not just those enclosed in <head>). -URL<filename>: Filename of the URL list (format: <file><tab><url><newline>). -smd<number>: Maximum number of words for small documents (default see config file). -ps: Print title and url scores -xml: Print XML output

Ablauf

• Einlesen des Textes bis zur spez. Anzahl von Wörtern

• Abgleich mit dem Lexikon

• Berechnen des Scores

• Ausgabe des Ergebnisses in Abhängigkeit vom Schwellenwert

Scoring Formel

• Sei:– d Dokument,– c Kategorie, – t Term, – l(t) = Länge von t, – wn(t) = Wortanzahl in t, – q(t,c) Gewicht von t für c und – s(t,c) starker Indikator t für c– T(c) Klassifikations-Schwellenwert für c– W = min(Wörter im Dokument, max proz. Wörter)

• Score(d,c) = ∑td (l(t)/2 + (wn(t) -1) x 2) x q(t,c))/W• Si-score(d,c) = ∑td s(tc)• d wird als c klassifiziert gdw. Si-score(d,c) > 1 &&

score(d,c) > T(c)

Klassifikations-Lexikon• Format: TERM.INFO1/INFO2/...• INFO:

TOPICS#FREQUENCY#QUALITY#LENGTH#TYPE#ALONE#OUTPUT– TOPICS: MAIN:SUB– FREQUENCY: 1 (not used)– QUALITY: 0...9– LENGTH (number of words)– TYPE: 0..3

• 0: genuine topic-subtopic indicator• 1: only to distinguish between subtopics, not indicating topic itself• 2: as 0, but word is to be counted only if there are other phrases for same

subtopic, with TYPE 0• 3: as 1, but word is to be counted only if there are other phrases for same

subtopic, with TYPE 0– ALONE: 0/1 : strong indicator– OUTPUT: Ø,$, PHRASE

Klassifikations-Lexikon• Beispiel

– a vinculo matrimonii.18:0#1#0#3#0#0#$– a-37 aircraft.14:0#1#1#3#0#1#a 37 aircraft– a-address register.2:0#1#1#3#0#1#a address register– a-bomb survivors.7:0#1#8#3#0#1#a bomb survivors– a-c substitutions.15:0#1#8#3#0#1#a c substitutions/7:0#1#8#3#0#1#a c

substitutions– a-calcium-calmodulin kinase.11:0#1#8#4#0#1#a calcium-calmodulin kinase– a-chromanoxyl radical.7:0#1#8#3#0#1#a chromanoxyl radical– a-crystallin gene.15:0#1#8#3#0#1#a crystallin gene/7:0#1#8#3#0#1#a crystallin

gene– a-d conversion.3:0#1#1#3#0#1#a d conversion– a-d converter.13:0#1#1#3#0#1#a d converter/3:0#1#1#3#0#1#a d

converter/9:0#1#1#3#0#1#a d converter– a-deficient mice.11:0#1#7#3#0#1#a deficient mice/15:0#1#8#3#0#1#a deficient

mice– a-delta activity.11:0#1#8#3#0#1#a delta activity