13
European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme

Wolfgang Täger

Embed Size (px)

DESCRIPTION

European Patent Office. European Machine Translation Programme. Wolfgang Täger. December 2006. Programme Partners and Goals. Trigger: Success of JP-EN patent translation Agreement EPO - Member States MT of patents/ abstracts/ communications to/from English Three language pairs per year - PowerPoint PPT Presentation

Citation preview

Page 1: Wolfgang Täger

EuropeanPatent Office

Wolfgang Täger

December 2006

EuropeanPatent Office

European Machine Translation Programme

Page 2: Wolfgang Täger

The European Patent OfficeEuropeanPatent Office

Programme Partners and Goals

• Trigger: Success of JP-EN patent translation

• Agreement EPO - Member States

1. MT of patents/ abstracts/ communications to/from English

2. Three language pairs per year

3. First three languages: FR - DE - ES

• Candidates for next year: Swedish, Dutch, Italian, Romanian, Greek

Page 3: Wolfgang Täger

The European Patent OfficeEuropeanPatent Office

MT engine

Trial with SMT system (Language Weaver)

Call for tender: Winner Worldlingo (Systran)

Going public (esp@cenet): December 2006

Needed: Improve translation by specific dictionaries

Page 4: Wolfgang Täger

The European Patent OfficeEuropeanPatent Office

Dictionary format

Desiderata • open standard • XML-Unicode• support features of MT engines• support conditional translations (e.g. based on IPC)

Is not intended for terminology (no definitions, lexical focus and no semantic focus).

OLIF format was chosen

How to get dictionaries ? By bilingual term extraction !

Page 5: Wolfgang Täger

The European Patent OfficeEuropeanPatent Office

Available corpora

560.000 EP-B publications => claims in EN,DE,FR

300.000 DE-T2 publications

37.000 ES-B3/T3 publications

=> Align corpora for term extraction, concordancing, translation memory (and SMT)

CL EN CL FR CL DE

DESC EN OR FR OR DE

EP-B1 DE-T2

CL ES

DESC ES

ES B3/T3 (LaTex)

(CL DE)

DESC DE

Page 6: Wolfgang Täger

The European Patent OfficeEuropeanPatent Office

Available corpora

560.000 EP-B publications => claims in EN,DE,FR

300.000 DE-T2 publications

37.000 ES-B3/T3 publications

=> Align corpora for term extraction, concordancing, translation memory (and SMT)

CL EN CL FR CL DE

DESC EN OR FR OR DE

EP-B1 DE-T2

CL ES

DESC ES

ES B3/T3 (LaTex)

(CL DE)

DESC DE

Page 7: Wolfgang Täger

The European Patent OfficeEuropeanPatent Office

Alignment & Extraction

Alignment: Trial at EPO with internally developed SW

Result was not improved by external companies during call for tender.

Page 8: Wolfgang Täger

The European Patent OfficeEuropeanPatent Office

Alignment & Extraction

Call for tender for bilingual term extraction

Winner: DFKI

1. Alignment of corpora, POS tagging, Identification of terms

2. Pairing of terms using clues like co-occurrence score, string similarity, grammatical clues, position, available dictionaries, ...

3. Providing further information like gender, inflection, transitivity, countable, ...

Page 9: Wolfgang Täger

The European Patent OfficeEuropeanPatent Office

Validation & Concordancing

Development of OLIF editor at EPO• Remove noise• Correct entries• Use concordancer (provides statistics based on parallel corpora)

=> DEMO

Page 10: Wolfgang Täger

The European Patent OfficeEuropeanPatent Office

OLIF format

• Support of more languages• Clarification of inflection scheme• Clarification of term vs lex approach• Tools

Page 11: Wolfgang Täger

The European Patent OfficeEuropeanPatent Office

Relational database ??

Concept Term

SurfForm

Lemma

InflForm

LexType

RegEx

Infl

SemRelTransl

Naming

Page 12: Wolfgang Täger

The European Patent OfficeEuropeanPatent Office

Relational database ??

„hot drink ...“ grüner Tee

grüner

grün

Nom. Sg. str. f. pos.

DE, Adj

-er

iLike „klein“

SemRelTransl

Naming

Page 13: Wolfgang Täger

The European Patent OfficeEuropeanPatent Office

End

Thank you!