· Statistical Models for Hierarchical Phrase-based Machine Translation Von der Fakultät für Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung

Statistical Models forHierarchical Phrase-based Machine Translation

Von der Fakultät für Mathematik, Informatik und Naturwissenschaften derRWTH Aachen University zur Erlangung des akademischen Grades

eines Doktors der Naturwissenschaften genehmigte Dissertation

vorgelegt von

Diplom–InformatikerMatthias Huck

aus Ludwigshafen am Rhein

Berichter: Universitätsprofessor Dr.–Ing. Hermann NeyProfessor Dr. Alexander M. Fraser

Tag der mündlichen Prüfung: 1. August 2018

Diese Dissertation ist auf den Internetseiten der Universitätsbibliothek online verfügbar.

Acknowledgments

I would like to thank Prof. Dr.–Ing. Hermann Ney for supervising this thesis, for his kindsupport, and for giving me the opportunity to work in an excellent research environment as amember of his Human Language Technology and Pattern Recognition Group (Chair of ComputerScience 6) at RWTH Aachen University.

I also thank Prof. Dr. Alexander M. Fraser for reviewing the thesis and for serving on mydoctoral committee.

Many colleagues at RWTH Aachen University have given me encouragement and have influencedmy work with fruitful discussions. I am particularly grateful to Daniel Stein and David Vilar,who initially triggered my interest in hierarchical phrase-based statistical machine translation,and who originally started the work on Jane, the RWTH Aachen University Statistical MachineTranslation Toolkit. Daniel and David have both been exceptionally helpful and cooperative whenI first joined the group and the machine translation research community.

My more experienced colleagues never hesitated to take the time for answering my questionsand giving their invaluable advice. I have learned a lot thanks to the guidance by Saša Hasan,Gregor Leusch, Arne Mauser, Maja Popović, and many others. Christian Plahl—though not beingan expert in machine translation himself, but a speech recognition researcher—always providedhelp with general issues, including system administration; as did the other sysadmins and thesecretaries at the institute. Thanks a lot to all of you.

I have collaborated with many people at RWTH Aachen University over the years and wantto thank all co-authors of publications for the productive collaboration. I have enjoyed workingwith you. I furthermore thank all collaborators from partner institutes, notably KIT and LIMSI.

During my time at RWTH Aachen University, I had the honor of supervising three highlygifted students: Jan-Thorsten Peter (Soft String-to-Dependency Hierarchical Machine Transla-tion, Diplom, 2011), Erik Scharwächter (Discontinuous Phrases for Statistical Machine Transla-tion, Bachelor, 2012), and Munkhzul Erdenedash (Pruning Strategies for Phrase-based StatisticalMachine Translation, Bachelor, 2013). It has been a great pleasure to work with the three of you.I wish you the very best and hope to see you again someday.

Thanks for a good time together at work and/or after work to all those who have been mentionedalready, and: Tamer Alkhouli, Arnaud Dagnelies, Minwei Feng, Jens Forster, Markus Freitag,Pavel Golik, Andreas Guta, Yannick Gweth, Stefan Hahn, Mahdi Hamdani, Carmen Heger, GeorgHeigold, Björn Hoffmeister, Oscar Koller, Patrick Lehnen, Jonas Lööf, Saab Mansour, EvgenyMatusov, David Nolden, Malte Nuhn, Markus Nußbaum-Thom, Christian Oberdörfer, StephanPeitz, Martin Ratajczak, David Rybach, Ralf Schlüter, Christoph Schmidt, Volker Steinbiß, Mar-tin Sundermeyer, Muhammad Tahir, Zoltán Tüske, Jörn Wübker, Jia Xu, Yuqi Zhang, as well aseverybody else who has been at RWTH’s Lehrstuhl für Informatik 6 at the time. Special thanksto Simon Wiesler and Daniel Stein for the close friendship.

iii

My best regards to friends and colleagues at the Institute for Language, Cognition and Com-putation at the University of Edinburgh, where I was employed from 2013 to 2016, and at theCenter for Information and Language Processing at LMU Munich, where I am employed now.

I am grateful to Fabienne Braune, Maja Popović, and David Vilar, who proofread my thesis.

Der größte Dank aber gilt meiner Familie, die mich stets unterstützt hat und für mich da ist,wenn ich sie brauche.

The research leading to these results has received funding from the European Union SeventhFramework Programme (FP7/2007-2013) under grant agreement № 287658 (EU-BRIDGE). Thiswork was partly achieved as part of the Quaero Programme, funded by OSEO, French Stateagency for innovation. This material is also partly based upon work supported by the DefenseAdvanced Research Projects Agency (DARPA) under Contract № HR0011-08-C-0110 (GALE)and Contract № HR0011-12-C-0015 (BOLT). Any opinions, findings and conclusions or recom-mendations expressed in this material are those of the author and do not necessarily reflect theviews of any of the funding agencies DARPA, the European Commission, and OSEO.

Abstract

Machine translation systems automatically translate texts from one natural language to an-other. The dominant approach to machine translation has been phrase-based statistical machinetranslation for many years. In statistical machine translation, probabilistic models are learnedfrom training data, and a decoder is conducting a search to determine the best translation of aninput sentence based on model scores. Phrase-based systems rely on elementary translation unitsthat are continuous bilingual sequences of words, called phrases.

The hierarchical approach to statistical machine translation allows for phrases with gaps. For-mally, the hierarchical phrase inventory can be represented as a synchronous context-free grammarthat is induced from bilingual text, and hierarchical decoding can be carried out with a parsing-based procedure. The hierarchical phrase-based machine translation paradigm enables modelingof reorderings and long-distance dependencies in a consistent way. The typical statistical modelsthat guide hierarchical search are fairly similar to those employed in conventional phrase-basedtranslation.

In this work, novel extensions with statistical models for hierarchical phrase-based machinetranslation are developed, with a focus on methods that do not require any syntactic annotationof the data. Specifically, enhancements of hierarchical systems with extended lexicon modelsthat take global source sentence context into account are investigated; various lexical smoothingvariants are examined; reordering extensions and a phrase orientation model for hierarchical trans-lation are introduced; word insertion and deletion models are presented; techniques for trainingof hierarchical translation systems with additional synthetic data are suggested; and a trainingmethod is proposed that utilizes additional synthetic data which is created via a pivot language.The beneficial impact of the extensions on translation quality is verified by means of empiri-cal evaluation on various language pairs, including Arabic→English, Chinese→English, French→German, English→French, and German→French.

v

Kurzfassung

Maschinelle Übersetzungssysteme übersetzen Texte automatisch aus einer natürlichen Sprachein eine andere. Der dominierende Ansatz zur maschinellen Übersetzung war für viele Jahre diephrasenbasierte statistische maschinelle Übersetzung. In der statistischen maschinellen Über-setzung werden probabilistische Modelle aus Trainingsdaten gelernt, und ein Dekoder führt eineSuche durch, um basierend auf den Modellbewertungen die beste Übersetzung eines Eingabesatzeszu bestimmen. Phrasenbasierte Systeme stützen sich auf elementare Übersetzungseinheiten, dieaus zusammenhängenden bilingualen Sequenzen von Wörtern bestehen, sogenannten Phrasen.

Der hierarchische Ansatz zur statistischen maschinellen Übersetzung erlaubt Phrasen mitLücken. Formal kann das hierarchische Phraseninventar als eine synchrone kontextfreie Gram-matik repräsentiert werden, die aus bilingualem Text induziert wird, und das hierarchische De-kodieren kann mit einer parsingbasierten Prozedur durchgeführt werden. Das Paradigma der hier-archischen phrasenbasierten maschinellen Übersetzung ermöglicht eine konsistente Art und Weiseder Modellierung von Umordnungen und Abhängigkeiten über weite Distanzen. Die üblichenstatistischen Modelle, die die hierarchische Suche leiten, sind recht ähnlich zu denjenigen, die inder konventionellen phrasenbasierten Übersetzung eingesetzt werden.

In der vorliegenden Arbeit werden neuartige Erweiterungen der hierarchischen phrasenbasiertenmaschinellen Übersetzung mit statistischen Modellen entwickelt, mit einem Hauptaugenmerk aufMethoden, für die keinerlei syntaktische Annotation der Daten erforderlich ist. Es werden imEinzelnen Verbesserungen hierarchischer Systeme mittels erweiterter lexikalischer Modelle er-forscht, welche den gesamten Quellsatz als Kontext berücksichtigen. Es werden verschiedeneVarianten der lexikalischen Glättung untersucht. Umordnungserweiterungen und ein Modell derPhrasenorientierung für die hierarchische Übersetzung werden eingeführt. Modelle der Worteinfü-gung und -löschung werden präsentiert. Techniken zum Training hierarchischer Übersetzungssys-teme mit Hilfe zusätzlicher synthetischer Daten werden vorgestellt. Und eine Trainingsmeth-ode wird vorgeschlagen, die zusätzliche synthetische Daten verwendet, welche ausgehend voneiner Zwischensprache erzeugt wurden. Die Nützlichkeit der Erweiterungen zur Verbesserungder Übersetzungsqualität wird anhand empirischer Evaluation an mehreren Sprachpaaren ver-ifiziert, darunter Arabisch→Englisch, Chinesisch→Englisch, Französisch→Deutsch, Englisch→Französisch und Deutsch→Französisch.

vii

Contents

Introduction 1Statistical Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Training Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Preprocessing and Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Word Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Public Evaluation Campaigns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Jane: an Open Source Statistical Machine Translation Toolkit . . . . . . . . . . . . . . . 7Other Statistical Machine Translation Toolkits . . . . . . . . . . . . . . . . . . . . 8

Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Prior Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Individual Contributions vs. Team Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Scientific Goals 15

I Fundamentals 17

1 Phrase Extraction 191.1 Continuous Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.2 Notational Conventions for SCFG Rules . . . . . . . . . . . . . . . . . . . . . . . . 211.3 Hierarchical Phrase Inventory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.3.1 Deep Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.3.2 Shallow Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Baseline Models 252.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2 Decision Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.3 Feature Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.1 Phrase Translation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.2 Lexical Translation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3.3 Phrase Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3.4 Word Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3.5 Phrase Length Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.6 Glue Rule Indicator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.7 Hierarchical Indicator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.8 Paste Indicator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

ix

2.3.9 Count Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.10 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Hierarchical Search with Cube Pruning 313.1 The Cube Pruning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.1 k-best Generation Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.1.2 Hypothesis Recombination . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Performance of Cube Pruning for Hierarchical Search . . . . . . . . . . . . . . . . . 333.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

II Enhancing Hierarchical Translation with Additional Models 41

4 Extended Lexicon Models 434.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3 Triplet Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.1 Triplet Feature Functions with Global Source Context . . . . . . . . . . . . 464.4 Discriminative Word Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4.1 DWL Feature Function with Global Source Context . . . . . . . . . . . . . 474.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.5.1.1 Hierarchical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 484.5.1.2 Phrase-based Systems . . . . . . . . . . . . . . . . . . . . . . . . . 484.5.1.3 Extended Lexicon Model Training . . . . . . . . . . . . . . . . . . 48

4.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Lexical Smoothing Variants 535.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3 Lexicon Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3.1 Word Lexicon from Word-aligned Data . . . . . . . . . . . . . . . . . . . . 545.3.2 IBM Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3.3 Scoring Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3.4 Regularized IBM Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.3.5 Triplet Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.3.6 Discriminative Word Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Reordering Extensions 636.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.3 Reordering Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.3.1 Swap Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.3.1.1 Swap Rule for Deep Grammars . . . . . . . . . . . . . . . . . . . . 646.3.1.2 Swap Rule for Shallow Grammars . . . . . . . . . . . . . . . . . . 65

6.3.2 Jump Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.3.2.1 Jump Rules for Deep Grammars . . . . . . . . . . . . . . . . . . . 666.3.2.2 Jump Rules for Shallow Grammars . . . . . . . . . . . . . . . . . 66

6.4 Discriminative Lexicalized Reordering Model . . . . . . . . . . . . . . . . . . . . . 666.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.5.2.1 Dropping Length Constraints . . . . . . . . . . . . . . . . . . . . . 696.5.2.2 Monotonic Concatenation Rule . . . . . . . . . . . . . . . . . . . . 696.5.2.3 Distance-based Distortion Feature . . . . . . . . . . . . . . . . . . 696.5.2.4 Discriminative Reordering for Reordering Rules Only . . . . . . . 69

6.5.3 Investigation of the Rule Usage . . . . . . . . . . . . . . . . . . . . . . . . . 696.5.4 Translation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7 Phrase Orientation Model 737.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747.3 Modeling Phrase Orientation for Hierarchical Translation . . . . . . . . . . . . . . 747.4 Phrase Orientation Scoring in Hierarchical Decoding . . . . . . . . . . . . . . . . . 75

7.4.1 Determining Orientations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.4.2 Scoring Orientations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.4.3 Boundary Non-Terminals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807.5.2 Chinese→English Experimental Results . . . . . . . . . . . . . . . . . . . . 807.5.3 French→German Experimental Results . . . . . . . . . . . . . . . . . . . . 82

7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8 Insertion and Deletion Models 838.1 Insertion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838.2 Deletion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848.3 Lexicon Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8.3.1 Word Lexicon from Word-aligned Data . . . . . . . . . . . . . . . . . . . . 848.3.2 IBM Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8.4 Thresholding Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

9 Hierarchical Translation for Large-scale English→French News and Talk Tasks 899.1 WMT News Translation Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

9.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899.1.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

9.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919.1.4 Comparison with Other Groups . . . . . . . . . . . . . . . . . . . . . . . . . 92

9.2 IWSLT TED Talk Translation Task . . . . . . . . . . . . . . . . . . . . . . . . . . 929.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929.2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

III Lightly-supervised Training 95

10 Lightly-supervised Training for Hierarchical Translation 9710.1 Synthetic Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9710.2 Combining Phrase Tables for Lightly-supervised Training . . . . . . . . . . . . . . 9910.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

10.3.1 Baseline System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9910.3.2 Arabic–English Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . 10010.3.3 Phrase Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10010.3.4 Arabic→English Experimental Results . . . . . . . . . . . . . . . . . . . . . 100

10.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10210.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

11 Pivot Lightly-supervised Training 10511.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10511.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10711.3 Synthetic Training Data by Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . 10811.4 Parallel Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10811.5 Phrase-based Translation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11011.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

11.6.1 Systems for Producing Synthetic Data . . . . . . . . . . . . . . . . . . . . . 11111.6.2 Human-generated and Synthetic Training Corpora . . . . . . . . . . . . . . 11111.6.3 German→French Experimental Results in Phrase-based Translation . . . . 112

11.6.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11311.6.4 German→French Experimental Results in Hierarchical Translation . . . . . 115

11.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Scientific Achievements 117

Conclusions 121

List of Figures 123

List of Tables 125

Bibliography 127

Introduction

The purpose of a machine translation system is to automatically translate natural language datafrom a source language to a target language. The input data is typically a written text in a foreignlanguage (say, French), and the system outputs a translated text in a language which is familiar tothe end user or target audience (say, English). The research field of machine translation (MT) is asub-topic of natural language processing, a broader scientific area which generally covers methodsto computationally deal with natural language. The most successful approaches for tacklingthe complex problems in natural language processing are nowadays data-driven: The systemsimplement algorithms to learn statistical models from training data and—given some input—search for the best solution according to the learned models. Statistical systems continuouslyimprove with the availability of increasing amounts of training data, as larger amounts of trainingdata allow for learning of more reliable models.

Statistical Machine Translation

In the statistical approach to machine translation, the objective is to find the target languagesentence which is the translation of a given source language sentence with maximum probability.We can formalize this problem as

fJ1 → eI1 = argmax

I,eI1

{Pr(eI1|fJ

1 )}. (0.1)

A source sentence fJ1 = f1 . . . fj . . . fJ of length J (i.e., consisting of J words) is given as input

and has to be translated into a target sentence eI1 = e1 . . . ei . . . eI , where I denotes the length ofthe target sentence. The system has to find the translation eI1 which maximizes the conditionalprobability. For that purpose, we need to

1. model the posterior probability distribution Pr(eI1|fJ1 ) over all target language sentences eI1

given the source sentence fJ1 ,

2. learn the model from training data, and

3. search the space of all possible translations for the best one according to the model (thetop-scoring hypothesis).

While the posterior probability was decomposed via Bayes’ theorem in early work on statisticalmachine translation [Brown & Cocke+ 90], it is modeled in a log-linear framework [Och & Ney

1

Introduction

02] in many state-of-the-art systems today:

Pr(eI1|fJ1 ) = pλM

1(eI1|fJ

1 ) (0.2)

pλM1(eI1|fJ

1 ) =exp

(∑Mm=1 λmhm(fJ

1 , eI1))

∑e′I

′1

exp(∑M

m=1 λmhm(fJ1 , e

′I′1 )) (0.3)

The log-linear model allows for weighted combination of M feature functions (or simply: features)hm(·, ·), 1 ≤ m ≤M . Each feature hm(·, ·) is weighted with a corresponding scaling factor λm.

As the denominator in Equation (0.3) depends on the given source sentence only, and theexponential function is monotone, we end up with a linear combination of individual features:

fJ1 → eI1 = argmax

I,eI1

{Pr(eI1|fJ

1 )}

(0.4)

= argmaxI,eI1

exp(∑M

m=1 λmhm(fJ1 , e

I1))

∑e′I

′1

exp(∑M

m=1 λmhm(fJ1 , e

′I′1 )) (0.5)

= argmaxI,eI1

{M∑

m=1

λmhm(fJ1 , e

I1)

}(0.6)

In the following, we will frequently refer to the individual features in the log-linear model asmodels. An overview of the baseline models used throughout this work is given in Chapter 2.

In order to facilitate learning of translation models from training data, sentences are broken upinto smaller elementary units. The elementary units can be source–target pairs of words (as inearly work on statistical machine translation [Brown & Della Pietra+ 93]), continuous bilingualsequences of words denoted as phrases (as in many state-of-the-art machine translation systemsthroughout the last decade [Zens & Och+ 02, Koehn & Och+ 03]), or phrases with gaps such asin the case of hierarchical rules [Chiang 05]. We will focus on the latter in this work. Detailsabout hierarchical rules will be given in Chapter 1.

Modeling translation on smaller elementary units decreases sparsity and thus allows for morereliable probability estimates and better generalization capabilities. The feature scores hm(fJ

1 , eI1)

over sentence pairs from Equation (0.6) decompose into local scores over individual elementaryunits, or scores that take a limited context beyond them into account.1 Phrase-based modelscan capture more local context within its elementary units than models based on single words.Hierarchical rules can model dependencies across longer distances within a sentence.

A decoder implements the argmax operation in Equation (0.3). Given input data in foreignlanguage, the decoder has to produce translations in target language and search the space oftranslations for the model-best hypothesis, which it outputs. Conceptionally, the decoder seg-ments the input sentences into parts that allow it to apply the elementary units from the model.It then translates the parts separately by applying single word translation, phrase translation,or translation by means of hierarchical rules, respectively, depending on the paradigm. The fulltarget sentence is composed from the parts.

Searching the full space of possible translations is not feasible in practice due to its large size.2Modern phrase-based decoders therefore constrain reordering [Zens & Ney+ 04b] to reduce com-putational complexity from exponential to polynomial, apply techniques such as recombination[Och & Tillmann+ 99, Och & Ney 04] for practical speed-ups, and perform dynamic program-ming beam search [Zens & Ney 08] to prune out partial hypotheses with low model scores early

1An example for features with context beyond the elementary units are n-gram language models.2[Knight 99] proves that the decoding problem with unrestricted reorderings is NP-complete.

2

Introduction

on while extending only the most promising candidates. Most hierarchical decoders employ thecube pruning algorithm [Chiang 07]. Hierarchical decoding with cube pruning will be presentedin Chapter 3.

Training CorporaStatistical machine translation crucially relies on parallel corpora. A parallel corpus is a collec-

tion of bitexts, i.e. content that is available in two different languages. The statistical learningalgorithms require such available translations as training data.

Parallel corpora emerge at organizations with multilingual demands such as the United Nations,the European Union, the Canadian Parliament, the Legislative Council of the Hong Kong SpecialAdministrative Region, or the World Intellectual Property Organization, where large volumes ofcontent are recorded in written form and transferred into additional languages by professionaltranslators. Companies often have internal documents such as technical documentations avail-able in many languages. Parallel corpora may also be crawled from the web, for instance frommultilingual news websites, or created via crowdsourcing with the help of amateur translators.

For the purpose of using multilingual data for training a machine translation system, the bilin-gual text is sentence-segmented on both sides and then sentence-aligned, for instance with meth-ods based on sentence lengths and lexical co-occurrences [Gale & Church 93, Moore 02, Braune& Fraser 10]. After sentence alignment, the parallel corpus has a line-by-line correspondenceof sentences in the one language with sentences in the other. As an example, Table 0.1 showssentence-aligned French–English parallel data from a section of the European Parliament Proceed-ings Parallel Corpus, or briefly Europarl [Koehn 05].

Parallel corpora with millions of sentence pairs exist for a few language pairs such as Arabic–English, Chinese–English, or several European languages such as Czech, French, or German pairedwith English. Combinations of languages with no English involved are more rare, possibly notonly due to English being the default language for international communication, but also due toa focus of research funding on English and only a handful of other languages in the past. At theother end of the spectrum, little or no parallel data exists for lots of language pairs, which makesbuilding a statistical machine translation system for them difficult.

Besides parallel corpora, statistical machine translation systems can also benefit from the uti-lization of monolingual corpora. Monolingual data in the target language can be used for learninga language model. Monolingual data can be collected in larger volumes, and large amounts oftraining data in turn allow for more reliable model estimation.

Many publicly available corpora have been prepared and are distributed by research laboratoriesat universities. The corpora employed for the empirical work presented in this thesis can mostlyeither be obtained from the Linguistic Data Consortium (LDC)3 or have been released for publicmachine translation evaluation campaigns and can be downloaded from the respective websites.

Preprocessing and PostprocessingPrior to training, preprocessing is applied to the data in order to harmonize it and to ease

model learning and decoding. Preprocessing steps that are applied to the source side of thetraining corpus also need to be applied to any input data that will be decoded. Preprocessingsteps on the target side of the training data, on the other hand, need to be reverted on the outputof the decoder via postprocessing after decoding to produce the final translation. Preprocessingtypically includes segmentation steps such as tokenization (punctuation marks are separated fromwords by whitespace, and conjoined tokens such as “it’s” are split into separate ones, e.g. “it ’s”or “it is”) on both source and target language side. Words at the beginning of sentences with

3http://www.ldc.upenn.edu

3

http://www.ldc.upenn.edu

Introduction

Table 0.1: Example sentences from a French–English parallel corpus [Koehn 05].

French EnglishReprise de la session Resumption of the sessionJe déclare reprise la session du Parlement européenqui avait été interrompue le vendredi 17 décembredernier et je vous renouvelle tous mes vux en es-pérant que vous avez passé de bonnes vacances.

I declare resumed the session of the European Par-liament adjourned on Friday 17 December 1999,and I would like once again to wish you a happynew year in the hope that you enjoyed a pleasantfestive period.

Comme vous avez pu le constater, le grand ”boguede l’an 2000” ne s’est pas produit. En revanche,les citoyens d’un certain nombre de nos pays ontété victimes de catastrophes naturelles qui ont vrai-ment été terribles.

Although, as you will have seen, the dreaded ’mil-lennium bug’ failed to materialise, still the peoplein a number of countries suffered a series of naturaldisasters that truly were dreadful.

Vous avez souhaité un débat à ce sujet dans lesprochains jours, au cours de cette période de ses-sion.

You have requested a debate on this subject inthe course of the next few days, during this part-session.

En attendant, je souhaiterais, comme un certainnombre de collègues me l’ont demandé, que nousobservions une minute de silence pour toutes lesvictimes, des tempêtes notamment, dans les dif-férents pays de l’Union européenne qui ont ététouchés.

In the meantime, I should like to observe a minute’s silence, as a number of Members have requested,on behalf of all the victims concerned, particularlythose of the terrible storms, in the various coun-tries of the European Union.

Je vous invite à vous lever pour cette minute desilence.

Please rise, then, for this minute’ s silence.

(Le Parlement, debout, observe une minute de si-lence)

(The House rose and observed a minute’ s silence)

Madame la Présidente, c’est une motion de procé-dure.

Madam President, on a point of order.

Vous avez probablement appris par la presse et parla télévision que plusieurs attentats à la bombe etcrimes ont été perpétrés au Sri Lanka.

You will be aware from the press and televisionthat there have been a number of bomb explosionsand killings in Sri Lanka.

L’une des personnes qui vient d’être assassinée auSri Lanka est M. Kumar Ponnambalam, qui avaitrendu visite au Parlement européen il y a quelquesmois à peine.

One of the people assassinated very recently in SriLanka was Mr Kumar Ponnambalam, who had vis-ited the European Parliament just a few monthsago.

Ne pensez-vous pas, Madame la Présidente, qu’ilconviendrait d’écrire une lettre au président du SriLanka pour lui communiquer que le Parlement dé-plore les morts violentes, dont celle de M. Ponnam-balam, et pour l’inviter instamment à faire tout cequi est en son pouvoir pour chercher une réconcilia-tion pacifique et mettre un terme à cette situationparticulièrement difficile.

Would it be appropriate for you, Madam Presi-dent, to write a letter to the Sri Lankan Presidentexpressing Parliament’s regret at his and the otherviolent deaths in Sri Lanka and urging her to doeverything she possibly can to seek a peaceful rec-onciliation to a very difficult situation?

Oui, Monsieur Evans, je pense qu’une initiativedans le sens que vous venez de suggérer serait toutà fait appropriée.

Yes, Mr Evans, I feel an initiative of the type youhave just suggested would be entirely appropriate.

Si l’Assemblée en est d’accord, je ferai comme M.Evans l’a suggéré.

If the House agrees, I shall do as Mr Evans hassuggested.

4

Introduction

upper-case first characters are frequent-cased by rewriting them in the casing that appears mostfrequently in the training corpus. Postprocessing applies desegmentation and recases sentence-initial words.

Additional processing can be necessary or at least helpful depending on the involved languages,for instance:

• Written Chinese does not indicate word boundaries. Chinese word segmentation wouldintroduce them in preprocessing [Xu & Zens+ 04].

• Compound splitting is commonly applied to the source side when translating from a lan-guage with many compound lexemes such as German [Koehn & Knight 03, Popović & Stein+

06, Fritzinger & Fraser 10], and sometimes to the target side when translating into a com-pounding language [Stymne & Cancedda+ 13, Cap & Fraser+ 14, Sennrich & Williams+15].

• Syntactic pre-reordering draws on linguistic knowledge to restructure source language sen-tences to look more similar to sentences in the target language, thus simplifying statisticaltranslation by making cross-lingual dependencies more locally bound [Collins & Koehn+

05, Popović & Ney 06, Wang & Collins+ 07].• Deeper morphological analysis can be beneficial for morphological simplification or disam-

biguation, depending on the task [Habash & Rambow 05, Avramidis & Koehn 08, Mansour& Ney 12].

For the empirical work conducted as part of this thesis, we also replace any numerical quantitiesin the data by a single special category symbol. In decoding, rather than translating numbersin the input data, we simply carry them over to the target side. We may apply straightforwardtransformations on numbers, such as substituting a decimal comma (as used e.g. in German) bya decimal point (as used e.g. in English).

Word AlignmentsA word alignment of a sentence pair is a set of position index pairs A ⊆ {1, ..., I} × {1, ..., J},

with I being the length of the target sentence and J the length of the source sentence. The wordalignment indicates translation correspondences of words on the target side and words on thesource side. Figure 0.1 shows a visualization of a word alignment for one of the sentences fromthe French–English Europarl corpus (cf. Table 0.1) after preprocessing.

Word alignments can be obtained on parallel data with unsupervised learning. Most commonly,the Expectation-Maximization (EM) algorithm [Dempster & Laird+ 77] with maximum likelihoodas the training criterion is employed to train a sequence of probabilistic generative models initiallyproposed by a group at IBM [Brown & Della Pietra+ 93] (hence denoted as IBM models). Theoriginal IBM sequence of models is typically extended with a Hidden Markov Model (HMM)[Vogel & Ney+ 96]. Since the IBM/HMM word alignment models only allow for one-to-manycorrespondences, training is conducted in both source-to-target and target-to-source direction,and the two resulting alignments are merged with a simple symmetrization heuristic in order toallow for many-to-many correspondences as well [Och & Ney 03, Koehn & Och+ 03].

Word alignment training is still a crucial first step in the training pipeline of modern phrase-based and hierarchical machine translation systems. In such systems, the decoder requires aninventory of phrases or hierarchical rules, called the phrase table or rule table. The inventoryof phrases (or hierarchical rules) is extracted from the word-aligned parallel training data inthe training phase of the system. The word alignments over the parallel data are necessary todetermine the boundaries of valid phrases or hierarchical rules. A phrase extraction algorithmreads the word-aligned training data and adds bilingual units to the phrase inventory if they areseen in the training data and comply with a consistency criterion wrt. the word alignment.

5

Introduction

if

the

House

agrees

,

I

shall

do

as

Mr

Evans

has

suggested

.

si l’

Ass

em

blé

e en

est d

’

acc

ord

,

je

fera

i

com

me

M.

Eva

ns l’

a

sug

gé

ré

.

Figure 0.1: A word-aligned French–English sentence pair. Both the French and the English sidehave been preprocessed with tokenization and with frequent-casing of the sentence-initial word.

Implementations for word alignment training with the IBM/HMM sequence of models havebeen released as open source software in the GIZA++ [Och & Ney 03] and MGIZA++ [Gao &Vogel 08] tools.4,5 For the experiments in this thesis, word alignments are created by aligning thedata in both directions with GIZA++ and symmetrizing the two trained alignments following therefined symmetrization heuristic by [Och & Ney 03].

Evaluation Metrics

Reliable human evaluation of translation quality is time consuming and expensive. It is thereforenot suitable for system development purposes. Automatic evaluation metrics solve this problemby calculating a quality score by means of comparing the hypothesis translations with referencetranslations.

For Bleu, the most widely used automatic evaluation metric, matches of n-grams in the hy-pothesis with n-grams in the reference are counted to compute a modified n-gram precision. Thegeometric mean of the modified n-gram precisions of n-grams up to order 4 is multiplied by abrevity penalty factor to obtain the overall score. Bleu scores are in the interval from 0 to 1according to the original definition [Papineni & Roukos+ 02]. However, it is common practice toreport a percentage instead (i.e., to multiply the score of the original definition by 100).

A vital property of any machine translation evaluation metric is good correlation with humanjudgment. Popular automatic metrics such as Bleu and Ter [Snover & Dorr+ 06] have provedto be reliable on different tasks for different languages [Macháček & Bojar 13, Macháček & Bojar

4https://github.com/moses-smt/giza-pp/5https://github.com/moses-smt/mgiza/

6

https://github.com/moses-smt/mgiza/

Introduction

14, Stanojević & Kamran+ 15]. In this thesis, we primarily use Bleu to measure translationquality, but in most cases also report Ter.6

Optimization

Just as an evaluation metric for machine translation should correlate well with human judgmentof translation quality, we require model scores to correlate well with the objective function, sincethe decoder identifies good translations based on them. The log-linear model from Equation (0.6)allows for direct tuning of a statistical machine translation system towards an objective functionsuch as Bleu that measures translation quality. An iterative optimization procedure can be usedto find scaling factors λm, 1 ≤ m ≤M , that maximize the objective function. The scaling factorsare first initialized with some default values. Next, a development set is decoded using a log-linearmodel with the initial scaling factors. Rather than just producing a single-best output, n-best listsare written by the decoder. The n-best list entries are scored according to the objective function.The optimizer then chooses new scaling factors that improve the correlation of the model score ofthe log-linear model with the objective function. The procedure of n-best list creation and choiceof updated scaling factors is iterated until a maximum number of iterations or some stoppingcriterion is reached.

We use Minimum Error Rate Training (MERT) [Och 03] with 100-best lists for optimizationof model weights. Alternatives to MERT are MIRA [Cherry & Foster 12], PRO [Hopkins & May11], and Lattice MERT [Macherey & Och+ 08].

Public Evaluation Campaigns

Public evaluation campaigns with machine translation tasks have driven progress in statisticalmachine translation in recent years. Major evaluation campaigns have been organized by theNational Insitute of Standards and Technology (NIST) as well as in conjunction with the Workshopon Statistical Machine Translation (WMT) and the International Workshop on Spoken LanguageTranslation (IWSLT). The NIST Open Machine Translation Evaluation campaigns have mostlyfocussed on Chinese→English and Arabic→English tasks. The WMT shared translation taskstypically cover European language pairs with a focus on the news domain. IWSLT provides atestbed for translation of speech and talks.

Public evaluation campaigns supply standardized test conditions for machine translation. NISTOpenMT, WMT, and IWSLT each release common development and test sets and define a setof training corpora, thus facilitating replicability and comparability of results. Empirical resultsin this thesis will be reported on NIST, WMT, and IWSLT test sets. Four reference translationsare available for the NIST sets and are taken into account for automatic evaluation. The WMTand IWSLT sets have only one reference translation.

Jane: an Open Source Statistical Machine Translation ToolkitThe work on hierarchical machine translation presented in this thesis has been conducted with

Jane, the RWTH Aachen University Statistical Machine Translation Toolkit, and the source codehas been released.7 Jane [Vilar & Stein+ 10, Stein & Vilar+ 11, Vilar & Stein+ 12] is implementedin C++ and freely available for non-commercial use.

6Note that Bleu is an accuracy measure (with higher values indicating better translation quality), whereas Teris an error measure based on edit distance and block movements (with lower values indicating better translationquality).

7http://www.hltpr.rwth-aachen.de/jane/

7

http://www.hltpr.rwth-aachen.de/jane/

Introduction

At RWTH Aachen, we initially developed Jane as a toolkit for hierarchical machine translation[Vilar & Stein+ 10, Stein & Vilar+ 11, Vilar & Stein+ 12] with a few reordering extensions andsyntactic enhancements [Stein & Peitz+ 10]. Later on, we added more features for hierarchicaltranslation [Peter & Huck+ 11, Huck & Peter+ 12], before we went on to also implement phrase-based translation in Jane [Wuebker & Huck+ 12], along with forced alignment phrase modeltraining [Wuebker & Mauser+ 10] and source-side discontinuous phrases [Galley & Manning 10,Huck & Scharwächter+ 13]. Jane now also includes tools for system combination [Freitag & Huck+

14]. The basic usage of Jane is described in the user’s manual [Vilar & Stein+ 13].

Other Statistical Machine Translation ToolkitsSeveral other research groups have also released the source code of their statistical machine

translation toolkits. In the following, we give a brief (and very likely incomplete) list of relatedpublicly released statistical machine translation toolkits.

Moses [Koehn & Hoang+ 07] is a widely used open source toolkit for statistical machine trans-lation. It was originally designed for phrase-based decoding, but now also supports thehierarchical paradigm [Hoang & Koehn+ 09] and syntax-based models [Williams & Koehn12]. Moses provides tools for the complete machine translation pipeline, contains implemen-tations for a wide variety of different models, and is well documented.

Joshua [Li & Callison-Burch+ 09a] is written in Java and implements the full pipeline for hier-archical machine translation. In addition to standard hierarchical rule tables, it is capableof extracting syntax augmented machine translation (SAMT) grammars [Zollmann & Venu-gopal 06].

cdec [Dyer & Lopez+ 10] is a flexible decoding framework with a unified representation fortranslation forests.

Phrasal [Cer & Galley+ 10] is an open source machine translation package with a Java implemen-tation of the phrase-based machine translation paradigm. Phrasal is capable of extractingand translating with discontinuous phrases [Galley & Manning 10].

HiFST [de Gispert & Iglesias+ 10] is a lattice-based decoder for hierarchical phrase-based transla-tion using OpenFST. An open source package has been released as the Cambridge StatisticalMachine Translation system (UCAM-SMT).

Ncode [Crego & Yvon+ 11] implements the n-gram-based approach to machine translation[Mariño & Banchs+ 06]. Reordering is performed by creating a lattice in a preprocessingstep, which is passed on to the monotone decoder.

Kriya [Sankaran & Razmara+ 12] is a toolkit for hierarchical phrase-based machine translationwhich is written in Python.

NiuTrans [Xiao & Zhu+ 12] is developed in C++ and supports phrase-based, hierarchical phrase-based, and syntax-based models.

Thesis OutlineThis thesis presents work on novel models that extend the hierarchical approach to statistical

machine translation [Chiang 05, Chiang 07].In Part I of the thesis, we review the fundamentals of hierarchical phrase-based machine trans-

lation. We discuss phrase extraction for hierarchical translation systems (Chapter 1), presentstandard features that are used in the log-linear model of a baseline hierarchical system (Chap-ter 2), and desribe hierarchical search with the cube pruning algorithm, including an in-depthperformance study on Chinese→English and Arabic→English translation tasks (Chapter 3).

8

Introduction

In Part II, we propose enhancements of hierarchical phrase-based translation with additionalmodels and evaluate them on various translation tasks in order to empirically validate theireffectiveness at improving translation quality.

We first investigate discriminative and trigger-based lexicon models with sentence-level contextin hierarchical translation (Chapter 4). Previous research has shown that such extended lexiconmodels are beneficial in phrase-based translation. We demonstrate that comparable gains can beachieved when integrating extended lexicon models into a hierarchical decoder.

We next employ different types of lexicon models for smoothing of hierarchical phrase tables(Chapter 5), including IBM model 1, a regularized variant of IBM model 1, discriminative lexiconmodels, and the trigger-based triplet lexicon models. Several scoring variants are examined aswell.

Default hierarchical systems incorporate neither explicit lexicalized reordering models nor ad-ditional mechanisms to perform reorderings. Both are important components of state-of-the-artphrase-based systems. We implement two reordering extensions for hierarchical phrase-basedtranslation: additional non-lexicalized reordering rules and a discriminative lexicalized reorderingmodel. These are described in Chapter 6.

A more widely used type of lexicalized reordering model for phrase-based systems scores mono-tone, swap, and discontinuous phrase orientations. We develop a phrase orientation model forhierarchical machine translation which resembles this idea (Chapter 7).

In Chapter 8, we address the problem of modeling word insertions and deletions. We proposea simple method based on thresholding of lexical probabilities.

In the final chapter of Part II, we present results that we have achieved with our Jane implemen-tation of the hierarchical paradigm and its extensions under competitive conditions in WMT andIWSLT evaluation campaigns. The language pair for the experiments in Chapter 9 is English→French.

In Part III of the thesis, we apply lightly-supervised training—a form of self-training of sta-tistical machine translation systems—to hierarchical machine translation. We outline effectivetechniques of lightly-supervised training for hierarchical phrase-based translation (Chapter 10)and propose a novel pivot lightly-supervised training approach (Chapter 11).

Prior PublicationsExcerpts of the research conducted as part of this thesis have been presented at international

conferences or workshops and have been published in the associated peer-reviewed proceedingsalready (either in print or electronically). Some of the findings and results which have beenachieved during the course of this work have also been accepted for prior publication in scientificjournals.

The major correspondences of parts of the thesis and pre-published papers are (see the biblio-graphy for full references):

Chapter 3 [Huck & Vilar+ 13] “A Performance Study of Cube Pruning for Large-Scale Hierarchical Machine Translation” (published atNAACL-SSST 2013)

Chapter 4 [Huck & Ratajczak+ 10] “A Comparison of Various Types of Extended Lexi-con Models for Statistical Machine Translation” (pub-lished at AMTA 2010)

Chapter 5 [Huck & Mansour+ 11] “Lexicon Models for Hierarchical Phrase-Based Ma-chine Translation” (published at IWSLT 2011)

9

Introduction

[Huck & Peter+ 12] “Hierarchical Phrase-Based Translation with Jane 2”(published in PBML 98, October 2012)

Chapter 6 [Huck & Peitz+ 12a] “Discriminative Reordering Extensions for Hierarchi-cal Phrase-Based Machine Translation” (published atEAMT 2012)


Chapter 7 [Huck & Wuebker+ 13] “A Phrase Orientation Model for Hierarchical MachineTranslation” (published at WMT 2013)

Chapter 8 [Huck & Ney 12a] “Insertion and Deletion Models for Statistical MachineTranslation” (published at NAACL 2012)


Chapter 9 [Huck & Peitz+ 12b] “The RWTH Aachen Machine Translation System forWMT 2012” (published at WMT 2012)

[Wuebker & Huck+ 11] “The RWTH Aachen Machine Translation System forIWSLT 2011” (published at IWSLT 2011)

Chapter 10 [Huck & Vilar+ 11b] “Lightly-Supervised Training for HierarchicalPhrase-Based Machine Translation” (publishedat EMNLP 2011 Workshop on Unsupervised Learningin NLP)

Chapter 11 [Huck & Ney 12b] “Pivot Lightly-Supervised Training for Statistical Ma-chine Translation” (published at AMTA 2012)

Overall, I have authored or co-authored the following publications (arranged by publicationtype), some of which contain pre-published content of this thesis, as stated above:

Basic research on aspects of phrase-based, hierarchical, syntax-based, and neural machine trans-lation:

[Conforti & Huck+ 18] “Neural Morphological Tagging of Lemma Sequences forMachine Translation”

[Huck & Riess+ 17] “Target-side Word Segmentation Strategies for Neural Ma-chine Translation”

[Huck & Tamchyna+ 17] “Producing Unseen Morphological Variants in StatisticalMachine Translation”

[Huck & Birch+ 15b] “Mixed-Domain vs. Multi-Domain Statistical MachineTranslation”

[Sennrich & Williams+ 15] “A tree does not make a well-formed sentence: Improvingsyntactic string-to-tree statistical machine translation withmore linguistic knowledge”

[Huck & Hoang+ 14b] “Preference Grammars and Soft Syntactic Constraints forGHKM Syntax-based Statistical Machine Translation”

10

Introduction

[Huck & Hoang+ 14a] “Augmenting String-to-Tree and Tree-to-String Transla-tion with Non-Syntactic Phrases”

[Freitag & Feng+ 13] “Reverse Word Order Models”[Huck & Wuebker+ 13] “A Phrase Orientation Model for Hierarchical Machine

Translation”[Huck & Vilar+ 13] “A Performance Study of Cube Pruning for Large-Scale

Hierarchical Machine Translation”[Huck & Scharwächter+ 13] “Source-Side Discontinuous Phrases for Machine Transla-

tion: A Comparative Study on Phrase Extraction andSearch”

[Huck & Ney 12b] “Pivot Lightly-Supervised Training for Statistical MachineTranslation”

[Huck & Ney 12a] “Insertion and Deletion Models for Statistical MachineTranslation”

[Huck & Peitz+ 12a] “Discriminative Reordering Extensions for HierarchicalPhrase-Based Machine Translation”

[Huck & Mansour+ 11] “Lexicon Models for Hierarchical Phrase-Based MachineTranslation”

[Peter & Huck+ 12] “Soft String-to-Dependency Hierarchical Machine Transla-tion”

[Peter & Huck+ 11] “Soft String-to-Dependency Hierarchical Machine Transla-tion”

[Huck & Vilar+ 11b] “Lightly-Supervised Training for Hierarchical Phrase-Based Machine Translation”

[Huck & Vilar+ 11a] “Advancements in Arabic-to-English Hierarchical MachineTranslation”

[Huck & Ratajczak+ 10] “A Comparison of Various Types of Extended Lexicon Mod-els for Statistical Machine Translation”

Jane – the RWTH Aachen University open source statistical machine translation toolkit:

[Freitag & Huck+ 14] “Jane: Open Source Machine Translation System Combination”[Wuebker & Huck+ 12] “Jane 2: Open Source Phrase-based and Hierarchical Statistical

Machine Translation”[Huck & Peter+ 12] “Hierarchical Phrase-Based Translation with Jane 2”[Vilar & Stein+ 12] “Jane: an advanced freely available hierarchical machine trans-

lation toolkit”[Stein & Vilar+ 11] “A Guide to Jane, an Open Source Hierarchical Translation

Toolkit”[Vilar & Stein+ 10] “Jane: Open Source Hierarchical Translation, Extended with

Reordering and Lexicon Models”

11

Introduction

System descriptions in the context of participations in machine translation evaluation cam-paigns:

[Huck & Braune+ 17] “LMU Munich’s Neural Machine Translation Systems forNews Articles and Health Information Texts”

[Huck & Fraser+ 16] “The Edinburgh/LMU Hierarchical Machine TranslationSystem for WMT 2016”

[Williams & Sennrich+ 16] “Edinburgh’s Statistical Machine Translation Systems forWMT16”

[Peter & Alkhouli+ 16] “The QT21/HimL Combined Machine Translation System”[Huck & Birch 15a] “The Edinburgh Machine Translation Systems for

IWSLT 2015”[Haddow & Huck+ 15] “The Edinburgh/JHU Phrase-based Machine Translation

Systems for WMT 2015”[Williams & Sennrich+ 15] “Edinburgh’s Syntax-Based Systems at WMT 2015”[Freitag & Wuebker+ 14] “Combined Spoken Language Translation”[Birch & Huck+ 14] “Edinburgh SLT and MT System Description for the

IWSLT 2014 Evaluation”[Williams & Sennrich+ 14] “Edinburgh’s Syntax-Based Systems at WMT 2014”[Freitag & Peitz+ 14] “EU-BRIDGE MT: Combined Machine Translation”[Freitag & Peitz+ 13] “EU-BRIDGE MT: Text Translation of Talks in the EU-

BRIDGE Project”[Peitz & Mansour+ 13a] “Joint WMT 2013 Submission of the QUAERO Project”[Peitz & Mansour+ 13b] “The RWTH Aachen Machine Translation System for

WMT 2013”[Peitz & Mansour+ 12] “The RWTH Aachen Speech Recognition and Machine

Translation System for IWSLT 2012”[Freitag & Peitz+ 12] “Joint WMT 2012 Submission of the QUAERO Project”[Huck & Peitz+ 12b] “The RWTH Aachen Machine Translation System for

WMT 2012”[Wuebker & Huck+ 11] “The RWTH Aachen Machine Translation System for

IWSLT 2011”[Huck & Wuebker+ 11] “The RWTH Aachen Machine Translation System for

WMT 2011”[Heger & Wuebker+ 10a] “The RWTH Aachen Machine Translation System for

WMT 2010”

12

Introduction

Overview papers, published as a co-organizer of the EMNLP 2017 Second Conference on Ma-chine Translation, the ACL 2016 First Conference on Machine Translation, and the EMNLP2015 Tenth Workshop on Statistical Machine Translation:

[Bojar & Chatterjee+ 17] “Findings of the 2017 Conference on Machine Translation(WMT17)”

[Bojar & Chatterjee+ 16] “Findings of the 2016 Conference on Machine Translation(WMT16)”

[Bojar & Chatterjee+ 15] “Findings of the 2015 Workshop on Statistical Machine Trans-lation”

Project presentations for research and innovation projects funded by the European Union:

[Haddow & Birch+ 17] “HimL: Health in my Language”[Kordoni & van den Bosch+ 16] “Enhancing Access to Online Education: Quality Ma-

chine Translation of MOOC Content”

Individual Contributions vs. Team Work

The work presented in this thesis has been conducted while the author was embedded into alarger research group as a team member, collaborating with his colleagues. As mentioned above,parts of the thesis correspondent with pre-published papers that have been authored jointly withcollaborators. For the parts in question, a brief overview is given here that highlights specificcontributions by my collaborators within the context of team efforts.

Chapter 3 [Huck & Vilar+ 13]. The core implementation of the cube pruning decoder for hier-archical translation in Jane was written by David Vilar. David Vilar and Markus Freitagalso assisted with discussions on experimental design and with data preparation for theexperiments reported in this chapter. The realization of the experimental study and theanalysis of the results are my individual work.

Chapter 4 [Huck & Ratajczak+ 10]. The concept of a sparse version of the discriminative wordlexicon model (DWL) is due to Martin Ratajczak, who also wrote the code for trainingsuch sparse DWL models. (The paper reports on a feature selection approach for DWLmodels, which is omitted in this thesis because it is work by Martin Ratajczak.) PatrickLehnen helped with the conceptual design of the machine translation experiments in thischapter, and with important discussions of the empirical results. The practical realizationof all machine translation experiments in this part is my individual work, except that thesparse DWL was trained by Martin Ratajczak. I individually implemented the support fortriplet lexicon models and discriminative word lexicons in Jane.

Chapter 5 [Huck & Mansour+ 11]. Saab Mansour and Simon Wiesler have come up with theregularization for IBM Model 1, and have implemented it. Most of the code in Jane for differ-ent lexical scoring methods was contributed by me, including support for GIZA++-trainedIBM models, noisy-or and Moses-style scoring, sentence-level scoring, phrase-level scoringwith triplets and DWLs, and precomputation of the latter lexical scores. The Chinese→English experiments reported in the thesis have been conducted by me individually. Ara-bic→English experiments from the paper are Saab Mansour’s work and therefore omittedhere.

13

Introduction

Chapter 6 [Huck & Peitz+ 12a]. Stephan Peitz and Markus Freitag helped with the experimentaldesign and with the analysis. Specifically, the rule usage statistics were mostly preparedfrom the log files by Stephan Peitz, who also selected the translation example for the figures.I am also very grateful to David Vilar for useful discussions on the work in this section. Allmachine translation experiments in this chapter are my individual work. Older code fortraining discriminative lexicalized reordering models [Zens & Ney 06] was ported to Jane,and feature scoring in the hierarchical decoder implemented by myself.

Chapter 7 [Huck & Wuebker+ 13]. The phrase orientation model for hierarchical translationextends an initial implementation of phrase orienation for plain phrase-based translation,written by Felix Rietig and Joern Wuebker. The generalization of the orientation modelextraction to hierarchical phrases was conceptualized and implemented by myself individu-ally, as was the feature scoring for Jane’s hierarchical decoder. Detailed discussions withJoern Wuebker have lead to refinements. Joern has also contributed to a clear formulationof the model. All machine translation experiments in this chapter are my individual work.

Chapter 8 [Huck & Ney 12a]. All implementation and empirical work was done by myself.Chapter 9 [Huck & Peitz+ 12b, Wuebker & Huck+ 11]. Many people have contributed to

RWTH’s WMT and IWSLT evaluation submissions. The thesis chapter reports only onthe machine translation engines developed by myself. Omitted in the thesis is content fromthe papers that describes work on other tasks or language pairs which I have not beenresponsible for individually.

Chapter 10 [Huck & Vilar+ 11b]. David Vilar and Daniel Stein have helped with the conceptualdesign and scripting for the work presented in this chapter. All other work was done bymyself individually, including all practical experiments to produce the empirical results.

Chapter 11 [Huck & Ney 12b]. The work in this chapter was accomplished individually bymyself.

14

Scientific Goals

In this thesis, we will investigate techniques to improve machine translation quality with hi-erarchical phrase-based systems by means of extensions with novel statistical models. We willdevelop enhancements that can be employed on top of the known methods and will empiricallyevaluate the enhanced systems against the baseline.

Performance of hierarchical search with cube pruning. First, we will establish a state-of-the-art baseline hierarchical machine translation system for our work (Part I). We will employ astandard set of features and use the cube pruning algorithm for hierarchical search. Certainkey parameters of the baseline setup can have a critical impact on translation quality as well ason computational efficiency, i.e. translation speed and memory consumption. Efficient decodingis important for practical applications of machine translation technology. We will conduct aperformance study of hierarchical search with cube pruning and contribute a detailed empiricalanalysis of the effect of different parameter settings in relation to translation quality and resourcerequirements (Section 3).

Extended lexicon models. We will then propose enhancements with additional models (Part II),the first ones being extended lexicon models. Discriminative word lexicons and the trigger-basedtriplet lexicon models have been shown to improve translation quality when added to conventionalphrase-based systems in previous work. Global source sentence context can be taken into accountusing these models. We will integrate extended lexicon models into hierarchical decoding in orderto promote a better lexical choice (Chapter 4).

Lexical smoothing variants. The basic phrase translation probabilities of phrases that are rarelyseen in the training data are typically overestimated. Lexical smoothing is a commonly adoptedmethod to counteract this effect, also in hierarchical phrase-based translation. A widely usedapproach to obtain a lexicon model for smoothing is estimation of lexical translation probabilitiesfrom word-aligned data with symmetrized word alignments. Phrases are then scored with thelexicon model based on one of several different scoring techniques that have been presented in theliterature. Comparative studies of lexical smoothing methods are however rare in the literatureand typically cover standard phrase-based translation, not hierarchical translation. We will fill thisgap by evaluating multiple different lexicon models and scoring techniques for lexical smoothingin hierarchical translation. We will implement lexical smoothing with triplet lexicon models anddiscriminative word lexicons and explore whether the extended lexicon models are suitable forthis purpose (Chapter 5).

Reordering extensions. The hierarchical phrase-based model provides an integrated reorderingmechanism through non-terminals that are linked on the source side and on the target side of

15

Scientific Goals

hierarchical phrases. Hierarchical decoders usually neither utilize any explicit lexicalized reorder-ing models, nor do they incorporate any mechanisms to perform reorderings that do not resultfrom the application of hierarchical phrases. We argue that hierarchical systems can benefit fromdesignated reordering extensions. We will augment hierarchical phrase-based translation with adiscriminative lexicalized reordering model that is inspired by previous work for standard phrase-based translation and complement it with additional non-lexicalized reordering rules (Chapter 6).

Phrase orientation model. The most pervasive type of lexicalized reordering model in standardphrase-based translation is a phrase orientation model that estimates the probabilities of mono-tone, swap, and discontinuous phrase orientation classes and assesses the adequacy of phrasereordering during search. The orientation probabilities are conditioned on phrases. We will showthat the notion of phrase orientation can be generalized to hierarchical phrases. We will modelphrase orientation for hierarchical translation and develop the scoring procedures that are requiredto apply the orientation model in hierarchical decoding (Chapter 7).

Insertion and deletion models. Statistical translation systems are prone to omission of contentwords in the translation. Content words should not be dropped, whereas sometimes other typesof words that do not exist in the source need to be added in the translation. We will investigatefeatures that specifically score word insertions and deletions in hierarchical decoding. A targetlanguage word is considered inserted or deleted based on lexical probabilities with the words onthe foreign language side of the phrase (Chapter 8).

Hierarchical translation for large-scale English→French news and talk tasks. We will buildstate-of-the-art hierarchical machine translation systems for the translation of news texts and talktranscripts from English into French (Chapter 9).

Lightly-supervised training. In lightly-supervised training for machine translation (Part III),synthetic parallel data is produced by automatically translating monolingual corpora. The syn-thetic parallel data is used as additional training data. Lightly-supervised training has beensuccessfully applied to standard phrase-based translation before. We will propose effective ap-proaches for lightly-supervised training of hierarchical phrase-based machine translation systems(Chapter 10).

Pivot lightly-supervised training. We will introduce a novel pivot lightly-supervised trainingapproach (Chapter 11). Pivot lightly-supervised training combines lightly-supervised trainingwith ideas from pivot translation. Synthetic parallel data between source and target is producedby translating from a third “pivot” language. This is possible whenever a source–pivot or target–pivot parallel corpus is available. Pivot lightly-supervised training carries information which ispresent in such additional resources over to the translation system.

16

Part I

Fundamentals

17

1. Phrase Extraction

As distinguished from standard phrase-based translation, hierarchical translation [Chiang 05,Chiang 07] does not only allow for continuous bilingual sequences of words as its elementarytranslation units, but also makes use of phrases with gaps.

The phrase inventory is extracted from the word-aligned parallel training data. For hierarchicalphrase extraction, continuous bilingual sequences of words are considered valid phrases based onthe very same consistency criterion as in standard phrase-based translation. We will specifythe consistency criterion for continuous phrases in Section 1.1. In addition to continuous lexicalphrases as in standard phrase-based translation, hierarchical phrases with usually up to two gapsare extracted from the data. A hierarchical phrase can be extracted from a training sentence pairwhenever a larger valid phrase in the training instance fully contains one or more smaller validphrases. A smaller valid phrase is cut out of the larger phrase and replaced by a placeholdersymbol that marks the gap. We will give a formal description of this process in Section 1.3.

For hierarchical phrase-based translation, the phrase inventory is represented as a synchronouscontext-free grammar (SCFG) that is induced from bilingual text. Placeholder symbols that markgaps are non-terminals and words are terminals in the grammar. Phrases are SCFG rules, whichalso need a left-hand side non-terminal that we add. We will have to define a few notationalconventions (Section 1.2), but will not recapitulate any basic concepts of formal language theory.1

Hierarchical decoding is then typically carried out with a parsing-based procedure. The inputsentence can be parsed using the source sides of the SCFG rules, and target-side parses can beconstructed from source-side parses.

By using non-terminals in a specific way to define an SCFG for hierarchical translation, theparsing-based decoding can be restricted to prohibit recursive applications of hierarchical phrases[de Gispert & Iglesias+ 10], resulting in a better practical efficiency of the overall decoding pro-cedure. Modifications of the decoder are not necessary. We refer to this kind of grammar andthe parses produced with it as shallow, as opposed to the standard grammar and parses whichwe denote as deep. In Section 1.3.1, we describe the deep grammars, which are usually employedby default in most work presented in the machine translation literature. We describe shallowgrammars in Section 1.3.2.

1.1 Continuous Phrases

In the standard phrase-based approach, only continuous phrases are extracted [Och & Tillmann+

99, Och 02, Zens & Och+ 02]. The set of continuous bilingual phrases BP(fJ1 , e

I1, A), given a

training instance consisting of a source sentence fJ1 , a target sentence eI1, and a word alignment

1A concise review of SCFGs, hypergraphs, and parsing within the context of hierarchical machine translationcan be found in [Vilar 11]. [Hopcroft & Motwani+ 01] provide a general introduction to formal language theory.

19

Chapter 1

if

the

House

agrees

,

I

shall

do

as

Mr

Evans

has

suggested

.

si l’

Ass

em

blé

e en

est d

’

acc

ord

,

je

fera

i

com

me

M.

Eva

ns l’

a

sug

gé

ré

.

Figure 1.1: Phrases highlighted in a word-aligned French–English sentence pair.

A ⊆ {1, ..., I} × {1, ..., J}, is defined as follows:

BP(fJ1 , e

I1, A) =

{⟨f j2

j1, ei2i1⟩

∣∣∣ ∃(i, j) ∈ A : i1 ≤ i ≤ i2 ∧ j1 ≤ j ≤ j2

∧ ∀(i, j) ∈ A : i1 ≤ i ≤ i2 ⇔ j1 ≤ j ≤ j2

} (1.1)

f j2j1

is the sequence of source words from source position j1 to source position j2 in the sentencefJ1 , and ei2i1 is the sequence of target words from target position i1 to target position i2 in the

sentence eI1. Consistency for continuous phrases is based upon two constraints in this definition:(1.) At least one source and target position within the phrase must be aligned, and (2.) wordsfrom inside the source phrase may only be aligned to words from inside the target phrase andvice versa.2

Figure 1.1 shows the French–English sentence pair that we already used to illustrate wordalignments in the introduction (Figure 0.1), but this time with two valid phrases highlighted, thesmaller one being ⟨M. Evans, Mr Evans⟩ and the larger one being ⟨comme M. Evans l’ a suggéré,as Mr Evans has suggested⟩. Both fulfill the consistency criterion with respect to the two aspects:a position within each phrase is aligned, and no positions in the intervals covered by each phraseare aligned outside of the phrase. The shaded areas indicate the areas that are not allowed tocontain any word alignments for the phrases to be valid wrt. the second part of the consistencycriterion.

2An extension that enables discontinuous phrases for phrase-based translation has been proposed [Galley &Manning 10, Huck & Scharwächter+ 13] but is not in common use.

20

Phrase Extraction

1.2 Notational Conventions for SCFG RulesIn hierarchical grammars, we deal with SCFG rules X → ⟨α, β,∼ ⟩ where α ∈ (N ∪ VF )

+

and β ∈ (N ∪ VE)+. VF denotes the source vocabulary and VE the target vocabulary. N is a

non-terminal set which is shared by source and target, and the left-hand side of the rule is anon-terminal symbol X ∈ N , common to source and target.

The non-terminals on the source side and on the target side of hierarchical rules are linked ina one-to-one correspondence. The ∼ relation defines this one-to-one correspondence between thenon-terminals within the source part α and the non-terminals within the target part β.

Let Jα denote the number of terminal symbols in α and Iβ the number of terminal symbols inβ. Indexing α with j, i.e. the symbol αj , 1 ≤ j ≤ Jα, denotes the j-th terminal symbol on thesource side α of the phrase pair, and analogous with βi, 1 ≤ i ≤ Iβ, on the target side β.

1.3 Hierarchical Phrase InventoryWe follow [Vilar 11] in formally describing the hierarchical phrase inventory by means of a

recursive definition.We first define a setH0(f

J1 , e

I1, A) of continuous lexical phrases analogous to the set BP(fJ

1 , eI1, A)

from Equation (1.1), but with phrases formulated as SCFG rules for the hierarchical grammar,with left-hand side non-terminals:

H0(fJ1 , e

I1, A) =

{X→ ⟨f j2

j1, ei2i1⟩

∣∣∣ ∃(i, j) ∈ A : i1 ≤ i ≤ i2 ∧ j1 ≤ j ≤ j2

∧ ∀(i, j) ∈ A : i1 ≤ i ≤ i2 ⇔ j1 ≤ j ≤ j2

} (1.2)

The phrases in this set do not have gaps, i.e. there are no right-hand side non-terminals in any ofthe SCFG rules in H0(f

J1 , e

I1, A).

Sets Hn(fJ1 , e

I1, A) containing hierarchical phrases with n > 0 gaps are defined recursively:

Hn(fJ1 , e

I1, A) =

{X→ ⟨αX∼n−1γ, βX∼n−1δ⟩

∣∣∣ α, γ ∈ (N ∪ VF )⋆ ∧ β, δ ∈ (N ∪ VE)

⋆∧

∃j1, j2, i1, i2 :(

X→ ⟨αf j2j1γ, βei2i1δ⟩ ∈ Hn−1(f

J1 , e

I1, A)

∧X→ ⟨f j2j1, ei2i1⟩ ∈ H0(f

J1 , e

I1, A)

)} (1.3)

In the notation above, we are indicating the ∼ one-to-one relation of non-terminals on source andtarget side as a numerical index in a superscript directly with each non-terminal (with indicesstarting at 0). Non-terminals on the source side and on the target side are linked if they have thesame index. The general definition in Equation (1.3) also permits SCFG rules without terminalsymbols. In practice, however, we apply an additional restriction that requires extracted rules tohave a terminal symbol on both source and target side.

The complete set H(fJ1 , e

I1, A) of hierarchical phrases extractable from (fJ

1 , eI1, A) with up to

the maximum number of N gaps (usually N = 2) is the union of all sets Hn(fJ1 , e

I1, A) with

0 ≤ n ≤ N :

H(fJ1 , e

I1, A) =

N∪n=0

Hn(fJ1 , e

I1, A) (1.4)

Finally, the overall phrase inventory H obtained from the full parallel training corpus is theunion of the complete sets extracted from the individual sentences in the corpus.

Going back to the example in Figure 1.1, we note that the SCFG rules X → ⟨M. Evans,Mr Evans⟩ and X → ⟨comme M. Evans l’ a suggéré, as Mr Evans has suggested⟩ are both

21

Chapter 1

in H0(fJ1 , e

I1, A), and that the shorter phrase can be cut out of the longer phrase to produce an

element of H1(fJ1 , e

I1, A) in accordance with Equation (1.3). The hierarchical phrase with one gap

is X → ⟨comme X∼0 l’ a suggéré, as X∼0 has suggested⟩. An advantage of hierarchical phraseslike this one is that within elementary translation units they can capture distant dependenciesbetween words that appear together in training sentences, while at the same time providing bettergeneralization capabilities than standard continuous phrases by not being restricted to a specificcontext in the gap between the distant words. Another aspect is that during decoding, via the∼ relation, a hierarchical translation rule implicitly specifies the placement of word sequencessubstituting one of its right-hand side non-terminals in a partial hypothesis. The hierarchicalphrase-based approach hence already includes an integrated reordering mechanism.

Some heuristics that vary the phrase extraction procedure for specific cases—in particularregarding situations with unaligned words—are discussed by [Stein & Vilar+ 11].

For experiments reported in this thesis, we apply several restrictions when extracting phrases,in particular a maximum length of ten on source and target side for lexical phrases, a length limitof five on source and ten on target side (including non-terminal symbols) for hierarchical phraseswith gaps, and no more than two gaps per phrase. Right-hand side non-terminals are not allowedto be adjacent on the source side. Hierarchical phrases with gaps which are not observed morethan once in the training data are discarded.

1.3.1 Deep Grammar

The non-terminal set of a standard hierarchical grammar is made up of two symbols: an initialsymbol S and one generic non-terminal symbol X. The generic non-terminal X is used as aplaceholder for the gaps within the right-hand side of hierarchical translation rules as well as onall left-hand sides of the translation rules that are extracted from the training corpus. In additionto the extracted rules, a non-lexicalized initial rule

S → ⟨X∼0, X∼0⟩ (1.5)

is incorporated into the grammar, as well as a special glue rule

S → ⟨S∼0X∼1, S∼0X∼1⟩ (1.6)

that the system can use for serial concatenation of phrases as in monotonic phrase-based transla-tion. The initial symbol S is the start non-terminal symbol of the grammar. We denote standardhierarchical grammars as deep grammars here.

1.3.2 Shallow Grammar

[Iglesias & de Gispert+ 09a] and in a later journal publication [de Gispert & Iglesias+ 10]present a way to limit the recursion depth for hierarchical rules by means of a modification tothe hierarchical grammar, referred to as shallow-n grammar. The main benefit of the limitationis a gain in decoding efficiency. Moreover, the modification of the grammar to a shallow versionrestricts the search space of the decoder and may be convenient to prevent overgeneration. In thisthesis, we will investigate hierarchical systems with deep grammars and systems with shallow-1grammars, i.e. grammars which limit the depth of the hierarchical recursion to one.

In a shallow-1 grammar, the generic non-terminal X of the standard deep grammar is replacedby two distinct non-terminals XH and XP. By changing the left-hand sides of the rules, lexicalphrases are allowed to be derived from XP only, hierarchical phrases from XH only. On all right-hand sides of hierarchical rules, the X is replaced by XP. Gaps within hierarchical phrases can

22

Phrase Extraction

thus be filled with continuous lexical phrases only, not with hierarchical phrases that have gaps.The initial rule is substituted with

S → ⟨XP∼0,XP∼0⟩S → ⟨XH∼0,XH∼0⟩ ,

(1.7)

and the glue rule is substituted with

S → ⟨S∼0XP∼1, S∼0XP∼1⟩S → ⟨S∼0XH∼1, S∼0XH∼1⟩ .

(1.8)

23

2. Baseline Models

We now present standard features that are employed in the log-linear model for our hierarchicalbaseline setups. Most of the features listed in the current chapter are in common use in typicalsystems. Later, in the core research parts of this thesis, we will propose novel models thatwe have developed to augment this baseline feature set (Part II), and techniques for advancedmodel learning with lightly-supervised training (Part III). To examine their impact on translationquality, the enhancements will be evaluated against plain baselines.

2.1 Terminology

During decoding, the log-linear model score informs the search algorithm about the quality ofdifferent candidate translations. The aim of the search algorithm is to identify the model-besthypothesis. We introduce some terminology to help us define how hypotheses are scored with thedifferent features.

A derivation is a sequence of context-free grammar rule applications [Hopcroft & Motwani+ 01].In this thesis, we will typically use the term to denote sequences of SCFG rule applications thattransform the start non-terminal of the grammar into a string of terminal symbols. For sequencesof rule applications that do not begin with the start non-terminal or end with a string of terminalsymbols, we use the terms partial derivation or sub-derivation. The string of terminal symbols iscalled the yield of the derivation. For an SCFG derivation d we distinguish the source yield σ(d)and the target yield τ(d). The source is given to the decoder, but parsing can produce manydifferent source trees over it (comparable to phrase segmentation in phrase-based decoding). TheSCFG may contain many rules with the same source side but different target sides. This impliesthat a single source-side parse can correspond with many different synchronous derivations thatinduce different target yields. Model scores are calculated for each (partial) derivation duringdecoding. They are ranked and the most promising hypotheses are expanded to construct thesearch space. With R(d) we denote the set of rule applications in a derivation d. Since manyfeatures can be decomposed to the level of individual rule applications, we generally use thesymbol t(·) to define phrase scoring functions, with some subscript to signify the feature. Forother notation in the subsequent sections, please recall the conventions from Section 1.2; inparticular, recall that r = X→ ⟨α, β,∼ ⟩.

2.2 Decision Rule

An exact computation of Equation (0.6) in decoding would require the summation over allderivations with the same yield. In practice, this requirement is eliminated by a maximum ap-proximation, resulting in a decision rule that selects as the best translation the target yield τ(d)of the derivation d with maximum model score. Formally, the decision rule for hierarchical search

25

Chapter 2

is

fJ1 → eI1(f

J1 ) = argmax

I,eI1

maxd:σ(d)=fJ

1

τ(d)=eI1

{M∑

m=1

(λmhm(fJ

1 , eI1, d)

)} . (2.1)

Feature functions thus also take the derivation as an argument.For the practical implementation, we may minimize an overall model cost rather than maximiz-

ing a model score, which gives us the (equivalent) decision rule

fJ1 → eI1(f

J1 ) = argmin

I,eI1

mind:σ(d)=fJ

1

τ(d)=eI1

{M∑

m=1

(−λmhm(fJ

1 , eI1, d)

)} . (2.2)

Scaling factors λm can be negative, meaning that it is up to the optimizer to determine whetherthe contribution of a feature function to the log-linear model should act as a penalty or as a reward,by choosing a scaling factor with the respective sign. The optimizer normalizes the absolute valuesof the scaling factors λm to sum to 1.

2.3 Feature Functions2.3.1 Phrase Translation Models

We employ two phrase translation models in our systems, one modeling translation probabilitiesof the target side of phrases given the source side (source-to-target) and the other one modelingtranslation probabilities of the source side of phrases given the target side (target-to-source). Thefeature function for phrase translation model scoring in source-to-target direction is

hs2tPhr(fJ1 , e

I1, d) = log

∏r∈R(d)

ps2tPhr(r) =∑

r∈R(d)

log ps2tPhr(r) =∑

r∈R(d)

ts2tPhr(r) (2.3)

withts2tPhr(r) = log ps2tPhr(r) . (2.4)

Similarly, for the target-to-source direction, the feature function is

ht2sPhr(fJ1 , e

I1, d) = log

∏r∈R(d)

pt2sPhr(r) =∑

r∈R(d)

log pt2sPhr(r) =∑

r∈R(d)

tt2sPhr(r) (2.5)

withtt2sPhr(r) = log pt2sPhr(r) . (2.6)

The probability distributions ps2tPhr(r) and pt2sPhr(r) are estimated with relative frequencyfrom the parallel training data. Denoting with N(·) the count of the argument in the trainingcorpus, the probability estimate for the source-to-target model is

ps2tPhr(r) = p(β|α) = N(α, β)

N(α), (2.7)

and for the target-to-source model it is

pt2sPhr(r) = p(α|β) = N(α, β)

N(β). (2.8)

Rather than counting occurrences monolingually in the corpus for the denominators in Equa-tions (2.7) and (2.8), we may use marginal counts of occurrences within valid phrases that areconsistent with the word alignment of the parallel training data, i.e. N(α) =

∑β′ N(α, β′) and

N(β) =∑

α′ N(α′, β).

26

Baseline Models

2.3.2 Lexical Translation Models

We employ word-based translation models for lexical scoring. Again, as with phrase translationmodels, we use one model in source-to-target and one in target-to-source direction, with featurefunctions

hs2tLex(fJ1 , e

I1, d) =

∑r∈R(d)

ts2tLex(r) (2.9)

andht2sLex(f

J1 , e

I1, d) =

∑r∈R(d)

tt2sLex(r) (2.10)

and the phrase scoring functions being

ts2tLex(r) =

Iβ∑i=1

log(p(βi|NULL) +

∑Jαj=1 p(βi|αj))

1 + Jα

)(2.11)

and

tt2sLex(r) =Jα∑j=1

log(p(αj |NULL) +

∑Iβi=1 p(αj |βi))

1 + Iβ

). (2.12)

NULL is a special empty word token. For the actual word-based lexicons p(e|f) and p(f |e),we estimate the probabilities by relative frequency from word-aligned data (with symmetrizedword alignments), similarly to [Koehn & Och+ 03]. We will investigate lexical scoring in depth inChapter 5 and will show that we can improve over our baseline approach.

2.3.3 Phrase Penalty

The phrase penalty feature fires on every rule application and is defined as

hpp(fJ1 , e

I1, d) = |R(d)| =

∑r∈R(d)

tpp(r) (2.13)

withtpp(r) = 1 . (2.14)

With its associated scaling factor, the phrase penalty can give the system control over whether totranslate sentences using less, but longer phrases, or the reverse.

2.3.4 Word Penalty

The word penalty with feature function

hwp(fJ1 , e

I1, d) = I =

∑r∈R(d)

twp(r) (2.15)

withtwp(r) = Iβ (2.16)

counts the number of words on the target side. Depending on the scaling factor of the wordpenalty feature, the system will produce shorter or longer translations.

27

Chapter 2

2.3.5 Phrase Length RatiosWe also use two phrase length ratio features with

hs2tLength(fJ1 , e

I1, d) =

∑r∈R(d)

ts2tLength(r) (2.17)

andht2sLength(f

J1 , e

I1, d) =

∑r∈R(d)

tt2sLength(r) (2.18)

and the phrase scoring functions being

ts2tLength(r) =IβJα

(2.19)

andtt2sLength(r) =

JαIβ

. (2.20)

2.3.6 Glue Rule IndicatorThe glue rule indicator feature fires on applications of the glue rule (Eq. (1.6)) and is for deep

grammars defined ashglue(f

J1 , e

I1, d) =

∑r∈R(d)

tglue(r) (2.21)

withtglue(r) =

[r = S → ⟨S∼0X∼1, S∼0X∼1⟩

]. (2.22)

Here, [·] denotes a true or false statement: The result is 1 if the condition is true and 0 if thecondition is false. For shallow grammars, we modify the feature accordingly.

2.3.7 Hierarchical IndicatorThe hierarchical indicator distinguishes hierarchical phrases (with gaps) from lexical phrases

(without gaps). The feature function is

hpaste(fJ1 , e

I1, d) =

∑r∈R(d)

thierarchical(r) (2.23)

withthierarchical(r) = [r ∈ H \ H0] (2.24)

where H is the overall inventory of extracted phrases and H0 denotes the union of extracted lexicalphrase sets H0(f

J1 , e

I1, A) over all parallel training instances, i.e. any lexical phrase in the phrase

inventory. See Section 1.3 for the definitions of H and H0(fJ1 , e

I1, A).

2.3.8 Paste IndicatorAnother indicator function, the paste indicator, counts the number of applications of hierarchical

phrases with non-terminals positioned bilingually at the phrase boundary. The feature functionis

hpaste(fJ1 , e

I1, d) =

∑r∈R(d)

tpaste(r) (2.25)

28

Baseline Models

withtpaste(r) =

[r ∈ RpasteLeft ∨ r ∈ RpasteRight

](2.26)

and RpasteLeft and RpasteRight defined as

RpasteLeft ={

X→ ⟨Y∼0γ,Y∼0δ⟩ | γ ∈ (N ∪ VF )⋆ ∧ δ ∈ (N ∪ VE)

⋆ ∧X ∈ N ∧Y ∈ N}

(2.27)

and

RpasteRight ={

X→ ⟨γY∼0, δY∼0⟩ | γ ∈ (N ∪ VF )⋆ ∧ δ ∈ (N ∪ VE)

⋆ ∧X ∈ N ∧Y ∈ N}. (2.28)

2.3.9 Count IndicatorsCount indicators can be used to indicate phrases that have been seen more than c times in the

training data. We employ the feature function template

hcountBin(fJ1 , e

I1, d) =

∑r∈R(d)

tcountBin(r) . (2.29)

withtcountBin(r) = [N(α, β) > c] (2.30)

and activate it for several values of c, e.g. marking phrases with absolute counts larger than one,two, three, and five times, respectively.

2.3.10 Language ModelWe assess the fluency of the output with an n-gram language model (LM). The feature function

takes the form

hLM(fJ1 , e

I1, d) = log

I∏i=1

p(ei|ei−1i−n+1) =

I∑i=1

log p(ei|ei−1i−n+1) . (2.31)

Whereas we have seen that all the above features can be decomposed to the rule level and arethus independent of the context of individual rule applications, this is not the case for the languagemodel. When decoding with a language model, target-side context needs to be taken into account.A rule application may score higher or lower depending on the current context. The decoderstores a state that preserves the context and prohibits recombination of (partial) hypotheses withdifferent states. For experiments reported in this thesis, we utilize 4-gram language models withmodified Kneser-Ney smoothing [Kneser & Ney 95, Chen & Goodman 98] trained with the SRILMtoolkit [Stolcke 02].

29

3. Hierarchical Search with Cube Pruning

Cube pruning is the most widely used search strategy in state-of-the-art hierarchical decoders.For hierarchical search, a parsing algorithm is extended to handle translation candidates and toincorporate language model scores via cube pruning. The cube pruning algorithm for hierarchicalsearch was introduced by [Chiang 07] and is basically an adaptation of one of the k-best parsingalgorithms by [Huang & Chiang 05].

The essentials of hierarchical search with the cube pruning algorithm are explained in Sec-tion 3.1. We emphasize two important aspects: the k-best generation size (Section 3.1.1) andthe hypothesis recombination scheme (Section 3.1.2). We then empirically investigate the perfor-mance of cube pruning on Chinese→English and Arabic→English translation tasks (Section 3.2).A number of comparative experiments with varied configuration parameters are conducted, andthe findings are analyzed and discussed. We review related work in Section 3.3.

3.1 The Cube Pruning Algorithm

Cube pruning operates on a hypergraph which represents the whole parsing space. This hy-pergraph is built employing a customized version of the cyk+ parsing algorithm [Chappelier &Rajman 98]. With cyk+, the input sentence is monolingually parsed using the source sides ofthe SCFG rules. The hypergraph is a compact representation of the cyk+ parse forest. Con-ceptionally, hypernodes correspond with distinct non-terminals in cells of the cyk+ parse chart.Target-side parses can be constructed from source-side parses, and cube pruning allows us to onlyexplore the most promising target candidates in a beam search manner. Given the hypergraph,cube pruning expands at most k bilingual derivations at each hypernode.1

The pseudocode of the k-best generation step of the cube pruning algorithm is shown in Fig-ure 3.1. This function is called in bottom-up topological order for all hypernodes. A heap ofactive derivations H is maintained. H initially contains the first-best derivations for each incom-ing hyperedge (line 1). Active derivations are processed in a loop (line 3) until a limit k is reachedor H is empty. If a candidate derivation d is recombinable, the Recombine auxiliary functionrecombines it and returns true; otherwise (for non-recombinable candidates) Recombine returnsfalse. Non-recombinable candidates are appended to the list D of k-best derivations (line 6).This list will be sorted before the function terminates (line 8). The PushSucc auxiliary function(line 7) updates H with the next best derivations following d along the hyperedge. PushSuccdetermines the cube order by processing adjacent derivations in a specific sequence (of predecessorhypernodes along the hyperedge and phrase translation options).2

1The hypergraph on which cube pruning operates can be constructed based on other techniques, such as treeautomata, but cyk+ parsing is the dominant approach.

2[Vilar 11] presents a more detailed outline of the algorithm.

31

Chapter 3

1Input: a hypernode and the size k of the k-best listOutput: D, a list with the k-best derivations

1 let H ← heap({(e,1|e|) | e ∈ incoming edges)})2 let D ← [ ]3 while |H| > 0 ∧ |D| < k do4 d← pop(H)5 if not Recombine(D, d) then6 D ← D ++ [d]

7 PushSucc(d,H)8 sort D

Figure 3.1: k-best generation with the cube pruning algorithm.

3.1.1 k-best Generation Size

Candidate derivations are generated by cube pruning best-first along the incoming hyperedges.A problem results from the language model integration, though. As soon as language modelcontext is considered, monotonicity properties of the derivation score can no longer be guaranteed.Thus, even for single-best translation, k-best derivations are collected to a buffer in a beam searchmanner and finally sorted according to their score. The k-best generation size is consequently acrucial parameter to the cube pruning algorithm.

3.1.2 Hypothesis Recombination

Partial hypotheses with states that are indistinguishable from each other are recombined duringsearch. We define two notions of when to consider two derivations as indistinguishable, and thuswhen to recombine them:

Recombination T: The T recombination scheme recombines derivations that produce identicaltranslations.

Recombination LM: The LM recombination scheme recombines derivations with identical lan-guage model context.

Recombination is conducted within the loop of the k-best generation step of cube pruning.Recombined derivations do not increment the generation count; the k-best generation limit is thuseffectively applied after recombination.3 In general, more phrase translation candidates per hyper-node are being considered (and need to be rated with the language model) in the recombinationLM scheme compared to the recombination T scheme. The more partial hypotheses can berecombined, the more iterations of the inner code block of the k-best generation loop are possible.The same internal k-best generation size results in a larger search space for recombination LM. Inthe next section, we will examine how the overall number of loop iterations relates to the k-bestgeneration limit. By measuring the number of derivations as well as the number of recombinationoperations on our test sets, we will be able to give an insight into how large the fraction ofrecombinable candidates is for different configurations.

3Whether recombined derivations contribute to the generation count or not is a configuration decision (orimplementation decision). Please note that some publicly available toolkits count recombined derivations by default.

32

Hierarchical Search with Cube Pruning

3.2 Performance of Cube Pruning for Hierarchical SearchWe investigate the performance of hierarchical phrase-based translation with cube pruning and

look into the key aspects:

• k-best generation size.• Hypothesis recombination scheme.• Deep vs. shallow grammar.

Specifically, we study how the choice of the k-best generation size affects translation qualityand resource requirements in hierarchical search. We furthermore examine the influence of thetwo different granularities of hypothesis recombination (recombination T and recombination LM).Besides standard hierarchical grammars, we also explore search with restricted recursion depthof hierarchical rules based on shallow-1 grammars (cf. Section 1.3 for more details about deepvs. shallow grammars). We conduct a comparative empirical study of all combinations of thesethree factors in hierarchical decoding and present experimental results on the Chinese→Englishand Arabic→English 2008 NIST tasks.4

Performance is evaluated in terms of both translation quality and computational efficiency, i.e.translation speed and memory consumption.

3.2.1 Experimental SetupWe work with parallel training corpora of 3.0 M Chinese–English sentence pairs (77.5 M Chi-

nese / 81.0 M English running words after preprocessing) and 2.5 M Arabic–English sentence pairs(54.3 M Arabic / 55.3 M English running words after preprocessing), respectively. More trainingdata statistics are given in Tables 3.1 and 3.2. Phrase tables are not prepruned to contain amaximum number of translation options per distinct source side, but we configured the decoderto load only the best options with respect to the weighted phrase-level model scores (100 forChinese, 50 for Arabic).

During decoding, we expand the initial symbol S in the leftmost cells in each row of the cyk+

parse chart only. A maximum length constraint of ten is applied to the source yields of sub-derivations headed by any non-terminal except S. The optimized feature weights are obtained(separately for deep and for shallow-1 grammars) with a generation size of 1 000 for Chinese→English and of 500 for Arabic→English in MERT and kept for all setups. We employ MT06 asdevelopment sets. Translation quality is measured with Bleu on the MT08 test sets. We reportcased Bleu scores. Data statistics of the preprocessed source sides of the MT06 developmentsets and the MT08 test sets for Chinese→English and Arabic→English are given in Tables 3.3and 3.4.

The Jane toolkit is used for our translation experiments, compiled with GCC version 4.4.3 andits -O2 optimization flag. The language models are 4-grams with modified Kneser-Ney smoothingwhich were trained on a large collection of English data including the target side of the parallelcorpus. In binarized version, the language models have a size of 3.6G (Chinese→English; 1105 K1-grams, 51.8 M 2-grams, 96.9 M 3-grams, 186.9 M 4-grams) and 6.2G (Arabic→English; 361 K1-grams, 70.9 M 2-grams, 165.3 M 3-grams, 342.4 M 4-grams). We employ the SRILM librariesto perform language model scoring in the decoder. Language models and phrase tables havebeen copied to the local hard disks of the machines. In all experiments, the language model iscompletely loaded beforehand. Loading time of the language model and any other initializationsteps are not included in the measured translation time. Phrase tables are in the Jane toolkit’sbinarized format. The decoder initializes the prefix tree structure, required nodes get loaded from

4http://www.itl.nist.gov/iad/mig/tests/mt/2008/

33

http://www.itl.nist.gov/iad/mig/tests/mt/2008/

Chapter 3

Table 3.1: Data statistics of the preprocessedChinese–English parallel trainingcorpus.

Chinese EnglishSentences 3.0 MRunning words 77.5 M 81.0 MVocabulary 83 K 213 KSingletons 21 K 96 K

Table 3.2: Data statistics of the preprocessedArabic–English parallel training cor-pus.

Arabic EnglishSentences 2.5 MRunning words 54.3 M 55.3 MVocabulary 265 K 208 KSingletons 115 K 91 K

Table 3.3: Data statistics of the preprocessedsource sides of the Chinese→EnglishNIST MT06 and MT08 sets.

Chinese MT06 (Dev) MT08 (Test)Sentences 1 664 1 357Running words 40 740 34 463Vocabulary 6 138 6 209

Table 3.4: Data statistics of the preprocessedsource sides of the Arabic→EnglishNIST MT06 and MT08 sets.

Arabic MT06 (Dev) MT08 (Test)Sentences 1 797 1 360Running words 49 677 45 095Vocabulary 9 274 9 387

secondary storage into main memory on demand, and the loaded content is being cleared eachtime a new input sentence is to be parsed. There is nearly no overhead due to unused data inmain memory. We do not rely on memory mapping. Memory statistics are with respect to virtualmemory. The hardware was equipped with RAM well beyond the requirements of the tasks, andsufficient memory has been reserved for the processes.

3.2.2 Experimental ResultsFigures 3.2 and 3.3 depict how the Chinese→English and Arabic→English setups behave in

terms of translation quality. The k-best generation size in cube pruning is varied between 10 and10 000. The four graphs in each plot illustrate the results with combinations of deep grammarand recombination scheme T, deep grammar and recombination scheme LM, shallow grammarand recombination scheme T, as well as shallow grammar and recombination scheme LM. Figures3.4 and 3.5 show the corresponding translation speed in words per second for these settings.The maximum memory requirements in gigabytes are given in Figures 3.6 and 3.7. In order tobetter visualize the trade-offs between translation quality and resource consumption, we plottedtranslation quality against time requirements in Figures 3.8 and 3.9 and translation quality againstmemory requirements in Figures 3.10 and 3.11. Translation quality and model score (averagedover all sentences; higher is better) are nicely correlated for all configurations, as can be concludedfrom Figures 3.12 through 3.15.

3.2.3 DiscussionChinese→English. For Chinese→English translation, the system with deep grammar performsgenerally a bit better with respect to quality than the shallow one, which accords with the findingsof other groups [de Gispert & Iglesias+ 10, Sankaran & Razmara+ 12]. The LM recombinationscheme yields slightly better quality than the T scheme, and with the shallow-1 grammar itoutperforms the T scheme at any given fixed amount of time or memory allocation (Figures 3.8and 3.10).

Shallow-1 translation is up to roughly 2.5 times faster than translation with the deep grammar.However, the shallow-1 setups are considerably slowed down at higher k-best sizes as well, while

34


23

23.5

24

24.5

25

25.5

10 100 1000 10000

BLE

U [%

]

k-best generation size

NIST Chinese-to-English (MT08)

deep, recombination Tdeep, recombination LMshallow-1, recombination Tshallow-1, recombination LM

Figure 3.2: Translation quality (cased) withthe cube pruning hierarchical de-coder for the NIST Chinese→English translation task (MT08).

42.5

43

43.5

44

44.5

45

10 100 1000 10000

BLE

U [%

]


NIST Arabic-to-English (MT08)


Figure 3.3: Translation quality (cased) with thecube pruning hierarchical decoderfor the NIST Arabic→English trans-lation task (MT08).

the effort pays off only very moderately. Overall, the shallow-1 grammar at a k-best size between100 and 1 000 seems to offer a good compromise of quality and efficiency. Deep translation withk = 2000 and the LM recombination scheme promises high quality translation, but note the rapidmemory consumption increase beyond k = 1000 with the deep grammar. At k ≤ 1 000, memoryconsumption is not an issue in both deep and shallow systems, but translation speed starts todrop at k > 100 already.

Arabic→English. Shallow-1 translation produces very competitive quality for Arabic→Englishtranslation [de Gispert & Iglesias+ 10, Huck & Vilar+ 11a]. The LM recombination scheme booststhe Bleu scores slightly.

The systems with deep grammar are slowed down strongly with every increase of the k-bestsize. Their memory consumption likewise inflates early. We actually stopped running experimentswith deep grammars for Arabic→English at k = 7000 for the T recombination scheme, and atk = 700 for the LM recombination scheme because 124G of memory did not suffice any more forhigher k-best sizes. The memory consumption of the shallow systems stays nearly constant acrossa large range of the surveyed k-best sizes, but Figure 3.11 reveals a plateau where more resourcesdo not improve translation quality. Increasing k from 100 to 2 000 in the shallow setup withLM recombination provides half a Bleu point, but reduces speed by a factor of more than 10.

Actual amount of derivations. We measured the amount of hypernodes (Table 3.5), the amountof actually generated derivations after recombination, and the amount of generated candidatederivations including recombined ones—or, equivalently, loop iterations in the algorithm fromFigure 3.1—for selected limits k (Tables 3.6 and 3.7). The ratio of the average amount of deriva-tions per hypernode after and before recombination remains consistently at low values for allrecombination T setups. For the setups with LM recombination scheme, this recombination fac-tor rises with larger k, i.e. the fraction of recombinable candidates increases. The increase isremarkably pronounced for Arabic→English with deep grammar. The steep slope of the recombi-nation factor may be interpreted as an indicator for undesired overgeneration of the deep grammaron the Arabic→English task.

35

Chapter 3

3.2.4 SummaryWe systematically studied three key aspects of hierarchical phrase-based translation with cube

pruning: deep vs. shallow-1 grammars, the k-best generation size, and the hypothesis recombina-tion scheme. In a series of empirical experiments, we revealed the trade-offs between translationquality and resource requirements to a more fine-grained degree than this is typically done in theliterature.

3.3 Related WorkSome alternatives and extensions to the classical cube pruning algorithm for hierarchical de-

coding as proposed by [Chiang 07] have been presented in the literature since, e.g. cube growing[Huang & Chiang 07], lattice-based hierarchical translation [Iglesias & de Gispert+ 09b, de Gis-pert & Iglesias+ 10], and source cardinality synchronous cube pruning [Vilar & Ney 12]. Heafieldet al. have developed techniques to speed up hierarchical search by means of an improved languagemodel integration [Heafield & Hoang+ 11, Heafield & Koehn+ 12, Heafield & Koehn+ 13]. Stan-dard cube pruning remains the commonly adopted decoding procedure in hierarchical machinetranslation research at the moment, though.

The algorithm has meanwhile been implemented in many publicly available toolkits, as forexample in Moses [Koehn & Hoang+ 07, Hoang & Koehn+ 09], Joshua [Li & Callison-Burch+

09a], cdec [Dyer & Lopez+ 10], Kriya [Sankaran & Razmara+ 12], and NiuTrans [Xiao & Zhu+

12]. Good descriptions of the cube pruning implementation in the Joshua decoder have beenprovided by [Li & Khudanpur 08] and [Li & Callison-Burch+ 09b]. [Xu & Koehn 12] implementedhierarchical search with the cube growing algorithm in Moses and compared its performance toMoses’ cube pruning implementation.

While the plain hierarchical approach to machine translation is only formally syntax-based, cubepruning can also be utilized for decoding with syntactically or semantically enhanced models, forinstance those by [Zollmann & Venugopal 06], [Marton & Resnik 08], [Venugopal & Zollmann+

09], [Shen & Xu+ 10], [Hoang & Koehn 10], [Chiang 10], [Xie & Mi+ 11], [Almaghout & Jiang+12], [Li & Tu+ 12], [Williams & Koehn 12], or [Baker & Bloodgood+ 10]. An interesting empiricalcomparison of phrase-based, hierarchical phrase-based, and syntax-augmented systems on theChinese→English and Arabic→English NIST tasks was presented by [Zollmann & Venugopal+ 08].[Auli & Lopez+ 09] carried out a valuable analysis of the search spaces and model expressivenessof phrase-based and hierarchical phrase-based setups.

36


0

1

2

3

4

5

6

7

8

9

10 100 1000 10000

wor

ds p

er s

econ

d




Figure 3.4: Translation speed with the cubepruning hierarchical decoder for theNIST Chinese→English translationtask.

0

2

4

6

8

10

12

14

16

18

10 100 1000 10000

wor

ds p

er s

econ

dk-best generation size



Figure 3.5: Translation speed with the cubepruning hierarchical decoder for theNIST Arabic→English translationtask.

0

8

16

24

32

40

10 100 1000 10000

giga

byte

s




Figure 3.6: Memory requirements with the cubepruning hierarchical decoder for theNIST Chinese→English translationtask.

0

8

16

24

32

40

10 100 1000 10000

giga

byte

s




Figure 3.7: Memory requirements with the cubepruning hierarchical decoder for theNIST Arabic→English translationtask.

37

Chapter 3

23

23.5

24

24.5

25

25.5

0.125 0.25 0.5 1 2 4 8 16 32

BLE

U [%

]

seconds per word



Figure 3.8: Trade-off between translation qual-ity and speed with the cube pruninghierarchical decoder for the NISTChinese→English translation task.

42.5

43

43.5

44

44.5

45

0.125 0.25 0.5 1 2 4 8 16 32B

LEU

[%]

seconds per word



Figure 3.9: Trade-off between translation qual-ity and speed with the cube pruninghierarchical decoder for the NISTArabic→English translation task.

23

23.5

24

24.5

25

25.5

8 16 32 64

BLE

U [%

]

gigabytes



Figure 3.10: Trade-off between translation qual-ity and memory requirements withthe cube pruning hierarchical de-coder for the NIST Chinese→English translation task.

42.5

43

43.5

44

44.5

45

16 32 64 128

BLE

U [%

]

gigabytes



Figure 3.11: Trade-off between translation qual-ity and memory requirements withthe cube pruning hierarchical de-coder for the NIST Arabic→English translation task.

38


23

23.5

24

24.5

25

25.5

-8.7 -8.65 -8.6 -8.55 -8.5 -8.45 -8.4

BLE

U [%

]

average model score


deep, recombination Tdeep, recombination LM

Figure 3.12: Relation of translation quality andaverage model score with the cubepruning hierarchical decoder forthe NIST Chinese→English trans-lation task (deep grammar).

42.5

43

43.5

44

44.5

45

-6.6 -6.5 -6.4 -6.3 -6.2 -6.1

BLE

U [%

]average model score


deep, recombination Tdeep, recombination LM

Figure 3.13: Relation of translation quality andaverage model score with the cubepruning hierarchical decoder forthe NIST Arabic→English transla-tion task (deep grammar).

23

23.5

24

24.5

25

25.5

-9.4 -9.35 -9.3 -9.25 -9.2 -9.15 -9.1

BLE

U [%

]

average model score


shallow-1, recombination Tshallow-1, recombination LM

Figure 3.14: Relation of translation quality andaverage model score with the cubepruning hierarchical decoder forthe NIST Chinese→English trans-lation task (shallow-1 grammar).

42.5

43

43.5

44

44.5

45

-12.1 -12 -11.9 -11.8 -11.7 -11.6

BLE

U [%

]

average model score


shallow-1, recombination Tshallow-1, recombination LM

Figure 3.15: Relation of translation quality andaverage model score with the cubepruning hierarchical decoder forthe NIST Arabic→English transla-tion task (shallow-1 grammar).

39

Chapter 3

Table 3.5: Average amount of hypernodes per sentence and average length of the preprocessedinput sentences on the NIST Chinese→English (MT08) and Arabic→English (MT08)tasks.

Chinese→English Arabic→Englishdeep shallow-1 deep shallow-1

avg. #hypernodes per sentence 480.5 200.7 896.4 308.4avg. source sentence length 25.4 33.2

Table 3.6: Detailed statistics about the actual amount of derivations on the NIST Chinese→English translation task (MT08).

deeprecombination T recombination LM

avg. #derivations avg. #derivations avg. #derivations avg. #derivationsper hypernode per hypernode per hypernode per hypernode

k (after recombination) (incl. recombined) factor (after recombination) (incl. recombined) factor10 10.0 11.7 1.17 10.0 18.2 1.82

100 99.9 120.1 1.20 99.9 275.8 2.761000 950.1 1142.3 1.20 950.1 4246.9 4.47

10000 9429.8 11262.8 1.19 9418.1 72008.4 7.65

shallow-1recombination T recombination LM



100 90.8 105.2 1.16 90.4 168.6 1.861000 707.3 811.3 1.15 697.4 2143.4 3.07

10000 6478.1 7170.4 1.11 6202.8 34165.6 5.51

Table 3.7: Detailed statistics about the actual amount of derivations on the NIST Arabic→Englishtranslation task (MT08).

deeprecombination T recombination LM



100 98.0 177.4 1.81 98.0 1726.0 17.62500 482.1 849.0 1.76 482.1 14622.1 30.33

1000 961.8 1675.0 1.74 – – –

shallow-1recombination T recombination LM



100 80.9 105.2 1.30 80.2 193.8 2.421000 690.1 902.1 1.31 672.1 2413.0 3.59

10000 5638.6 7149.5 1.27 5275.1 31283.6 5.93

40

Part II

Enhancing Hierarchical Phrase-basedTranslation with Additional Models

41

4. Extended Lexicon Models

Previous research has demonstrated how two types of extended lexicon models called tripletlexicon model (we will abbreviate this simply as triplets in many cases) and discriminative wordlexicon (DWL) can improve the translation quality of standard phrase-based systems when appliedin n-best reranking [Hasan & Ganitkevitch+ 08] as well as directly in beam search decoding [Hasan& Ney 09, Mauser & Hasan+ 09]. Both types of extended lexicon models consider global sourcesentence context in order to predict context-specific target words. Their main advantage is thatthey promote a better lexical choice than the baseline features alone are able to achieve.

In this chapter, we investigate the impact of an integration of discriminative and trigger-basedlexicon models in hierarchical machine translation. We have implemented DWL and triplet modelscoring for hierarchical decoding in Jane [Vilar & Stein+ 10]. For comparison, we also presentempirical results with extended lexicon models in standard phrase-based translation. Experimentsare conducted on the NIST Arabic→English task.

Since training the extended lexicon models can be computationally demanding and the trainedmodels can become fairly large, we furthermore look into certain restrictions that discard some ofthe less useful information. We show how these restrictions facilitate the training of the extendedlexicon models and compare the utility of the model variants in translation.

4.1 MotivationTaking long-range dependencies into account is still one of the main problems in today’s statis-

tical machine translation (SMT). State-of-the-art systems incorporate components like a phrasetranslation model and n-gram language models that act effectively within a local context and givereliable results as long as only information from a limited window is required. But reordering intranslation between different languages, recursive embedding of subphrases, as it is common innatural language, and distant lexical interconnections are hard to model and difficult to handlein a computationally efficient way.

The hierarchical phrase-based approach to machine translation is not restricted to consecutivewords, but allows for hierarchical phrase embeddings and can capture lexical dependencies thatcross gaps. The phrase table of a hierarchical phrase-based translation (HPBT) system can beconsidered to be the rule set of a synchronous context-free grammar. This formal grammar usuallydoes not comply with a linguistically motivated grammar, but since the decoding procedure isrealized as a probabilistic parser, the hierarchical phrase-based paradigm connects more closely tolinguistics-oriented work in natural language processing than standard phrase-based translation(PBT). Following a natural path of research, several efforts have been made to engineer syntacti-cally more informed hierarchical systems based on syntactic annotation of the data [Zollmann &Venugopal 06, Venugopal & Zollmann+ 09, Shen & Xu+ 08, Shen & Xu+ 10, Marton & Resnik08, Vilar & Stein+ 08, Stein & Peitz+ 10, Peter & Huck+ 11, Chiang 10, Hoang & Koehn 10]. Ap-propriate features can be introduced into the log-linear framework of modern statistical machine

43

Chapter 4

translation systems without having to impose any hard contraints on the translation process.The primary aim of augmenting the hierarchical translation model with syntactic knowledge is toproduce output sentences with a better syntactic structure. On the other hand, standard phrase-based translation with continuous phrases and left-to-right target generation has proven verysuccessful and robust by relying on statistics learned purely on surface forms from huge corpora.Such systems still outperform plain hierarchical systems on many translation tasks. In this work,we complement the efforts that are being made on a deep structural level within the hierarchicalparadigm by including additional models which are trained on surface forms only—without anysyntactic annotation that would have to be made available—but still operate at a scope thatextends beyond the capabilities of a standard feature set for hierarchical systems. We argue thatbetter statistical modeling is required for the systems to truly take advantage of hierarchicalphrases.

Hierarchical translation as well as syntax-augmented statistical machine translation share ashortcoming concerning lexical choice with standard phrase-based translation. Bilingual lexicalcontext beyond the phrase boundaries is not taken into account by the commonly applied features.As we have seen in Chapter 2, none of the standard features for hierarchical translation exceptthe language model consider any context beyond the phrase boundaries. The language modelconsiders only a limited history of previous words on the target side. We enhance our log-linearmodel with feature functions for the trigger-based triplet lexicon model and the discriminativeword lexicon. Triplets and the discriminative word lexicon can both model distant lexical depen-dencies. During decoding, rule applications are scored conditioned on the full source sentencecontext, i.e. given all words in the input sentence. Efficient implementations of feature functionsthat consider the full source sentence context are less of a problem because the input remainsfixed throughout decoding. We can cache score values that have already been calculated for reuse.

We compare hierarchical and standard phrase-based setups—either of them enriched with ex-tended lexicon models—against each other on the NIST Arabic→English translation task. En-hanced with extended lexicon models, we find that the translation quality of our best hierarchicalsystem slightly surpasses the translation quality of the best standard phrase-based system (by+0.2 Bleu). Variants of the extended lexicon models with certain restrictions can help reducethe computational demands for training and decrease memory requirements and runtime duringdecoding. We examine a path-constrained triplet model in addition to the full variant and applycount cutoffs to the triplets. For the discriminative word lexicon, we investigate a full model anda sparse model.

The remainder of the chapter is structured as follows. In Section 4.2, we give a short overviewof related previously published work. We introduce triplet lexicons and discriminative word lex-icons in Sections 4.3 and 4.4, show how they are used in hierarchical decoding, and describethe modifications to reduce their computational demands in training, and their final size. Theexperimental evaluation is presented in Section 4.5. We first give a characterization of the exper-imental setup. We then report on the different extended lexicon models we trained and proceedwith a comparison of the translation results using these models in phrase-based and hierarchicaltranslation.

4.2 Related Work

[Hasan & Ganitkevitch+ 08] proposed triplet lexicon models for statistical machine translationfor the first time. Triplet lexicon models are related to the well-known IBM model 1 [Brown &Della Pietra+ 93] but extend it with a second trigger. [Hasan & Ganitkevitch+ 08] also introducedthe restrictions that are applied to triplets in this work, they did however apply the modelsonly in an n-best list reranking framework. They evaluated their methods on a small Chinese→

44

Extended Lexicon Models

English and on Spanish→English / English→Spanish tasks. [Hasan & Ney 09] investigated tripletlexicon models for standard phrase-based translation on a large-scale Chinese→English task. Theycompared translation quality using a path-constrained triplet model variant applied in n-bestreranking to an integrated application in decoding. [Hasan 11] studied triplet lexicon models indepth, including many model variants.

The discriminative word lexicon model was presented by [Mauser & Hasan+ 09]. Their workis inspired by the approach described by [Bangalore & Haffner+ 07]. [Mauser & Hasan+ 09] alsocompared the effect of a triplet lexicon model and a DWL model in phrase-based decoding on aChinese→English task and on the NIST Arabic→English task that we likewise work on. [Jeong& Toutanova+ 10] utilized a discriminatively trained lexicon model within a treelet translationsystem for languages with complex morphology.

Different from previous work, we integrate triplet lexicon models and discriminative word lex-icons into hierarchical phrase-based decoding, and we compare different variants of the two ex-tended lexicon models in hierarchical translation, including a new sparse variant of the discrimi-native word lexicon model.

4.3 Triplet Lexicon

The triplet lexicon relies on triplets which are composed of two source language words triggeringone target language word, i.e. it models probabilities p(e|f, f ′). The probability of a whole targetsentence eI1 given the source sentence fJ

1 is calculated as

p(eI1|fJ1 ) =

I∏i=1

p(ei|fJ1 )

=I∏

i=1

2

J · (J + 1)

J∑j=0

J∑j′=j+1

p(ei|fj , fj′) .(4.1)

with f0 = NULL being the empty word.The path-constrained (or path-aligned) triplet model variant restricts the first trigger f to the

aligned target word e. The second trigger f ′ is allowed to range over all remaining words of thesource sentence. When {aij} denotes the word alignment of the sentence pair eI1 and fJ

1 , theprobability of a whole target sentence results in

p(eI1|fJ1 , {aij}) =

I∏i=1

p(ei|fJ1 , {aij})

=I∏

i=1

∑j∈{ai}

1

Zi

J∑j′=1

p(ei|fj , fj′) .(4.2)

with a normalization factor Zi = J · |{ai}|. j ∈ {ai} expresses that fj is aligned to the targetword ei.

To further reduce the size of a triplet model, count cutoffs can be applied. This means thattriplets that occur less than a fixed number of times in the corpus are not considered in thetraining of the model.

Like the IBM model 1, triplet lexicon models are trained iteratively with the EM algorithm.[Hasan 11] derives the equations for EM training of triplet lexicon models.

45

Chapter 4

4.3.1 Triplet Feature Functions with Global Source ContextUsing the notational conventions from Sections 1.2 and 2.1, we define the feature function

for the unconstrained triplet lexicon model with global source sentence context in hierarchicaltranslation as

hs2tTriplet(fJ1 , e

I1, d) =

∑r∈R(d)

ts2tTriplet(r, fJ1 ) (4.3)

with the scoring function

ts2tTriplet(r, fJ1 ) =

Iβ∑i=1

log

2

J · (J + 1)

J∑j=0

J∑j′=j+1

p(βi|fj , fj′)

(4.4)

and f0 = NULL being the empty word.The feature function for the path-constrained triplet lexicon model takes a slightly different

form. We need to take the word alignment into account. For the decoder to have access to theword alignment, we keep track of the most frequent alignment for each phrase during phraseextraction and store it with the entry in the phrase table. The decoder will thus be aware ofwithin-phrase word alignments.

With {aij} we denote a word alignment of a phrase pair that is stored in the phrase table,i.e. here we index the terminal symbols on source and target side of a rule r, 1 ≤ j ≤ Jα and1 ≤ i ≤ Iβ. We will use the {aij} notation in later chapters of this thesis, but introduce anothernotational convention for the definition of the feature function for the path-constrained tripletlexicon model with global source sentence context.

Let {aij} denote the word alignment between the target side of the phrase and the words in thesource sentence, i.e. now target-side indices i are still for terminal symbols on the target side ofthe rule, whereas source-side indices j are for positions in the full source sentence fJ

1 , 1 ≤ j ≤ Jand 1 ≤ i ≤ Iβ. Knowing the within-phrase word alignment {aij} that is stored in the phrasetable, the offset of the beginning of the source side of the applied phrase from the sentence start,the ∼ relation, and the span width of non-terminals in the source sentence, we can calculate {aij}in search. All this information is actually available to the decoder.

We define the feature function for the path-constrained triplet lexicon model with global sourcesentence context in hierarchical translation as

hs2tConstrainedTriplet(fJ1 , e

I1, d, {aij}) =

∑r∈R(d)

ts2tConstrainedTriplet(r, {aij}, fJ1 ) (4.5)


ts2tConstrainedTriplet(r, {aij}, fJ1 ) =

Iβ∑i=1

log∑

j∈{ai}

1

Zi

J∑j′=j+1

p(βi|fj , fj′) (4.6)

and normalization factor Zi = J · |{ai}|. j ∈ {ai} expresses that fj is aligned to βi. In fact, wescore with NULL as a trigger as well. In favor of notational convenience, we omitted this in theformula.

A small floor value is used for probabilities of triplets that are encountered in decoding but arenot in the model.

4.4 Discriminative Word LexiconGiven a source sentence fJ

1 , the discriminative word lexicon model estimates the probability ofthe vocabulary that is used in the translation being a set of target words e ⊆ VE. The sequential

46


order of the words in the source sentence is not taken into account, and interdependencies betweentarget words are ignored. The set of target words e can be coded in a binary vector E = (..., Ee, ...).The indicator variable Ee is set to 1 if the word e is contained in the target sentence, otherwise itis set to 0. In a similar way, the counts Ff of the source words occurring in the source sentence fJ

1

can be represented as a count vector F = (..., Ff , ...). The probability for the set e is composedof the individual and independent probabilities over the target vocabulary VE ,

p(E|F) =∏e∈VE

pe(Ee|F) . (4.7)

The probability for a single target word is modeled as a log-linear model

pe(Ee|F) =exp (ge(Ee,F))∑

Ee∈{0,1}exp

(ge(Ee,F)

) (4.8)

with the functionge(Ee,F) = λe,· +

∑f∈VF

Ffλf,e,Ee (4.9)

where λf,e,· represent lexical weights and λe,· are prior weights.Because statistical independence of the occurance of words is assumed, it is easy to parallelize

the training procedure. The models are trained with the improved RProp algorithm [Igel &Hüsken 03] in our work, in contrast to [Mauser & Hasan+ 09] where the L-BFGS method is used.100 iterations of the training algorithm are carried out for each target word. Regularization isdone using Gaussian priors [Chen & Rosenfeld 99].

The final size of a DWL model can be reduced in a straightforward way by applying thresholdpruning to the learned weights. After training, we can discard weights that are below a threshold τ ,e.g. τ = 0.1. The computational resources that are required for training are not decreased becausethe lexical weights have to be obtained first.

In order to accelerate DWL training, we can formulate a model variant where lexical weightsλf,e,· are trained only for pairs (f, e) of source and target words that appear together in somesentence pair in the training corpus. The function from Equation (4.9) is changed to

ge(Ee,F) = λe,· +∑

f :(f,e)∈W

Ffλf,e,Ee (4.10)

with W being the set of seen word pairs. We refer to the discriminative word lexicon modelvariant that is trained only for seen pairs as sparse DWL.

4.4.1 DWL Feature Function with Global Source ContextTo apply discriminative word lexicon models in hierarchical search, we need to be able to

calculate feature scores for partial hypotheses, where the full set of target words that will beincluded in the complete translation is not known yet.

The term in Equation (4.7) can be decomposed into two components,

p(E|F) =∏

e∈VE :Ee=0

pe(0|F) ·∏

e∈VE :Ee=1

pe(1|F) , (4.11)

and then further rewritten as

p(E|F) =∏e∈VE

pe(0|F) ·∏

e∈VE :Ee=1

pe(1|F)

pe(0|F). (4.12)

47

Chapter 4

For a given source sentence, the term on the left side of the product in Equation (4.12) remainsconstant during decoding and can be ignored.

Using the notational conventions from Section 1.2 and 2.1 again, we define the feature functionfor the discriminative word lexicon model with global source sentence context in hierarchicaltranslation as

hs2tDWL(fJ1 , e

I1, d) =

∑r∈R(d)

ts2tDWL(r, fJ1 ) (4.13)


ts2tDWL(r, fJ1 ) =

Iβ∑i=1

log pβi(1|F)

pβi(0|F)

. (4.14)

In practice we transform the scoring function further, simplify it, and conduct all computationsas summations of logarithmic values for better numerical stability.

4.5 Experiments

4.5.1 Experimental Setup

We build both hierarchical systems and standard phrase-based systems using a training corpusof 2.5 M Arabic–English sentence pairs. The training corpus is the same as in Section 3.2; seeTable 3.2 for data statistics. Also, all of the systems employ the same 4-gram language model asused for our previous Arabic→English experiments. The NIST MT06 set is used for optimization,MT08 is used as a test set. See Table 3.4 in Section 3.2 for data statistics of the Arabic MT06and MT08 sets.

4.5.1.1 Hierarchical Systems

A shallow grammar and our default baseline features are utilized for the hierarchical translationsystem. We run the cube pruning algorithm with a k-best limit of 500 and the LM recombinationscheme. Observation histogram pruning is configured at a value of 50, i.e. we load only the 50best phrase translation options with respect to the weighted phrase-level model scores. Duringdecoding, a maximum length constraint of ten is applied to all non-terminals except the initialsymbol S. Overall, the hierarchical system setup for the experiments in the current section arethe same as for the previous experiments from Section 3.2.

4.5.1.2 Phrase-based Systems

Our standard phrase-based machine translation system operates in the way described by [Zens& Ney 08]. Phrase translation and lexical translation models in both directions, phrase penaltyand word penalty, a binary feature that indicates a source phrase length of 1, a distance-baseddistortion model, and the language model are incorporated into the log-linear model combination.Phrase reorderings are restricted with IBM reordering constraints [Zens & Ney+ 04b].

4.5.1.3 Extended Lexicon Model Training

We train triplet models and sparse DWL models on a manually selected high-quality subset ofthe parallel data of 717 K sentences pairs. The full DWL model is trained on a smaller subset of277 K sentence pairs.

48


Triplet models. We experiment with unconstrained triplet lexicon models and path-constrainedtriplet lexicon models with different count cutoffs. The number of iterations for EM training issix in all cases. Two different unconstrained triplet models are considered, the first one trained onall triplets that are seen at least seven times in the training corpus (min. count 7), the second oneonly on triplets that occur ten or more times (min. count 10). A path-constrained triplet modelis trained without any count cutoff, and three other path-constrained triplet models with tripletsdiscarded that appear less than two, three, and four times in the training data, respectively. Weuse symmetrized GIZA++ alignments in the training of the path-constrained triplet models, justas for phrase extraction. Details on the sizes of the models and on the computional requirementsfor their training are shown in Table 4.1.

DWL models. We train a full DWL model and a sparse DWL model. Because training runtimesare considerably higher than for the sparse model variant, we employ a smaller training corpusfor the full model, as already mentioned above (277 K sentences instead of 717 K). After training,the full DWL model is pruned with a threshold of 0.1. Details on the average number of featuresof each model and on the computional requirements for their training are given in Table 4.2. Thefull model is denoted simply as DWL, the sparse model as sparse DWL.

4.5.2 Experimental Results

The translation results of all our systems on the test set and also on the development set arelisted in Table 4.3. The standard phrase-based baseline has a small advantage over the hierarchicalbaseline, with the latter performing 0.3 points Bleu worse on the test set. If we integrate a (fullor sparse) DWL model, the situation is going into reverse. We see better translation quality withthe hierarchical system. The hierarchical system with full DWL model outperforms the phrase-based system with full DWL model by 0.6 points Bleu, indicating that the hierarchical systembenefits more from the enhancement with a discriminative word lexicon.

The sparse DWL is less effective in decoding than the full DWL that has been trained involvingunseen features and pruned after training with a threshold of 0.1. But even the sparse DWLyields +1.4 BLEU over the hierarchical baseline, which is a decent improvement.

Looking at the experimental results with triplet models, we see a more mixed picture, with path-constrained triplets possibly being minimally more helpful in standard phrase-based translationthan in hierarchical translation. Unconstrained triplets with a minimum count of ten give thelargest improvement in hierarchical translation (+1.1 Bleu). The same triplet model does notperform quite as well in standard phrase-based translation, where the largest gain is achievedusing a path-constrained triplet model without singleton triplets (+0.7 Bleu). The best triplet-augmented hierarchical system and the best triplet-augmented phrase-based system are prettymuch on par (45.5 Bleu vs. 45.4 Bleu).

By adding both a discriminative word lexicon and a triplet lexicon model, we obtain a furthergain over integrating any of the two types of lexicon models in isolation in the phrase-basedsystem. In hierarchical translation, on the other hand, we see good results when adding bothtypes of models, but on the test set we do not improve beyond the Bleu and Ter scores withthe full DWL only, though the development set scores get minimally better. The best result witha hierarchical setup is 46.2 Bleu on the test set (+1.8 Bleu improvement over the hierarchicalbaseline), the best result with a phrase-based setup is 46.0 Bleu (+1.3 Bleu improvement overthe phrase-based baseline, but 0.2 points Bleu behind the best hierarchical setup).

49

Chapter 4

4.6 SummaryWe applied two types of extended lexicon models—the triplet lexicon and the discriminative

word lexicon—with global source sentence context in hierarchical decoding. Our experimentalresults show that both types of extended lexicon models yield nice improvements in hierarchicalphrase-based translation. We have been able to achieve translation quality that improves over acomparable standard phrase-based system by a small margin.

In addition, we demonstrated that restricted variants of the triplet lexicon and of the discrimi-native word lexicon model require less computational effort in training but still considerably raisetranslation quality.

1Note that the results with hierarchical systems in Table 4.3 differ from those reported in [Huck & Ratajczak+

10]. We reran all hierarchical experiments based on a slightly different hierarchical baseline to be consistent withother experiments in this thesis. The main difference between the experiments in [Huck & Ratajczak+ 10] and theones reported here is that in [Huck & Ratajczak+ 10] we added special sentence start and sentence end symbols tothe training data before extracting the hierarchical phrase table, whereas this is not done here.

50


Table 4.1: Sizes and computational demands in training for the triplet models.

no. of triplets training time training mem.[h:min] [GB]

triplets (min. count 7) 140.4 M 34:48 7.1triplets (min. count 10) 98.8 M 32:53 4.8path-constrained triplets 128.6 M 3:11 11.0path-constrained triplets (min. count 2) 45.0 M 2:27 3.8path-constrained triplets (min. count 3) 27.1 M 2:29 2.2path-constrained triplets (min. count 4) 20.2 M 2:27 1.6

Table 4.2: Average number of features per target word and average training time per target wordfor the DWL models. Note that the model denoted as DWL has been pruned aftertraining with a threshold of 0.1. The number of features per target word which haveto be considered during the training of this model is equal to the size of the sourcevocabulary of the training corpus, i.e. 122 592 in this case.

avg. no. of features avg. training time [s]per target word per target word

DWL (full, pruned with threshold 0.1) 80 (unpruned: 122 592) 225sparse DWL 510 64

Table 4.3: Experimental results (cased) with extended lexicon models for the NIST Arabic→English translation task.1

MT06 (Dev) MT08 (Test)HPBT PBT HPBT PBT

Bleu Ter Bleu Ter Bleu Ter Bleu TerNIST Arabic→English [%] [%] [%] [%] [%] [%] [%] [%]Baseline 44.1 49.9 44.1 49.4 44.4 49.4 44.7 49.1DWL 45.8 48.3 45.1 48.4 46.2 48.3 45.6 48.4sparse DWL 45.3 48.6 44.8 48.8 45.8 48.4 45.3 48.7triplets (min. count 7) 45.3 48.3 44.8 48.8 45.1 48.5 45.2 48.6triplets (min. count 10) 45.3 48.7 44.6 49.2 45.5 48.7 44.9 49.0path-constrained triplets 44.7 49.6 44.7 49.1 44.9 49.3 45.3 48.7path-constrained triplets (min. count 2) 44.7 49.4 44.8 48.9 44.9 49.4 45.4 48.8path-constrained triplets (min. count 3) 44.5 49.6 44.5 49.3 45.0 49.5 45.0 49.1path-constrained triplets (min. count 4) 44.6 49.7 44.5 49.5 44.8 49.6 44.9 49.3DWL + triplets (min. count 10) 45.9 48.3 45.1 48.5 46.0 48.3 45.5 48.5DWL + path-constrained triplets 45.8 48.2 45.1 48.6 46.2 48.3 45.8 48.3DWL + path-constrained triplets (min. count 2) 46.0 48.2 45.4 48.4 46.2 48.3 46.0 48.3

51

5. Lexical Smoothing Variants

In this chapter, we further investigate lexicon models for hierarchical phrase-based statisticalmachine translation, but now with a special focus on lexical smoothing. We study five typesof lexicon models: a model which is extracted from word-aligned training data and—given thesymmetrized word alignment—relies on pure relative frequencies [Koehn & Och+ 03]; the IBMmodel 1 lexicon [Brown & Della Pietra+ 93]; a regularized version of IBM model 1; a tripletlexicon model variant [Hasan & Ney 09]; and the discriminative word lexicon model [Mauser &Hasan+ 09]. We explore source-to-target models with phrase-level as well as sentence-level scoringand target-to-source models with scoring on phrase level only. For the first two types of lexiconmodels, we compare several scoring variants. All models are used during search, i.e. they areincorporated directly into the log-linear model combination of the decoder.

Phrase table smoothing with triplet lexicon models and with discriminative word lexicons arenovel contributions. We also propose a new regularization technique for IBM model 1 by means ofthe Kullback-Leibler divergence with the empirical unigram distribution as a regularization term.

Experiments are carried out on the NIST Chinese→English translation task. We achieve thebest results by using the discriminative word lexicon to smooth our phrase tables.

5.1 MotivationLexical scoring on phrase level is the standard technique for phrase table smoothing in statistical

machine translation [Koehn & Och+ 03, Zens & Ney 04a]. Since most of the longer phrases appearonly sparsely in the training data, their translation probabilities are overestimated when usingrelative frequencies to obtain conditional probabilities. One way to counteract overestimation ofphrase pairs for which little evidence in the training data exists is to score phrases with word-basedmodels and to interpolate these lexical probabilities with the phrase translation probabilities.Interpolation of the models is usually done log-linearly as part of the combination of featurefunctions of the translation system. In this way the interpolation parameter can be tuned directlytowards the metric of translation quality, e.g. Bleu or Ter, on a development set.

Lexicon models in both source-to-target and target-to-source direction are thus a crucial com-ponent of standard phrase-based systems, and likewise hierarchical phrase-based systems.

In addition to phrase table smoothing, lexicon models are often applied on sentence level torerank the n-best candidate translations of the decoder [Och & Gildea+ 04, Mauser & Zens+06, Hasan & Ganitkevitch+ 08]. In reranking, the complete translation is available and globaltarget sentence context can be taken into account. Both source-to-target and target-to-sourcemodels may be used.

Lexicon models in source-to-target direction are sometimes applied to score the target side ofphrases given the whole source sentence during decoding already [Mauser & Hasan+ 09], as wehave done in the previous chapter. This can be accomplished quite efficiently since the givensource sentence does not change. Phrase-level scoring, on the other hand, has the advantage that

53

Chapter 5

the scores do not have to be calculated on demand for each hypothesis expansion, but can beprecomputed in advance and written to the phrase table.

The triplet lexicon model and the discriminative word lexicon model have only been appliedusing sentence-level context before. For the DWL model, results in target-to-source directionhave never been reported. We demonstrate that especially the DWL model performs very well onphrase level in both directions compared to the other types of lexicon models, and, surprisingly,that limiting the context to phrase level does not harm translation quality in the hierarchicalsystem.

While phrase table smoothing with the DWL model performs better as with IBM model 1with respect to both metrics we use (Bleu and Ter), the conceptually appealing approach ofextending IBM model 1 with a regularization term reduces the errors made by the system withregard to our secondary metric (Ter) only. We show that on the NIST Chinese→English task,the DWL model and both standard and regularized IBM model 1 clearly outperform the lexiconmodel which is extracted from word-aligned training data, though the latter one is probably mostcommonly used in setups reported in the literature.

5.2 Related Work

The well-known IBM model 1 lexicon was introduced by [Brown & Della Pietra+ 93]. IBMmodel 1 is still employed within the widely used GIZA++ toolkit [Och & Ney 03] as part of theword alignment training, which is the basis of modern phrase-based machine translation. Besides,it can be helpful as an additional model in the log-linear combination or in n-best reranking[Och & Gildea+ 04, Mauser & Zens+ 06, Hasan & Ganitkevitch+ 08]. [Moore 04] suggestedimprovements to IBM model 1 parameter estimation, including an add-n smoothing techniquewhich could be modeled within our IBM model 1 regularization framework. [Toutanova & Galley11] pointed out that the optimization problem for IBM model 1 is not strictly convex.

Word lexicon models extracted from the alignment have been proposed by [Koehn & Och+

03] and [Zens & Ney 04a] and applied in their respective translation systems for phrase tablesmoothing. [Foster & Kuhn+ 06] compared several strategies for phrase table smoothing, includingthe former two. [Chiang & DeNeefe+ 11] suggested morphology-based and provenance-basedimprovements to the [Koehn & Och+ 03] method.

5.3 Lexicon Models

We describe the source-to-target directions of the models in the following sections. The reversemodels and scoring functions are computed similarly.

5.3.1 Word Lexicon from Word-aligned Data

Given a word-aligned parallel training corpus, we are able to estimate single-word based trans-lation probabilities pRF(e|f) by relative frequency.

Let [fJs1 ; eIs1 ; {aij}s]s, 1 ≤ s ≤ S, be training samples of S word-aligned sentence pairs, where

{aij}s denotes the word alignment of the s-th sentence pair. Let j ∈ {ai} express that fj isaligned to the target word ei.

We can now define (possibly fractional) counts

Ns(e, f) =∑

eis :eis=e

∑fjs :fjs=f,j∈{ai}s

1

|{ai}s|(5.1)

54

Lexical Smoothing Variants

for 1 ≤ s ≤ S. If an occurence ei of e has multiple aligned source words, each of the |{ai}| > 1alignment links contributes with a fractional count of 1

|{ai}| .By summing over the whole corpus we obtain a count of aligned co-occurrences of target word

e and source word f

N(e, f) =∑s

Ns(e, f). (5.2)

The probabilities pRF(e|f) can then be computed as

pRF(e|f) =N(e, f)∑e′ N(e′, f)

. (5.3)

This model is most similar to the one presented by [Koehn & Och+ 03]. One difference we makeis that we do not assume unaligned words to be aligned to the empty word (NULL). Probabilitieswith the empty word are thus not included in our lexicon. If scoring with the empty word isdesired, we use a constant value of 0.05. The model does not apply the discounting technique of[Zens & Ney 04a]. We are going to denote it as relative frequency (RF) word lexicon throughoutthis chapter.

5.3.2 IBM Model 1The IBM model 1 lexicon (IBM-1) is the first and most simple one in a sequence of probabilistic

generative models [Brown & Della Pietra+ 93]. The following assumptions are made for IBM-1:the target length I depends on the length J of the source sentence only, each target word is alignedto exactly one source word, the alignment of the target word depends on its absolute positionand the sentence lengths only, and the target word depends on the aligned source word only. Thealignment probability is in addition assumed to be uniform for IBM-1.

The probability of a target sentence eI1 given a source sentence fJ0 (with f0 = NULL) can thus

be decomposed as

p(eI1|fJ1 ) = p(I|J) ·

I∏i=1

p(ei|fJ1 ) (5.4)

= p(I|J) ·I∏

i=1

J∑j=0

p(j, ei|fj) (5.5)

= p(I|J) ·I∏

i=1

J∑j=0

p(j|i, I, J) · pibm1(ei|fj) (5.6)

= p(I|J) ·I∏

i=1

J∑j=0

1

J + 1· pibm1(ei|fj) (5.7)

= p(I|J) · 1

(J + 1)I·

I∏i=1

J∑j=0

pibm1(ei|fj) . (5.8)

The IBM-1 lexical translation probabilities pibm1(e|f) can then be trained iteratively with theEM algorithm using the maximum likelihood criterion. The iterative equation for training is forinstance given by [Vogel & Ney+ 96].

5.3.3 Scoring VariantsSeveral methods to score phrase pairs with RF word lexicons or IBM-1 lexicons have been

suggested in the literature and are in common use. We apply and compare four of them. Recall

55

Chapter 5

the notational conventions from Sections 1.2 and 2.1 for the following formulas. We define phrasescoring functions t(·) in an analogous manner as for most of the baseline feature function definitionsin Chapter 2. The basic principle for obtaining an actual feature function value h(fJ

1 , eI1, d) from

the respective phrase scores t(r) is h(fJ1 , e

I1, d) =

∑r∈R(d) t(r) . We can pass additional arguments

to phrase scoring functions, such as the word alignments on phrase level {aij}, or the sourcesentence fJ

1 , or both. Phrase scoring functions for features that take such information intoaccount will then be of the form t(r, {aij}), t(r, fJ

1 ), or t(r, {aij}, fJ1 ). We have seen examples in

Sections 4.4.1 and 4.3.1 when we considered global source sentence context for extended lexiconmodel scoring, and when we scored with the path-constrained triplet lexicon model.

Our first scoring variant tNorm(·) uses an IBM-1 or RF lexicon model p(e|f) to rate the qual-ity of a target side β given the source side α of a hierarchical rule r with an included lengthnormalization:

tNorm(r) =

Iβ∑i=1

log(p(βi|NULL) +

∑Jαj=1 p(βi|αj))

1 + Jα

)(5.9)

This variant has e.g. been used by [Vilar & Stein+ 10].By dropping the length normalization we arrive at our second variant tNoNorm(·):

tNoNorm(r) =

Iβ∑i=1

log

p(βi|NULL) +Jα∑j=1

p(βi|αj))

(5.10)

Among others, [Mauser & Zens+ 06] apply this variant in their standard phrase-based system.Our third scoring variant tNoisyOr(·) is the noisy-or model proposed by [Zens & Ney 04a]:

tNoisyOr(r) =

Iβ∑i=1

log

1−Jα∏j=1

(1− p(βi|αj))

(5.11)

The fourth scoring variant tMoses(·) is due to [Koehn & Och+ 03] and is the standard methodin the open source Moses system [Koehn & Hoang+ 07]:

tMoses(r, {aij}) =Iβ∑i=1

log({

1|{ai}|

∑j∈{ai} p(βi|αj)) if |{ai}| > 0

p(βi|NULL) otherwise

)(5.12)

where j ∈ {ai} expresses that αj is aligned to βi. This last variant requires the availability ofword alignments {aij} for phrase pairs (cf. Section 4.3.1). We store the most frequent alignmentduring phrase extraction and use it to compute tMoses(·).

Note that all of these scoring methods generalize to hierarchical phrase pairs which may beonly partially lexicalized. Unseen events are scored with a small floor value.

If not stated otherwise explicitly, we score with tNorm(·) (Eq. (5.9)) in our experiments. Source-to-target sentence-level scores are calculated analogously to Equation (5.9), but with the differencethat the quality of the target side β of a rule currently chosen to expand a partial hypothesis israted given the whole input sentence fJ

1 instead of the source side α of the rule only:

tNormSentence(r, fJ1 ) =

Iβ∑i=1

log(p(βi|NULL) +

∑Jj=1 p(βi|fj))

1 + J

)(5.13)

56


5.3.4 Regularized IBM Model 1Despite the wide use of the IBM model 1, basic modeling problems such as non-strict convexity,

overfitting, and the use of heuristics for unseen events were not resolved algorithmically so far. Wepropose extending IBM-1 with the Kullback-Leibler (KL) divergence of the IBM-1 probabilitieswith respect to a smooth reference distribution pref as a regularization term:

r(p) =∑f

DKL(pref(·|f)∥p(·|f))

=∑f

∑e

pref(e|f) log pref(e|f)p(e|f)

(5.14)

For pref we choose the empirical unigram distribution

pref(e|f) = p(e) . (5.15)

An advantage of the KL regularization term is that it can be easily integrated into the EMalgorithm. Taking the derivative of the new auxiliary function which includes the regularizationterm, we obtain a weighted average of the reference distribution and the unregularized update asthe EM update formula of the regularized IBM model 1:

p(e|f) = 1

Z(f)

(∑s

cs(e|f) + C · pref(e|f)), (5.16)

whereZ(f) =

∑e′

∑s

cs(e′|f) + C . (5.17)

With s we denote the training samples, cs(e|f) is the expected count of e given f calculatedexactly as in the original IBM model 1, and C > 0 denotes the regularization constant.

By using regularization, we gain two advantages: (1.) Over-fitting is avoided and training canbe performed until convergence, and (2.) the use of small floor values for unseen events is notrequired anymore because unseen event probabilities can be computed on the fly when the modelis applied during decoding.

5.3.5 Triplet LexiconThe triplet lexicon relies on triplets which are composed of two source language words triggering

one target language word, i.e. it models probabilities p(e|f, f ′). We use the path-constrained tripletmodel variant as described in Section 4.3.

For source-to-target phrase-level scoring with a path-constrained triplet lexicon model, we canmodify Equation (4.2) to

ts2tConstrainedTripletPhrase(r, {aij}) =Iβ∑i=1

log∑

j∈{ai}

1

Zi

Jα∑j′=j+1

p(βi|αj , αj′) (5.18)

with normalization factor Zi = Jα · |{ai}|. j ∈ {ai} expresses that αj is aligned to βi.1 We scorewith NULL as a trigger as well but omitted this in the formula for notational convenience.

Probabilities p(f |e, e′) of two target words triggering a source word are modeled by a target-to-source triplet lexicon. With phrase-level scoring, we can employ a target-to-source triplet feature

1Note that in Equation (5.18) we use {aij} rather than {aij} as previously in Equation (4.2). See Section 4.3.1for the different meanings of the two notations.

57

Chapter 5

Table 5.1: Experimental results (cased) with different lexicon models in lexical smoothing forthe NIST Chinese→English translation task. s2t denotes source-to-target scoring, t2starget-to-source scoring.

MT06 (Dev) MT08 (Test)Bleu Ter Bleu Ter

NIST Chinese→English [%] [%] [%] [%]Baseline 1 (no phrase table smoothing) 32.0 62.2 24.3 67.8+ phrase-level s2t+t2s RF word lexicons 32.6 61.2 25.2 66.6+ phrase-level s2t+t2s IBM-1 33.9 60.5 26.7 65.6+ phrase-level s2t+t2s regularized IBM-1 33.7 60.2 26.6 65.2+ phrase-level s2t+t2s path-constrained triplets 32.6 61.8 25.5 66.7+ phrase-level s2t+t2s DWL 33.7 60.5 27.0 65.6

in search. The scoring function in target-to-source direction is a straightforward modification ofthe scoring function in source-to-target direction,

tt2sConstrainedTripletPhrase(r, {aij}) =Jα∑j=1

log∑

i∈{aj}

1

Zj

Iβ∑i′=i+1

p(αj |βi, βi′) (5.19)

with normalization factor Zj = Jα · |{aj}|. i ∈ {aj} expresses that βi is aligned to αj .

5.3.6 Discriminative Word LexiconThe discriminative word lexicon model acts as a classifier that predicts the words contained

in the translation from the words given on the source side, as described in Section 4.4. Despitethe computational cost in training, we utilize the full model here and not the sparse variant.Translation quality was a bit better with the full variant in our previous experiments with globalsource sentence context.

When applying a discriminative word lexicon in source-to-target direction with phrase-levelscoring, we do not modify the model training, only the scoring function. The basic formula forscoring essentially remains the same as in Equation (4.14), but we condition on the terminalsymbols on the source side of the phrase rather than on the words in the full source sentence. Fortarget-to-source phrase-level scoring with a discriminative word lexicon, we have to train a modelin reverse direction. The words contained in the source sentence are predicted from the words onthe target side. We omit the formulas because they are simple modifications of the ones given inSection 4.4.

5.4 ExperimentsWe present empirical results obtained with the different lexicon models and scoring variants on

the NIST Chinese→English task.

5.4.1 Experimental SetupWe build hierarchical systems using a training corpus of 3.0 M Chinese–English sentence pairs,

just as in Section 3.2; see Table 3.1 for data statistics. The 4-gram language model is also thesame as in our previous Chinese→English experiments. We translate with a deep grammar. Thecube pruning algorithm with a k-best limit of 1000 and the LM recombination scheme is used tocarry out the search. Observation histogram pruning is configured at a value of 100.

58


Table 5.2: Experimental results (cased) with various scoring functions in lexical smoothing forthe NIST Chinese→English translation task. s2t denotes source-to-target scoring, t2starget-to-source scoring.


NIST Chinese→English [%] [%] [%] [%]Baseline 1 (no phrase table smoothing) 32.0 62.2 24.3 67.8+ phrase-level s2t+t2s RF word lexicons, Eq. (5.9): tNorm(·) 32.6 61.2 25.2 66.6+ phrase-level s2t+t2s RF word lexicons, Eq. (5.10): tNoNorm(·) 32.7 61.8 25.6 66.7+ phrase-level s2t+t2s RF word lexicons, Eq. (5.11): tNoisyOr(·) 32.4 61.2 25.5 66.4+ phrase-level s2t+t2s RF word lexicons, Eq. (5.12): tMoses(·) 32.7 61.8 25.4 66.9+ phrase-level s2t+t2s IBM-1, Eq. (5.9): tNorm(·) 33.9 60.5 26.7 65.6+ phrase-level s2t+t2s IBM-1, Eq. (5.10): tNoNorm(·) 33.8 60.5 26.6 65.7+ phrase-level s2t+t2s IBM-1, Eq. (5.11): tNoisyOr(·) 33.7 60.5 26.7 66.0+ phrase-level s2t+t2s IBM-1, Eq. (5.12): tMoses(·) 33.2 61.3 26.0 66.0

Table 5.3: Experimental results (cased) with lexical enhancements considering source sentencecontext or phrase context in each direction for the NIST Chinese→English translationtask. s2t denotes source-to-target scoring, t2s target-to-source scoring.


NIST Chinese→English [%] [%] [%] [%]Baseline 2 (with s2t+t2s RF word lexicons) 32.6 61.2 25.2 66.6+ sentence-level s2t IBM-1 32.9 61.6 25.7 66.6+ sentence-level s2t path-constrained triplets 33.1 61.1 26.0 66.3+ sentence-level s2t DWL 33.0 61.0 26.2 65.5+ phrase-level s2t IBM-1 33.0 61.4 26.4 66.1+ phrase-level s2t path-constrained triplets 33.1 61.3 26.0 66.3+ phrase-level s2t DWL 33.4 61.3 26.4 66.3+ phrase-level t2s IBM-1 33.4 60.7 26.5 65.7+ phrase-level t2s path-constrained triplets 33.0 61.5 26.3 66.3+ phrase-level t2s DWL 33.8 60.5 26.5 65.7+ phrase-level s2t+t2s IBM-1 33.8 60.5 26.9 65.4+ phrase-level s2t+t2s path-constrained triplets 33.3 61.3 26.3 66.1+ phrase-level s2t+t2s DWL 34.0 60.2 27.2 65.2

59

Chapter 5

The symmetrized GIZA++ alignment is used to compute the counts for the RF lexicon model,to train path-constrained triplets, and for phrase extraction. All our lexicon models are trainedon the full parallel data, the DWL models are pruned after training with a threshold of 0.1. TheIBM-1 models are produced with GIZA++. Phrase-level scores are precomputed and added tothe phrase table.

We optimize on the NIST MT06 set and test on MT08. See Table 3.3 in Section 3.2 for datastatistics of the Chinese MT06 and MT08 sets. The performance of the systems is evaluated usingthe two metrics Bleu and Ter. Since Bleu is the optimized measure, Ter mainly serves as anadditional metric to verify the consistency of our improvements.

5.4.2 Experimental ResultsThe empirical evaluation of all our Chinese→English setups is presented in Tables 5.1, 5.2 and

5.3. In the experiments shown in Table 5.1, we applied each one of the five types of lexiconmodels separately for phrase table smoothing—i.e. on phrase level in both directions—over abaseline that does not utilize any lexical features (Baseline 1). The impact of the scoring varianton the performance of RF word lexicons and IBM-1 models is examined in the series of experimentspresented in Table 5.2. In Table 5.3, we took a standard setup including lexical smoothing withthe RF word lexicon as a baseline (Baseline 2) to which we added IBM-1, path-constrained tripletand DWL models separately in either source-to-target direction or target-to-source direction orboth. For the source-to-target direction, we also set up systems with sentence-level scoring foreach of these three models.

Applying IBM-1 for phrase table smoothing brings about a considerably better result thanrelying on lexical smoothing with the RF lexicon model (+1.5 Bleu / −1.0 Ter). The regularizedIBM-1 yields improvements over standard IBM-1 in Ter only (−0.4 Ter). Path-constrainedtriplets perform slightly better than the RF lexicon. The best phrase table smoothing result isobtained with the DWL model (+1.8 Bleu / −1.0 Ter over the RF lexicon model and +0.3 Bleuover IBM-1).

For the RF word lexicon, scoring with tNorm(·) is a bit worse than the other scoring variants.For IBM-1, tMoses(·) does not perform very well, which could be explained by the fact that thisscoring variant is inconsistent with the training conditions of IBM-1.

Source-to-target sentence-level scoring is not better than phrase-level scoring in any of ourexperiments. Adding target-to-source triplet or DWL models to a standard baseline (Baseline 2),which was not done in any previous work, results in notably better translation quality. The besthypotheses are produced with the system that includes phrase-level DWLs in both directions inaddition to lexical smoothing with RF lexicon models (+2.0 Bleu / −1.4 Ter over Baseline 2).Note that, though they perform worse in the phrase table smoothing experiments, RF lexiconmodels are still valuable in combination with IBM-1, triplets, or DWL models.

5.5 DiscussionStrictly speaking, our improvements over well-known models—more precisely, over source-to-

target and target-to-source IBM-1 on phrase level—are rather small (e.g. up to +0.3 Bleu /−0.2 Ter) with DWL models instead of IBM-1 on top of Baseline 2). The potentially largegain by simply utilizing a stronger lexical smoothing method is however easily overlooked. As anexample, phrase table smoothing with the method we found to perform weakest on our Chinese→English task—word lexicons obtained with relative frequencies from the word alignment andphrase scoring according to Equation (5.9)—used to be the standard technique at RWTH Aachenand has been applied by several system builders in their baseline setups. We thus do not only givea survey and a comparison of known as well as several novel lexical smoothing techniques here,

60


but also point out the weakness of established lexical feature functions that have been widelyused in previous machine translation systems.

5.6 SummaryWe investigated five types of lexicon models in source-to-target and target-to-source direction

with sentence-level or phrase-level context in a hierarchical phrase-based decoder. For tripletand discriminative word lexicon models, we presented a novel restriction to the phrase level.Restricting the scoring to phrase level has the advantage that the model scores can be precomputedand written to the phrase table. It furthermore enables the efficient use of extended lexicon modelsin target-to-source direction in addition to the source-to-target direction. In our translationexperiments on the NIST Chinese→English task, we were able to obtain the same or betterresults by means of phrase-level scoring as when considering sentence-level lexical context.

We showed that phrase table smoothing with IBM model 1 or discriminative word lexiconsclearly outperforms smoothing with lexicon models which are extracted from parallel trainingdata with symmetrized word alignments. Our novel lexical smoothing with discriminative wordlexicon models also yields improvements over IBM model 1. Our best Chinese→English systemscores +2.0 Bleu / −1.4 Ter better than the hierarchical system with baseline lexical smoothing.

We gave an empirical comparison of several commonly applied scoring variants. We finally sug-gested a regularization technique for IBM model 1 and evaluated it within our system, obtaininga reduced error rate with respect to Ter.

61

6. Reordering Extensions

In this chapter, we propose novel extensions of hierarchical phrase-based systems with a dis-criminative lexicalized reordering model. We compare different feature sets for the discriminativereordering model and investigate combinations with three types of non-lexicalized reordering ruleswhich are added to the hierarchical grammar in order to allow for more reordering flexibility dur-ing decoding. All extensions are evaluated in both hierarchical systems with a deep grammar andhierarchical systems with a shallow grammar. We achieve improvements of up to +1.2 Bleu onthe NIST Chinese→English translation task.

6.1 MotivationIn standard phrase-based translation with continuous phrases only and left-to-right hypothesis

generation [Koehn & Och+ 03, Zens & Ney 08], reordering is implemented by jumps within theinput sentence. The choice of the best order for the target sequence is made based on the languagemodel score of this sequence and a distortion penalty that is computed from the source-side jumpdistances. Though the space of admissible reorderings is in most cases contrained by a maximumjump width or coverage-based restrictions [Zens & Ney+ 04b] for efficiency reasons, the basicapproach of arbitrarily jumping to uncovered positions on source side is still very permissive.Lexicalized reordering models assist the decoder in taking a good decision.

In hierarchical phrase-based machine translation, reordering is modeled implicitely as part ofthe hierarchical grammar. The hierarchical phrase-based model provides an integrated reorderingmechanism through non-terminals that are linked (co-indexed via the ∼ relation) on the sourceside and on the target side of hierarchical phrases. Hierarchical phrase-based decoders carryout reordering based on this one-to-one relation. Usually neither explicit lexicalized reorderingmodels nor additional mechanisms to perform reorderings that do not result from the applicationof hierarchical rules are integrated into hierarchical decoders. The reorderings which are beingconducted by the hierarchical decoder are a result of the application of SCFG rules, which gen-erally means that there must have been some evidence in the training data for each reorderingoperation. At first glance one might be tempted to believe that as a consequence of this, anyadditional designated reordering extensions would be futile in hierarchical translation. We arguethat such a conclusion is false, and we will provide empirical evidence in this work that lexicalizedreordering models as well as additional non-lexicalized reordering rules can be highly beneficialin hierarchical systems.

We augment the grammar with more flexible reordering mechanisms based on additional non-lexicalized reordering rules and integrate a discriminative lexicalized reordering model. This kindof model has been shown to perform well when being added to the log-linear model combinationof standard phrase-based systems [Zens & Ney 06]. We present an extension of a hierarchicaldecoder with the discriminative reordering model and evaluate it in setups with a deep grammarand in setups with a shallow grammar. Two different feature sets for the discriminative reordering

63

Chapter 6

model are examined. We report experimental results on the NIST Chinese→English translationtask. The best translation quality is achieved with combinations of the extensions with additionalreordering rules and with the discriminative reordering model. The overall improvement over therespective baseline system is +1.2 Bleu / −0.6 Ter with a deep grammar and +1.2 Bleu / −0.5Ter with a shallow grammar.

6.2 Related WorkLexicalized reordering models are a commonly included component of standard phrase-based

machine translation systems. A discriminatively trained lexicalized reordering model like the oneemployed by us has been examined in a standard phrase-based setting by [Zens & Ney 06] before.

For hierarchical translation, [He & Meng+ 10a] combined an additional BTG-style swap rulewith a maximum entropy based lexicalized reordering model and achieve improvements on theNIST Chinese→English task. Their approach is comparable to ours, but their reordering modelrequires the training of different classifiers for different rule patterns [He & Meng+ 10b]. Extract-ing training instances separately for several patterns of hierarchical rules yields a dependence onthe phrase segmentation. In the more general approach we propose, the definition of the featuresis independent of the phrase boundaries on the source side. Other groups have attempted toattain superior modeling of reordering effects in their hierarchical systems by examining syntac-tic annotation, e.g. [Gao & Koehn+ 11] and [Kazemi & Toral+ 15]. [Hayashi & Tsukada+ 10]explored the word-based reordering model by [Tromble & Eisner 09] in hierarchical translation.

The limitation of the recursion depth with shallow grammars [Iglesias & de Gispert+ 09a,de Gispert & Iglesias+ 10] affects the search space and in particular the reordering capabilitiesof the system. It is therefore basically antipodal to some of the techniques presented here, whichallow for even more flexibility during the search process by extending the grammar with specificnon-lexicalized reordering rules. Combinations of both techniques are possible, though, and infact [Iglesias & de Gispert+ 09a] also investigated a maximum phrase jump of 1 (MJ1) reorderingmodel. In the MJ1 experiment, they include a swap rule, but at the same time exclude allhierarchical phrases with gaps.

In [Vilar & Stein+ 10], we previously extended Jane with non-lexicalized rules that permit jumpsacross whole blocks of symbols and reported improvements on a German→English Europarl task.The technique is inspired by conventional phrase-based IBM-style reordering [Zens & Ney+ 04b].

6.3 Reordering RulesIn this section we describe two types of reordering extensions to the hierarchical grammar. Both

of them add specific non-lexicalized reordering rules which facilitate a more flexible arrangementof phrases in the hypotheses. We first present a simple swap rule extension (Section 6.3.1), thenwe suggest an extension with several additional rules which allow for block jumps (Section 6.3.2).Variants for deep and for shallow grammars are proposed.

6.3.1 Swap Rule6.3.1.1 Swap Rule for Deep Grammars

In a deep grammar, we can bring in more reordering capabilities by adding a single swap rule

X→ ⟨X∼0X∼1,X∼1X∼0⟩ (6.1)

supplementary to the standard initial rule (Eq. (1.5)) and glue rule (Eq 1.6)). The swap ruleallows adjacent phrases to be transposed.

64

Reordering Extensions

An alternative with a comparable effect would be to remove the standard glue rule and to addtwo rules instead, one of them being as in Equation (6.1) and the other a monotonic concatenationrule for the non-terminal X which is symmetric to the swap rule. The latter rule acts as areplacement for the glue rule. This is the approach [He & Meng+ 10a] take. Our approach tokeep the standard glue rule has however one advantage: We are still able to apply a maximumlength constraint to X. The maximum length constraint limits the source-side length of the yield of(sub-)derivations headed by the non-terminal. The lexical span covered by X is typically restrictedto ten to make decoding less demanding in terms of computational resources. We would still beable to add a monotonic concatenation rule to our grammar in addition to the standard gluerule. Its benefit is that it entails more symmetry in the grammar. In our variant, sub-derivationswhich result from applications of the swap rule can fill the gap within hierarchical phrases, whileno mechanism to carry out the same in a monotonic manner is available. We refrain from addinga monotonic concatenation rule, since recursive embeddings are possible anyway with the deepgrammar. We nevertheless tried the variant with the additional monotonic concatenation rule ina supplementary experiment (cf. Section 6.5.2.2) to make sure that our assumption that this ruleis dispensable is correct. We were not able to obtain improvements over the setup with the swaprule only.

6.3.1.2 Swap Rule for Shallow Grammars

In a shallow grammar, several directions of integrating swaps are possible. We decided to adda swap rule and a monotonic concatenation rule

XP→ ⟨XP∼0XP∼1,XP∼1XP∼0⟩XP→ ⟨XP∼0XP∼1,XP∼0XP∼1⟩

(6.2)

supplementary to the standard shallow initial rules (Eq. (1.7)) and glue rules (Eq. (1.8)). Theswap rule allows adjacent lexical phrases to be transposed, but not hierarchical phrases. Here,we could as well have used XH as the left-hand side of the rules. Because we chose XP andthus allow for embedding of sub-derivations resulting from applications of the swap rule intohierarchical phrases, which is not possible with sub-derivations resulting from applications ofhierarchical rules in a shallow grammar, we also include the monotonic concatenation rule forsymmetry reasons. A constraint can again be applied to the number of source terminals spannedby both XP and XH. With a length constraint, building sub-derivations of arbitrary length byapplying the rules from Equation (6.2) is impossible.

6.3.2 Jump Rules

Instead of employing a swap rule that transposes adjacent phrases, we can adopt more complexextensions to the grammar that implement jumps across blocks of symbols. Our jump rulesfacilitate an arbitrary number of blocks per sentence to be jumped across. We however do notallow for a recursive application of block jumps, or for block jumps within gaps of hierarchicalrules. Reorderings with the jump rules are thus more restricted than with the swap rule. Our jumprule set keeps the convenient property of the hierarchical grammar that the start non-terminal Sneeds to be expanded in the leftmost cells of the cyk+ parse chart only.

65

Chapter 6

6.3.2.1 Jump Rules for Deep Grammars

In a deep grammar, to enable block jumps, we include the rules

S → ⟨B∼0X∼1, X∼1B∼0⟩ †

S → ⟨S∼0B∼1X∼2, S∼0X∼2B∼1⟩ †

B → ⟨X∼0, X∼0⟩B → ⟨B∼0X∼1, B∼0X∼1⟩ ‡

(6.3)

in addition to the standard initial rule and glue rule. The rules marked with † are jump rules thatput jumps across blocks (B) on source side into effect. The rules with B on their left-hand sideenable blocks that are skipped by the jump rules to be translated, but without further jumps.Reordering within these windows is just possible with hierarchical rules.

A binary jump feature for the two jump rules (†) may be added to the log-linear model com-bination of the decoder, as well as a binary feature that fires for the rule that acts analogouslyto the glue rule, but within blocks that is being jumped across (‡). A maximum jump widthcan be established by applying a length constraint to the non-terminal B. A distance-based dis-tortion feature can also easily be implemented by computing the length of the source yield ofsub-derivations with root non-terminal B on the right-hand side of the jump rules in decoding.

6.3.2.2 Jump Rules for Shallow Grammars

In a shallow grammar, block jumps are realized in the same way as in a deep one, but thenumber of rules that are required is doubled. We include

S → ⟨B∼0XP∼1, XP∼1B∼0⟩ †

S → ⟨B∼0XH∼1, XH∼1B∼0⟩ †

S → ⟨S∼0B∼1XP∼2, S∼0XP∼2B∼1⟩ †

S → ⟨S∼0B∼1XH∼2, S∼0XH∼2B∼1⟩ †

B → ⟨XP∼0, XP∼0⟩B → ⟨XH∼0, XH∼0⟩B → ⟨B∼0XP∼1, B∼0XP∼1⟩ ‡

B → ⟨B∼0XH∼1, B∼0XH∼1⟩ ‡

(6.4)

in addition to the standard shallow initial rules and glue rules.

6.4 Discriminative Lexicalized Reordering ModelOur discriminative reordering extensions for hierarchical phrase-based machine translation sys-

tems integrate a discriminative reordering model that predicts the orientation of neighboringblocks. We use two orientation classes left and right, in the same manner as described by [Zens& Ney 06]. The reordering model is applied at the phrase boundaries only, where words whichare adjacent to gaps within hierarchical phrases are defined as boundary words as well. Theorientation probability is modeled in a maximum entropy framework. We investigate two modelsthat differ in the set of feature functions:

discrim. RO (src word): The feature set of this model consists of binary features based on thesource word at the current source position.

66


e1

e2

e3

f1 f2 f3

Figure 6.1: Illustration of an embedding of a lexical phrase (light) in a hierarchical phrase (dark),with orientations scored with the neighboring blocks.

discrim. RO (src+tgt word+class): The feature set of this model consists of binary featuresbased on the source word and word class at the current source position and the target wordand word class at the current target position.

Using features that depend on word classes provides generalization capabilities. We employ 100automatically learned word classes which are obtained with the mkcls tool on both source andtarget side.1 The reordering model is trained with the Generalized Iterative Scaling (GIS) algo-rithm [Darroch & Ratcliff 72] with the maximum class posterior probability as training criterion,and it is smoothed with a Gaussian prior [Chen & Rosenfeld 99].

For each rule application during hierarchical decoding, we apply the reordering model at allboundaries where lexical blocks are placed side by side within the partial hypothesis. For thispurpose, we need to access neighboring boundary words and their aligned source words and sourcepositions. Note that, as hierarchical phrases are involved, several block joinings may take placeat once during a single rule application. Figure 6.1 gives an illustration with an embeddingof a lexical phrase (light) in a hierarchical phrase (dark). The gap in the hierarchical phrase⟨f1f2X∼0, e1X

∼0e3⟩ is filled with the lexical phrase ⟨f3, e2⟩. The discriminative reordering modelscores the orientation of the lexical phrase with regard to the neighboring block of the hierarchicalphrase which precedes it within the target sequence (here: right orientation), and the block ofthe hierarchical phrase which succeeds the lexical phrase with regard to the latter (here: leftorientation).

The way we interpret reordering in hierarchical phrase-based translation keeps our model simple.We are basically able to treat the orientation of continuous lexical blocks in almost exactly thesame way as the orientation of phrases in standard phrase-based translation. We avoid the usageof multiple reordering models for different source and target patterns of rules that is done by [He& Meng+ 10b].

6.5 Experiments

We present empirical results obtained with the additional swap rule, the jump rules, and thediscriminative reordering model on the NIST Chinese→English task.

1mkcls is distributed along with the GIZA++ package.

67

Chapter 6

Table 6.1: Experimental results (cased) with reordering extensions for the NIST Chinese→Englishtranslation task.

MT06 (Dev) MT08 (Test)deep shallow deep shallow

Bleu Ter Bleu Ter Bleu Ter Bleu TerNIST Chinese→English [%] [%] [%] [%] [%] [%] [%] [%]Baseline 32.6 61.2 31.4 61.8 25.2 66.6 24.9 66.6+ discrim. RO (src word) 32.9 61.3 31.6 61.8 25.4 66.3 25.2 66.6+ discrim. RO (src+tgt word+class) 33.0 61.3 31.6 61.6 25.8 66.0 25.1 66.3+ swap rule 32.8 61.7 31.8 62.1 25.8 66.6 25.0 67.0

+ discrim. RO (src word) 33.0 61.2 32.5 61.4 25.8 66.1 26.0 66.2+ discrim. RO (src+tgt word+class) 33.1 61.2 32.6 61.4 26.0 66.1 26.1 66.3+ binary swap feature 33.2 61.0 32.1 61.8 25.9 66.2 25.7 66.5

+ discrim. RO (src word) 33.1 61.3 32.4 61.4 26.0 66.1 26.1 66.3+ discrim. RO (src+tgt word+class) 33.2 61.3 32.9 61.0 26.2 66.1 26.1 66.1

+ jump rules 32.9 61.3 32.1 62.4 25.6 66.4 25.1 67.5+ discrim. RO (src word) 32.9 61.1 31.9 62.0 25.8 66.0 25.1 66.9+ discrim. RO (src+tgt word+class) 33.2 61.0 32.1 62.0 25.9 66.1 25.6 66.5+ binary jump feature 32.8 61.3 31.9 61.7 25.7 66.3 25.2 66.7

+ discrim. RO (src word) 32.8 61.3 32.2 61.9 25.8 66.1 25.2 66.7+ discrim. RO (src+tgt word+class) 33.1 61.2 32.3 62.0 26.0 66.1 25.5 66.7


We work with our usual baseline configuration, employing the Chinese–English training corpusof 3.0 M sentence pairs. During decoding, a maximum length constraint of ten is applied to allnon-terminals but the initial symbol S. Model weights are optimized towards Bleu with MERTon 100-best lists. We employ MT06 as development set, MT08 is used as test set. Translationquality is evaluated using the two metrics Bleu and Ter.


The empirical evaluation of our reordering extensions is presented in Table 6.1. We reporttranslation results on both the development and the test corpus. The figures with deep and withshallow rules are set side by side in separate columns to facilitate a direct comparison betweenthem. All the setups given in separate rows exist in a deep and a shallow variant.

The shallow baseline is a bit worse than the deep baseline. Adding discriminative reorderingmodels to the baselines without additional reordering rules results in an improvement of upto +0.6 Bleu / −0.6 Ter (in the deep setup). The src+tgt word+class feature set for thediscriminative reordering model altogether seems to perform slightly better than the src wordfeature set. Adding reordering rules in isolation can also improve the systems, in particular inthe deep setup. However, extensions with both reordering rules and discriminative lexicalizedreordering model provide the best results, e.g. +1.0 Bleu / −0.5 Ter with the system with deepgrammar, swap rule, binary swap feature and discrim. RO (src+tgt word+class) and +1.2 Bleu /−0.5 Ter with the system with shallow grammar, swap rule, binary swap feature and discrim. RO(src+tgt word+class). With a shallow grammar, the combinations of the discrim. RO with theswap rule clearly outperforms the jump rules.

We proceed with discussing some supplementary results obtained with the deep grammar thatare not included in Table 6.1. The results for Sections 6.5.2.2 to 6.5.2.4 can be found in Table 6.2.

68


6.5.2.1 Dropping Length Constraints

In order to find out if we lose performance by applying the maximum length constraint often to all non-terminals but the initial symbol S during decoding, we optimized systems withno length constraints. When we drop the length constraint in the baseline setup, we observe noimprovement on the development set and +0.3 Bleu improvement on the test set. Dropping thelength constraint in the system with deep grammar, swap rule, discrim. RO (src+tgt word+class)and binary jump feature results in +0.2 Bleu / −0.2 Ter on the development set, but noimprovement on the test set.

6.5.2.2 Monotonic Concatenation Rule

In this experiment, we add a monotonic concatenation rule

X→ ⟨X∼0X∼1,X∼0X∼1⟩ (6.5)

as discussed in Section 6.3.1.1 to the system with deep grammar, swap rule, binary swap featureand discrim. RO (src+tgt word+class). As we presumed, the monotonic concatenation rule doesnot improve the performance of our deep system.

6.5.2.3 Distance-based Distortion Feature

We have already added binary features which fire on applications of the swap rule or thejump rules, respectively. With an additional distance-based distortion feature, we can take thelengths of the source sequences into account that are swapped or jumped across. For the jumprules, whenever a block jump occurs, the distance-based distortion feature counts the numberof words in the source yield of sub-derivations with root non-terminal B. A comparable featurecan be defined for the swap rule by just computing the source span width of the left-hand sidenon-terminal at each swap rule application.

On our test set, adding the distance-based distortion feature does neither give an improvementwith the jump rules nor with the swap rule.

6.5.2.4 Discriminative Reordering for Reordering Rules Only

Instead of applying the discriminative reordering model at all rule applications, the model can aswell be used to score the orientation of blocks only if they are placed side by side within the targetsequence by selected rules. We conducted experiments in which the discriminative reorderingscoring is restricted to the swap rule or the explicit jump rules (marked as † in Equation (6.3)),respectively. The result is in both setups slightly worse than the result with the discriminativereordering model applied to all rules.

6.5.3 Investigation of the Rule UsageTo figure out the influence of the swap rule on the usage of different types of rules in the

translation process, we compare in Table 6.3 the baseline systems (deep and shallow) with thesystems using the swap rule, binary swap feature and discrim. RO (denoted as Best Swap Systemin the table). We give statistics on the rule usage for the single best translation of the test set(MT08). As expected, the systems with deep grammar apply more hierarchical phrases with gapscompared to the systems with shallow grammar. With both grammars, adding the swap rulecauses an increased usage of hierarchical phrases and less applications of the glue rule. The swaprule itself accounts for the smallest fraction of rule applications, but it is employed in 22% (deep)and 33% (shallow) of the 1357 test sentences.

69

Chapter 6

Table 6.2: Supplementary experimental results (cased) with reordering extensions using the deepgrammar.

deepMT06 (Dev) MT08 (Test)Bleu Ter Bleu Ter

NIST Chinese→English [%] [%] [%] [%]Baseline 32.6 61.2 25.2 66.6+ no length contraints 32.6 61.5 25.5 66.6+ swap rule + bin. swap feat. + discrim. RO (src+tgt word+class) 33.2 61.3 26.2 66.1

+ no length constraints 33.4 61.1 26.2 66.3+ monotonic concatenation rule 33.2 61.6 26.0 66.4+ dist. feature 33.4 61.4 26.2 66.2+ discrim. RO scoring restricted to swap rule 33.1 61.4 26.0 66.4

+ jump rules + bin. jump feat. + discrim. RO (src+tgt word+class) 33.1 61.2 26.0 66.1+ dist. feature 33.2 61.1 25.9 66.1+ discrim. RO scoring restricted to jump rules 32.8 61.3 25.9 66.3

6.5.4 Translation ExamplesFigure 6.2 depicts a translation example along with its decoding tree from our baseline system

with deep grammar. The example is taken from the MT08 set, with the four reference translations“But it is actually very hard to do that.”, “However, it is indeed very difficult to achieve.”, “But toachieve this point is actually very difficult.” and “But to be truly frank is, in fact, very difficult.”.Our baseline system with deep grammar translates the sentence as “but to do this , it is in factvery difficult .”. Though the hypothesis may preserve the meaning, the fluency of the sentence israther bad, in particular due to the choice and the position of the phrase translation “to do this”for the source segment “做到这点”.

Figure 6.3 shows the translation and decoding tree for the same input sentence from our systemwith deep grammar, swap rule, binary swap feature and discrim. RO (src+tgt word+class). Thesystem translates the sentence as “but , in fact , it is very difficult to achieve this .”. Thishypothesis does not match any of the references either, but still is a fully convincing Englishtranslation. Note how the application of the swap rule affects the translation.2

6.6 SummaryWe presented novel extensions of hierarchical phrase-based systems with a discriminative lex-

icalized reordering model. We investigated combinations with two variants of additional non-lexicalized reordering rules. Our approach shows large improvements (up to +1.2 Bleu) overthe respective baselines with both a deep and a shallow-1 hierarchical grammar on the NISTChinese→English translation task.

2Both examples are the raw output of the decoder. Recasing and detokenization are part of the postprocessingpipeline.

70


Table 6.3: Statistics on the rule usage for the single best translation of the test set (MT08).

deep shallowNIST Chinese→English Baseline Best Swap System Baseline Best Swap Systemused hierarchical phrases 25.8 % 32.0 % 17.8 % 24.0 %used lexical phrases 45.8 % 40.0 % 47.6 % 44.7 %used initial and glue rules 28.4 % 26.8 % 34.6 % 29.5 %used swap rules – 1.2 % – 1.8 %applied swap rule in sentences – 295 (22 %) – 446 (33 %)

S

X

.X

其实很难

,X

做到这点

S

X

但

S

X

.X

it is in fact very difficult

,X

to do this

S

X

but

Figure 6.2: Translation example from the base-line system with deep grammar.

S

X

.X

X

X

很难

, 其实

X

做到这点

但

S

X

.X

X

achieve this

X

X

it is very difficult to

, in fact ,

but

Figure 6.3: Translation example from the sys-tem with deep grammar, swap rule,binary swap feature and discrim.RO (src+tgt word+class).

71

7. Phrase Orientation Model

We introduce a phrase orientation model for hierarchical phrase-based machine translation.The model scores monotone, swap, and discontinuous phrase orientations in the manner of theone presented by [Tillmann 04]. While this type of lexicalized reordering model is a valuable andwidely-used component of standard phrase-based statistical machine translation systems [Koehn& Hoang+ 07], it is however commonly not employed in hierarchical decoders.

We describe how phrase orientation probabilities can be extracted from word-aligned trainingdata for use with hierarchical phrase inventories, and show how orientations can be scored in hier-archical decoding. The model is empirically evaluated on the NIST Chinese→English translationtask. We achieve an improvement of +1.2 Bleu over a typical hierarchical baseline setup and animprovement of +0.7 Bleu over a syntax-augmented hierarchical setup. On a French→Germantranslation task, we obtain a gain of up to +0.4 Bleu.

7.1 MotivationThe purpose of a phrase orientation model is to assess the adequacy of phrase reordering during

search. Standard phrase-based decoding allows for a straightforward integration of lexicalizedreordering models which assign different scores depending on how a currently translated phrase hasbeen reordered with respect to its context. Popular lexicalized reordering models for phrase-basedtranslation distinguish three orientation classes: monotone, swap, and discontinuous [Tillmann04, Koehn & Hoang+ 07, Galley & Manning 08]. To obtain such a model, scores for the threeclasses are calculated from the counts of the respective orientation occurrences in the word-alignedtraining data for each extracted phrase. The left-to-right orientation of phrases during phrase-based decoding can be easily determined from the start and end positions of continuous phrases.Approximations may need to be adopted for the right-to-left scoring direction.

The utility of phrase orientation models in standard phrase-based translation is plausible andhas been empirically established in practice. In hierarchical phrase-based translation, some othertypes of lexicalized reordering models have been investigated [He & Meng+ 10a, He & Meng+10b, Hayashi & Tsukada+ 10], but in none of them are the orientation scores conditioned on thelexical identity of each phrase individually. These models are rather word-based and applied onblock boundaries. Experimental results obtained with these other types of lexicalized reorderingmodels have been very encouraging, though.

There are certain reasons why assessing the adequacy of phrase reordering should be useful inhierarchical search:

• Albeit phrase reorderings are always a result of the application of SCFG rules, the decoderis still able to choose from many different parses of the input sentence.

• The decoder can furthermore choose from many translation options for each given parse,which result in different reorderings and different phrases being embedded in the reorderingnon-terminals.

73

Chapter 7

• All other models only weakly connect an embedded phrase with the hierarchical phrase itis placed into, in particular as the set of non-terminals of the hierarchical grammar onlycontains two generic non-terminal symbols.

We therefore investigate phrase orientation modeling for hierarchical translation in this work.The remainder of the chapter is structured as follows. We briefly review important related

publications in the next section. Phrase orientation modeling and a way in which a phraseorientation model can be trained for hierarchical phrase inventories are explained in Section 7.3.In Section 7.4 we introduce an extension of hierarchical search which enables the decoder to scorephrase orientations. Empirical results are presented in Section 7.5. We conclude the chapter bygiving a summary in Section 7.6.

7.2 Related WorkFor standard phrase-based translation, [Galley & Manning 08] introduced a hierarchical

phrase orientation model. Similarly to previous approaches [Tillmann 04, Koehn & Hoang+ 07], itdistinguishes the three orientation classes monotone, swap, and discontinuous. However, it differsin that it is not limited to modeling local reordering phenomena, but allows for phrases to be hier-archically combined into blocks in order to determine the orientation class. This has the advantagethat probability mass is shifted from the rather uninformative default category discontinuous tothe other two orientation classes, which model the location of a phrase more specifically. In thiswork, we transfer this concept to a hierarchical phrase-based machine translation system.

[Nguyen & Vogel 13] and [Xiao & Su+ 11] have independently developed orientation modelsfor hierarchical translation that are reminiscent of ours. A feature that resembles our phraseorientation model for hierarchical machine translation has now also been implemented in theMoses toolkit, with minor implementation differences to the variant described here and includedin Jane. Experiments with Moses show nice gains with the model on an English→Romaniantranslation task [Huck & Fraser+ 16].

7.3 Modeling Phrase Orientation for Hierarchical TranslationThe phrase orientation model we are using was introduced by [Galley & Manning 08]. To

model the sequential order of phrases within the global translation context, the three orientationclasses monotone (M), swap (S) and discontinuous (D) are distinguished, each in both left-to-rightand right-to-left direction. In order to capture the global rather than the local context, previousphrases can be merged into blocks if they are consistent with respect to the word alignment.A phrase is in monotone orientation if a consistent monotone predecessor block exists, and inswap orientation if a consistent swap predecessor block exists. Otherwise it is in discontinuousorientation.

Given a sequence of source words fJ1 and a sequence of target words eI1, a block ⟨f j2

j1, ei2i1⟩

(with 1 ≤ j1 ≤ j2 ≤ J and 1 ≤ i1 ≤ i2 ≤ I) is consistent with respect to the word alignmentA ⊆ {1, ..., I} × {1, ..., J} iff

∃(i, j) ∈ A : i1 ≤ i ≤ i2 ∧ j1 ≤ j ≤ j2

∧ ∀(i, j) ∈ A : i1 ≤ i ≤ i2 ⇔ j1 ≤ j ≤ j2 .(7.1)

Consistency is based upon two conditions in this definition: (1.) At least one source and targetposition within the block must be aligned, and (2.) words from inside the source interval may onlybe aligned to words from inside the target interval and vice versa. These are the same conditionsas those that are applied for the extraction of standard continuous phrases and which we have

74

Phrase Orientation Model

seen in Equation (1.1). The only difference is that length constraints are applied to phrases, butnot to blocks.

Figure 7.1 illustrates the extraction of monotone, swap, and discontinuous orientation classesin left-to-right direction from word-aligned bilingual training samples. The right-to-left directionworks analogously.

We found that this concept can be neatly plugged into the hierarchical phrase-based translationparadigm, without having to resort to approximations in decoding, which is necessary to determinethe right-to-left orientation in a standard phrase-based system [Cherry & Moore+ 12]. To train theorientations, the extraction procedure from the standard phrase-based version of the reorderingmodel can be used, with minor extensions. The model is trained on the same word-aligneddata from which the phrases are extracted. For each training sentence, we extract all phrases ofunlimited length that are consistent with the word alignment, and store their corners in a matrix.The corners are distinguished by their location: top-left, top-right, bottom-left, and bottom-right.For each bilingual phrase, we determine its left-to-right and right-to-left orientation by checkingfor adjacent corners.

The lexicalized orientation probability for the orientation O ∈ {M,S,D} and the phrase pair⟨α, β,∼ ⟩ is estimated as its smoothed relative frequency:

p(O) =N(O)∑

O′∈{M,S,D}N(O′)(7.2)

p(O|⟨α, β,∼ ⟩) = σ · p(O) +N(O|⟨α, β,∼ ⟩)σ +

∑O′∈{M,S,D}N(O′|⟨α, β,∼ ⟩)

. (7.3)

Here, N(O) denotes the global count and N(O|⟨α, β,∼ ⟩) the lexicalized count for the orientationO. σ is a smoothing constant.

To determine the orientation frequency for a hierarchical phrase with non-terminal symbols,the orientation counts of all those phrases are accumulated from which a sub-phrase is cut outand replaced by a non-terminal symbol to obtain this hierarchical phrase. Figure 7.2 gives anexample.

Logarithms of the values are used as scores in the log-linear model combination. Values of 0for all orientations is assigned to the special rules which are not extracted from the training data(initial and glue rule).

7.4 Phrase Orientation Scoring in Hierarchical DecodingOur implementation of phrase orientation scoring in hierarchical decoding is based on the ob-

servation that hierarchical rule applications, i.e. the usage of rules with non-terminals within theirright-hand sides, settle the target sequence order. Monotone, swap, or discontinuous orientationsof blocks are each due to monotone, swap, or discontinuous placement of non-terminals which arebeing substituted by these blocks.

The problem of phrase orientation scoring can thus be mostly reduced to three steps whichneed to be carried out whenever a hierarchical rule is applied:

1. Determining the orientations of the non-terminals in the rule.

2. Retrieving the proper orientation score of the topmost rule application in the sub-derivationwhich corresponds to the embedded block for the respective non-terminal.

3. Applying the orientation score to the log-linear model combination for the current derivation.

75

Chapter 7

f1

f2

f3

f4

f5

e1 e2 e3 e4 e5

target

source

(a) Monotone phrase orientation.

f1

f2

f3

f4

f5

e1 e2 e3 e4 e5

target

source

(b) Swap phrase orientation.

f1

f2

f3

f4

f5

e1 e2 e3 e4 e5

target

source

(c) Discontinuous phrase orientation.

Figure 7.1: Extraction of the orientation classes monotone, swap, and discontinuous from word-aligned training samples. The examples show the left-to-right orientation of the shadedphrases. The dashed rectangles indicate how the predecessor words are merged intoblocks with regard to their word alignment.

f1

f2

f3

f4

f5

e1 e2 e3 e4 e5

target

source

(a) A monotone orientation.Left-to-right orientation counts:

N(M |f2X∼0f4, e2X∼0e4) = 1

N(S|f2X∼0f4, e2X∼0e4) = 0

N(D|f2X∼0f4, e2X∼0e4) = 0

f1

f2

f6

f4

f5

e1 e2 e6 e4 e5

target

source

(b) Another monotone orientation.Left-to-right orientation counts:

N(M |f2X∼0f4, e2X∼0e4) = 2

N(S|f2X∼0f4, e2X∼0e4) = 0

N(D|f2X∼0f4, e2X∼0e4) = 0

f1

f2

f3

f4

f5

e1 e2 e3 e4 e5

target

source

(c) A swap orientation.Left-to-right orientation counts:

N(M |f2X∼0f4, e2X∼0e4) = 2

N(S|f2X∼0f4, e2X∼0e4) = 1

N(D|f2X∼0f4, e2X∼0e4) = 0

Figure 7.2: Accumulation of orientation counts for hierarchical phrases during extraction. Thehierarchical phrase ⟨f2X∼0f4, e2X∼0e4⟩ (dark shaded) can be extracted from all thethree training samples. Its orientation is identical to the orientation of the continuousphrase (lightly shaded) which the sub-phrase is cut out of, respectively. Note thatthe actual lexical content of the sub-phrase may differ. For instance, the sub-phrase⟨f3, e3⟩ is being cut out in Fig. 7.2a, and the sub-phrase ⟨f6, e6⟩ is being cut out inFig. 7.2b.

76


f1

f2

f3

X~0

f4

e1 e2 e3 X~0 e4

target

source

(a) Monotone non-terminal orienta-tion.

f1

X~0

f2

f3

f4

e1 e2 e3 X~0 e4

targetsource

(b) Swap non-terminal orientation.

f1

f2

X~0

f3

f4

e1 e2 X~0 e4

target

source

e3

(c) Discontinuous non-terminal orien-tation.

Figure 7.3: Scoring with the orientation classes monotone, swap, and discontinuous. Each pictureshows exactly one hierarchical phrase. The block which replaces the non-terminal Xduring decoding is embedded with the orientation of this non-terminal X within the hi-erarchical phrase. The examples show the left-to-right orientation of the non-terminal.The left-to-right orientation can be detected from the word alignment of the hierarchi-cal phrase, except for cases where the non-terminal is in boundary position on targetside.

The orientation of a non-terminal in a hierarchical rule is dependent on the word alignmentsin its context. Figure 7.3 depicts three examples.1 We however need to deal with special caseswhere a non-terminal orientation cannot be established at the moment when the hierarchical ruleis considered. We first describe the non-degenerate case (Section 7.4.1). Afterwards we brieflydiscuss our strategy in the special situation of boundary non-terminals where the non-terminalorientation cannot be determined from information which is inherent to the hierarchical rule underconsideration (Section 7.4.3).

We focus on left-to-right orientation scoring; right-to-left scoring is symmetric.

7.4.1 Determining OrientationsIn order to determine the orientation class of a non-terminal, we rely on the word alignments

within the phrases. With each phrase, we store the word alignment that has been seen mostfrequently during phrase extraction as a boolean matrix. Non-terminal symbols on target side areassumed to be aligned to the respective non-terminal symbols on source side according to the ∼

relation. In the alignment matrix, the rows and columns of non-terminals can obviously containonly exactly this one alignment link.

Starting from the last previous aligned target position to the left of the non-terminal, thealgorithm expands a box that spans across the other relevant alignment links onto the corner ofthe non-terminal. Afterwards it checks whether the areas on the opposite sides of the non-terminalposition are non-aligned in the source and target intervals of this box. The non-terminal is indiscontinuous orientation if the box is not a consistent block. If the box is a consistent block, the

1Note that even maximal consecutive lexical intervals (either on source or target side) are not necessarily alignedin a way which makes them consistent bilingual blocks. In Fig. 7.3a, e4 is for instance aligned both below andabove the non-terminal. In Fig. 7.3c, neither ⟨f1f2, e1e2⟩ nor ⟨f1f2, e3e4⟩ would be valid continuous phrases (thesame holds for ⟨f3f4, e1e2⟩ and ⟨f3f4, e3e4⟩). We actually need the generalization of the phrase orientation model tohierarchical phrases as described in Section 7.3 for this reason. Otherwise we would be able to just score neighboringconsistent sub-blocks with a model that does not account for hierarchical phrases with non-terminals.

77

Chapter 7

f1

f2

f3

X~0

f4

e1 e2 e3 X~0e4

f5

e6e5

target

source

(a) Last previous aligned target position.

f1

f2

f3

X~0

f4

e1 e2 e3 X~0e4

f5

e6e5

target

source

(b) Initial box.

f1

f2

f3

X~0

f4

e1 e2 e3 X~0e4

f5

e6e5

target

source

(c) Expansion of the initial box.

f1

f2

f3

X~0

f4

e1 e2 e3 X~0e4

f5

e6e5

target

source

(d) The final box is a consistent left-to-rightmonotone predecessor block of the non-terminal.

Figure 7.4: Determining the orientation class during decoding. Starting from the last previousaligned target position, a box is spanned across the relevant alignment links onto thecorner of the non-terminal. The box is then checked for consistency.

non-terminal is in monotone orientation if its source-side position is larger than the maximum ofthe source-side interval of the box, and in swap orientation if its source-side position is smallerthan the minimum of the source-side interval of the box.

Figure 7.4 illustrates how the procedure operates. In left-to-right direction, an initial box isspanned from the last previous aligned target position to the lower (monotone) or upper (swap)left corner of the non-terminal. In the example, starting from ⟨f3, e5⟩ (Fig. 7.4a), this initial boxis spanned to the lower left corner by iterating from f3 to f4 and expanding its target intervalto the minimum aligned target position within these two rows of the alignment matrix. Theinitial box covers ⟨f3f4, e3e4e5⟩ (Fig. 7.4b). The procedure then repeatedly checks whether thebox needs to be expanded—alternating to the bottom (monotone) or top (swap) and to the left—until no alignment links below or to the left of the box break the consistency. Two box expansionsare conducted in the example: the first one expands the initial box below, resulting in a largerbox which covers ⟨f1f2f3f4, e3e4e5⟩ (Fig. 7.4c); the second one expands this new box to the left,resulting in a final box which covers ⟨f1f2f3f4, e1e2e3e4e5⟩ (Fig. 7.4d) and does not need to be

78


f1

f2

X~0

e1X~0

target

source

(a) Left boundary non-terminal that can beplaced in left-to-rightmonotone or discontin-uous orientation whenthe phrase is embeddedinto another one.

f1

f2

X~0

e1X~0

targetsource

(b) Left boundary non-terminal that can beplaced in left-to-rightdiscontinuous or swaporientation when thephrase is embedded intoanother one.

f1

f2X~0

e1 X~0

target

source

(c) Left boundary non-terminal that can beplaced in left-to-rightmonotone, swap, ordiscontinuous orienta-tion when the phrase isembedded into anotherone.

f1

f2X~0

e1X~0

target

source

(d) Left boundary non-terminal that can onlybe placed in left-to-right discontinuousorientation when thephrase is embedded intoanother one.

Figure 7.5: Left boundary non-terminal symbols. Orientations the non-terminal can eventuallyturn out to get placed in differ depending on existing alignment links in the restof the phrase. Delayed left-to-right scoring is not required in cases as in Fig. 7.5d.Fractional scores for the possible orientations are temporarily applied in the othercases and recursively corrected as soon as an orientation is constituted in an upperhypernode.

expanded towards the lower left corner any more. Afterwards the procedure examines whetherthe final box is a consistent block by inspecting whether the areas on the opposite side of thenon-terminal position are non-aligned in the intervals of the box (areas with waved lines in theFig. 7.4d). These areas do not contain alignment links in the example: the orientation classof the non-terminal is monotone as it has a consistent left-to-right monotone predecessor block.(Suppose an alignment link ⟨f5, e2⟩ would break the consistency: the orientation class would thenbe discontinuous as the final box would not be a consistent block.)

Orientations of non-terminals could basically be precomputed and stored in the translationtable. We however compute them on demand during decoding. The computational overhead didnot seem to be too severe in our experiments.

7.4.2 Scoring OrientationsOnce the orientation is determined, the proper orientation score of the embedded block needs

to be retrieved. We access the topmost rule application in the sub-derivation which correspondsto the embedded block for the respective non-terminal and read the orientation model score forthis rule. The special case of delayed scoring for boundary non-terminals as described in thesubsequent section is recursively processed if necessary. The retrieved orientation scores of theembedded blocks of all non-terminals are finally added to the log-linear model combination forthe current derivation.

7.4.3 Boundary Non-TerminalsCases where a non-terminal orientation cannot be established at the moment when the hierar-

chical rule is considered arise when a non-terminal symbol is in a boundary position on target side.We define a non-terminal to be in (left or right) boundary position iff no symbols are alignedbetween the phrase-internal target-side index of the non-terminal and the (left or right) phrase

79

Chapter 7

boundary. Left boundary positions of non-terminals are critical for left-to-right orientation scor-ing, right boundary positions for right-to-left orientation scoring. We denote non-terminals inboundary position as boundary non-terminals.

The procedure as described in Section 7.4.1 is not applicable to boundary non-terminals becausea last previous aligned target position does not exist. If it is impossible to determine the finalnon-terminal orientation in the hypothesis from information which is inherent to the phrase, weare forced to delay the orientation scoring of the embedded block. Our solution in these cases isto heuristically add fractional scores of all orientations the non-terminal can still eventually turnout to get placed in (cf. Figure 7.5). We do so because not adding an orientation score to thederivation would give it an unjustified advantage over other ones. As soon as an orientation isconstituted in an upper hypernode, any heuristic and actual orientation scores can be collectedby means of a recursive call. Note that monotone or swap orientations in upper hypernodes cantop-down transition into discontinuous orientations for boundary non-terminals, depending onexisting phrase-internal alignment links in the context of the respective boundary non-terminal.In the derivation at the upper hypernode, the heuristic scores are subtracted and the correctactual scores added. Delayed scoring helps keeping the risk of search errors confined, however, itcannot fully avoid them.

7.5 ExperimentsWe evaluate the effect of phrase orientation scoring in hierarchical translation on the Chinese→

English 2008 NIST task and on the French→German language pair using the standard WMTnewstest sets for development and testing.2

7.5.1 Experimental SetupWe work with our Chinese–English parallel training corpus of 3.0 M sentence pairs. To train

the German→French baseline system, we use 2.0 M sentence pairs that are partly taken from theEuroparl corpus and have partly been collected within the Quaero project.3 The standard set ofmodels is used in the baselines. Model weights are optimized towards Bleu with MERT on 100-best lists. For Chinese→English we employ MT06 as development set, MT08 is used as test set.For German→French we employ newstest2009 as development set, newstest2008, newstest2010,and newstest2011 are used as test sets. During decoding, a maximum length constraint of ten isapplied to all non-terminals except the initial symbol S. Translation quality is measured casedwith Bleu and Ter.

7.5.2 Chinese→English Experimental ResultsTable 7.1 contains all results of our empirical evaluation on the Chinese→English task. We first

compare the performance of the phrase orientation model in left-to-right direction only with theperformance of the phrase orientation model in left-to-right and right-to-left direction (bidirec-tional). In all experiments, monotone, swap, and discontinuous orientation scores are treated asbeing from different feature functions in the log-linear model combination: we assign a separatescaling factor to each of the orientations. We have three more scaling factors than in the baselinefor left-to-right direction only, and six more scaling factors for bidirectional phrase orientationscoring. As can be seen from the results table, the left-to-right model already yields a gain of 1.1Bleu over the baseline on the test set (MT08). The bidirectional model performs just slightlybetter (+1.2 Bleu over the baseline). With both models, Ter is reduced considerably as well

2http://www.statmt.org/wmt13/translation-task.html3http://www.quaero.org

80

http://www.statmt.org/wmt13/translation-task.html

http://www.quaero.org


Table 7.1: Experimental results (cased) with the phrase orientation model for the NIST Chinese→English translation task.

MT06 (Dev) MT08 (Test)NIST Chinese→English Bleu [%] Ter [%] Bleu [%] Ter [%]Baseline 32.6 61.2 25.2 66.6+ discrim. RO 33.0 61.3 25.8 66.0+ phrase orientation (left-to-right) 33.3 60.7 26.3 65.5+ phrase orientation (bidirectional) 33.2 60.6 26.4 65.3+ swap rule 32.8 61.7 25.8 66.6

+ discrim. RO 33.1 61.2 26.0 66.1+ phrase orientation (bidirectional) 33.3 60.7 26.5 65.3+ binary swap feature 33.2 61.0 25.9 66.2

+ discrim. RO 33.2 61.3 26.2 66.1+ phrase orientation (bidirectional) 33.6 60.5 26.6 65.1

+ soft syntactic labels 33.4 60.8 26.1 66.4+ phrase orientation (bidirectional) 33.7 60.1 26.8 65.1

+ phrase-level s2t+t2s DWL + triplets 34.3 60.1 27.7 65.0+ discrim. RO 34.8 59.8 27.7 64.7+ phrase orientation (bidirectional) 35.3 59.0 28.4 63.7

(−1.1 / −1.3 Ter compared to the baseline). In the table, we include results with the discrim-inative lexicalized reordering model from Chapter 6 for comparison purposes (discrim. RO withsource and target features over words and classes). The phrase orientation model provides clearlybetter translation quality.

As a next experiment, we bring in more reordering capabilities by augmenting the hierarchicalgrammar with a single swap rule X → ⟨X∼0X∼1,X∼1X∼0⟩ supplementary to the initial rule andglue rule. The swap rule allows adjacent phrases to be transposed. The setup with swap rule andbidirectional phrase orientation model is about as good as the setup with just the bidirectionalphrase orientation model and no swap rule. If we furthermore mark the swap rule with a binaryfeature (binary swap feature), we end up at an improvement of +1.4 Bleu over the baseline.The phrase orientation model again provides higher translation quality than the discriminativereordering model.

In a third experiment, we investigate whether the phrase orientation model also has a positiveinfluence when integrated into a syntax-augmented hierarchical system. We configured a hierar-chical setup with soft syntactic labels [Stein & Peitz+ 10], a syntactic enhancement in the mannerof preference grammars [Venugopal & Zollmann+ 09]. On MT08, the syntax-augmented systemperforms 0.9 Bleu above the baseline setup. We achieve an additional improvement of +0.7Bleu and −1.3 Ter by including the bidirectional phrase orientation model. Interestingly, thetranslation quality of the setup with soft syntactic labels (but without phrase orientation model)is worse than of the setup with phrase orientation model (but without soft syntactic labels) onMT08. The combination of both extensions provides the best result, though.

In a last experiment, we finally take a very strong setup which improves over the baseline by2.5 Bleu through the integration of discriminative word lexicon models and triplet lexicon modelson phrase level in source-to-target (s2t) and target-to-source (t2s) direction. In this strong setup,the discriminative reordering model gives a gain on the development set which however barelycarries over to the test set. Adding the bidirectional phrase orientation model, in contrast, resultsin a nice gain of +0.7 Bleu and a reduction of 1.3 points in Ter on the test set, even on top ofthe DWL and triplet lexicon models.

81

Chapter 7

Table 7.2: Experimental results (cased) with the phrase orientation model for the French→German translation task. newstest2009 is used as development set.

newstest2008 newstest2009 newstest2010 newstest2011Bleu Ter Bleu Ter Bleu Ter Bleu Ter

French→German [%] [%] [%] [%] [%] [%] [%] [%]Baseline 15.2 71.7 15.0 71.7 15.7 69.5 14.2 72.2+ phrase orientation (left-to-right) 15.1 71.4 15.3 71.4 15.9 69.2 14.5 71.8+ phrase orientation (bidirectional) 15.4 71.1 15.4 71.3 15.9 69.1 14.6 71.6

7.5.3 French→German Experimental ResultsTable 7.2 contains the results of our empirical evaluation on the French→German task. The left-

to-right phrase orientation model boosts the translation quality by up to 0.3 Bleu. The reductionin Ter is in a similar order of magnitude. The bidirectional model performs a bit better again,with an advancement of up to 0.4 Bleu and a maximal reduction in Ter of 0.6 points.

7.6 SummaryIn this chapter, we introduced a phrase orientation model for hierarchical machine translation.

The training of a lexicalized reordering model which assigns probabilities for monotone, swap,and discontinuous orientation of phrases was generalized from standard continuous phrases tohierarchical phrases. We explained how phrase orientation scoring can be implemented in hierar-chical decoding and conducted a number of experiments on a Chinese→English and on a French→German translation task. The results indicate that phrase orientation modeling is a very suitableenhancement of the hierarchical paradigm.

82

8. Insertion and Deletion Models

We investigate insertion and deletion models for hierarchical phrase-based statistical machinetranslation. Insertion and deletion models are designed as a means to avoid the omission of contentwords in the hypotheses. In our case, they are implemented as phrase-level feature functionswhich count the number of inserted or deleted words. A target language word is consideredinserted or deleted based on lexical probabilities with the words on the foreign language sideof the phrase. Related techniques have been employed before by [Och & Gildea+ 03] in an n-best reranking framework and by [Mauser & Zens+ 06] and [Zens 08] in standard phrase-basedtranslation systems. We propose novel thresholding methods in this work and study insertion anddeletion features which are based on two different types of lexicon models. We give an extensiveexperimental evaluation of all these variants on the NIST Chinese→English translation task.

We define insertion and deletion models, each in both source-to-target and target-to-sourcedirection, by giving phrase-level scoring functions for the features. In our implementation, thefeature values are precomputed and written to the phrase table. The features are then incorpo-rated directly into the log-linear model combination of the decoder.

8.1 Insertion ModelsOur insertion model in source-to-target direction ts2tIns(·) counts the number of inserted words

on the target side β of a hierarchical rule r with respect to the source side α of the rule:

ts2tIns(r) =

Iβ∑i=1

Jα∏j=1

[p(βi|αj) < ταj

](8.1)

Here, [·] denotes a true or false statement: The result is 1 if the condition is true and 0 if thecondition is false. The model considers an occurrence of a target word e an insertion iff no sourceword f exists within the phrase where the lexical translation probability p(e|f) is greater than acorresponding threshold τf . We employ lexical translation probabilities from two different types oflexicon models, a model which is extracted from word-aligned training data and—given the wordalignment—relies on pure relative frequencies, and the IBM model 1 lexicon (cf. Section 8.3).For τf , previous authors have used a fixed heuristic value which was equal for all f ∈ Vf . InSection 8.4, we describe how such a global threshold can be computed and set in a reasonable waybased on the characteristics of the model. We also propose several novel thresholding techniqueswith distinct thresholds τf for each source word f .

In an analogous manner to the source-to-target direction, the insertion model in target-to-sourcedirection tt2sIns(·) counts the number of inserted words on the source side α of a hierarchical rulewith respect to the target side β of the rule:

tt2sIns(r) =Jα∑j=1

Iβ∏i=1

[p(αj |βi) < τβi] (8.2)

83

Chapter 8

Target-to-source lexical translation probabilities p(f |e) are thresholded with values τe which maybe distinct for each target word e. The model considers an occurrence of a source word f aninsertion iff no target word e exists within the phrase with p(f |e) greater than or equal to τe.

8.2 Deletion ModelsOur deletion model, compared to the insertion model, interchanges the connection of the direc-

tion of the lexical probabilities and the order of source and target in the sum and product of theterm. The source-to-target deletion model thus differs from the target-to-source insertion modelin that it employs a source-to-target word-based lexicon model.

The deletion model in source-to-target direction ts2tDel(·) counts the number of deleted wordson the source side α of a hierarchical rule with respect to the target side β of the rule:

ts2tDel(r) =Jα∑j=1

Iβ∏i=1

[p(βi|αj) < ταj

](8.3)

It considers an occurrence of a source word f a deletion iff no target word e exists within thephrase with p(e|f) greater than or equal to τf .

The target-to-source deletion model tt2sDel(·) correspondingly considers an occurrence of a tar-get word e a deletion iff no source word f exists within the phrase with p(f |e) greater than orequal to τe:

tt2sDel(r) =

Iβ∑i=1

Jα∏j=1

[p(αj |βi) < τβi] (8.4)

Note that the four feature functions may each assign a different score to a phrase.

8.3 Lexicon ModelsWe proceed with a description of lexicon models and thresholding methods that can be employed

for the computation of insertion and deletion counts. We restrict ourselves to the description ofthe source-to-target direction of the models.

8.3.1 Word Lexicon from Word-aligned DataWe can utilize the relative frequency (RF) word lexicon from Section 5.3.1 in the insertion and

deletion scoring functions. Single-word based translation probabilities pRF(e|f) are estimated byrelative frequency, given a parallel training corpus with existing word alignments. Denoting withN(e, f) counts of aligned co-occurrences of target word e and source word f , we can compute

pRF(e|f) =N(e, f)∑e′ N(e′, f)

. (8.5)

If an occurrence of e has multiple aligned source words, each of the alignment links contributeswith a fractional count.

8.3.2 IBM Model 1As an alternative to the RF word lexicon, we can utilize IBM model 1 in our insertion and

deletion scoring functions. Making the simplifying assumptions as mentioned in Section 5.3.2, the

84

Insertion and Deletion Models

probability of a target sentence eI1 given a source sentence fJ0 (with f0 = NULL) is modeled as

P (eI1|fJ1 ) = p(I|J) · 1

(J + 1)I·

I∏i=1

J∑j=0

pibm1(ei|fj) , (8.6)

and the IBM-1 lexicon model can be learned iteratively by means of EM training with maximumlikelihood.

8.4 Thresholding Methods

We introduce thresholding methods for insertion and deletion models which set thresholds basedon the characteristics of the lexicon model that is applied. For all the following thresholdingmethods, we disregard entries in the lexicon model with probabilities that are below a fixed floorvalue of 10−6. Again, we restrict ourselves to the description of the source-to-target direction.

individual: τf is a distinct value for each f , computed as the arithmetic average of all entriesp(e|f) of any e with the given f in the lexicon model.

global: The same value τf = τ is used for all f . We compute this global threshold by averagingover the individual thresholds.1

histogram n: τf is a distinct value for each f . τf is set to the value of the (n + 1)-th largestprobability p(e|f) of any e with the given f .

all: All entries with probabilities larger than the floor value are not thresholded. This variantmay be considered as histogram ∞. We only apply it with RF lexicons.

median: τf is a median-based distinct value for each f , i.e. it is set to the value that separatesthe higher half of the entries from the lower half of the entries p(e|f) for the given f .

8.5 Experiments

We present empirical results obtained with the different insertion and deletion model variantson the NIST Chinese→English task.


Again, we work with our parallel training corpus of 3.0 M Chinese–English sentence pairs andthe standard set of features for the baseline. The counts for the RF lexicon models are computedfrom a symmetrized word alignment, the IBM-1 models are produced with GIZA++. Modelweights are optimized towards Bleu with MERT, performance is measured with Bleu and Ter.We employ MT06 as development set, MT08 is used as the test set. The empirical evaluation ofall our setups is presented in Table 8.1.

1Concrete values from our experiments are: 0.395847 for the source-to-target RF lexicon, 0.48127 for the target-to-source RF lexicon. 0.0512856 for the source-to-target IBM-1, and 0.0453709 for the target-to-source IBM-1.[Mauser & Zens+ 06] mention that they chose their heuristic thresholds for use with IBM-1 between 10−1 and 10−4.

85

Chapter 8

8.5.2 Experimental ResultsWith the best model variant, we obtain an improvement of +1.0 points Bleu over the baseline

on MT08. A consistent trend towards one of the variants cannot be observed. The results onthe test set with RF lexicons or IBM-1, insertion or deletion models, and (in most of the cases)with all of the thresholding methods are roughly at the same level. For comparison we also givea result with an unaligned word count model (+0.4 Bleu).

In Chapter 5 we reported substantial improvements over typical hierarchical baseline setups byjust including phrase-level IBM-1 scores. When we add the IBM-1 models directly, our baselineis outperformed by +1.7 Bleu. We tried to get improvements with insertion and deletion modelsover this setup again, but the positive effect was largely diminished. In one of our strongest setups,which includes discriminative word lexicon models, triplet lexicon models and the discriminativereordering model, insertion models still yield a minimal gain, though.

8.6 SummaryOur results with insertion and deletion models for Chinese→English hierarchical machine trans-

lation are twofold. On the one hand, we achieved considerable improvements over a standardhierarchical baseline. We were also able to report a slight gain when adding the models to avery strong setup with discriminative word lexicons, triplet lexicon models and a discriminativereordering model. On the other hand, the positive impact of the models was mainly noticeablewhen we exclusively applied lexical smoothing with word lexicons which are simply extracted fromword-aligned training data, which is however the standard technique in most state-of-the-art sys-tems. If we included phrase-level lexical scores with IBM model 1 as well, the systems barelybenefited from our insertion and deletion models. Compared to an unaligned word count model,insertion and deletion models perform well.

86

Insertion and Deletion Models

Table 8.1: Experimental results (cased) with the insertion and deletion models for the NISTChinese→English translation task. s2t denotes source-to-target scoring, t2s target-to-source scoring.


NIST Chinese→English [%] [%] [%] [%]Baseline (with s2t+t2s RF word lexicons) 32.6 61.2 25.2 66.6+ s2t+t2s insertion model (RF, individual) 32.9 61.4 25.7 66.2+ s2t+t2s insertion model (RF, global) 32.8 61.8 25.7 66.7+ s2t+t2s insertion model (RF, histogram 10) 32.9 61.7 25.5 66.5+ s2t+t2s insertion model (RF, all) 32.8 62.0 26.1 66.7+ s2t+t2s insertion model (RF, median) 32.9 62.1 25.7 67.1+ s2t+t2s deletion model (RF, individual) 32.7 61.4 25.6 66.5+ s2t+t2s deletion model (RF, global) 33.0 61.3 25.8 66.1+ s2t+t2s deletion model (RF, histogram 10) 32.9 61.4 26.0 66.1+ s2t+t2s deletion model (RF, all) 33.0 61.4 25.9 66.4+ s2t+t2s deletion model (RF, median) 32.9 61.5 25.8 66.7+ s2t+t2s insertion model (IBM-1, individual) 33.0 61.4 26.1 66.4+ s2t+t2s insertion model (IBM-1, global) 33.0 61.6 25.9 66.5+ s2t+t2s insertion model (IBM-1, histogram 10) 33.7 61.3 26.2 66.5+ s2t+t2s insertion model (IBM-1, median) 33.0 61.3 26.0 66.4+ s2t+t2s deletion model (IBM-1, individual) 32.8 61.5 26.0 66.2+ s2t+t2s deletion model (IBM-1, global) 32.9 61.3 25.9 66.1+ s2t+t2s deletion model (IBM-1, histogram 10) 32.8 61.2 25.7 66.0+ s2t+t2s deletion model (IBM-1, median) 32.8 61.6 25.6 66.7+ s2t insertion + s2t deletion model (IBM-1, individual) 32.7 62.3 25.7 67.1+ s2t insertion + t2s deletion model (IBM-1, individual) 32.7 62.2 25.9 66.8+ t2s insertion + s2t deletion model (IBM-1, individual) 33.1 61.3 25.9 66.2+ t2s insertion + t2s deletion model (IBM-1, individual) 33.0 61.3 26.1 66.0+ source+target unaligned word count 32.3 61.8 25.6 66.7+ phrase-level s2t+t2s IBM-1 word lexicons 33.8 60.5 26.9 65.4

+ source+target unaligned word count 34.0 60.4 26.7 65.8+ s2t+t2s insertion model (IBM-1, histogram 10) 34.0 60.3 26.8 65.2

+ phrase-level s2t+t2s DWL + triplets + discrim. RO 34.8 59.8 27.7 64.7+ s2t+t2s insertion model (RF, individual) 35.0 59.5 27.8 64.4

87

9. Hierarchical Translation forLarge-scale English→French News and

Talk Tasks

With their public evaluation campaigns, the Workshop on Statistical Machine Translation(WMT) and the International Workshop on Spoken Language Translation (IWSLT) provide impor-tant testbeds for machine translation technology. Hierarchical systems developed for the English–French language pair on the WMT news translation task (Section 9.1) and on the IWSLT TEDtalk translation task (Section 9.2) are described in this chapter. The baselines are augmentedwith some of the previously described enhancements, along with other techniques that prove tobe effective on the tasks.

9.1 WMT News Translation Task9.1.1 Overview

We now present empirical results with the hierarchical phrase-based translation system on alarge-scale translation task for the English–French language pair with test sets from the newsdomain. We apply a number of the models and techniques that have been introduced in previouschapters of this thesis, specifically an insertion model, different lexical smoothing methods, andthe discriminative lexicalized reordering extension. Considerable improvements over the baselinesystem are achieved by means of a combination of the methods.

The hierarchical phrase-based translation system which is described here has been developedfor the RWTH Aachen University participation in the English→French shared translation taskof the NAACL 2012 Seventh Workshop on Statistical Machine Translation.1 The system wasused to produce the hypothesis translation of the newstest2012 evaluation set for RWTH’s officialsubmission to the evaluation campaign.

9.1.2 Experimental SetupCorpus statistics for the WMT 2012 English–French parallel training data are given in Table 9.1.

The parallel data originates from four different sources: the Europarl [Koehn 05],2 MultiUN [Eisele& Chen 10],3 WMT News Commentary, and 109 corpora.4 We employ GIZA++ to train wordalignments. The two trained alignments are heuristically merged to obtain a symmetrized wordalignment for phrase extraction.

1http://www.statmt.org/wmt12/translation-task.html2http://www.statmt.org/europarl/3http://www.euromatrixplus.net/multi-un/4The parallel 109 corpus is often also referred to as WMT Giga French–English release 2.

89

http://www.statmt.org/wmt12/translation-task.html

http://www.statmt.org/europarl/

http://www.euromatrixplus.net/multi-un/

Chapter 9

Table 9.1: Data statistics of the preprocessed WMT English–French parallel training corpora.

English FrenchEuroparl + News Commentary Sentences 2.1 M

Running Words 57.6 M 63.3 MVocabulary 128.5 K 147.8 KSingletons 5.1 K 5.4 K

+ 109 Sentences 22.9 MRunning Words 624.0 M 728.6 MVocabulary 1.7 M 1.7 MSingletons 0.8 M 0.8 M

+ UN Sentences 35.4 MRunning Words 956.4 M 1 113.5 MVocabulary 2.0 M 1.9 MSingletons 1.0 M 0.9 M

Table 9.2: Experimental results (cased) for the WMT English→French translation task. news-test2009 is used as development set.


WMT English→French [%] [%] [%] [%] [%] [%] [%] [%]HPBT 20.9 66.0 23.6 62.5 25.1 60.2 27.4 57.6

+ 109 and UN 22.5 63.2 25.4 59.8 27.0 57.1 29.9 53.9+ LDC Gigaword v2 23.0 63.0 25.9 59.4 27.3 56.9 29.6 54.1

+ insertion model 23.0 62.9 26.1 59.2 27.2 56.8 30.0 53.7+ noisy-or lexical scores 23.2 62.5 26.1 59.0 27.6 56.4 30.2 53.4

+ DWL 23.3 62.5 26.2 58.9 27.9 55.9 30.4 53.2+ IBM-1 23.4 62.3 26.2 58.8 28.0 55.7 30.4 53.1

+ discrim. RO 23.5 62.2 26.7 58.5 28.1 55.9 30.8 52.8

The language model is created with the SRILM toolkit and is a standard 4-gram LM withmodified Kneser-Ney smoothing. It is trained on the provided resources for the French language(Europarl, MultiUN, News Commentary, 109, and monolingual News Crawl language model train-ing data). We furthermore experiment with additional target-side data from the LDC FrenchGigaword Second Edition (LDC2009T28), which is an archive of newswire text data that hasbeen acquired over several years by the Linguistic Data Consortium. The LDC French Gigawordis permitted for constrained submissions in the WMT shared translation task.

The baseline system is a hierarchical phrase-based setup with at most two right-hand side non-terminals per hierarchical rule. We limit the recursion depth for hierarchical rules with a shallow-1grammar (cf. Section 1.3.2).

The baseline models integrated into our system are: phrase translation probabilities and lexi-cal translation probabilities (with RF word lexicons and tNorm(·) scoring; cf. Chapter 5) in bothtranslation directions, word and phrase penalties, glue rule indicator, hierarchical and paste indi-cators, and the 4-gram language model. We utilize the cube pruning algorithm for decoding (cf.Chapter 3) and optimize the model weights with MERT towards Bleu. As a development set forMERT, we use newstest2009. We evaluate cased with Bleu and Ter.

90

Hierarchical Translation for Large-scale English→French News and Talk Tasks


The experimental results for the WMT English→French news translation task are given in Ta-ble 9.2. Starting from the shallow hierarchical baseline setup on Europarl and News Commentaryparallel data only (but Europarl, News Commentary, 109, UN, and News Crawl data for LMtraining), we are able to improve translation quality considerably by first adopting more parallel(109 and UN) and monolingual (French LDC Gigaword v2) training resources and then enhancingthe system with additional models that are not included in the baseline already. We proceed withbrief individual descriptions of the enhancements and report their respective effect in Bleu onthe test sets.

109 and UN (up to +2.5 points Bleu): While the amount of provided parallel data from Eu-roparl and News Commentary sources is rather limited (around 2 M sentence pairs in total),the UN and the 109 corpus each provide a substantial collection of further training material.By appending both corpora, we end up at roughly 35 M parallel sentences (cf. Table 9.1).We utilize this full amount of data in our system, but extract a phrase table with only lexicalphrases (i.e. phrases without gaps) from the full parallel data. We add it as a second phrasetable to the baseline system, with a binary feature that enables the system to reward orpenalize the application of phrases from this table.

LDC Gigaword v2 (up to +0.5 points Bleu): The LDC French Gigaword Second Edition pro-vides some more monolingual French resources. We include a total of 28.2 M sentences fromboth the AFP and APW collections in our LM training data.

insertion model (up to +0.4 points Bleu): We add an insertion model to the log-linear modelcombination. This model is designed as a means to avoid the omission of content wordsin the hypotheses. It is implemented as a phrase-level feature function which counts thenumber of inserted words. We apply the model in source-to-target and target-to-sourcedirection. A target-side word is considered inserted based on lexical probabilities with thewords on the foreign language side of the phrase, and vice versa for a source-side word. Asthresholds, we compute individual arithmetic averages for each word from the vocabulary(cf. Chapter 8).

noisy-or lexical scores (up to +0.4 points Bleu): In our baseline system, the tNorm(·) lexicalscoring variant is employed with a relative frequency (RF) lexicon model for phrase tablesmoothing. The single-word based translation probabilities of the RF lexicon model areextracted from word-aligned parallel training data. We exchange the baseline lexical scoringwith a noisy-or lexical scoring variant tNoisyOr(·) (cf. Section 5.3.3).

DWL (up to +0.3 points Bleu): We augment our system with phrase-level lexical scores fromdiscriminative word lexicon models in both source-to-target and target-to-source direction(cf. Chapter 5). The discriminative word lexicons are trained on News Commentary dataonly.

IBM-1 (up to +0.1 points Bleu): On News Commentary and Europarl data, we train IBMmodel 1 lexicons in both translation directions and also use them to compute phrase-levelscores (cf. Chapter 5).

discrim. RO (up to +0.4 points Bleu): The modification of the grammar to a shallow-1 versionrestricts the search space of the decoder and is convenient to prevent overgeneration. Inorder to not be too restrictive, we reintroduce more flexibility into the search process by

91

Chapter 9

extending the grammar with dedicated reordering rules (cf. Section 6.3.1.2)

XP→ ⟨XP∼0XP∼1,XP∼1XP∼0⟩XP→ ⟨XP∼0XP∼1,XP∼0XP∼1⟩ .

(9.1)

The upper rule in Equation (9.1) is a swap rule that allows adjacent lexical phrases to betransposed, the lower rule is added for symmetry reasons, in particular because sequencesassembled with these rules are allowed to fill gaps within hierarchical phrases. A lengthconstraint of ten is imposed to the number of source terminals spanned by an XP duringdecoding. We introduce two binary indicator features, one for each of the two rules inEquation (9.1). Supplementary to adding these rules, a discriminatively trained lexicalizedreordering model is applied (cf. Section 6.4).

9.1.4 Comparison with Other Groups

The English→French hierarchical phrase-based translation system described in this chapterhas been competitive with the best systems in the NAACL 2012 Seventh Workshop on StatisticalMachine Translation. In a manual evaluation which has been organized as part of the WMTevaluation campaign, the RWTH Aachen University submission [Huck & Peitz+ 12b] rankedsecond-best out of fifteen systems which had to be judged by human evaluators [Callison-Burch& Koehn+ 12]. The RWTH Aachen University submission also scored second-best in cased Bleuon the newstest2012 evaluation set. Table 9.3 gives the scores of the six primary submissionsof individual participating groups which obtained the best scores with respect to cased Bleu.Scores and system outputs on newstest2012 are available at http://matrix.statmt.org/matrix/systems_list/1698.

Table 9.3: Comparison with other groups on the WMT English→French translation task(cased Bleu).

newstest2012WMT English→French Bleu [%]LIMSI–CNRS 28.8RWTH Aachen University 28.6Karlsruhe Institute of Technology 28.5LIUM, University of Le Mans 28.1University of Edinburgh 28.0Johns Hopkins University 24.7

9.2 IWSLT TED Talk Translation Task

9.2.1 Overview

An evaluation campaign on the translation of TED talks5 has been organized at the Interna-tional Workshop on Spoken Language Translation 2011 [Federico & Bentivogli+ 11]. We describethe hierarchical phrase-based translation system developed at RWTH Aachen University for theEnglish→French MT—i.e., text translation—track of the IWSLT 2011 evaluation campaign.6 This

5http://www.ted.com/talks6http://iwslt2011.org/doku.php?id=06_evaluation

92

http://matrix.statmt.org/matrix/systems_list/1698

http://matrix.statmt.org/matrix/systems_list/1698

http://www.ted.com/talks

http://iwslt2011.org/doku.php?id=06_evaluation

Hierarchical Translation for Large-scale English→French News and Talk Tasks

Table 9.4: Data statistics of the preprocessed IWSLT English–French parallel training corpus.The data includes TED, Europarl, and News Commentary.

English FrenchSentences 2.0 MRunning words 54.3 M 59.9 MVocabulary 136 K 159 KSingletons 56 K 61 K

system was used to produce the hypothesis translation of the 2011 evaluation set for RWTH’sofficial submission to the evaluation campaign.

A number of enhancements are employed in order to improve over the baseline translationquality, including lexical smoothing with discriminative word lexicons, adaptation by means ofa second in-domain TED phrase table, a triplet lexicon model, and the discriminative lexical-ized reordering extension. Empirical results reveal that the application of these methods yieldsconsiderable gains over a hierarchical baseline system in terms of Bleu and Ter.


For the IWSLT English→French TED talk translation task, the translation model is trainedon TED [Cettolo & Girardi+ 12],7 Europarl, and News Commentary data. Statistics on thebilingual data are presented in Table 9.4. This parallel training data contains 107 K sentencesfrom in-domain TED sources with 2.1 M English and 2.2 M French running words. We employsymmetrized GIZA++ word alignments, a shallow-1 grammar, MERT for optimization, and cubepruning for search. A 4-gram language model is trained with the SRILM toolkit on the target partof the bilingual training data (TED, Europarl, News Commentary) plus additional monolingualNews Crawl data.


Experimental results on the IWSLT English→French TED talk translation task are given inTable 9.5. The shallow-1 baseline hierarchical system is incrementally augmented with monolin-gual data selection, alternative lexical smoothing using discriminative word lexicons, an improvedlanguage model smoothing technique, a second in-domain phrase table, a triplet lexicon model,and reordering extensions. We proceed with individual descriptions of these methods and theireffect in terms of translation quality on the test set. Overall we are able to improve the baselineby +1.8 Bleu and −2.1 Ter on the test set.

mooreLM (+0.3 points Bleu): We apply the monolingual data selection technique by [Moore& Lewis 10]. Instead of employing all of the French News Crawl data for language modeltraining (as in the baseline setup), we select 1

4 of it. Monolingual data selection enables usto adapt our language model to the domain, i.e. the style and topics of TED talks, while atthe same time also bringing about a reduction of the language model size, i.e. the numberof n-grams.

DWL (+0.6 points Bleu): We replace the baseline lexical smoothing by phrase-level lexical scoresfrom sparse discriminative word lexicons in source-to-target and target-to-source direction(cf. Chapter 5). We found smoothing with discriminative word lexicon models to yield the

7https://wit3.fbk.eu

93

https://wit3.fbk.eu

Chapter 9

Table 9.5: Experimental results (cased) for the IWSLT English→French TED MT task.

Dev TestBleu Ter Bleu Ter

IWSLT English→French [%] [%] [%] [%]HPBT 25.7 58.6 29.3 52.8

+ mooreLM 26.0 58.1 29.6 51.8+ IBM-1 26.3 58.1 30.0 52.0+ DWL 26.3 58.0 30.2 51.8

+ opt. KN LM 26.5 57.9 30.3 51.3+ TED PT 27.2 57.2 30.7 51.1

+ s2t TED triplets 27.5 57.0 30.8 50.9+ discrim. RO 27.4 57.0 31.1 50.7

best results among several lexical smoothing methods. For comparison, the result with IBMmodel 1 is given in Table 9.5 as well.

opt. KN LM (+0.1 points Bleu): [Sundermeyer & Schlüter+ 11] presented a way to optimizethe values of the Kneser-Ney discount parameters with the improved RProp algorithm [Igel& Hüsken 03]. We apply their method to a machine translation task and train our Frenchlanguage model with optimized smoothing parameters.

TED PT (+0.4 points Bleu): One of the main challenges of the 2011 IWSLT evaluation cam-paign is adaptation to style and topic of the TED talks. We tackle the problem by aug-menting our system with an additional phrase table trained on in-domain TED data only.Phrases from the TED phrase table are marked with a binary feature.

s2t TED triplets (+0.1 points Bleu): We also apply a path-aligned triplet lexicon model forstyle and topic adaptation. The TED triplet model is trained on the same parallel data asthe TED TM. This model is integrated in source-to-target direction only. It takes the fullsource sentence context into account (cf. Chapters 4 and 5).

discrim. RO (+0.3 points Bleu): We extend the grammar with two specific reordering rules

XP→ ⟨XP∼0XP∼1,XP∼1XP∼0⟩XP→ ⟨XP∼0XP∼1,XP∼0XP∼1⟩

(9.2)

and a discriminatively trained lexicalized reordering model (cf. Section 6.4), in the sameway as for the WMT setup (Section 9.1.3).

94

Part III

Lightly-supervised Training

95

10. Lightly-supervised Training forHierarchical Phrase-based Translation

In this chapter, we apply lightly-supervised training to a hierarchical phrase-based statisticalmachine translation system. We employ bitexts that have been built by automatically translatinglarge amounts of monolingual data as additional parallel training corpora. Different ways of usingthis synthetic data to improve the system are explored.

Our results show that integrating a second phrase table with only non-hierarchical phrasesextracted from the automatically generated bitexts is a reasonable approach. The translationperformance matches the result we achieve with a joint extraction on all training bitexts whilethe system is kept smaller due to a considerably lower overall number of phrases.

10.1 Synthetic Training DataIn lightly-supervised training scenarios, the baseline source-to-target statistical machine trans-

lation system is being augmented with additional synthetic parallel data that is produced byautomatically translating either source language monolingual data to the target language or tar-get language monolingual data to the source language. The former is done either with the baselinesystem itself or with another source-to-target translation system, the latter is done with a reverse(target-to-source) system. The reverse system can naturally only be trained on the same preex-isting parallel resources between source and target as the source-to-target baseline system. Thelanguage model data is not only composed of the respective side of the parallel resources and thusdiffers, though. The method does in fact go by the name of lightly-supervised training because thetopics that are covered in the monolingual corpora that are being translated may potentially alsobe covered by parts of the language model training data of the system which is used to translatethem. This can be considered as a form of light supervision.1 Figure 10.1 illustrates source-to-target lightly-supervised training, where source-side monolingual data is automatically translatedto the target language to obtain synthetic parallel data. Figure 10.2 illustrates target-to-sourcelightly-supervised training, where target-side monolingual data is translated.

The standard purpose of lightly-supervised training is adaptation. With sufficient amountsof reliable in-domain monolingual data, either in source or target language, a generic or out-of-domain baseline system can be trained towards aspects of topic and style of the domain underconsideration. Extracting a phrase table from the baseline parallel data plus the new syntheticdata results in new phrases (due to the phrase segmentation, choice of translation options, andreordering performed by the system that is used for the production of the synthetic data) andmodified scores for phrases that have already been available before (due to the different number

1We loosely apply the term lightly-supervised training if we mean the process of utilizing a machine translationsystem to produce additional bitexts and using them as training data, and refer to the automatically producedbilingual corpora as synthetic data.

97

Chapter 10

source–targetparallel data

additional sourcemonolingual data

source→targetSMT system 2

source→targetSMT system 1

source–targetsynthetic

parallel data

inputsource

text

outputtargettext

Figure 10.1: Illustration of lightly-supervised training (source-to-target). Additional source-sidemonolingual resources are utilized to improve the translation model.


additional targetmonolingual data

source→targetSMT system

target→sourceSMT system


parallel data

inputsource

text

outputtargettext

Figure 10.2: Illustration of lightly-supervised training (target-to-source). Additional target-sidemonolingual resources are utilized to improve the translation model.

98

Lightly-supervised Training for Hierarchical Translation

of occurrences). The vocabulary size remains unchanged. In the work by [Schwenk 08], a crucialingredient of the lightly-supervised training pipeline is consequently the integration of a largesupplementary bilingual dictionary with a high coverage, including morphological variants. Thelightly-supervised training procedure enables the acquisition of phrases that contain words fromthis dictionary, where the input words would be out-of-vocabulary otherwise, and to learn reliabletranslation scores for phrase table entries which originate from the dictionary.

We investigate the impact of an employment of large amounts of synthetic parallel data asadditional training data for a hierarchical machine translation system. The synthetic paralleldata is created by automatically translating a monolingual source language corpus. We studyseveral different ways of incorporating synthetic training data into the hierarchical system. Thebasic techniques we adopt are the use of multiple phrase tables and a distinction of the hierarchicaland the non-hierarchical (i.e. lexical) part of the phrase table. We report experimental resultson the NIST Arabic→English translation task and show that lightly-supervised training yieldsconsiderable gains over the baseline.

10.2 Combining Phrase Tables for Lightly-supervised TrainingThe most straightforward way of trying to improve the baseline with lightly-supervised training

is to concatenate the human-generated parallel data and the synthetic data and to jointly extractphrases from the concatenated data (after having trained word alignments for the synthetic bitextsas well). This method is simple and expected to be effective usually.

There may however be two drawbacks. First, the reliability and the amount of parallel sentencesmay differ between the human-generated and the synthetic part of the training data. Secondly, ifwe incorporate large amounts of additional synthetic data, the amount of extracted phrases canbecome much larger. This holds in particular in the case of hierarchical phrases with gaps. Forreasons of decoding efficiency, we would want to avoid blowing up our phrase table size withoutan appropriate effect on translation quality.

We tackle the first drawback by running separate phrase extractions on the two corpora in orderto be able to distinguish and weight phrases according to their origin during decoding. They canbe weighted against each other with either a simple additional binary indicator feature, or bykeeping the lexical and phrase translation features for the two phrase inventories separate, withseparate scaling factors tuned for them. The weights are optimized with MERT along with allother scaling factors. In the experiments, we will compare a joint extraction to the usage of twoseparate phrase tables.

We tackle the second drawback by including only lexical phrases from the synthetic data, nohierarchical phrases with gaps. Phrase-based machine translation systems are usually able to cor-rectly handle local context dependencies, but often have problems in producing a fluent sentencestructure across long distances. Thus—since our synthetic parallel data is created with a phrase-based system—it is an intuitive supposition that using hierarchical phrases with gaps extractedfrom synthetic data in addition to the hierarchical phrases extracted from the presumably morereliable human-generated bitexts does not increase translation quality. We will empirically checkwhether including phrases with gaps from synthetic data is beneficial or not.

10.3 Experiments

10.3.1 Baseline System

The baseline hierarchical system is trained using a human-generated parallel corpus of 2.5 MArabic–English sentence pairs. Word alignments in both directions are produced with GIZA++

99

Chapter 10

and symmetrized. We integrate our standard baseline features and run cube pruning in decoding,with the depth of the hierarchical recursion restricted to one by using a shallow grammar. Thescaling factors of the log-linear model combination are optimized towards Bleu with MERT onthe MT06 corpus. MT08 is employed as a test set. Detailed statistics of the human-generatedparallel training corpus have been given previously in Table 3.2 (Section 3.1).

10.3.2 Arabic–English Synthetic Data

The synthetic data that we integrate has been created by automatic translation of parts of theArabic LDC Gigaword corpus (mostly from the HYT collection) with a standard phrase-basedsystem.2 We thus in fact conduct a cross-system and cross-paradigm variant of lightly-supervisedtraining.

The score computed by the phrase-based decoder for each translation has been normalized withrespect to the sentence length and used to select the most reliable sentence pairs. Word alignmentsfor the synthetic data are produced in the same way as for the baseline bilingual training data.We report the statistics of the synthetic data in Table 10.1.

10.3.3 Phrase Tables

We extract three different phrase tables, one from the baseline human-generated parallel dataonly, one from the synthetic data only, and one joint phrase table from the concatenation of thebaseline data and the synthetic data. We denote the different extractions as baseline, synthetic,and joint, respectively.

The conventional restrictions are applied for phrase extraction under all conditions, i.e. a max-imum length of ten words on source and target side for lexical phrases, a length limit of five onsource side and ten on target side for hierarchical phrases, and at most two right-hand side non-terminals per rule which are not allowed to be adjacent on the source side. Singleton hierarchicalphrases are discarded.

Statistics on the phrase table sizes are presented in Table 10.2.3 In total the joint extractionresults in almost three times as many phrases as the baseline extraction. The extraction fromonly the synthetic data results in more than twice as many hierarchical phrases as from thebaseline data. The sum of the number of hierarchical phrases from separate baseline and syntheticextraction is very close to the number of hierarchical phrases from the joint extraction. If wediscard the hierarchical phrases extracted from the synthetic data and use the lexical part of thesynthetic phrase table (27.3 M phrases) as a second phrase table in addition to the baseline phrasetable (67.0 M phrases), the overall number of phrases is increased by only 41% compared to thebaseline system.

10.3.4 Arabic→English Experimental Results

The empirical evaluation of all our systems is presented in Table 10.3. When we combine the fullbaseline phrase table with the synthetic phrase table or the lexical part of it, we either use commonscaling factors for their source-to-target and target-to-source lexical and phrase translation featurefunctions, or we use common scaling factors but mark entries from the synthetic table with anadditional binary feature, or we optimize the four translation features separately for each of thetwo tables as part of the log-linear model combination.

2Translating the monolingual Arabic data has been done by LIUM, Le Mans, France. We thank Holger Schwenkfor kindly providing the translations.

3The phrase tables have been filtered towards the phrases needed for the translation of a given collection of testcorpora.

100


Table 10.1: Data statistics of the preprocessed Arabic–English synthetic training corpus afterselection of the most reliable sentence pairs.

Arabic EnglishSentences 4.7 MRunning words 121.4 M 134.2 MVocabulary 306 K 238 KSingletons 131 K 102 K

Table 10.2: Phrase table statistics for lightly-supervised training of Arabic→English systems. Thephrase tables have been filtered towards a larger set of test corpora containing a totalof 2.3 M running words.

number of phraseslexical hierarchical total

extraction from baseline data 19.8 M 47.2 M 67.0 Mextraction from synthetic data 27.3 M 115.6 M 142.9 Mphrases present in both tables 15.0 M 40.1 M 55.1 Mjoint extraction baseline + synthetic 32.1 M 166.5 M 198.6 M

Table 10.3: Experimental results (cased) with lightly-supervised training for the NIST Arabic→English translation task.


NIST Arabic→English [%] [%] [%] [%]HPBT baseline 44.1 49.9 44.4 49.4HPBT synthetic only 45.3 48.8 45.2 49.1joint extraction baseline + synthetic 45.6 48.7 45.4 49.1baseline hierarchical phrases + synthetic lexical phrases 45.1 49.1 45.2 49.2baseline hierarchical phrases + joint extraction lexical phrases 45.3 48.7 45.3 49.1baseline + synthetic lexical phrases 45.3 48.9 45.3 49.0baseline + synthetic lexical phrases (with binary feature) 45.3 48.8 45.4 49.0baseline + synthetic lexical phrases (separate scaling factors) 45.3 48.9 45.0 49.3baseline + synthetic full table 45.6 48.6 45.1 48.9baseline + synthetic full table (with binary feature) 45.5 48.6 45.2 48.8baseline + synthetic full table (separate scaling factors) 45.5 48.7 45.3 49.0

101

Chapter 10

Including the synthetic data leads to a substantial gain on the MT08 test set of up to +1.0Bleu. We observe differences of around ±0.4 Bleu in translation quality amongst the differentways of enhancing the baseline with synthetic data, the two best variants being joint extractionas well as a second phrase table with lexical phrases from the synthetic data, distinguished bymeans of a binary feature. The phrase inventory for the system with a combination of the baselinephrase table with only the lexical phrases from the synthetic data contains much fewer phrasesthan the joint extraction phrase table and yet is able to attain the same translation quality. Wecompared the decoding speed of these two setups and observed that the system with less phrasesis clearly faster (5.5 vs. 2.6 words per second, measured on MT08). The memory requirements ofthe systems do not differ greatly because we are using a binarized representation of the phrasetable with on-demand loading. All setups consume less than 16 gigabytes of RAM.

10.4 Related Work[Ueffing & Haffari+ 07] introduced semi-supervised learning methods for the effective use of

monolingual data in order to improve translation quality of statistical machine translation systems.Large-scale lightly-supervised training for statistical machine translation as we define it here has

been first carried out by [Schwenk 08]. Schwenk translates a large amount of monolingual Frenchdata with an initial Moses baseline system into English. He uses the resulting synthetic bitextsas additional training corpora to improve the baseline French→English system. In Schwenk’soriginal work, an additional bilingual dictionary is added to the baseline. With lightly-supervisedtraining, Schwenk achieves improvements of around one Bleu point over the baseline. In a laterwork, [Schwenk & Senellart 09] applied the same method for translation model adaptation on anArabic→French task with gains of up to +3.5 Bleu.

[Li & Eisner+ 11] presented an approach that is very similar to lightly-supervised training.They conduct their experiments with a hierarchical phrase-based system and translate monolin-gual target language data into the source language. [Lambert & Schwenk+ 11] investigated a largevariety of lightly-supervised training settings on the French–English language pair in both direc-tions. They draw some interesting conclusions, in particular that it is better to add automaticallytranslated texts to the translation model training data which have been translated from the targetto the source language (instead of from the source to the target language), and that using theword alignments that are produced by the decoder during the generation of the unsupervised dataand using GIZA++ word alignments performs roughly equally well. [Lambert & Schwenk+ 11]also proposed to make use of an automatically contructed dictionary which provides unobservedmorphological forms of nouns, verbs or adjectives. They achieve a gain of about +0.5 Bleu overa competitive baseline.

Combining multiple phrase tables has been investigated for domain adaptation by [Foster &Kuhn 07] and [Koehn & Schroeder 07] before. [Heger & Wuebker+ 10b] exploited the distinctionbetween hierarchical and lexical phrases in a similar way as we do. They train phrase translationprobabilities with forced alignment using a conventional phrase-based system [Wuebker & Mauser+10] and employ them for the lexical phrases while the hierarchical phrases remained untouched.

10.5 SummaryWe presented several approaches of applying lightly-supervised training to hierarchical phrase-

based machine translation. Using the additional automatically produced bitexts, we have beenable to obtain considerable gains compared to the baseline on the NIST Arabic→English trans-lation task. We showed that a joint phrase extraction from human-generated and automaticallygenerated parallel training data is not required to achieve improvements. The same translation

102


quality can be reached by adding a second phrase table with only lexical phrases extracted fromthe automatically created bitexts. The overall amount of phrases can be kept much smaller withthis method.

103

11. Pivot Lightly-supervised Training

In this chapter, we investigate large-scale lightly-supervised training with a pivot language.We augment a baseline statistical machine translation system that has been trained on human-generated parallel training corpora with large amounts of additional synthetic parallel data; butinstead of creating this synthetic data from monolingual source language data with the baselinesystem itself (or another machine translation system), or from target language data with a reversesystem, we employ a parallel corpus of target language data and data in a pivot language. Thepivot language data is automatically translated into the source language, resulting in a trilingualcorpus with synthetic source language side. We augment our baseline system with the syntheticsource–target parallel data.

Experiments are conducted for the German–French language pair using the WMT newstest setsfor development and testing. Synthetic data is obtained by translating the English side of theEnglish–French 109 corpus to German. We first explore the approach in standard phrase-basedtranslation, then extend it to hierarchical translation. With careful system design, we are able toachieve improvements of up to +0.4 points Bleu over the baseline in phrase-based translation,and up to +0.5 points Bleu in hierarchical translation.

11.1 MotivationPivot language approaches for statistical machine translation are typically applied in scenarios

where no bilingual resources exist to build a translation system between a source and a targetlanguage. For many under-resourced language pairs, no human-generated parallel data of sourceand target texts is available. There may however still be bitexts at hand between both the sourcelanguage and a third pivot language, as well as other bitexts between the same pivot language andthe target language. Pivot translation employs such bitext to bridge from source to target acrossthe pivot language, thus effectively providing source-to-target translation [Utiyama & Isahara07, Wu & Wang 09]. Figure 11.2 illustrates a typical transfer approach to pivot translation,in contrast to normal direct translation as shown in Figure 11.1. The transfer pivot approachsimply pipes two machine translation systems after another, a source-to-pivot and a pivot-to-target machine translation system.

Beyond low-resource scenarios, the pivot translation method can also be advantageous in caseswhere many translation systems between a large number of languages are to be built. To savetime and cost, it may be convenient to resort to a pivot approach and to set up 2(n− 1) systemsbetween each of n − 1 languages and a common pivot language, instead of setting up n(n − 1)systems between all pairs of n languages [Koehn & Birch+ 09].

We utilize the pivot translation paradigm with a different motivation in this work. We investi-gate a source–target language combination—German and French—that does not suffer from a lackof resources. A noticeable amount of parallel training data and monolingual target language dataexists for this language pair, from which we build a well-performing German→French baseline

105

Chapter 11



inputsource

text

outputtargettext

Figure 11.1: Illustration of direct translation. Source–target parallel data is employed to build asource→target direct SMT system (typical situation).

source–pivotparallel data

pivot–targetparallel data

source→pivotSMT system

pivot→targetSMT system

inputsource

text

intermediatepivot language

text

outputtargettext

Figure 11.2: Illustration of the transfer pivot approach. No source–target parallel data is available,but parallel data of both source and target language with a third language, whichcan be used for pivoting.

106

Pivot Lightly-supervised Training


source–pivotparallel data

pivot–targetparallel data


pivot→sourceSMT system


parallel data

inputsource

text

outputtargettext

Figure 11.3: Illustration of pivot lightly-supervised training (pivot-to-source). Additional parallelresources of each source and target language with a pivot language are utilized toimprove the translation model.

statistical machine translation system. We however argue that our system could be improved ifwe were able to also take advantage of the extensive amount of parallel resources of both of theselanguages with third languages, in particular with English. English–German and English–Frenchparallel corpora may contain additional information that is not present in the German–Frenchdata. We take a pivot lightly-supervised training approach to make our German→French setuplearn from English–German and English–French resources. From English–German corpora, weset up an English→German translation system. We run this system on the English side of alarge English–French parallel corpus. The German output of this translation step and the Frenchside of the English–French parallel corpus constitute a synthetic bitext which can be used assupplementary training material for our German→French system. Figure 11.3 visualizes thepivot lightly-supervised training approach with synthetic bitexts created via translation of pivotlanguage data into the source language.

11.2 Related Work

Our method combines techniques from two topics: pivot translation and lightly-supervisedtraining. Many pivot translation approaches have been proposed in the past. [Wu & Wang09] and [Utiyama & Isahara 07] provide good overviews of the field. More recent publicationsare e.g. [Leusch & Max+ 10], [Cettolo & Bertoldi+ 11], and [Zhu & He+ 14], to mention some.The synthetic method [Wu & Wang 09] comes closest to what is done by us. We would like toparticularly point to the work by [Cohn & Lapata 07] and by [Callison-Burch & Koehn+ 06].[Cohn & Lapata 07] adopted the pivot translation by triangulation method to improve existingbaselines. This idea is quite similar to our pivot lightly-supervised training approach, whichemploys synthetic data. [Callison-Burch & Koehn+ 06] suggested an interesting paraphrasingtechnique for statistical machine translation that rests upon parallel data with a pivot language.The effect should de facto be comparable.

107

Chapter 11

An overview of literature on lightly-supervised training for machine translation has been givenin Section 10.4 already.

11.3 Synthetic Training Data by PivotingPivot lightly-supervised training borrows the idea of improving an existing system with addi-

tional synthetic parallel data from previous lightly-supervised training approaches. In contrastto these, the synthetic data does not originate from monolingual data, but from parallel corporaof either source or target language with a pivot language. The pivot language data is being trans-lated. This in turn resembles the synthetic method in pivot translation approaches. Existingpivot translation approaches however do not aim at improving systems, but at creating new onesfrom scratch for under-resourced language pairs.

A precondition for being able to perform pivot lightly-supervised training is the availabilityof a rich amount of multilingual data. A parallel corpus between source and target language isrequired in order to train the baseline system. We need parallel data between source or target (inour experiments: target) language and a pivot language which is used to produce the syntheticdata by automatically translating its pivot language side to target or source, respectively (in ourexperiments: source). The translation of pivot language data is done with a system that is trainedon parallel data between pivot language and the language under consideration for the side of thesynthetic data that needs to be created automatically. Let’s assume that a human-generatedtarget–pivot corpus is at hand, like it is the case in our experiments. The pivot data then hasto be translated into the source language. We thus need pivot–source human-generated paralleldata to train a system that can conduct this translation.

Just as lightly-supervised training, pivot lightly-supervised training may serve the purpose ofadaptation. Adaptation is not its main goal, though. In our experiments, we will even be able toshow that pivot lightly-supervised training yields improvements in a setting where the standardlightly-supervised training approach is not effective. This can easily be investigated empiricallyby doing a comparison with lightly-supervised training on the non-pivot side of the parallel corpuswhich is used for the creation of the synthetic data. The crucial key to the effectiveness of anincorporation of synthetic source–target training data that results from translation of pivot datainto the source language is the pivot→source translation system, i.e. mainly the pivot–sourceparallel data it is trained with. Standard lightly-supervised training without pivot language canmerely benefit from high-quality monolingual resources to refine the phrase translation model.In the pivot approach, translation options and vocabulary of the source language with a pivotlanguage can be bridged via the synthetic data to the target language.1 By adding the bridgedsynthetic bitexts, the source→target system does not only learn from the contents of the corpusthe synthetic data originates from, but also from bilingual information that is represented in thetranslation model of the pivot→source system.

11.4 Parallel ResourcesWe now specify the parallel training corpora we utilize for an empirical evaluation of pivot

lightly-supervised training on a German→French translation task. The pivot language is English.To train the German→French baseline system, we use 2.0 M sentence pairs that are partly taken

from the Europarl corpus and have partly been collected within the Quaero project. Statistics ofthe preprocessed data can be found in the direct entry of Table 11.6 (first three lines). Thepreprocessing pipeline includes splitting of German compound words with a frequency-based

1Or in general: also with “source” and “target” interchanged in this statement. We restrict our presentation tothe variant we tried in our experiments.

108


Table 11.1: Data statistics of the preprocessed English–French WMT 109 corpus. Some noisyparts of the raw corpus have been removed beforehand.

English FrenchSentences 17.4 MRunning Words 484.4 M 573.8 MVocabulary 1.4 M 1.4 M

Table 11.2: Data statistics of the preprocessedparallel training corpus for theFrench→German setup. Note thatno compound splitting has beenapplied to the German target-sidedata.

French GermanSentences 2.0 MRunning Words 53.1 M 45.8 MVocabulary 145.0 K 380.4 K

Table 11.3: Data statistics of the preprocessedparallel training corpus for the En-glish→German setup. Note thatno compound splitting has beenapplied to the German target-sidedata.

English GermanSentences 1.9 MRunning Words 50.6 M 48.4 MVocabulary 123.5 K 387.6 K

Table 11.4: Translation quality (cased) with the French→German phrase-based system. news-test2009 is used as development set.


French→German [%] [%] [%] [%] [%] [%] [%] [%]PBT direct 15.8 69.8 15.1 70.2 15.4 68.1 15.0 70.2

Table 11.5: Translation quality (cased) with the English→German phrase-based system. news-test2009 is used as development set.


English→German [%] [%] [%] [%] [%] [%] [%] [%]PBT direct 14.7 68.7 14.7 68.3 15.8 64.7 14.8 67.6

109

Chapter 11

Table 11.6: Data statistics of the preprocessed German–French (direct and synthetic) paralleltraining corpora. German compound words have been split.

German Frenchdirect Sentences 2.0 M

Running Words 47.3 M 53.1 MVocabulary 196.3 K 145.0 K

synthetic (non-pivot) Sentences 17.4 MRunning Words 494.3 M 573.8 MRunning Words (w/o Unknowns) 478.8 M –Vocabulary 1.3 M 1.4 MVocabulary (w/o Unknowns) 123.0 K –

synthetic (pivot) Sentences 17.4 MRunning Words 450.9 M 573.8 MRunning Words (w/o Unknowns) 434.2 M –Vocabulary 1.3 M 1.4 MVocabulary (w/o Unknowns) 128.6 K –

synthetic (non-pivot) + direct Sentences 19.4 MRunning Words 541.6 M 626.9 MRunning Words (w/o Unknowns) 526.1 M –Vocabulary 1.4 M 1.4 MVocabulary (w/o Unknowns) 201.1 K –

synthetic (pivot) + direct Sentences 19.4 MRunning Words 498.2 M 626.9 MRunning Words (w/o Unknowns) 481.5 M –Vocabulary 1.4 M 1.4 MVocabulary (w/o Unknowns) 210.8 K –

method [Koehn & Knight 03]. We apply compound splitting to German text whenever Germanis the source language, but not for setups where German is the target language.

The synthetic data is produced by translating the English side of the English–French 109 corpusas provided for the translation task of the Workshop on Statistical Machine Translation. Datastatistics are given in Table 11.1. Some noisy parts of the raw corpus have been removed before-hand by means of an SVM classifier in a fashion comparable to the filtering technique describedby [Herrmann & Mediani+ 11].

The English→German statistical machine translation system with which we translate the En-glish side of the 109 corpus to German is trained with the English–German parallel resources thathave been provided for the 2011 WMT shared translation task (constrained track). Statistics ofthe preprocessed corpus are given in Table 11.3.

11.5 Phrase-based Translation System

We apply a phrase-based translation system which is an in-house implementation of the state-of-the-art decoder as described by [Zens & Ney 08]. A standard set of models is used, comprisingphrase translation probabilities and lexical translation probabilities in both directions, word andphrase penalty, a distance-based distortion model, an n-gram target language model and threesimple count-based binary features. Parameter weights are optimized with the downhill simplexalgorithm [Nelder & Mead 65] on the word graph. The 4-gram language models in all our setupsare trained over large collections of monolingual data.

110


11.6 ExperimentsIn our experiments, we work with the standard WMT newstest sets from the years 2008 to

2011. These sets are multi-parallel corpora. Each of the sets exists in a version in each of thethree languages that are of relevance to us: German, French, English. We employ newstest2009 asdevelopment set in all setups; newstest2008, newstest2010, and newstest2011 are used for testing.

To evaluate the pivot lightly-supervised training approach, we first conduct several contrastiveexperiments in standard phrase-based translation and then apply it to hierarchical translation.

11.6.1 Systems for Producing Synthetic DataWe first measure the translation performance of two direct phrase-based translation systems, a

French→German system that we run on the French 109 data to produce pivot synthetic data,and an English→German system that we run on the English 109 data to produce data for lightly-supervised training without pivot language for comparison with the pivot approach.

French→German: The French→German PBT system is based on the parallel data from Ta-ble 11.2. Translation results are shown in Table 11.4.

English→German: The English→German PBT system is based on the parallel data from Ta-ble 11.3. Translation results are shown in Table 11.5.

11.6.2 Human-generated and Synthetic Training CorporaTable 11.6 contains statistics for the following German–French corpora:

direct: The human-generated parallel data.

synthetic (non-pivot): The non-pivot synthetic data produced with the French→German system.

synthetic (pivot): The pivot synthetic data produced with the English→German system.

synthetic (non-pivot) + direct: A concatenation of non-pivot synthetic data and human-gene-rated parallel data.

synthetic (pivot) + direct: A concatenation of pivot synthetic data and human-generated data.

For the automatically generated German data, we give the overall number of running words,but also the number of running words without unknowns. Unknowns are words that result fromsource-side words being out-of-vocabulary to the translation system. These are carried over tothe target side by means of an identity mapping, but are marked in a special way. We keep themwhen we extract phrases from the synthetic data, but do not allow for the usage of phrase tableentries that contain unknowns in search. We remove such entries from the phrase table. TheGerman vocabulary of the system with which the synthetic data is created is an upper bound forthe vocabulary size without unknowns on the German side of the synthetic data.

We train word alignments with GIZA++. Reusing the alignment of the synthetic data givenby the translation systems is not convenient for us because of a practical reason: We have toapply compound splitting on the synthetic German data. German is going to be on sourceside in the systems that make use of the synthetic data, and German compound splitting onsource side typically improves the translation quality. We thus apply the compound splittingafter having created the data and word-align the compound-split synthetic German data withthe corresponding French data from the 109 corpus. Note that the corpus statistics in Table 11.6have been calculated after compound splitting has been applied.

111

Chapter 11

Table 11.7: Experimental results (cased) with pivot lightly-supervised training in phrase-basedtranslation for the German→French task. newstest2009 is used as development set.


German→French [%] [%] [%] [%] [%] [%] [%] [%]PBT direct 19.2 66.6 18.9 66.6 20.3 65.9 19.6 65.6PBT transfer (En. intermediate) 17.8 67.6 17.3 67.8 19.0 66.7 18.2 66.3PBT synthetic (non-pivot) 17.4 69.3 16.7 70.0 17.8 69.2 17.8 68.3PBT synthetic (pivot) 17.6 69.1 17.0 69.8 18.3 68.8 17.9 68.3PBT synthetic (non-pivot) + direct

— joint extraction 18.8 66.8 17.7 67.4 19.5 66.3 18.8 65.8— joint extraction, direct lex. 18.8 66.6 17.9 67.2 19.7 66.0 19.1 65.5— two phrase tables, direct lex. 19.3 67.7 18.8 67.7 19.7 67.0 19.6 66.5

PBT synthetic (pivot) + direct— joint extraction 18.7 67.0 18.2 67.5 19.6 66.5 18.9 66.1— joint extraction, direct lex. 19.3 67.2 18.7 67.6 19.8 66.7 19.4 66.3— two phrase tables, direct lex. 19.4 66.2 19.0 66.3 20.7 65.3 19.9 65.0

11.6.3 German→French Experimental Results in Phrase-based Translation

We are now in a position to examine German→French translation quality based onhuman-generated training data, lightly-supervised training, and pivot lightly-supervised training.The experimental results are presented in Table 11.7.

PBT direct: The German→French baseline system is trained with the human-generated paralleldata.

PBT transfer (En. intermediate): This setup applies the pivot translation by transfer scheme[Wu & Wang 09], which we additionally want to compare to. We set up direct systems forGerman→English and English→French translation in order to be able to conductGerman→English→French transfer pivoting with English as intermediate language. TheGerman→English translation system is trained on the data from Table 11.3, but with com-pound splitting on the German side. Its translation performance is indicated in Table 11.8.The English→French translation system is trained on the 109 data (Table 11.1). Its trans-lation performance is indicated in Table 11.9. We translate from German to a single-bestEnglish intermediate hypothesis, which we feed into the English→French system to obtaina French output. Results are 1.3 to 1.6 points Bleu worse than direct translation.

PBT synthetic (non-pivot): This system is trained on the synthetic (non-pivot) corpus only, noton any human-generated data. Training a system on synthetic parallel data only that hasbeen automatically translated from a target-side monolingual corpus, results are 1.8 to 2.5points Bleu worse than direct translation.

PBT synthetic (pivot): This system is trained on the synthetic (pivot) corpus only, not on anyhuman-generated data. It resembles the synthetic pivot translation scheme [Wu & Wang09]. Results are 1.6 to 2.0 points Bleu worse than direct translation.

PBT synthetic (non-pivot) + direct: These systems are based on lightly-supervised trainingwithout pivoting. They make use of both the baseline human-generated data and thesynthetic parallel corpus that has been automatically translated from target-side data.

112


Table 11.8: Translation quality (cased) with the German→English phrase-based system. news-test2009 is used as development set.


German→English [%] [%] [%] [%] [%] [%] [%] [%]PBT direct 21.4 63.3 21.2 62.4 23.1 60.7 21.0 62.5

Table 11.9: Translation quality (cased) with the English→French phrase-based system.newstest2009 is used as development set. The translation model of the system wastrained with the 109 corpus.


English→French [%] [%] [%] [%] [%] [%] [%] [%]PBT direct 23.1 62.8 25.5 59.2 27.1 56.6 29.5 53.6

PBT synthetic (pivot) + direct: These systems are based on pivot lightly-supervised training.They make use of both the baseline human-generated data and the synthetic parallel corpusthat has been automatically translated from pivot language data.

Three different settings have been tried for both the lightly-supervised and the pivotlightly-supervised approach. The word-based lexicon model used for phrase table smoothing andseparate phrase tables for human-generated and synthetic data prove to be crucial for translationquality here.

joint extraction: A single phrase table is extracted from the concatenation of synthetic andhuman-generated data. Lexical scores are computed with a lexicon model which is like-wise extracted from the word-aligned concatenated data.

joint extraction, direct lex.: A single phrase table is extracted from the concatenation of syn-thetic and human-generated data. Lexical scores are computed with a lexicon model whichis extracted from the word-aligned human-generated data only.

two phrase tables, direct lex.: Two separate phrase tables from the baseline human-generateddata and from the synthetic data are extracted and utilized by the decoder. On both of thephrase tables, lexical scores are computed with a lexicon model which is extracted from theword-aligned human-generated data only.

The best results are obtained with the third of these settings. In the third setting, lightly-supervised training without pivoting is in terms of Bleu exactly on the level of the PBT directbaseline system, but in terms of Ter clearly worse. With pivot lightly-supervised training, weare able to outperform the baseline by up to +0.4 points Bleu / −0.6 points Ter.

11.6.3.1 Analysis

An adaptation effect towards the domain of the newstest corpora by means of the syntheticdata from the 109 collection does not seem to exist, according to our (negative) results with non-pivot lightly-supervised training. We analyze why pivot lightly-supervised training can still yieldimprovements.

Table 11.10 contains the out-of-vocabulary (OOV) rates of each of the newstest sets with regardto the vocabulary of the five training corpora from Table 11.6. The source-side OOV rates arebarely reduced by adding synthetic data. The target-side OOV rates are reduced considerably,

113

Chapter 11

Table 11.10: Out-of-vocabulary (OOV) rates of the development and test sets with the vocabularyof each of the preprocessed German–French (direct and synthetic) parallel trainingdata settings.

newstest2008 newstest2009 newstest2010 newstest2011OOV [%] with training data German French German French German French German Frenchdirect 2.7 2.1 2.7 2.4 2.9 2.4 3.1 2.7synthetic (non-pivot) 3.4 0.9 3.4 1.0 3.7 1.1 4.1 1.1synthetic (pivot) 3.4 0.9 3.3 1.0 3.6 1.1 3.9 1.1synthetic (non-pivot) + direct 2.7 0.9 2.7 0.9 2.9 1.1 3.1 1.0synthetic (pivot) + direct 2.6 0.9 2.5 0.9 2.8 1.1 3.0 1.0

Table 11.11: Phrase table statistics for the German→French phrase-based systems. All phrasetables have been filtered towards the German side of the four newstest sets andpruned to contain a maximum of 400 distinct translation candidates per source side.

entries distinct source sides avg. number of candidatesPBT direct 12.1 M 198.3 K 61PBT synthetic (non-pivot) 24.7 M 257.5 K 96PBT synthetic (pivot) 32.9 M 245.4 K 134PBT synthetic (non-pivot) + direct 28.5 M 274.1 K 104PBT synthetic (pivot) + direct 36.2 M 266.6 K 136

Table 11.12: Out-of-vocabulary (OOV) rates of the French references of the development and testsets, measured with regard to the target side vocabulary of those phrase tables entriesthat can actually be used for the translation of each of the sets.

newstest2008 newstest2009 newstest2010 newstest2011OOV [%] with filtered phrase voc. French French French FrenchPBT direct 2.6 2.9 3.0 3.1PBT synthetic (non-pivot) 2.6 3.0 2.9 3.1PBT synthetic (pivot) 1.8 2.0 2.0 2.0PBT synthetic (non-pivot) + direct 2.5 2.8 2.8 3.0PBT synthetic (pivot) + direct 1.7 1.9 1.9 2.0

114


Table 11.13: Experimental results (cased) with pivot lightly-supervised training in hierarchicalphrase-based translation for the German→French task. newstest2009 is used asdevelopment set.


German→French [%] [%] [%] [%] [%] [%] [%] [%]HPBT direct 19.3 64.9 18.8 65.2 20.5 64.1 19.6 64.0HPBT synthetic (non-pivot) + direct

— two phrase tables, direct lex. 19.5 64.8 18.9 65.2 20.5 64.0 19.7 63.7HPBT synthetic (pivot) + direct

— two phrase tables, direct lex. 19.8 65.0 19.4 65.3 20.8 64.4 20.0 64.0

but these numbers are overly optimistic as most of the words will correspond to unknowns onthe source side. To obtain more insightful numbers, we had a look into the phrase tables ofour systems. We filtered the phrase tables towards the German side of the four newstest setsand determined the total number of entries, the number of distinct source sides and the averagenumber of candidates per source side. Note that our phrase tables are pruned to contain amaximum of 400 distinct translation candidates per source side. The phrase table statistics arepresented in Table 11.11. Interestingly, the number of distinct source sides is slightly smaller withpivot lightly-supervised training than with non-pivot lightly-supervised training. The averagenumber of translation candidates per source side is on the contrary about one third larger. Thisindicates that bilingual information that is represented in the translation model of the pivot→source system is in fact carried over to the source–target system via pivot lightly-supervisedtraining. The richer choice of translation options pays off during search. Also, the target-sidevocabulary that can actually be generated by the decoder is larger with pivot lightly-supervisedtraining. To assess this, we filtered each phrase table towards the German side of each of thenewstest sets individually. We then collected the French vocabulary present on the French sideof the entries in each filtered phrase table and computed target-side OOV rates with respect tothese filtered phrase vocabularies. The numbers are given in Table 11.12. The rates are roughlyone third lower for pivot lightly-supervised training than for non-pivot lightly-supervised trainingand for the baseline.

11.6.4 German→French Experimental Results in Hierarchical TranslationWe now apply the pivot lightly-supervised training approach to hierarchical phrase-based trans-

lation. The same data as for the phrase-based systems is used to set up the hierarchical systems.Specifically, the synthetic parallel training corpora are produced with phrase-based systems (cf.Sections 11.6.1 and 11.6.2). Default features are integrated into the hierarchical system. We builda baseline system with shallow grammar trained on human-generated parallel data only and thenaugment it with non-pivot and pivot lightly-supervised training, respectively. Similarly to thebest phrase-based lightly-supervised training setups, lexical scores are computed with a lexiconmodel which is extracted from the word-aligned human-generated data only. We use two phrasetables and add a binary feature to distinguish phrases extracted from synthetic data from phrasesextracted from human-generated data. As suggested in Chapter 10, no hierarchical phrases withgaps are extracted from synthetic data.

Table 11.13 contains the experimental results. In hierarchical translation, we are able to out-perform the baseline by up to +0.5 points Bleu on the test sets with pivot lightly-supervisedtraining. Ter scores do not improve. Pivot lightly-supervised training is consistently +0.3 pointsBleu better than non-pivot lightly-supervised training.

115

Chapter 11

11.7 SummaryWe showed how well-performing phrase-based and hierarchical phrase-based statistical machine

translation systems can be improved by means of pivot lightly-supervised training. Pivot lightly-supervised training carries information which is present in additional resources that are parallelin source (or alternatively target) language and a third pivot language over to the source→targettranslation system. This is done via automatic generation of synthetic source–target data. Gainsin translation quality can even be achieved without a domain adaptation effect as in non-pivotlightly-supervised training.

For the setup that turned out to provide the best translation quality in our series of experiments,we conducted lexical smoothing with a lexicon model which is created from human-generated dataonly, and employed separate phrase tables for human-generated and synthetic data.

116

Scientific Achievements

Our work has significantly enhanced the state of the art in hierarchical phrase-based statisticalmachine translation, enabling us to achieve top translation quality in open evaluation campaigns(Chapter 9; [Huck & Peitz+ 12b], [Wuebker & Huck+ 11]) and very competitive results in project-internal machine translation evaluations (Quaero, GALE).

A study on the behavior of the cube pruning algorithm for hierarchical decoding provided usand the scientific community with important insights into how to attain good translation resultswithout compromising search efficiency too much in a baseline hierarchical setting (Chapter 3;[Huck & Vilar+ 13]). Such detailed systematic investigations of decoding efficiency in hierarchicaltranslation have been rare in the literature previously.

As a core scientific contribution of this thesis, a number of novel additional models for hier-archical phrase-based translation have been proposed, developed, implemented, and empiricallyevaluated (Part II). Machine translation quality improves considerably when a hierarchical ma-chine translation system is enhanced with any of the proposed models individually, but evenmore so when several of them are employed in combination (e.g., up to +3.2 Bleu over a plainChinese→English hierarchical baseline, cf. Table 7.1). Different from many orthogonal develop-ments in hierarchical machine translation research, the proposed enhancements do not necessitateany linguistic annotation, making them broadly applicable to any task, even when no linguisticannotation tools such as syntactic taggers or parsers are at hand. By releasing the source codeas part of the Jane toolkit, we have provided public access to our implementation of the novel en-hancements to hierarchical translation. In brief, we have made the following contributions, whichprovide progress beyond the previous state of the art, and novel scientific insights in machinetranslation:

• We have integrated triplet lexicon models and discriminative word lexicons directly intohierarchical decoding, providing the system with global source sentence context (Chapter 4;[Huck & Ratajczak+ 10]). In prior work, these extended lexicon models were utilized instandard phrase-based translation or in n-best reranking only. We have empirically demon-strated that extended lexicon models can yield even larger gains in hierarchical translationthan in phrase-based translation, resulting in hierarchical setups that outperform compara-ble phrase-based setups on an Arabic→English task. We have shown that an improvementin translation quality of up to +1.8 Bleu over a standard hierarchical baseline system canbe achieved (cf. Table 4.3).

• We have harnessed extended lexicon models in a novel way by utilizing them for lexicalsmoothing in hierarchical translation (Chapter 5; [Huck & Mansour+ 11]). We have demon-strated how this can be accomplished for hierarchical rules. While we relinquish globalsource sentence context by scoring within rules only, we gain two major benefits: scorescan be precomputed and added to the phrase table beforehand, and we can apply models

117


in target-to-source direction as well directly in decoding, not exclusively source-to-targetmodels. We have empirically compared many lexical smoothing variants, including a novelregularized IBM Model 1, and found the discriminative word lexicon to be the best choice.Surprisingly, for Chinese→English hierarchical translation, we also found that global sourcesentence context does not seem to be particularly useful, and that an EM-trained IBMModel 1 is a much better choice for lexical smoothing than the commonly applied baselinelexicon model which is extracted from data based on the symmetrized viterbi alignments inboth directions. We have shown that lexical smoothing with discriminative word alignmentsin both directions gives an improvement of up to +1.8 Bleu over a standard hierarchicalbaseline (which includes a common lexical smoothing method) on a Chinese→English task(cf. Table 5.3), and that the stronger EM-trained IBM Model 1 lexical smoothing is likewiseoutperformed by +0.3 Bleu.

• We have developed novel reordering extensions for hierarchical translation (Chapter 6; [Huck& Peitz+ 12a]). Lexicalized reordering models have originally not been integrated into hierar-chical systems because researchers have previously argued that hierarchical rules do alreadymodel reordering. By means of implementing and evaluating a discriminative lexicalizedreordering model in hierarchical translation, we have been able to prove that—against com-mon belief—lexicalized reordering models indeed are very effective in a hierarchical system.Furthermore, reorderings that can be conducted by a hierarchical decoder are typicallybound by the inventory of extracted hierarchical rules. We have designed dedicated reorder-ing rules that provide the decoder with more reordering flexibility. We have shown thatcombining a discriminative lexicalized reordering model and a dedicated swap reorderingrule, an improvement of up to +1.2 Bleu over a standard hierarchical baseline system canbe observed on a Chinese→English task (cf. Table 6.1).

• We have designed a phrase orientation model for hierarchical translation (Chapter 7; [Huck& Wuebker+ 13]). Phrase orientation models are an effective feature of standard phrase-based translation systems. However, as with other lexicalized reordering models, they havepreviously typically not been available for hierarchical systems. We have formulated ageneralized phrase orientation model for hierarchical translation, and solved the problem oforientation scoring in hierarchical decoding. Empirical evaluation reveals that the phraseorientation model is even more effective in hierarchical translation than our extension withthe discriminative lexicalized reordering model. We have shown that an improvement of+1.2 Bleu can be achieved over a standard hierarchical baseline system on a Chinese→English task when integrating the phrase orientation model bidirectionally and standalone,and an improvement of up to +1.4 Bleu when it is combined with a swap rule (cf. Table 7.1).On a French→German task, the gain in translation quality is clearly visible as well, but lesspronounced (up to +0.4 Bleu, cf. Table 7.2).

• We have introduced a simple, yet effective technique for modeling insertions and deletionsin statistical machine translation, and applied it successfully in a hierarchical system (Chap-ter 8; [Huck & Ney 12a]). The technique is based on thresholding over lexical translationprobabilities, and we have proposed and empirically compared several thresholding meth-ods. We have shown that insertion and deletion modeling gives good gains over a plainChinese→English hierarchical baseline (up to +0.4 Bleu, cf. Table 8.1) and a minor gainover a strong system that is extended with many of the other enhancements described inthis thesis.

In another strand of research contained in this thesis, methods have been investigated thatallow for better exploitation of monolingual and parallel data resources in hierarchical phrase-based translation (Part III). In brief, our scientific contributions in this part are the following:

118


• For lightly-supervised training, synthetic training data is created by automatically translat-ing monolingual corpora. Harnessing the lightly-supervised training approach in a straight-forward manner is prohibitive in hierarchical systems due to the resulting blowup of thephrase table. We have proposed a solution to the latter problem (Chapter 10; [Huck &Vilar+ 11b]) and have applied lightly-supervised training to hierarchical translation for thefirst time. We have shown that Arabic→English hierarchical translation quality benefitsfrom lightly-supervised training by up to +1.0 Bleu (cf. Table 10.3).

• We have introduced a novel approach for improved learning from additional bilingual cor-pora that are parallel between a third language and either the source or target languageof a machine translation system that we want to build. Ideas from pivot translation andfrom lightly-supervised training are combined to form a new pivot lightly-supervised train-ing technique (Chapter 11; [Huck & Ney 12b]), which we have empirically evaluated andanalyzed in both standard phrase-based translation and in hierarchical translation. Wehave shown that improvements of around +0.4 Bleu in German→French translation canbe attained under both translation paradigms (cf. Table 11.7 and Table 11.13).

119

Conclusions

In this thesis, we have first given an overview of the fundamentals of hierarchical phrase-basedmachine translation, including a detailed empirical study of decoding efficiency on large-scaletranslation tasks. On a fine-grained level, we revealed the trade-offs between translation qualityand resource requirements with different system settings. Our analysis allowed for an informedchoice of the type of hierarchical grammar, the hypothesis recombination scheme, and the k-bestgeneration limit in the cube pruning algorithm for hierarchical search.

We have then developed novel statistical models for hierarchical phrase-based machine transla-tion and applied them on top of our baseline. We found many of our proposed enhancements tobe very effective, yielding considerable improvements in translation quality over a default systemthat contains established baseline features. Specifically, we can draw the following conclusions:

• Two types of extended lexicon models—the triplet lexicon and the discriminative wordlexicon model—are just as effective when employed in a hierarchical system as they arein standard phrase-based translation. When we integrated features based on extendedlexicon models, taking global source sentence context into account, the hierarchical systemoutperformed a comparable phrase-based system by a small margin in terms of translationquality for Arabic→English. We also experimented with constrained variants of the modelswhich allow for more efficient training, including a novel sparse discriminative word lexiconmodel.

• In hierarchical translation, we can improve over the widely adopted lexical smoothingtechnique which uses word lexicons extracted from parallel data with symmetrized wordalignments. Simply employing the EM-trained IBM model 1 for lexical smoothing insteadshowed much better results for Chinese→English. Different scoring functions that havebeen proposed in the literature do generally not make that much of a difference. We pro-posed to utilize extended lexicon models for lexical smoothing in hierarchical translation.If we consider only phrase-internal context rather than sentence context, we can efficientlyprecompute the lexical scores, and we can easily apply triplet and discriminative word lex-icon models in target-to-source direction in decoding as well. Surprisingly, taking globalsentence context into account is not crucial for the extended lexicon models to act as aneffective enhancement in hierarchical translation. In our experiments, the discriminativeword lexicon was the best model for lexical smoothing. A novel regularized version of IBMmodel 1 is conceptually appealing, but not more effective in translation.

• Hierarchical translation benefits from reordering extensions. Despite the integrated re-ordering mechanism via co-indexed non-terminals of hierarchical rules, it is still valuableto further guide the reordering decisions in hierarchical search by means of a lexicalized re-ordering model. We implemented scoring with a discriminative lexicalized reordering model

121

Conclusions

in the hierarchical decoder and observed good results for Chinese→English. Enabling morepermissive reordering possibilities with additional non-lexicalized reordering rules gave fur-ther gains, especially when combined with the discriminative reordering model.

• We have shown that it is possible to model phrase orientation with hierarchical phrases,and that we can determine and score phrase orientations of rule applications in hierarchi-cal decoding. The phrase orientation model for hierarchical machine translation reliablygave us considerable improvements over the Chinese→English baseline setup, over a bettersetup with syntactic features, and over a setup enhanced with triplets and discriminativeword lexicons in both directions. We saw positive results in a French→German translationexperiment as well.

• Features that model word insertion and deletion can be designed based on thresholdingof lexical probabilities. We have proposed several thresholding techniques and evaluated thefeatures with two different lexicon models. Insertion and deletion models had a favorableimpact on translation quality compared to the baseline, and they act more effectively thansimple unaligned word count features. The gains diminished on top of stronger Chinese→English setups that already make use of our improved lexical smoothing variants, though.

• Applying multiple enhancements, hierarchical systems with competitive translation qual-ity can be developed for English→French, a language pair with availability of extensiveparallel and monolingual resources. Standard phrase-based systems are thought to be well-performing on English→French. With its various extensions, the hierarchical system can beon par or better.

• Adopting lightly-supervised training in hierarchical translation is viable and useful. Wesuggested to extract only lexical phrases from the synthetic data, no phrases with gaps, andto augment the baseline with a second phrase table containing the lexical phrases from thesynthetic data. A binary feature can be added in order to distinguish baseline phrases andsynthetic phrases. As our Arabic→English experiments have confirmed, this approach forlightly-supervised training of hierarchical systems is more efficient and just as effective asother approaches that include phrases with gaps from synthetic data, or that extract jointlyfrom a concatenation of human-generated and synthetic training data.

• Lightly-supervised training can be combined with the principle of pivot translation to con-duct pivot lightly-supervised training. If a parallel corpus between source languageor target language and a third pivot language is available, we can automatically translatethe pivot language side of that bitext to target or source language, so that we essentiallycreate a trilingual corpus. The pivoted synthetic source–target part of the trilingual corpuscan be used for lightly-supervised training. The translation systems benefits from informa-tion that is present in the additional resources. For German→French, we found that pivotlightly-supervised training was more effective than non-pivot lightly-supervised training inboth hierarchical and standard phrase-based translation.

122

List of Figures

0.1 A word-aligned French–English sentence pair . . . . . . . . . . . . . . . . . . . . . 6

1.1 Phrases highlighted in a word-aligned French–English sentence pair . . . . . . . . 20

3.1 k-best generation with the cube pruning algorithm . . . . . . . . . . . . . . . . . . 323.2 Translation quality with cube pruning for the NIST Chinese→English task . . . . 353.3 Translation quality with cube pruning for the NIST Arabic→English task . . . . . 353.4 Translation speed with cube pruning for the NIST Chinese→English task . . . . . 373.5 Translation speed with cube pruning for the NIST Arabic→English task . . . . . . 373.6 Memory requirements with cube pruning for the NIST Chinese→English task . . . 373.7 Memory requirements with cube pruning for the NIST Arabic→English task . . . 373.8 Trade-off between quality and speed for Chinese→English . . . . . . . . . . . . . . 383.9 Trade-off between quality and speed for Arabic→English . . . . . . . . . . . . . . 383.10 Trade-off between quality and memory requirements for Chinese→English . . . . . 383.11 Trade-off between quality and memory requirements for Arabic→English . . . . . 383.12 Relation of quality and model score for Chinese→English (deep grammar) . . . . . 393.13 Relation of quality and model score for Arabic→English (deep grammar) . . . . . 393.14 Relation of quality and model score for Chinese→English (shallow-1 grammar) . . 393.15 Relation of quality and model score for Arabic→English (shallow-1 grammar) . . . 39

6.1 An embedding of a phrase, with orientations scored with the neighboring blocks . 676.2 Translation example from a baseline system . . . . . . . . . . . . . . . . . . . . . . 716.3 Translation example from a system with reordering extensions . . . . . . . . . . . 71

7.1 Extraction of orientation classes from word-aligned training samples . . . . . . . . 767.2 Accumulation of orientation counts for hierarchical phrases during extraction . . . 767.3 Scoring with the orientation classes monotone, swap, and discontinuous . . . . . . 777.4 Determining the orientation class during decoding . . . . . . . . . . . . . . . . . . 787.5 Left boundary non-terminal symbols . . . . . . . . . . . . . . . . . . . . . . . . . . 79

10.1 Lightly-supervised training (source-to-target) . . . . . . . . . . . . . . . . . . . . . 9810.2 Lightly-supervised training (target-to-source) . . . . . . . . . . . . . . . . . . . . . 98

11.1 Direct translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10611.2 The transfer pivot approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10611.3 Pivot lightly-supervised training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

123

List of Tables

0.1 Example sentences from a French–English parallel corpus . . . . . . . . . . . . . . 4

3.1 Data statistics of the Chinese–English parallel training corpus . . . . . . . . . . . 343.2 Data statistics of the Arabic–English parallel training corpus . . . . . . . . . . . . 343.3 Data statistics of the Chinese→English NIST MT06 and MT08 sets . . . . . . . . 343.4 Data statistics of the Arabic→English NIST MT06 and MT08 sets . . . . . . . . . 343.5 Average amount of hypernodes per sentence on the NIST tasks . . . . . . . . . . . 403.6 Amount of derivations on the NIST Chinese→English task . . . . . . . . . . . . . 403.7 Amount of derivations on the NIST Arabic→English task . . . . . . . . . . . . . . 40

4.1 Sizes and computational demands in training for triplet models . . . . . . . . . . . 514.2 Average number of features per target word and training time for DWL models . . 514.3 Experimental results with extended lexicon models for the Arabic→English task . 51

5.1 Experimental results with different lexicon models in lexical smoothing for theChinese→English task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2 Experimental results with various scoring functions in lexical smoothing for theChinese→English task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3 Experimental results with lexical enhancements considering source sentencecontext or phrase context in each direction for the Chinese→English task . . . . . 59

6.1 Experimental results with reordering extensions for the Chinese→English task . . 686.2 Supplementary experimental results with reordering extensions . . . . . . . . . . . 706.3 Statistics on the rule usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.1 Experimental results with phrase orientation model for the Chinese→English task 817.2 Experimental results with phrase orientation model for the French→German task 82

8.1 Experimental results with insertion and deletion models for the Chinese→Englishtask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

9.1 Data statistics of the WMT English–French parallel training corpora . . . . . . . . 909.2 Experimental results for the WMT English→French task . . . . . . . . . . . . . . 909.3 Comparison with other groups on the WMT English→French task . . . . . . . . . 929.4 Data statistics of the IWSLT English–French parallel training corpus . . . . . . . 939.5 Experimental results for the IWSLT English→French task . . . . . . . . . . . . . . 94

10.1 Data statistics of the Arabic–English synthetic training corpus . . . . . . . . . . . 10110.2 Phrase table statistics for lightly-supervised training of Arabic→English systems . 101

125

List of Tables

10.3 Experimental results with lightly-supervised training for the Arabic→English task 101

11.1 Data statistics of the English–French WMT 109 corpus . . . . . . . . . . . . . . . 10911.2 Data statistics of the parallel training corpus for the French→German setup . . . 10911.3 Data statistics of the parallel training corpus for the English→German setup . . . 10911.4 Translation quality with the French→German phrase-based system . . . . . . . . . 10911.5 Translation quality with the English→German phrase-based system . . . . . . . . 10911.6 Data statistics of the German–French (direct and synthetic) parallel training

corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11011.7 Experimental results with pivot lightly-supervised training in phrase-based

translation for the German→French task . . . . . . . . . . . . . . . . . . . . . . . 11211.8 Translation quality with the German→English phrase-based system . . . . . . . . 11311.9 Translation quality with the English→French phrase-based system . . . . . . . . . 11311.10 Out-of-vocabulary rates for German–French . . . . . . . . . . . . . . . . . . . . . . 11411.11 Phrase table statistics for the German→French phrase-based systems . . . . . . . 11411.12 Target-side out-of-vocabulary rates, measured with regard to those phrase tables

entries that can actually be used for translation . . . . . . . . . . . . . . . . . . . 11411.13 Experimental results with pivot lightly-supervised training in hierarchical

translation for the German→French task . . . . . . . . . . . . . . . . . . . . . . . 115

126

Bibliography

[Almaghout & Jiang+ 12] H. Almaghout, J. Jiang, A. Way: Extending CCG-based SyntacticConstraints in Hierarchical Phrase-Based SMT. In Proceedings of the Annual Conference of theEuropean Association for Machine Translation (EAMT), pp. 193–200, Trento, Italy, May 2012.

[Auli & Lopez+ 09] M. Auli, A. Lopez, H. Hoang, P. Koehn: A Systematic Analysis of TranslationModel Search Spaces. In Proceedings of the Workshop on Statistical Machine Translation(WMT), pp. 224–232, Athens, Greece, March 2009.

[Avramidis & Koehn 08] E. Avramidis, P. Koehn: Enriching Morphologically Poor Languages forStatistical Machine Translation. In Proceedings of the Annual Meeting of the Association forComputational Linguistics (ACL), pp. 763–770, Columbus, OH, USA, June 2008.

[Baker & Bloodgood+ 10] K. Baker, M. Bloodgood, C. Callison-Burch, B. Dorr, N. Filardo,L. Levin, S. Miller, C. Piatko: Semantically-Informed Syntactic Machine Translation: ATree-Grafting Approach. In Proceedings of the Conference of the Association for MachineTranslation in the Americas (AMTA), Denver, CO, USA, Oct./Nov. 2010.

[Bangalore & Haffner+ 07] S. Bangalore, P. Haffner, S. Kanthak: Statistical Machine Translationthrough Global Lexical Selection and Sentence Reconstruction. In Proceedings of the AnnualMeeting of the Association for Computational Linguistics (ACL), pp. 152–159, Prague, CzechRepublic, June 2007.

[Birch & Huck+ 14] A. Birch, M. Huck, N. Durrani, N. Bogoychev, P. Koehn: Edinburgh SLTand MT System Description for the IWSLT 2014 Evaluation. In Proceedings of the InternationalWorkshop on Spoken Language Translation (IWSLT), pp. 49–56, Lake Tahoe, CA, USA, Dec.2014.

[Bojar & Chatterjee+ 15] O. Bojar, R. Chatterjee, C. Federmann, B. Haddow, M. Huck,C. Hokamp, P. Koehn, V. Logacheva, C. Monz, M. Negri, M. Post, C. Scarton, L. Specia,M. Turchi: Findings of the 2015 Workshop on Statistical Machine Translation. In Proceed-ings of the Workshop on Statistical Machine Translation (WMT), pp. 1–46, Lisbon, Portugal,September 2015. Association for Computational Linguistics.

[Bojar & Chatterjee+ 16] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow,M. Huck, A. Jimeno Yepes, P. Koehn, V. Logacheva, C. Monz, M. Negri, A. Neveol, M. Neves,M. Popel, M. Post, R. Rubino, C. Scarton, L. Specia, M. Turchi, K. Verspoor, M. Zampieri:Findings of the 2016 Conference on Machine Translation (WMT16). In Proceedings of the ACL2016 First Conference on Machine Translation (WMT16), pp. 131–198, Berlin, Germany, Aug.2016. Association for Computational Linguistics.

127

Bibliography

[Bojar & Chatterjee+ 17] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow,S. Huang, M. Huck, P. Koehn, Q. Liu, V. Logacheva, C. Monz, M. Negri, M. Post, R. Rubino,L. Specia, M. Turchi: Findings of the 2017 Conference on Machine Translation (WMT17).In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared TaskPapers, pp. 169–214, Copenhagen, Denmark, September 2017. Association for ComputationalLinguistics.

[Braune & Fraser 10] F. Braune, A. Fraser: Improved Unsupervised Sentence Alignment for Sym-metrical and Asymmetrical Parallel Corpora. In Proceedings of the International Conferenceon Computational Linguistics (COLING), pp. 81–89, Beijing, China, Aug. 2010.

[Brown & Cocke+ 90] P.F. Brown, J. Cocke, S.A. Della Pietra, V.J. Della Pietra, F. Jelinek, J.D.Lafferty, R.L. Mercer, P.S. Rossin: A Statistical Approach to Machine Translation. Computa-tional Linguistics, Vol. 16, No. 2, pp. 79–85, June 1990.

[Brown & Della Pietra+ 93] P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, R.L. Mercer: TheMathematics of Statistical Machine Translation: Parameter Estimation. Computational Lin-guistics, Vol. 19, No. 2, pp. 263–311, June 1993.

[Callison-Burch & Koehn+ 06] C. Callison-Burch, P. Koehn, M. Osborne: Improved StatisticalMachine Translation Using Paraphrases. In Proceedings of the Human Language TechnologyConference / North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pp. 17–24, New York City, NY, USA, June 2006.

[Callison-Burch & Koehn+ 12] C. Callison-Burch, P. Koehn, C. Monz, M. Post, R. Soricut,L. Specia: Findings of the 2012 Workshop on Statistical Machine Translation. In Proceedings ofthe Workshop on Statistical Machine Translation (WMT), pp. 10–51, Montréal, Canada, June2012.

[Cap & Fraser+ 14] F. Cap, A. Fraser, M. Weller, A. Cahill: How to Produce Unseen TeddyBears: Improved Morphological Processing of Compounds in SMT. In Proceedings of theConference of the European Chapter of the Association for Computational Linguistics (EACL),pp. 579–587, Gothenburg, Sweden, April 2014.

[Cer & Galley+ 10] D. Cer, M. Galley, D. Jurafsky, C.D. Manning: Phrasal: A Statistical Ma-chine Translation Toolkit for Exploring New Model Features. In Proceedings of the NAACLHLT 2010 Demonstration Session, pp. 9–12, Los Angeles, CA, USA, June 2010.

[Cettolo & Bertoldi+ 11] M. Cettolo, N. Bertoldi, M. Federico: Bootstrapping Arabic-ItalianSMT through Comparable Texts and Pivot Translation. In Proceedings of the Annual Con-ference of the European Association for Machine Translation (EAMT), pp. 249–256, Leuven,Belgium, May 2011.

[Cettolo & Girardi+ 12] M. Cettolo, C. Girardi, M. Federico: WIT3: Web Inventory of Tran-scribed and Translated Talks. In Proceedings of the Annual Conference of the European Asso-ciation for Machine Translation (EAMT), pp. 261–268, Trento, Italy, May 2012.

[Chappelier & Rajman 98] J.C. Chappelier, M. Rajman: A Generalized CYK Algorithm forParsing Stochastic CFG. In Proceedings of the First Workshop on Tabulation in Parsing andDeduction, pp. 133–137, Paris, France, April 1998.

[Chen & Goodman 98] S.F. Chen, J. Goodman: An Empirical Study of Smoothing Techniques forLanguage Modeling. Technical Report TR-10-98, Computer Science Group, Harvard University,Cambridge, MA, USA, 63 pages, Aug. 1998.

128

Bibliography

[Chen & Rosenfeld 99] S.F. Chen, R. Rosenfeld: A Gaussian Prior for Smoothing MaximumEntropy Models. Technical Report CMUCS-99-108, Carnegie Mellon University, Pittsburgh,PA, USA, 25 pages, Feb. 1999.

[Cherry & Foster 12] C. Cherry, G. Foster: Batch Tuning Strategies for Statistical MachineTranslation. In Proceedings of the Human Language Technology Conference / North Amer-ican Chapter of the Association for Computational Linguistics (HLT-NAACL), pp. 427–436,Montréal, Canada, June 2012.

[Cherry & Moore+ 12] C. Cherry, R.C. Moore, C. Quirk: On Hierarchical Re-ordering and Per-mutation Parsing for Phrase-based Decoding. In Proceedings of the Workshop on StatisticalMachine Translation (WMT), pp. 200–209, Montréal, Canada, June 2012.

[Chiang 05] D. Chiang: A Hierarchical Phrase-Based Model for Statistical Machine Translation.In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL),pp. 263–270, Ann Arbor, MI, USA, June 2005.

[Chiang 07] D. Chiang: Hierarchical Phrase-Based Translation. Computational Linguistics,Vol. 33, No. 2, pp. 201–228, June 2007.

[Chiang 10] D. Chiang: Learning to Translate with Source and Target Syntax. In Proceedings ofthe Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1443–1452,Uppsala, Sweden, July 2010.

[Chiang & DeNeefe+ 11] D. Chiang, S. DeNeefe, M. Pust: Two Easy Improvements to Lexi-cal Weighting. In Proceedings of the Annual Meeting of the Association for ComputationalLinguistics (ACL), pp. 455–460, Portland, OR, USA, June 2011.

[Cohn & Lapata 07] T. Cohn, M. Lapata: Machine translation by triangulation: Making effectiveuse of multi-parallel corpora. In Proceedings of the Annual Meeting of the Association forComputational Linguistics (ACL), pp. 728–735, Prague, Czech Republic, June 2007.

[Collins & Koehn+ 05] M. Collins, P. Koehn, I. Kucerova: Clause Restructuring for StatisticalMachine Translation. In Proceedings of the Annual Meeting of the Association for ComputationalLinguistics (ACL), pp. 531–540, Ann Arbor, MI, USA, June 2005.

[Conforti & Huck+ 18] C. Conforti, M. Huck, A. Fraser: Neural Morphological Tagging of LemmaSequences for Machine Translation. In Proceedings of the Conference of the Association forMachine Translation in the Americas (AMTA), pp. 39–53, Boston, MA, USA, March 2018.

[Crego & Yvon+ 11] J.M. Crego, F. Yvon, J.B. Mariño: Ncode: an Open Source Bilingual N-gram SMT Toolkit. The Prague Bulletin of Mathematical Linguistics, Vol. 96, pp. 49–58, Oct.2011.

[Darroch & Ratcliff 72] J.N. Darroch, D. Ratcliff: Generalized Iterative Scaling for Log-LinearModels. Annals of Mathematical Statistics, Vol. 43, pp. 1470–1480, 1972.

[de Gispert & Iglesias+ 10] A. de Gispert, G. Iglesias, G. Blackwood, E.R. Banga, W. Byrne:Hierarchical Phrase-Based Translation with Weighted Finite-State Transducers and Shallow-nGrammars. Computational Linguistics, Vol. 36, No. 3, pp. 505–533, 2010.

[Dempster & Laird+ 77] A.P. Dempster, N.M. Laird, D.B. Rubin: Maximum Likelihood fromIncomplete Data via the EM Algorithm. J. Royal Statist. Soc. Ser. B, Vol. 39, No. 1, pp. 1–22,1977.

129

Bibliography

[Dyer & Lopez+ 10] C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Se-tiawan, V. Eidelman, P. Resnik: cdec: A Decoder, Alignment, and Learning framework forfinite-state and context-free translation models. In Proceedings of the ACL 2010 System Demon-strations, pp. 7–12, Uppsala, Sweden, July 2010.

[Eisele & Chen 10] A. Eisele, Y. Chen: MultiUN: A Multilingual Corpus from United NationDocuments. pp. 2868–2872, Malta, May 2010.

[Federico & Bentivogli+ 11] M. Federico, L. Bentivogli, M. Paul, S. Stueker: Overview of theIWSLT 2011 Evaluation Campaign. In Proceedings of the International Workshop on SpokenLanguage Translation (IWSLT), pp. 11–27, San Francisco, CA, USA, Dec. 2011.

[Foster & Kuhn+ 06] G. Foster, R. Kuhn, H. Johnson: Phrasetable Smoothing for StatisticalMachine Translation. In Proceedings of the Conference on Empirical Methods for NaturalLanguage Processing (EMNLP), pp. 53–61, Sydney, Australia, July 2006.

[Foster & Kuhn 07] G. Foster, R. Kuhn: Mixture-Model Adaptation for SMT. In Proceedings ofthe Workshop on Statistical Machine Translation (WMT), pp. 128–135, Prague, Czech Republic,June 2007.

[Freitag & Feng+ 13] M. Freitag, M. Feng, M. Huck, S. Peitz, H. Ney: Reverse Word OrderModels. In Proceedings of the MT Summit XIV, pp. 159–166, Nice, France, Sept. 2013.

[Freitag & Huck+ 14] M. Freitag, M. Huck, H. Ney: Jane: Open Source Machine TranslationSystem Combination. In Proceedings of the Conference of the European Chapter of the Asso-ciation for Computational Linguistics (EACL), pp. 29–32, Gothenburg, Sweden, April 2014.Association for Computational Linguistics.

[Freitag & Peitz+ 12] M. Freitag, S. Peitz, M. Huck, H. Ney, T. Herrmann, J. Niehues, A. Waibel,A. Allauzen, G. Adda, B. Buschbeck, J.M. Crego, J. Senellart: Joint WMT 2012 Submissionof the QUAERO Project. In Proceedings of the Workshop on Statistical Machine Translation(WMT), pp. 322–329, Montréal, Canada, June 2012. Association for Computational Linguistics.

[Freitag & Peitz+ 13] M. Freitag, S. Peitz, J. Wuebker, H. Ney, N. Durrani, M. Huck, P. Koehn,T.L. Ha, J. Niehues, M. Mediani, T. Herrmann, A. Waibel, N. Bertoldi, M. Cettolo, M. Federico:EU-BRIDGE MT: Text Translation of Talks in the EU-BRIDGE Project. In Proceedings ofthe International Workshop on Spoken Language Translation (IWSLT), Heidelberg, Germany,Dec. 2013.

[Freitag & Peitz+ 14] M. Freitag, S. Peitz, J. Wuebker, H. Ney, M. Huck, R. Sennrich, N. Dur-rani, M. Nadejde, P. Williams, P. Koehn, T. Herrmann, E. Cho, A. Waibel: EU-BRIDGE MT:Combined Machine Translation. In Proceedings of the Workshop on Statistical Machine Trans-lation (WMT), Baltimore, MD, USA, June 2014. Association for Computational Linguistics.

[Freitag & Wuebker+ 14] M. Freitag, J. Wuebker, S. Peitz, H. Ney, M. Huck, A. Birch, N. Dur-rani, P. Koehn, M. Mediani, I. Slawik, J. Niehues, E. Cho, A. Waibel, N. Bertoldi, M. Cettolo,M. Federico: Combined Spoken Language Translation. In Proceedings of the InternationalWorkshop on Spoken Language Translation (IWSLT), pp. 57–64, Lake Tahoe, CA, USA, Dec.2014.

[Fritzinger & Fraser 10] F. Fritzinger, A. Fraser: How to Avoid Burning Ducks: CombiningLinguistic Analysis and Corpus Statistics for German Compound Processing. In Proceedings ofthe Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pp. 224–234,Uppsala, Sweden, July 2010.

130

Bibliography

[Gale & Church 93] W.A. Gale, K.W. Church: A Program for Aligning Sentences in BilingualCorpora. Computational Linguistics, Vol. 19, No. 1, pp. 75–90, 1993.

[Galley & Manning 08] M. Galley, C.D. Manning: A Simple and Effective Hierarchical PhraseReordering Model. In Proceedings of the Conference on Empirical Methods for Natural LanguageProcessing (EMNLP), pp. 847–855, Honolulu, HI, USA, Oct. 2008.

[Galley & Manning 10] M. Galley, C.D. Manning: Accurate Non-Hierarchical Phrase-BasedTranslation. In Proceedings of the Human Language Technology Conference / North Amer-ican Chapter of the Association for Computational Linguistics (HLT-NAACL), pp. 966–974,Los Angeles, CA, USA, June 2010.

[Gao & Koehn+ 11] Y. Gao, P. Koehn, A. Birch: Soft Dependency Constraints for Reordering inHierarchical Phrase-Based Translation. In Proceedings of the Conference on Empirical Methodsfor Natural Language Processing (EMNLP), pp. 857–868, Edinburgh, Scotland, UK, July 2011.

[Gao & Vogel 08] Q. Gao, S. Vogel: Parallel Implementations of Word Alignment Tool. In Soft-ware Engineering, Testing, and Quality Assurance for Natural Language Processing, SETQA-NLP ’08, pp. 49–57, Columbus, OH, USA, June 2008.

[Habash & Rambow 05] N. Habash, O. Rambow: Arabic Tokenization, Part-of-Speech Taggingand Morphological Disambiguation in One Fell Swoop. In Proceedings of the Annual Meetingof the Association for Computational Linguistics (ACL), pp. 573–580, Ann Arbor, MI, USA,June 2005.

[Haddow & Birch+ 17] B. Haddow, A. Birch, O. Bojar, F. Braune, C. Davenport, A. Fraser,M. Huck, M. Kašpar, K. Kovaříková, J. Plch, A. Ramm, J. Ried, J. Sheary, A. Tamchyna,D. Variš, M. Weller, P. Williams: HimL: Health in my Language. In Proceedings of the EAMT2017 User Studies and Project/Product Descriptions, 33, Prague, Czech Republic, May 2017.

[Haddow & Huck+ 15] B. Haddow, M. Huck, A. Birch, N. Bogoychev, P. Koehn: The Edin-burgh/JHU Phrase-based Machine Translation Systems for WMT 2015. In Proceedings of theWorkshop on Statistical Machine Translation (WMT), pp. 126–133, Lisbon, Portugal, Septem-ber 2015. Association for Computational Linguistics.

[Hasan 11] S. Hasan: Triplet Lexicon Models for Statistical Machine Translation. Ph.D. thesis,RWTH Aachen University, Aachen, Germany, Nov. 2011.

[Hasan & Ganitkevitch+ 08] S. Hasan, J. Ganitkevitch, H. Ney, J. Andrés-Ferrer: Triplet LexiconModels for Statistical Machine Translation. In Proceedings of the Conference on EmpiricalMethods for Natural Language Processing (EMNLP), pp. 372–381, Honolulu, HI, USA, Oct.2008.

[Hasan & Ney 09] S. Hasan, H. Ney: Comparison of Extended Lexicon Models in Search andRescoring for SMT. In Proceedings of the Human Language Technology Conference / NorthAmerican Chapter of the Association for Computational Linguistics (HLT-NAACL), pp. 17–20,Boulder, CO, USA, June 2009.

[Hayashi & Tsukada+ 10] K. Hayashi, H. Tsukada, K. Sudoh, K. Duh, S. Yamamoto: Hierarchi-cal Phrase-based Machine Translation with Word-based Reordering Model. In Proceedings ofthe International Conference on Computational Linguistics (COLING), pp. 439–446, Beijing,China, Aug. 2010.

131

Bibliography

[He & Meng+ 10a] Z. He, Y. Meng, H. Yu: Extending the Hierarchical Phrase Based Modelwith Maximum Entropy Based BTG. In Proceedings of the Conference of the Association forMachine Translation in the Americas (AMTA), Denver, CO, USA, Oct./Nov. 2010.

[He & Meng+ 10b] Z. He, Y. Meng, H. Yu: Maximum Entropy Based Phrase Reordering forHierarchical Phrase-based Translation. In Proceedings of the Conference on Empirical Methodsfor Natural Language Processing (EMNLP), pp. 555–563, Cambridge, MA, USA, Oct. 2010.

[Heafield & Hoang+ 11] K. Heafield, H. Hoang, P. Koehn, T. Kiso, M. Federico: Left LanguageModel State for Syntactic Machine Translation. In Proceedings of the International Workshopon Spoken Language Translation (IWSLT), pp. 183–190, San Francisco, CA, USA, Dec. 2011.

[Heafield & Koehn+ 12] K. Heafield, P. Koehn, A. Lavie: Language Model Rest Costs and Space-Efficient Storage. In Proceedings of the 2012 Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational Natural Language Learning, EMNLP-CoNLL ’12, pp.1169–1178, Jeju Island, Korea, July 2012.

[Heafield & Koehn+ 13] K. Heafield, P. Koehn, A. Lavie: Grouping Language Model BoundaryWords to Speed K–Best Extraction from Hypergraphs. In Proceedings of the Human Lan-guage Technology Conference / North American Chapter of the Association for ComputationalLinguistics (HLT-NAACL), pp. 958–968, Atlanta, GA, USA, June 2013.

[Heger & Wuebker+ 10a] C. Heger, J. Wuebker, M. Huck, G. Leusch, S. Mansour, D. Stein,H. Ney: The RWTH Aachen Machine Translation System for WMT 2010. In Proceedings ofthe Workshop on Statistical Machine Translation (WMT), pp. 93–97, Uppsala, Sweden, July2010. Association for Computational Linguistics.

[Heger & Wuebker+ 10b] C. Heger, J. Wuebker, D. Vilar, H. Ney: A Combination of HierarchicalSystems with Forced Alignments from Phrase-Based Systems. In Proceedings of the Interna-tional Workshop on Spoken Language Translation (IWSLT), pp. 291–297, Paris, France, Dec.2010.

[Herrmann & Mediani+ 11] T. Herrmann, M. Mediani, J. Niehues, A. Waibel: The KarlsruheInstitute of Technology Translation Systems for the WMT 2011. In Proceedings of the Workshopon Statistical Machine Translation (WMT), pp. 379–385, Edinburgh, Scotland, UK, July 2011.

[Hoang & Koehn+ 09] H. Hoang, P. Koehn, A. Lopez: A Unified Framework for Phrase-Based,Hierarchical, and Syntax-Based Statistical Machine Translation. In Proceedings of the Interna-tional Workshop on Spoken Language Translation (IWSLT), pp. 152–159, Tokyo, Japan, Dec.2009.

[Hoang & Koehn 10] H. Hoang, P. Koehn: Improved Translation with Source Syntax Labels. InACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, pp.409–417, Uppsala, Sweden, July 2010.

[Hopcroft & Motwani+ 01] J.E. Hopcroft, R. Motwani, J.D. Ullman: Introduction to AutomataTheory, Languages, and Computation - (2. ed.). Addison-Wesley, 2001.

[Hopkins & May 11] M. Hopkins, J. May: Tuning as Ranking. In Proceedings of the Conferenceon Empirical Methods for Natural Language Processing (EMNLP), pp. 1352–1362, Edinburgh,Scotland, UK, July 2011.

[Huang & Chiang 05] L. Huang, D. Chiang: Better k-best Parsing. In Proceedings of the 9thInternation Workshop on Parsing Technologies, pp. 53–64, Oct. 2005.

132

Bibliography

[Huang & Chiang 07] L. Huang, D. Chiang: Forest Rescoring: Faster Decoding with IntegratedLanguage Models. In Proceedings of the Annual Meeting of the Association for ComputationalLinguistics (ACL), pp. 144–151, Prague, Czech Republic, June 2007.

[Huck & Birch 15a] M. Huck, A. Birch: The Edinburgh Machine Translation Systems for IWSLT2015. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT),pp. 31–38, Da Nang, Vietnam, Dec. 2015.

[Huck & Birch+ 15b] M. Huck, A. Birch, B. Haddow: Mixed-Domain vs. Multi-Domain Statis-tical Machine Translation. In Proceedings of MT Summit XV, Volume 1: MT Researchers’Track, pp. 240–255, Miami, FL, USA, Oct./Nov. 2015.

[Huck & Braune+ 17] M. Huck, F. Braune, A. Fraser: LMU Munich’s Neural Machine Transla-tion Systems for News Articles and Health Information Texts. In Proceedings of the SecondConference on Machine Translation, Volume 2: Shared Task Papers, pp. 315–322, Copenhagen,Denmark, September 2017. Association for Computational Linguistics.

[Huck & Fraser+ 16] M. Huck, A. Fraser, B. Haddow: The Edinburgh/LMU Hierarchical Ma-chine Translation System for WMT 2016. In Proceedings of the ACL 2016 First Conferenceon Machine Translation (WMT16), pp. 311–318, Berlin, Germany, Aug. 2016. Association forComputational Linguistics.

[Huck & Hoang+ 14a] M. Huck, H. Hoang, P. Koehn: Augmenting String-to-Tree and Tree-to-String Translation with Non-Syntactic Phrases. In Proceedings of the Workshop on StatisticalMachine Translation (WMT), pp. 486–498, Baltimore, MD, USA, June 2014. Association forComputational Linguistics.

[Huck & Hoang+ 14b] M. Huck, H. Hoang, P. Koehn: Preference Grammars and Soft SyntacticConstraints for GHKM Syntax-based Statistical Machine Translation. In Proceedings of theEMNLP 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), pp. 148–156, Doha, Qatar, Oct. 2014. Association for Computational Linguistics.

[Huck & Mansour+ 11] M. Huck, S. Mansour, S. Wiesler, H. Ney: Lexicon Models for Hierar-chical Phrase-Based Machine Translation. In Proceedings of the International Workshop onSpoken Language Translation (IWSLT), pp. 191–198, San Francisco, CA, USA, Dec. 2011.

[Huck & Ney 12a] M. Huck, H. Ney: Insertion and Deletion Models for Statistical MachineTranslation. In Proceedings of the Human Language Technology Conference / North Amer-ican Chapter of the Association for Computational Linguistics (HLT-NAACL), pp. 347–351,Montréal, Canada, June 2012. Association for Computational Linguistics.

[Huck & Ney 12b] M. Huck, H. Ney: Pivot Lightly-Supervised Training for Statistical MachineTranslation. In Proceedings of the Conference of the Association for Machine Translation inthe Americas (AMTA), San Diego, CA, USA, Oct. 2012.

[Huck & Peitz+ 12a] M. Huck, S. Peitz, M. Freitag, H. Ney: Discriminative Reordering Exten-sions for Hierarchical Phrase-Based Machine Translation. In Proceedings of the Annual Con-ference of the European Association for Machine Translation (EAMT), pp. 313–320, Trento,Italy, May 2012.

[Huck & Peitz+ 12b] M. Huck, S. Peitz, M. Freitag, M. Nuhn, H. Ney: The RWTH Aachen Ma-chine Translation System for WMT 2012. In Proceedings of the Workshop on Statistical MachineTranslation (WMT), pp. 304–311, Montréal, Canada, June 2012. Association for ComputationalLinguistics.

133

Bibliography

[Huck & Peter+ 12] M. Huck, J.T. Peter, M. Freitag, S. Peitz, H. Ney: Hierarchical Phrase-BasedTranslation with Jane 2. The Prague Bulletin of Mathematical Linguistics (PBML), Vol. 98,pp. 37–50, Oct. 2012.

[Huck & Ratajczak+ 10] M. Huck, M. Ratajczak, P. Lehnen, H. Ney: A Comparison of VariousTypes of Extended Lexicon Models for Statistical Machine Translation. In Proceedings of theConference of the Association for Machine Translation in the Americas (AMTA), Denver, CO,USA, Oct./Nov. 2010.

[Huck & Riess+ 17] M. Huck, S. Riess, A. Fraser: Target-side Word Segmentation Strategies forNeural Machine Translation. In Proceedings of the Second Conference on Machine Translation,Volume 1: Research Papers, pp. 56–67, Copenhagen, Denmark, September 2017. Associationfor Computational Linguistics.

[Huck & Scharwächter+ 13] M. Huck, E. Scharwächter, H. Ney: Source-Side DiscontinuousPhrases for Machine Translation: A Comparative Study on Phrase Extraction and Search.The Prague Bulletin of Mathematical Linguistics (PBML), Vol. 99, pp. 17–38, April 2013.

[Huck & Tamchyna+ 17] M. Huck, A. Tamchyna, O. Bojar, A. Fraser: Producing Unseen Mor-phological Variants in Statistical Machine Translation. In Proceedings of the Conference of theEuropean Chapter of the Association for Computational Linguistics (EACL), Valencia, Spain,April 2017. Association for Computational Linguistics.

[Huck & Vilar+ 11a] M. Huck, D. Vilar, D. Stein, H. Ney: Advancements in Arabic-to-EnglishHierarchical Machine Translation. In Proceedings of the Annual Conference of the EuropeanAssociation for Machine Translation (EAMT), pp. 273–280, Leuven, Belgium, May 2011.

[Huck & Vilar+ 11b] M. Huck, D. Vilar, D. Stein, H. Ney: Lightly-Supervised Training forHierarchical Phrase-Based Machine Translation. In Proceedings of the EMNLP 2011 Workshopon Unsupervised Learning in NLP, pp. 91–96, Edinburgh, Scotland, UK, July 2011. Associationfor Computational Linguistics.

[Huck & Vilar+ 13] M. Huck, D. Vilar, M. Freitag, H. Ney: A Performance Study of CubePruning for Large-Scale Hierarchical Machine Translation. In Proceedings of the NAACL 7thWorkshop on Syntax, Semantics and Structure in Statistical Translation, pp. 29–38, Atlanta,GA, USA, June 2013. Association for Computational Linguistics.

[Huck & Wuebker+ 11] M. Huck, J. Wuebker, C. Schmidt, M. Freitag, S. Peitz, D. Stein, A. Dag-nelies, S. Mansour, G. Leusch, H. Ney: The RWTH Aachen Machine Translation System forWMT 2011. In Proceedings of the Workshop on Statistical Machine Translation (WMT), pp.405–412, Edinburgh, Scotland, UK, July 2011. Association for Computational Linguistics.

[Huck & Wuebker+ 13] M. Huck, J. Wuebker, F. Rietig, H. Ney: A Phrase Orientation Modelfor Hierarchical Machine Translation. In Proceedings of the Workshop on Statistical MachineTranslation (WMT), pp. 452–463, Sofia, Bulgaria, Aug. 2013. Association for ComputationalLinguistics.

[Igel & Hüsken 03] C. Igel, M. Hüsken: Empirical Evaluation of the Improved Rprop LearningAlgorithms. Neurocomputing, Vol. 50, pp. 105–123, 2003.

[Iglesias & de Gispert+ 09a] G. Iglesias, A. de Gispert, E.R. Banga, W. Byrne: Rule Filtering byPattern for Efficient Hierarchical Translation. In Proceedings of the Conference of the EuropeanChapter of the Association for Computational Linguistics (EACL), pp. 380–388, Athens, Greece,March 2009.

134

Bibliography

[Iglesias & de Gispert+ 09b] G. Iglesias, A. de Gispert, E. R. Banga, W. Byrne: HierarchicalPhrase-Based Translation with Weighted Finite State Transducers. In Proceedings of the Hu-man Language Technology Conference / North American Chapter of the Association for Com-putational Linguistics (HLT-NAACL), pp. 433–441, Boulder, CO, USA, June 2009.

[Jeong & Toutanova+ 10] M. Jeong, K. Toutanova, H. Suzuki, C. Quirk: A Discriminative Lex-icon Model for Complex Morphology. In Proceedings of the Conference of the Association forMachine Translation in the Americas (AMTA), Denver, CO, USA, Oct./Nov. 2010.

[Kazemi & Toral+ 15] A. Kazemi, A. Toral, A. Way, A. Monadjemi, M. Nematbakhsh:Dependency-based Reordering Model for Constituent Pairs in Hierarchical SMT. In Proceed-ings of the Annual Conference of the European Association for Machine Translation (EAMT),pp. 43–50, Antalya, Turkey, May 2015.

[Kneser & Ney 95] R. Kneser, H. Ney: Improved Backing-Off for M-gram Language Modeling.In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing,Vol. 1, pp. 181–184, Detroit, MI, USA, May 1995.

[Knight 99] K. Knight: Decoding Complexity in Word-Replacement Translation Models. Com-putational Linguistics, Vol. 25, No. 4, pp. 607–615, Dec. 1999.

[Koehn 05] P. Koehn: Europarl: A Parallel Corpus for Statistical Machine Translation. InProceedings of the MT Summit X, Phuket, Thailand, Sept. 2005.

[Koehn & Birch+ 09] P. Koehn, A. Birch, R. Steinberger: 462 Machine Translation Systems forEurope. In Proceedings of the MT Summit XII, pp. 65–72, Ottawa, Canada, Aug. 2009.

[Koehn & Hoang+ 07] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi,B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst: Moses:Open Source Toolkit for Statistical Machine Translation. In Proceedings of the ACL 2007 Demoand Poster Sessions, pp. 177–180, Prague, Czech Republic, June 2007.

[Koehn & Knight 03] P. Koehn, K. Knight: Empirical Methods for Compound Splitting. InProceedings of the Conference of the European Chapter of the Association for ComputationalLinguistics (EACL), pp. 187–194, Budapest, Hungary, April 2003.

[Koehn & Och+ 03] P. Koehn, F.J. Och, D. Marcu: Statistical Phrase-Based Translation. InProceedings of the Human Language Technology Conference / North American Chapter of theAssociation for Computational Linguistics (HLT-NAACL), pp. 127–133, Edmonton, Canada,May/June 2003.

[Koehn & Schroeder 07] P. Koehn, J. Schroeder: Experiments in Domain Adaptation for Statis-tical Machine Translation. In Proceedings of the Workshop on Statistical Machine Translation(WMT), pp. 224–227, Prague, Czech Republic, June 2007.

[Kordoni & van den Bosch+ 16] V. Kordoni, A. van den Bosch, K.L. Kermanidis, V. Sosoni,K. Cholakov, I. Hendrickx, M. Huck, A. Way: Enhancing Access to Online Education: QualityMachine Translation of MOOC Content. In Proceedings of the Tenth International Conferenceon Language Resources and Evaluation (LREC 2016), pp. 16–22, Portorož, Slovenia, May 2016.European Language Resources Association (ELRA).

[Lambert & Schwenk+ 11] P. Lambert, H. Schwenk, C. Servan, S. Abdul-Rauf: Investigations onTranslation Model Adaptation Using Monolingual Data. In Proceedings of the Workshop onStatistical Machine Translation (WMT), pp. 284–293, Edinburgh, Scotland, UK, July 2011.

135

Bibliography

[Leusch & Max+ 10] G. Leusch, A. Max, J.M. Crego, H. Ney: Multi–Pivot Translation by SystemCombination. In Proceedings of the International Workshop on Spoken Language Translation(IWSLT), pp. 299–306, Paris, France, Dec. 2010.

[Li & Callison-Burch+ 09a] Z. Li, C. Callison-Burch, C. Dyer, S. Khudanpur, L. Schwartz,W. Thornton, J. Weese, O. Zaidan: Joshua: An Open Source Toolkit for Parsing-Based Ma-chine Translation. In Proceedings of the Workshop on Statistical Machine Translation (WMT),pp. 135–139, Athens, Greece, March 2009.

[Li & Callison-Burch+ 09b] Z. Li, C. Callison-Burch, S. Khudanpur, W. Thornton: Decoding inJoshua: Open Source, Parsing-Based Machine Translation. The Prague Bulletin of Mathemat-ical Linguistics, Vol. 91, pp. 47–56, Jan. 2009.

[Li & Eisner+ 11] Z. Li, J. Eisner, Z. Wang, S. Khudanpur, B. Roark: Minimum Imputed Risk:Unsupervised Discriminative Training for Machine Translation. In Proceedings of the Confer-ence on Empirical Methods for Natural Language Processing (EMNLP), pp. 920–929, Edin-burgh, Scotland, UK, July 2011.

[Li & Khudanpur 08] Z. Li, S. Khudanpur: A Scalable Decoder for Parsing-Based Machine Trans-lation with Equivalent Language Model State Maintenance. In Proceedings of the Second Work-shop on Syntax and Structure in Statistical Translation, SSST ’08, pp. 10–18, Columbus, OH,USA, June 2008.

[Li & Tu+ 12] J. Li, Z. Tu, G. Zhou, J. van Genabith: Using Syntactic Head Information inHierarchical Phrase-Based Translation. In Proceedings of the Workshop on Statistical MachineTranslation (WMT), pp. 232–242, Montréal, Canada, June 2012.

[Macháček & Bojar 13] M. Macháček, O. Bojar: Results of the WMT13 Metrics Shared Task.In Proceedings of the Workshop on Statistical Machine Translation (WMT), pp. 45–51, Sofia,Bulgaria, August 2013.

[Macháček & Bojar 14] M. Macháček, O. Bojar: Results of the WMT14 Metrics Shared Task.In Proceedings of the Workshop on Statistical Machine Translation (WMT), pp. 293–301, Bal-timore, MD, USA, June 2014.

[Macherey & Och+ 08] W. Macherey, F. Och, I. Thayer, J. Uszkoreit: Lattice-based MinimumError Rate Training for Statistical Machine Translation. In Proceedings of the Conference onEmpirical Methods for Natural Language Processing (EMNLP), pp. 725–734, Honolulu, HI,USA, Oct. 2008.

[Mansour & Ney 12] S. Mansour, H. Ney: Arabic-Segmentation Combination Strategies for Sta-tistical Machine Translation. In Language Resources and Evaluation, pp. 3915–3920, Istanbul,Turkey, May 2012.

[Mariño & Banchs+ 06] J.B. Mariño, R.E. Banchs, J.M. Crego, A. de Gispert, P. Lambert, J.A.R.Fonollosa, M.R. Costa-Jussà: N-gram-based Machine Translation. Computational Linguistics,Vol. 32, No. 4, pp. 527–549, Dec. 2006.

[Marton & Resnik 08] Y. Marton, P. Resnik: Soft Syntactic Constraints for Hierarchical Phrased-Based Translation. In Proceedings of the Annual Meeting of the Association for ComputationalLinguistics (ACL), pp. 1003–1011, Columbus, OH, USA, June 2008.

[Mauser & Hasan+ 09] A. Mauser, S. Hasan, H. Ney: Extending Statistical Machine Translationwith Discriminative and Trigger-Based Lexicon Models. In Proceedings of the Conference on

136

Bibliography

Empirical Methods for Natural Language Processing (EMNLP), pp. 210–218, Singapore, Aug.2009.

[Mauser & Zens+ 06] A. Mauser, R. Zens, E. Matusov, S. Hasan, H. Ney: The RWTH StatisticalMachine Translation System for the IWSLT 2006 Evaluation. In Proceedings of the InternationalWorkshop on Spoken Language Translation (IWSLT), pp. 103–110, Kyoto, Japan, Nov. 2006.

[Moore 02] R.C. Moore: Fast and Accurate Sentence Alignment of Bilingual Corpora. In S.D.Richardson, editor, Machine Translation: From Research to Real Users, 5th Conference of theAssociation for Machine Translation in the Americas, AMTA 2002 Tiburon, CA, USA, October6-12, 2002, Proceedings, Vol. 2499 of Lecture Notes in Computer Science. Springer, 2002.

[Moore 04] R.C. Moore: Improving IBM Word-Alignment Model 1. In Proceedings of the AnnualMeeting of the Association for Computational Linguistics (ACL), pp. 518–525, Barcelona, Spain,July 2004.

[Moore & Lewis 10] R.C. Moore, W. Lewis: Intelligent Selection of Language Model TrainingData. In Proceedings of the Annual Meeting of the Association for Computational Linguistics(ACL), pp. 220–224, Uppsala, Sweden, July 2010.

[Nelder & Mead 65] J.A. Nelder, R. Mead: A Simplex Method for Function Minimization. TheComputer Journal, Vol. 7, pp. 308–313, 1965.

[Nguyen & Vogel 13] T. Nguyen, S. Vogel: Integrating Phrase-based Reordering Features into aChart-based Decoder for Machine Translation. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics (ACL), pp. 1587–1596, Sofia, Bulgaria, Aug. 2013.

[Och 02] F.J. Och: Statistical Machine Translation: From Single-Word Models to AlignmentTemplates. Ph.D. thesis, RWTH Aachen University, Aachen, Germany, Oct. 2002.

[Och 03] F.J. Och: Minimum Error Rate Training for Statistical Machine Translation. In Pro-ceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp.160–167, Sapporo, Japan, July 2003.

[Och & Gildea+ 03] F.J. Och, D. Gildea, S. Khudanpur, A. Sarkar, K. Yamada, A. Fraser, S. Ku-mar, L. Shen, D. Smith, K. Eng, V. Jain, Z. Jin, D. Radev: Syntax for Statistical MachineTranslation. Technical report, Johns Hopkins University 2003 Summer Workshop on LanguageEngineering, Center for Language and Speech Processing, Baltimore, MD, USA, 120 pages,Aug. 2003.

[Och & Gildea+ 04] F.J. Och, D. Gildea, S. Khudanpur, A. Sarkar, K. Yamada, A. Fraser, S. Ku-mar, L. Shen, D. Smith, K. Eng, V. Jain, Z. Jin, D. Radev: A Smorgasbord of Features forStatistical Machine Translation. In Proceedings of the Human Language Technology Conference/ North American Chapter of the Association for Computational Linguistics (HLT-NAACL),pp. 161–168, Boston, MA, USA, May 2004.

[Och & Ney 02] F.J. Och, H. Ney: Discriminative Training and Maximum Entropy Models forStatistical Machine Translation. In Proceedings of the Annual Meeting of the Association forComputational Linguistics (ACL), pp. 295–302, Philadelphia, PA, USA, July 2002.

[Och & Ney 03] F.J. Och, H. Ney: A Systematic Comparison of Various Statistical AlignmentModels. Computational Linguistics, Vol. 29, No. 1, pp. 19–51, March 2003.

[Och & Ney 04] F.J. Och, H. Ney: The Alignment Template Approach to Statistical MachineTranslation. Computational Linguistics, Vol. 30, No. 4, pp. 417–449, Dec. 2004.

137

Bibliography

[Och & Tillmann+ 99] F.J. Och, C. Tillmann, H. Ney: Improved Alignment Models for StatisticalMachine Translation. In Proceedings of the Joint SIGDAT Conference on Empirical Methodsin Natural Language Processing and Very Large Corpora (EMNLP99), pp. 20–28, University ofMaryland, College Park, MD, USA, June 1999.

[Papineni & Roukos+ 02] K. Papineni, S. Roukos, T. Ward, W.J. Zhu: Bleu: a Method forAutomatic Evaluation of Machine Translation. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics (ACL), pp. 311–318, Philadelphia, PA, USA, July2002.

[Peitz & Mansour+ 12] S. Peitz, S. Mansour, M. Freitag, M. Feng, M. Huck, J. Wuebker,M. Nuhn, M. Nußbaum-Thom, H. Ney: The RWTH Aachen Speech Recognition and Ma-chine Translation System for IWSLT 2012. In Proceedings of the International Workshop onSpoken Language Translation (IWSLT), pp. 69–76, Hong Kong, Dec. 2012.

[Peitz & Mansour+ 13a] S. Peitz, S. Mansour, M. Huck, M. Freitag, H. Ney, E. Cho, T. Herrmann,M. Mediani, J. Niehues, A. Waibel, A. Allauzen, Q.K. Do, B. Buschbeck, T. Wandmacher:Joint WMT 2013 Submission of the QUAERO Project. In Proceedings of the Workshop onStatistical Machine Translation (WMT), pp. 185–192, Sofia, Bulgaria, Aug. 2013. Associationfor Computational Linguistics.

[Peitz & Mansour+ 13b] S. Peitz, S. Mansour, J.T. Peter, C. Schmidt, J. Wuebker, M. Huck,M. Freitag, H. Ney: The RWTH Aachen Machine Translation System for WMT 2013. InProceedings of the Workshop on Statistical Machine Translation (WMT), pp. 193–199, Sofia,Bulgaria, Aug. 2013. Association for Computational Linguistics.

[Peter & Alkhouli+ 16] J.T. Peter, T. Alkhouli, H. Ney, M. Huck, F. Braune, A. Fraser, A. Tam-chyna, O. Bojar, B. Haddow, R. Sennrich, F. Blain, L. Specia, J. Niehues, A. Waibel, A. Al-lauzen, L. Aufrant, F. Burlot, e. knyazeva, T. Lavergne, F. Yvon, S. Frank, M. Pinnis: TheQT21/HimL Combined Machine Translation System. In Proceedings of the ACL 2016 FirstConference on Machine Translation (WMT16), pp. 344–355, Berlin, Germany, Aug. 2016. As-sociation for Computational Linguistics.

[Peter & Huck+ 11] J.T. Peter, M. Huck, H. Ney, D. Stein: Soft String-to-Dependency Hierar-chical Machine Translation. In Proceedings of the International Workshop on Spoken LanguageTranslation (IWSLT), pp. 246–253, San Francisco, CA, USA, Dec. 2011.

[Peter & Huck+ 12] J.T. Peter, M. Huck, H. Ney, D. Stein: Soft String-to-Dependency Hierar-chical Machine Translation. In Informatiktage der Gesellschaft für Informatik, Lecture Notesin Informatics (LNI), pp. 59–62, Bonn, Germany, March 2012. Gesellschaft f”ur Informatik.ISBN 978-3-88579-445-5.

[Popović & Ney 06] M. Popović, H. Ney: POS-based Word Reorderings for Statistical MachineTranslation. In Language Resources and Evaluation, pp. 1278–1283, Genoa, Italy, May 2006.

[Popović & Stein+ 06] M. Popović, D. Stein, H. Ney: Statistical Machine Translation of GermanCompound Words. In FinTAL - 5th International Conference on Natural Language Processing,Lecture Notes in Computer Science, Vol. 4139, pp. 616–624, Turku, Finland, Aug. 2006.

[Sankaran & Razmara+ 12] B. Sankaran, M. Razmara, A. Sarkar: Kriya - An end-to-end Hier-archical Phrase-based MT System. The Prague Bulletin of Mathematical Linguistics (PBML),Vol. 97, pp. 83–98, April 2012.

138

Bibliography

[Schwenk 08] H. Schwenk: Investigations on Large-Scale Lightly-Supervised Training for Statis-tical Machine Translation. In Proceedings of the International Workshop on Spoken LanguageTranslation (IWSLT), pp. 182–189, Waikiki, HI, USA, Oct. 2008.

[Schwenk & Senellart 09] H. Schwenk, J. Senellart: Translation Model Adaptation for an Ara-bic/French News Translation System by Lightly-Supervised Training. In Proceedings of the MTSummit XII, Ottawa, Canada, Aug. 2009.

[Sennrich & Williams+ 15] R. Sennrich, P. Williams, M. Huck: A tree does not make a well-formed sentence: Improving syntactic string-to-tree statistical machine translation with morelinguistic knowledge. Computer Speech & Language, Vol. 32, No. 1, pp. 27–45, 2015.

[Shen & Xu+ 08] L. Shen, J. Xu, R. Weischedel: A New String-to-Dependency Machine Trans-lation Algorithm with a Target Dependency Language Model. In Proceedings of the AnnualMeeting of the Association for Computational Linguistics (ACL), pp. 577–585, Columbus, OH,USA, June 2008.

[Shen & Xu+ 10] L. Shen, J. Xu, R. Weischedel: String-to-Dependency Statistical Machine Trans-lation. Computational Linguistics, Vol. 36, No. 4, pp. 649–671, Dec. 2010.

[Snover & Dorr+ 06] M. Snover, B. Dorr, R. Schwartz, L. Micciulla, J. Makhoul: A Study ofTranslation Edit Rate with Targeted Human Annotation. In Proceedings of the Conference ofthe Association for Machine Translation in the Americas (AMTA), pp. 223–231, Cambridge,MA, USA, Aug. 2006.

[Stanojević & Kamran+ 15] M. Stanojević, A. Kamran, P. Koehn, O. Bojar: Results of theWMT15 Metrics Shared Task. In Proceedings of the Workshop on Statistical Machine Transla-tion (WMT), pp. 256–273, Lisbon, Portugal, September 2015.

[Stein & Peitz+ 10] D. Stein, S. Peitz, D. Vilar, H. Ney: A Cocktail of Deep Syntactic Featuresfor Hierarchical Machine Translation. In Proceedings of the Conference of the Association forMachine Translation in the Americas (AMTA), Denver, CO, USA, Oct./Nov. 2010.

[Stein & Vilar+ 11] D. Stein, D. Vilar, S. Peitz, M. Freitag, M. Huck, H. Ney: A Guide toJane, an Open Source Hierarchical Translation Toolkit. The Prague Bulletin of MathematicalLinguistics (PBML), Vol. 95, pp. 5–18, April 2011.

[Stolcke 02] A. Stolcke: SRILM – an Extensible Language Modeling Toolkit. In Proceedings ofthe International Conference on Spoken Language Processing (ICSLP), Vol. 3, Denver, CO,USA, Sept. 2002.

[Stymne & Cancedda+ 13] S. Stymne, N. Cancedda, L. Ahrenberg: Generation of CompoundWords in Statistical Machine Translation into Compounding Languages. Computational Lin-guistics, Vol. 39, No. 4, pp. 1067–1108, Dec. 2013.

[Sundermeyer & Schlüter+ 11] M. Sundermeyer, R. Schlüter, H. Ney: On the Estimation ofDiscount Parameters for Language Model Smoothing. In Proceedings of Interspeech, pp. 1433–1436, Florence, Italy, Aug. 2011.

[Tillmann 04] C. Tillmann: A Unigram Orientation Model for Statistical Machine Translation.In Proceedings of the Human Language Technology Conference / North American Chapter ofthe Association for Computational Linguistics (HLT-NAACL), pp. 101–104, Boston, MA, USA,May 2004.

139

Bibliography

[Toutanova & Galley 11] K. Toutanova, M. Galley: Why Initialization Matters for IBM Model1: Multiple Optima and Non-Strict Convexity. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics (ACL), pp. 461–466, Portland, OR, USA, June 2011.

[Tromble & Eisner 09] R. Tromble, J. Eisner: Learning Linear Ordering Problems for BetterTranslation. In Proceedings of the Conference on Empirical Methods for Natural LanguageProcessing (EMNLP), pp. 1007–1016, Singapore, Aug. 2009.

[Ueffing & Haffari+ 07] N. Ueffing, G. Haffari, A. Sarkar: Transductive learning for statisticalmachine translation. In Proceedings of the Annual Meeting of the Association for ComputationalLinguistics (ACL), pp. 25–32, Prague, Czech Republic, June 2007.

[Utiyama & Isahara 07] M. Utiyama, H. Isahara: A Comparison of Pivot Methods for Phrase-based Statistical Machine Translation. In Proceedings of the Human Language TechnologyConference / North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pp. 484–491, Rochester, NY, USA, April 2007.

[Venugopal & Zollmann+ 09] A. Venugopal, A. Zollmann, N.A. Smith, S. Vogel: PreferenceGrammars: Softening Syntactic Constraints to Improve Statistical Machine Translation. InProceedings of the Human Language Technology Conference / North American Chapter of theAssociation for Computational Linguistics (HLT-NAACL), pp. 236–244, Boulder, CO, USA,June 2009.

[Vilar 11] D. Vilar: Investigations on Hierarchical Phrase-based Machine Translation. Ph.D.thesis, RWTH Aachen University, Aachen, Germany, Nov. 2011.

[Vilar & Ney 12] D. Vilar, H. Ney: Cardinality pruning and language model heuristics for hi-erarchical phrase-based translation. Machine Translation, Vol. 26, No. 3, pp. 217–254, Sept.2012.

[Vilar & Stein+ 08] D. Vilar, D. Stein, H. Ney: Analysing Soft Syntax Features and Heuristics forHierarchical Phrase Based Machine Translation. In Proceedings of the International Workshopon Spoken Language Translation (IWSLT), pp. 190–197, Waikiki, HI, USA, Oct. 2008.

[Vilar & Stein+ 10] D. Vilar, D. Stein, M. Huck, H. Ney: Jane: Open Source Hierarchical Trans-lation, Extended with Reordering and Lexicon Models. In Proceedings of the Workshop onStatistical Machine Translation (WMT), pp. 262–270, Uppsala, Sweden, July 2010. Associationfor Computational Linguistics.

[Vilar & Stein+ 12] D. Vilar, D. Stein, M. Huck, H. Ney: Jane: an advanced freely availablehierarchical machine translation toolkit. Machine Translation, Vol. 26, No. 3, pp. 197–216,Sept. 2012.

[Vilar & Stein+ 13] D. Vilar, D. Stein, M. Huck, J. Wuebker, M. Freitag, S. Peitz, M. Nuhn, J.T.Peter: Jane: User’s Manual, 2013. http://www.hltpr.rwth-aachen.de/jane/manual.pdf.

[Vogel & Ney+ 96] S. Vogel, H. Ney, C. Tillmann: HMM-Based Word Alignment in StatisticalTranslation. In Proceedings of the International Conference on Computational Linguistics(COLING), pp. 836–841, Copenhagen, Denkmark, Aug. 1996.

[Wang & Collins+ 07] C. Wang, M. Collins, P. Koehn: Chinese Syntactic Reordering for Statis-tical Machine Translation. In Proceedings of the 2007 Joint Conference on Empirical Meth-ods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 737–745, Prague, Czech Republic, June 2007.

140

http://www.hltpr.rwth-aachen.de/jane/manual.pdf

Bibliography

[Williams & Koehn 12] P. Williams, P. Koehn: GHKM Rule Extraction and Scope-3 Parsingin Moses. In Proceedings of the Workshop on Statistical Machine Translation (WMT), pp.388–394, Montréal, Canada, June 2012.

[Williams & Sennrich+ 14] P. Williams, R. Sennrich, M. Nadejde, M. Huck, E. Hasler, P. Koehn:Edinburgh’s Syntax-Based Systems at WMT 2014. In Proceedings of the Workshop on StatisticalMachine Translation (WMT), pp. 207–214, Baltimore, MD, USA, June 2014. Association forComputational Linguistics.

[Williams & Sennrich+ 15] P. Williams, R. Sennrich, M. Nadejde, M. Huck, P. Koehn: Edin-burgh’s Syntax-Based Systems at WMT 2015. In Proceedings of the Workshop on StatisticalMachine Translation (WMT), pp. 199–209, Lisbon, Portugal, September 2015. Association forComputational Linguistics.

[Williams & Sennrich+ 16] P. Williams, R. Sennrich, M. Nadejde, M. Huck, B. Haddow, O. Bojar:Edinburgh’s Statistical Machine Translation Systems for WMT16. In Proceedings of the ACL2016 First Conference on Machine Translation (WMT16), pp. 399–410, Berlin, Germany, Aug.2016. Association for Computational Linguistics.

[Wu & Wang 09] H. Wu, H. Wang: Revisiting Pivot Language Approach for Machine Translation.In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4thInternational Joint Conference on Natural Language Processing of the AFNLP, pp. 154–162,Suntec, Singapore, Aug. 2009.

[Wuebker & Huck+ 11] J. Wuebker, M. Huck, S. Mansour, M. Freitag, M. Feng, S. Peitz,C. Schmidt, H. Ney: The RWTH Aachen Machine Translation System for IWSLT 2011.In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), pp.106–113, San Francisco, CA, USA, Dec. 2011.

[Wuebker & Huck+ 12] J. Wuebker, M. Huck, S. Peitz, M. Nuhn, M. Freitag, J.T. Peter, S. Man-sour, H. Ney: Jane 2: Open Source Phrase-based and Hierarchical Statistical Machine Transla-tion. In Proceedings of the International Conference on Computational Linguistics (COLING),pp. 483–491, Mumbai, India, Dec. 2012.

[Wuebker & Mauser+ 10] J. Wuebker, A. Mauser, H. Ney: Training Phrase Translation Modelswith Leaving-One-Out. In Proceedings of the Annual Meeting of the Association for Computa-tional Linguistics (ACL), pp. 475–484, Uppsala, Sweden, July 2010.

[Xiao & Su+ 11] X. Xiao, J. Su, Y. Liu, Q. Liu, S. Lin: An Orientation Model for Hierarchi-cal Phrase-Based Translation. In Proceedings of the 2011 International Conference on AsianLanguage Processing, IALP ’11, pp. 165–168, Penang, Malaysia, Nov. 2011. IEEE ComputerSociety.

[Xiao & Zhu+ 12] T. Xiao, J. Zhu, H. Zhang, Q. Li: NiuTrans: An Open Source Toolkit forPhrase-based and Syntax-based Machine Translation. In Proceedings of the ACL 2012 SystemDemonstrations, pp. 19–24, Jeju, Republic of Korea, July 2012.

[Xie & Mi+ 11] J. Xie, H. Mi, Q. Liu: A Novel Dependency-to-String Model for StatisticalMachine Translation. In Proceedings of the Conference on Empirical Methods for NaturalLanguage Processing (EMNLP), pp. 216–226, Edinburgh, Scotland, UK, July 2011.

[Xu & Koehn 12] W. Xu, P. Koehn: Extending Hiero Decoding in Moses with Cube Growing.The Prague Bulletin of Mathematical Linguistics, Vol. 98, pp. 133–142, Oct. 2012.

141

Bibliography

[Xu & Zens+ 04] J. Xu, R. Zens, H. Ney: Do We Need Chinese Word Segmentation for StatisticalMachine Translation? In O. Streiter, Q. Lu, editors, ACL Workshop, Third SIGHAN Workshopon Chinese Language Learning, pp. 122–128, Barcelona, Spain, July 2004.

[Zens 08] R. Zens: Phrase-based Statistical Machine Translation: Models, Search, Training. Ph.D.thesis, RWTH Aachen University, Aachen, Germany, Feb. 2008.

[Zens & Ney 04a] R. Zens, H. Ney: Improvements in Phrase-Based Statistical Machine Transla-tion. In Proceedings of the Human Language Technology Conference / North American Chapterof the Association for Computational Linguistics (HLT-NAACL), pp. 257–264, Boston, MA,USA, May 2004.

[Zens & Ney+ 04b] R. Zens, H. Ney, T. Watanabe, E. Sumita: Reordering Constraints for Phrase-Based Statistical Machine Translation. In Proceedings of the International Conference onComputational Linguistics (COLING), pp. 205–211, Geneva, Switzerland, Aug. 2004.

[Zens & Ney 06] R. Zens, H. Ney: Discriminative Reordering Models for Statistical MachineTranslation. In Proceedings of the Human Language Technology Conference / North AmericanChapter of the Association for Computational Linguistics (HLT-NAACL), pp. 55–63, New YorkCity, NY, USA, June 2006.

[Zens & Ney 08] R. Zens, H. Ney: Improvements in Dynamic Programming Beam Search forPhrase-Based Statistical Machine Translation. In Proceedings of the International Workshopon Spoken Language Translation (IWSLT), pp. 195–205, Waikiki, HI, USA, Oct. 2008.

[Zens & Och+ 02] R. Zens, F.J. Och, H. Ney: Phrase-Based Statistical Machine Translation. InGerman Conference on Artificial Intelligence, pp. 18–32, Aachen, Germany, Sept. 2002.

[Zhu & He+ 14] X. Zhu, Z. He, H. Wu, C. Zhu, H. Wang, T. Zhao: Improving Pivot-BasedStatistical Machine Translation by Pivoting the Co-occurrence Count of Phrase Pairs. In Pro-ceedings of the Conference on Empirical Methods for Natural Language Processing (EMNLP),pp. 1665–1675, Doha, Qatar, Oct. 2014.

[Zollmann & Venugopal 06] A. Zollmann, A. Venugopal: Syntax Augmented Machine Translationvia Chart Parsing. In Proceedings of the Workshop on Statistical Machine Translation (WMT),pp. 138–141, New York City, NY, USA, June 2006.

[Zollmann & Venugopal+ 08] A. Zollmann, A. Venugopal, F.J. Och, J. Ponte: A SystematicComparison of Phrase-Based, Hierarchical and Syntax-Augmented Statistical MT. In Proceed-ings of the International Conference on Computational Linguistics (COLING), pp. 1145–1152,Manchester, UK, 2008.

142