AIDArabic+ Named Entity Disambiguation for Arabic Textgadelrab/downloads/Mohamed_Gade… · Arabic content to an automatically generated knowledge base from Wikipedia. The contributions

Universitat des SaarlandesMax-Planck-Institut fur Informatik

AIDArabic+

Named Entity Disambiguation for

Arabic Text

Masterarbeit im Fach Informatik

Master’s Thesis in Computer Science

von / by

Mohamed Gad-Elrab

angefertigt unter der Leitung von / supervised by

Prof. Dr. Gerhard Weikum

betreut von / advised by

Mohamed Amir Yosef

begutachtet von / reviewers

Prof. Dr. Gerhard Weikum

Dr. Klaus Berberich

Saarbrucken, July 2015

Eidesstattliche Erklarung

Ich erklare hiermit an Eides Statt, dass ich die vorliegende Arbeit selbststandig verfasst

und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.

Statement in Lieu of an Oath

I hereby confirm that I have written this thesis on my own and that I have not used any

other media or materials than the ones referred to in this thesis.

Einverstandniserklarung

Ich bin damit einverstanden, dass meine (bestandene) Arbeit in beiden Versionen in die

Bibliothek der Informatik aufgenommen und damit veroffentlicht wird.

Declaration of Consent

I agree to make both versions of my thesis (with a passing grade) accessible to the public

by having them added to the library of the Computer Science Department.

Saarbrucken, July 2015 Mohamed Gad-Elrab

Abstract

Named Entity Disambiguation (NED) is the problem of mapping mentions of ambiguous

names in a natural language text onto canonical entities such as people or places, registered

in a knowledge base. Recent advances in this field enable semantically understanding

content in different types of text. While the problem had been extensively studied for

the English text, the support for other languages and, in particular, Arabic is still in its

infancy. In addition, Arabic web content (e.g . in the social media) has been exponentially

increasing over the last few years. Therefore, we see a great potential for endeavors that

support entity-level analytics of these data. AIDArabic is the first work in the direction

of using evidences from both English and Arabic Wikipedia to allow disambiguation of

Arabic content to an automatically generated knowledge base from Wikipedia.

The contributions of this thesis are threefold: 1) We introduce EDRAK resource

as an automatic augmentation for AIDArabic’s entity catalog and disambiguation data

components using information beyond manually crafted data in the Arabic Wikipedia.

We build EDRAK by fusing external web resources and the output of machine translation

and transliteration applied on the data extracted from the English Wikipedia. 2) We

incorporate an Arabic-specific input pre-processing module into the disambiguation

process to handle the complex features of Arabic text. 3) We automatically build a test

corpus from other parallel English-Arabic corpus to overcome the absence of standard

benchmarks for Arabic NED systems. We evaluated the data resource as well as the full

pipeline using a mix of manual and automatic assessment. Our enrichment approaches

in EDRAK are capable of expanding the disambiguation space from 143K entities, in the

original AIDArabic, to 2.4M entities. Moreover, the full disambiguation process is able

to map 94.7% of the mentions to non-null entities with a precision of 73%, compared to

87.2% non-null mapping with only 69% precision in the original AIDArabic.

Acknowledgements

. .�AJ. J£

�@Q�

�J»

�@YÔg é<Ë YÒmÌ'@

During this thesis, I have learned several essential research skills that, I believe, will

shape my research career. Therefore, I would like to express my sincere gratitude to

Prof.Gerhard Weikum for giving me the opportunity to work under his supervision in such

a pioneering group, for facilitating the research and for his valuable advice throughout

the thesis.

I also would like to show my sincere gratitude and appreciation for my advisor

Mohamed Amir for his continuous guidance and support on the professional and personal

levels. I really appreciate his patience teaching me loads of essential research, communica-

tion and planning skills. I am extremely thankful for his generosity sharing his valuable

expertise and time. Working with him was one of the richest experiences in my life. I

wish to have the opportunity to work with him again in the future.

I would like to thank Akram El-korashy, Mayank Goyal and Uzair Mahmoud for

their useful feedback regarding the thesis writing. Also, our experiments would not

be finalized without the help of the proactive volunteers who agreed to participate in

our manual assessment. I would also like to thank the reviewers of this thesis for their

precious time and effort.

By the end of this masters program, I am grateful to the International Max-

Planck Research School for Computer Science (IMPRS-CS) family for their

support throughout the master program. I believe, I was fortunate enough to be part of

this big family.

On the personal level, words will never be enough to express how I am thankful

and indebted to my family (my parents, sister and brother) for their sincere support,

encouragement and prayers throughout my life and my long education journey. I

appreciate their patience towards my continuous absence.

I would like to thank all my friends in Saarbrucken. I believe. I am blessed being

surrounded by all those intelligent, caring and enthusiastic personalities.

Finally, I would like to extend my sense of gratitude to everyone expressed his support

and/or made Dua for me.

Beijing, China. Mohamed Gad-Elrab

July, 2015

vii

Contents

Abstract v

Acknowledgements vii

Contents ix

List of Figures xi

List of Tables xiii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5

2.1 Named Entity Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Named Entity Disambiguation for Arabic . . . . . . . . . . . . . . . . . . 6

2.3 AIDArabic: Under The Hood . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 AIDArabic+ 11

3.1 AIDArabic Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 AIDArabic+ in A Nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 Enriching Data Components . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Language Specific Processing . . . . . . . . . . . . . . . . . . . . . . . . . 16

ix

x CONTENTS

4 EDRAK: Entity-Centric Resource For Arabic Knowledge 19

4.1 External Name Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Entity-Aware Resources . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1.2 Lexical Name Dictionaries . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Named-Entities Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 General Statistical Machine Translation . . . . . . . . . . . . . . . 23

4.2.2 Named-Entities SMT . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.3 Named-Entities Light-SMT . . . . . . . . . . . . . . . . . . . . . . 25

4.2.4 Named-Entities Full-SMT . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Transliteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3.1 Transliteration Approaches . . . . . . . . . . . . . . . . . . . . . . 30

4.3.2 Character-Level Statistical Machine Translation . . . . . . . . . . 31

4.4 Arabic Names Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Edrak as A Standalone Resource . . . . . . . . . . . . . . . . . . . . . . . 33

4.5.1 Use-cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.5.2 Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Evaluation and Statistics 35

5.1 Evaluation EDRAK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1.2 Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1.3 Manual Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 AIDArabic+ Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2.1 Arabic Corpus Creation . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Conclusion and Outlook 45

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Bibliography 47

A Manual Assessment Interface 55

List of Figures

1.1 Internet Users Population . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Building Name Dictionary in AIDArabic . . . . . . . . . . . . . . . . . . . 8

2.2 Building Keyphrases Dictionary in AIDArabic . . . . . . . . . . . . . . . . 9

3.1 Building Name Dictionary in AIDArabic+ . . . . . . . . . . . . . . . . . . 14

3.2 Building Keyphrases Dictionary in AIDArabic+ . . . . . . . . . . . . . . . 15

4.1 General Statistical Machine Translation Pipeline . . . . . . . . . . . . . . 23

4.2 Single token translation using popularity voting . . . . . . . . . . . . . . . 27

4.3 Type-Aware Entity-Name Translation using full SMT system . . . . . . . 29

A.1 Manual Assessment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

A.2 Manual Assessment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

xi

List of Tables

4.1 Sample of Google-Word-to-Concept raw data . . . . . . . . . . . . . . . . 20

4.2 Sample of JRC-Names raw data . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Sample of CMUQ-Arabic-NET raw data . . . . . . . . . . . . . . . . . . . 22

4.4 Entity Names SMT Training Data Size . . . . . . . . . . . . . . . . . . . . 25

4.5 Sample of character-level training data . . . . . . . . . . . . . . . . . . . . 31

4.6 Arabic names splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.7 Main SQL Tables in EDRAK . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1 AIDArabic vs EDRAK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Enrichment Techniques Contribution . . . . . . . . . . . . . . . . . . . . . 36

5.3 Number of Entities per Type in AIDArabbic vs EDRAK . . . . . . . . . . 36

5.4 Conextual keyphrases dictionary AIDArabbic vs EDRAK . . . . . . . . . 37

5.5 Example from EDRAK resource . . . . . . . . . . . . . . . . . . . . . . . 38

5.6 Manual Assessment Results . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.7 LDC2014T05 Annotated Benchmark . . . . . . . . . . . . . . . . . . . . . 42

5.8 Disambiguation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

xiii

Chapter 1

Introduction

1.1 Motivation

Entities such as persons, organizations and locations can be referred to by many different

name aliases, and similarly, the same name can be used to refer to different entities. For

example, Barack Obama can be referred to as “Barack Hussien Obama”, “Obama”, or

“USA president”, in different pieces of text. This type of ambiguity makes it challenging

for Information Extraction (IE) and Information Retrieval (IR) to retrieve information

about these entities.

Named Entity Disambiguation (NED) is the process of resolving the different mentions

of people, organizations and places that appear in text onto canonical entities in a

knowledge base [27, 67] such as DBpedia [6] and YAGO [26]. NED is essential for several

IR and Semantic Analysis tasks. It can help create accurate analytics over canonical

entities instead of ambiguous mention strings [20]. Furthermore, NED can help advance

applications such as Entity-based search, Text Summarization and News Analysis [24].

Arabic is one of the most widely spoken languages around the globe. As shown in

Figure 1.1a, in December 2013, Arabic was estimated to have the 4th largest online

users population (135M) after English, Chinese and Spanish, and followed by famous

languages such as Japanese, German, French and Russian. Moreover, Arabic has the

highest growing population on the Internet in the period between 2000 and 2013 with

around 5000% growth. Consequently, Arabic online unstructured content such news

articles, forums, blogs and social media is rapidly growing. For instance, in March 2014,

Arabic speaking users contributed to twitter only with an average of 17.5M tweets/day2.

2http://www.arabsocialmediareport.com/

1

2 Chapter 1 Introduction

(a) Number of users (b) Users Growth

Figure 1.1: Internet Users Population1

On the other hand, the amount of structured or semi-structured Arabic content is

lagging behind. For example, Wikipedia is one of the main resources from which many

modern Knowledge Bases (KB) are extracted. It is heavily used in the literature for IR

and NLP tasks. However, the size of the Arabic Wikipedia is an order of magnitude

smaller than the English one3. Furthermore, the structured data in the Arabic Wikipedia,

such as info boxes, are on average of less quality in terms of coverage and accuracy.

Therefore, Arabic is still considered a resource-poor language.

1.2 Problem

While NED is a well studied problem for English input, few systems have considered

extending NED to other languages such as Arabic. Adapting NED to Arabic text exhibits

three main challenges:

• Limited structured resources: NED systems usually require dictionaries that

link the candidate entities to one or more name aliases. Moreover, it requires

textual representation of entities (entity description or entity context), usually, in

the form of a set of keyphrases. Keyphrases are essential for estimating the context

similarity between candidate entities and the retrieved mention [67]. Dictionaries

built from Arabic structured resources are limited in size and quality. This restricts

their ability to offer a robust NED process.

• Arabic language characteristics: Arabic is a morphologically-rich language

with a different character set and writing rules from Latin-alphabet languages.

Normal English tokenization and normalization (i.e. lemmatization and stemming)

techniques are not suitable for Arabic text. Incorrectly tokenized input heavily

3In July 2015, Arabic Wikipedia had 374,291 articles, while the English Wikipedia has 4,910,360

Chapter 1 Introduction 3

reduce the quality of name dictionary look-up and similarity measurement [67].

For example, pronouns written connected to previous words should be separated

out for better string matching.

• Annotated NED corpus: There is no Arabic corpus with semantic/entity an-

notation. Annotated corpora are essential for tuning NED parameters as well as

measuring the overall performance of the NED process in the development phase.

1.3 Proposed Solution

In this thesis, we introduce AIDArabic+, an NED system geared for Arabic text input.

AIDArabic+ overcomes the mentioned challenges by utilizing:

• Enriched Arabic data schema (EDRAK): EDRAK is an Arabic entity-centric

resource that offers comprehensive name and contextual keyphrases dictionaries for

entities from both Arabic and English Wikipedia’s. Dictionaries are not limited to

manually crafted data in the Arabic Wikipedia, instead, several external name

dictionaries are harnessed. In addition, Type-aware Named-Entity trans-

lation and transliteration techniques are developed to automatically compile

EDRAK’s dictionaries.

• Arabic input pre-processing components: We integrated an Arabic morphological-

based pre-processing component to perform deep tokenization and normalization,

consequently, enhancing name matching and context similarity estimation.

• Annotated NED corpus for Arabic: We automatically created an Arabic anno-

tated corpus using manually translated and aligned parallel English-Arabic corpus.

Produced corpus was used in evaluating the effect of the proposed components.

1.4 Outline

The following chapter discusses Named Entity Disambiguation concepts and essential

components as well as existing systems supporting Arabic. Chapter 3 describes our

general approach to create AIDarabic+. Then, the creation of EDRAK, our enriched

data schema, is explained in Chapter 4. We describe the statistics of the generated

resource EDRAK and the manual assessment performed to verify the resource quality in

Chapter 5. Finally, we discuss the creation of the annotated corpus and the effect of the

proposed approaches on the quality of the full NED pipeline.

Chapter 2

Background

2.1 Named Entity Disambiguation

Named-Entity Disambiguation (NED) (or as in some literature Entity Linking [20]) is

the problem of mapping ambiguous mentions of named entities such as places, organi-

zations, and persons appearing in natural language input text onto canonical entities

registered in a Knowledge base (KB) [27] such as DBpedia [6], YAGO [26], BabelNet [45].

NED is different from Named Entity Recognition (NER) which only concerns with

extracting named entities and classifying them to coarse-grain categories LOCATION,

ORGANIZATION, MISC and PERSON. NER is usually performed to recognize Named-Entities

for the disambiguation processes.

It is worth illustrating that, the NED tasks are different from the Word Sense

disambiguation tasks (WSD). While the WSD is concerned with resolving the correct

meaning for words and concepts such as “bank” or “plant” in the provided context, the

NED tasks focus on mapping ambiguous names to the correct entities. For instance,

the sentence “Muller plays for the German National Team” has two ambiguous names,

“Muller” and “German National Team” . “Muller” can refer to any person with this

name, and “German National Team” can be either the German football team or the

German basketball team. Nevertheless, both requires rich back-end names and word

context dictionaries to map the word correctly [62].

The NED problem is a well studied for English input. Several NED systems have been

developed for the English language such as DBpedia Spotlight [40], Illinois Wikifier [50],

Tagme2 [17], AIDA [27, 66], NERD-ML [61], AGDISTIS [59] and Babelfy [43]. Besides,

several annotated corpora have been developed to evaluate the performance of these

systems on English input [60] such as TAC Entity-Linking task [29], KORE-50 [25],

#Microposts2015 [51] and AIDA-CoNLL [27].

5

6 Chapter 2 Background

On the other hand, only few of these systems are cable of processing input in

other languages. Moreover, only few corpora exist for other language such as TAC

Entity-Linking Spanish and Chinese.

2.2 Named Entity Disambiguation for Arabic

Resource-poor languages such as Arabic have a limited support. Some research attempted

to support Arabic disambiguation as Cross-Lingual Information Retrieval problem (CLIR).

McNamee et al . (2011) [38] developed a cross-languages entity linking approach to map

names in any language text to the entities in the English Wikipedia registered in the

TAC-KBP [63, 39]. The input names and context are translated/transliterated to English

before processing, then, the linking is performed as a monolingual English problem. In

order to evaluate the performance of their approach, they developed a persons-only cross-

language ground truth for their experiments, using parallel corpora and crowd-sourcing

for creating annotation [37]. However, this approach overlooks the language and culture

specific entities and names that may not exist in the English-only KB.

Up to our knowledge, Babelfy and AIDArabic [67] are the only systems built to

disambiguate Arabic mentions to KB containing Arabic entities together with their

potential names.

Babelfy1 is a multilingual system that combines both the WSD and the NED tasks.

They use BabelNet [46] as their back end KB. BabelNet is a multilingual resource built

using Wikipedia entities and WordNet senses. They used the sense labels, Wikipedia

titles from incoming links, outgoing anchor texts, redirects and categories as sources

for the disambiguation context. In addition, off-the-shelf translation service was used

to translate Wikipedia concepts to other languages. Nevertheless, translation was not

applied on Named-Entities [46]. Babelfy was evaluated on English, Spanish, Italian,

German, and French corpora but not on an Arabic one.

AIDArabic [67] is an NED system that has been built specifically for Arabic on

top of YAGO3 [36] KB. While AIDArabic’s entity catalog spans over a sufficiently large

number of entities from the English and Arabic Wikipedia’s, it exhibits a low entity-name

and entity-description dictionaries coverage. Hence, the recall of the disambiguation was

heavily harmed.

1http://babelfy.org/

Chapter 2 Background 7

2.3 AIDArabic: Under The Hood

In this section, we describing in detail the main data components of AIDArabic and how

they are used in the disambiguation process.

2.3.1 Data Sources

AIDArabic, similar to most of NED frameworks, has three main data components:

Entity Catalog, Name-Entity Dictionary and Entity Descriptions. In addition to these

three components, AIDArabic uses an Entity-Entity relatedness model as a supporting

component.

Entity Catalog

Entity Catalog, or repository, is the source of the canonical entities known for the NED

system. During the disambiguation process, all names in the text are mapped to one of

the entities in the catalog. Names without proper mapping to any entity in the catalog

are mapped to null.

AIDArabic populate its entity catalog form YAGO3 KB [36] built from both English

and Arabic Wikipedia’s. This allows capturing English prominent entities as well as

culture specific Arabic entities. For the sake of the data integrity, English entities

identification are used to represent entities existing in both the English and the Arabic

Wikipedia’s.

Entity-Name Dictionary

Entity-Name Dictionary contains the possible names for each entity in the catalog. Names

in the dictionary are connected to all potential entities. The dictionary is then used to

extract all possible candidate for mentions appearing in the text. Entities that do not

have any potential names cannot appear in the disambiguation candidate list.

In AIDArabic, name dictionary is populated from the Arabic Wikipeida data only

(Figure: 2.1), and names belong to one of the following four sources:

• Titles of the Wikipedia pages. Titles are different from the page id appears in the

URL.


YAGO3EN, AR

Arabic Titles

Arabic Anchors

Arabic Disamb. Pages

Arabic Redirects


Entity Name

<Barack_Obama>

<Germany>

<Egypt>

باراك أوباما

ألمانیا

مصر

Figure 2.1: Building Name Dictionary in AIDArabic

• Disambiguation Pages, in Arabic “iJ

�ñ�JË @

�HAj

®�” . These pages contain all

possible entities/meaning referred to by a specific name. The title of the of the

disambiguation page is added as a potential name for all entities referenced in this

page.

• Redirects, in Arabic “ �HCKñm�

�'”. Redirects are pages with no actual content but

refer the reader to another page. Redirects are used when searching Wikipedia to

rout the user to the most prominent entity referred to by this name. For example,

searching “øQ�®Ë@ Ð

@” (Um Alkora) redirects to “

�éºÓ” (Mekka).

• Anchor Text of links pointing to the entity page. Anchor texts can differ from

the original title, and hence, they are harvested as potential names.

As shown, only manually crafted content is used in building the name-dictionary.

This limits the size of the dictionary to the existing Arabic content.

Technically, this information is collected from YAGO3 RDF tuples where redirects are

represented with predicate <redirectedFrom> and the remaining are represented under

predicate rdfs:label. In addition, separated persons names under <hasGivenName>

and <hasFamilyName> are added to the dictionary.

Entity Description

Entity descriptions or contextual keyphrases are the set of keywords that describes an

entity and are expected to appear in the text surrounding the entity mention. For

example, when “Tomas Muller”, the German footballer, appears in some text, usually

Chapter 2 Background 9

YAGO3EN, AR

Arabic Anchor Text

Arabic Inlinks Titles

Arabic Categories

Entity-Keyphrases Dictionary

Entity Keyphrase

<Barack_Obama>

<Germany>

<Egypt>

الوالیات المتحدة

أنجیال میركل

القاھرة

English Inlinks

English C

ategories English-Arabic Interwiki links lookup

Figure 2.2: Building Keyphrases Dictionary in AIDArabic

words related to football also appear in the text such as football, match, national team,

Germany, goal,etc. Contextual keyphrases are used to compute the similarity between

the mention context and the candidate entity context.

AIDArabic utilizes an entity-description dictionary of Arabic keyphrases. Keyphrases

are further split up into keywords with a specific weight for each. Keyphrases are harvested

from three sources (figure: 2.2):

• Anchor Text inside the Arabic entity pages that point to other pages are assigned

as keyphrases to this entity.

• Inlink Titles are the titles of the pages that link to the current entity. Inlink

titles of Arabic pages pointing to an entity are added directly to its keyphrases

set. However, English Inlink titles are translated to Arabic via the cross-languages

inter-Wikipedia links dictionary. For example, to include the inlink title of page

“<Eygpt>”, the dictionary pair “<Eygpt>→ <ar/Qå�Ó>” is used to get the Arabic

title “Qå�Ó”.

• Categories are manual classes added to each entity. Similar to entities, YAGO3

contains a union of the English and Arabic categories. English Wikipedia ids

are used to represent the categories, unless the category only exists in Arabic.

Similar to the Inlink titles, Arabic categories are added directly to the keyphrases

but English categories are translated via the cross-languages inter-Wikipedia links

between categories.

As mentioned above, only Inlink titles and Categories are translated using the manual

inter-Wikipedia links. Moreover, no other external dictionaries are used, hence, the

context is still limited to the Arabic Wikipedia size.


Entity-Entity relatedness model

It is common that a single text snippet or document contains small amount of related

entities. Therefore, AIDArabic exploits the Entity-Entity relatedness model to improve

the quality of the disambiguation results. The relatedness is estimated based on the

overlap in the incoming links [41] fused from both the Arabic and the English Wikipedia’s.

2.3.2 Processing

AIDArabic, as thethe most of NED systems, starts with retrieving the possible name

mentions from the input text. Name mentions are usually recognized via a NER system.

Retrieved mentions are normalized (e.g . converting text to lowercase or uppercase in

English). Then, the possible candidate entities for the mentions in the text are retrieved

from the Entity Catalog using the Name Dictionary.

In order to resolve the mention-to-entity mapping, a weighted graph of the mentions

and the candidate entities is constructed. Weights on edges between mentions and their

candidates are estimated from the entity keyphrases and the mention context similarity as

well as the candidate entity popularity (i.e. prior). Weights on edges between candidate

entities are assigned according to the entity-entity relatedness scores. The disambiguation

problem is solved by iteratively reducing the graph to dense sub-graph till each mention

is connected to exactly one candidate.

Chapter 3

AIDArabic+

3.1 AIDArabic Challenges

The original AIDArabic introduced NED for Arabic text. Nevertheless, it exhibited a

low recall compared to the English AIDA [67]. This problem has two main roots:

First, while AIDArabic utilized a comprehensive entity catalog, the generated name

and contextual keyphrases dictionaries are still limited to the manually crafted information

in the Arabic Wikipedia and cross-languages inter-Wikipedia links. On its turn, the

Arabic Wikipedia does not have enough coverage and quality. It does not only miss

a lot of entities, but also existing entities have short non-comprehensive pages. The

Arabic Wikipedia, as the most used structured resource, is not capable of covering the

fast growing Arabic content.

Secondly, AIDArabic follows the same tokenization and normalization applied to

Latin input without any Arabic-specific pre-processing. Improperly tokenized names

cannot be matched against the name-dictionary using the strict matching mechanism

adopted in AIDArabic, consequently, no candidate entities will be retrieved for this

mention. Similarly, entity-mention similarity computation using keyphrases is negatively

affected.

3.2 AIDArabic+ in A Nutshell

AIDArabic+ aims at achieving a robust NED on Arabic text. Hence, we need to target

the weak points in both the data schema and the processing components to enhance the

overall recall an precision.

11

12 Chapter 3 AIDArabic+

In this work, we introduce EDRAK resource as an automatic augmentation for

AIDArabic resource. We propose two approaches to overcome the limited data of the

Arabic Wikipedia. The first approach is to collect names from other resources on the web

containing possible names using semantic and syntactic equivalence. The second is to

incorporate translation and transliteration techniques to automatically generate Arabic

content based on evidence from the English and Arabic Wikipedia beyond the direct

cross-language inter-Wikipedia mapping. In order to guarantee building an accurate

data schema, different rules are enforced on the techniques according to the type of the

entity and the source of the data as discussed in Section 3.3.

In addition, we introduce the integration of a pre-processing component into the NED

pipeline to handle Arabic-specific features. Section 3.4 illustrates the set of procedures

proposed for proper Arabic normalization and tokenization to achieve better name

matching and context similarity estimation.

3.3 Enriching Data Components

As illustrated in Section 2.3, the main three data components necessary for our NED

system are an entity catalog, name dictionary and entity description (i.e. contextual

keyphrases). The forth component of AIDArabic, Entity-Entity relatedness model,

depends on the topology of the KB, but not the language used. Hence, it does not

require language specific enhancements. This section discusses the idea behind applying

enrichment techniques on each of the main three components to improve the Arabic

NED process. The design decisions and the implementation of the proposed enrichment

approaches (as closed modules) are discussed in Chapter 4.

Let us consider this hypothetical Arabic sentence

à Q

�m�

��' B é� K.�

A�J»�

á«�é��K��

ñ�«

�è�

�Q K� Am.

Ì�

��ú

G�

�Q��®Ë @

��

�� A«

�l�

��

�Q�K

�Y

��¯

written in English for clarity as:

Aaidh Al-Qarni might get nominated for Goethe Prize for his book La Tahzan

This sentence has three named entities: a writer (Aaidh Al-Qarni/ àQ

�®Ë@

�

�A«), a

prize (Goethe Prize/ é�Kñ

« è Q

KAg. ) and a book (La Tahzan/

à Qm��' B). We will illustrate, how

we can adapt each data component of the NED framework to be capable of correctly

disambiguating them.

Chapter 3 AIDArabic+ 13

Entity Catalog

By considering our example, the writer is known enough to exist in both the English and

the Arabic Wikipedia’s. However, despite the fact that the book is translated, it has

only an Arabic Wikipedia page. Moreover, the prize is not known enough in the Arab

world to exist in the Arabic Wikipedia1.

In order to disambiguate such sentence, we need to make sure that the entity

repository contains all of those entities. Therefore, we followed the same approach as

in AIDArabic[67], we used YAGO3 compiled from both the English and the Arabic

Wikipedia’s as our back-end KB. This allows capturing prominent English entities as

well as local entities that are only known in the Arabic culture.

Name Dictionary

Generally, entity-name dictionary is an influential component for any NED system.

Having an incomplete dictionary dramatically harms the disambiguation quality. If

the dictionary misses a name-entity entry, either no candidates will be nominated for

disambiguation, or even worse, a wrong entity might be picked for one or more mentions.

Since modern NED systems consider coherence measures when collectively resolving

all mentions, one or more wrong mappings might propagate to mislead mapping other

mentions onto wrong canonical entities.

We started with the same sources collected in the original AIDArabic. we harvest

Wikipedia page titles, anchor texts, disambiguation pages’ titles (under predicate rdfs:label

in YAGO3 [36]) and redirects. Also, we include separated Given names and Family

names of persons. Nevertheless, the contribution of these sources to the Arabic dictionary

is limited as shown in the statistics in Section 5.1.

In order to correctly disambiguate names in our example, we need a dictionary

that is aware of the Arabic names of all the three entities. Since, the writer (Aaidh

Al-Qarni/ àQ

�®Ë@

�

�A«) and the book (La Tahzan/

à Qm��' B) exist in the Arabic Wikipedia,

our names dictionary has at least one Arabic name for both of them (their page titles).

On the other hand, Goethe Prize exists only in the English Wikipedia, without any

potential Arabic name. Therefore, the correct entity will not be nominated as a candidate

for its Arabic mention.

In this work, we propose to go beyond Wikipedia content via automatic data

generation. However, It is a challenging task to automatically build a entity-name

1As on 01 July 2015


YAGO3EN, AR

Arabic Titles

Arabic Anchors

Arabic Disamb. Pages

Arabic Redirects


Entity Name

<Barack_Obama>

<Germany>

<Egypt>

باراك أوباما

ألمانیا

مصر

En. Titles

En. A

nchors

En. R

edirects

En. D

isamb. P

ages

External Dictionaries

Translations

Transliteration

External Dict. Names

Transliterated Persons NamesTranslated Names

Figure 3.1: Building Name Dictionary in AIDArabic+

dictionary that captures name variations for all entities in the entity catalog while

keeping the data precision. For example, the name of “Goethe Prize” in Arabic is

obtained by (1) Transliterating “Goethe” into the Arabic script, (2) Translating

“Prize” into Arabic, and finally (3) Reordering the tokens to follow Arabic writing rules.

Therefore, we introduce three approaches to enrich the entity-name dictionary of

AIDArabic+:

1. External Name Dictionaries: We harness the existing English-Arabic name

dictionaries via semantic and syntactic equivalence. For example, if two strings

from one or more dictionaries are linking to the same canonical entity, we consider

them potential name aliases. In Section 4.1, we discuss the harnessed resources as

well as the procedure designed for the integration process.

2. Entity-Name Translation: While external dictionaries (e.g . gazetteers) and

hyper links extracted from the web provided Arabic names for some English entities,

many entities are still lacking potential names in the Arabic world. Arabic names

should be generated instead of only extracting them. Moreover, general purpose

translations exhibited problems translating/transliterating Named-Entities [4, 21, 7]

even if they appear within a context. Therefore, we introduced Entity-name

translation to populate our dictionary with accurate automatically generated

Arabic names as discussed in Section 4.2 in-detail.

3. Persons Names Transliteration: Fair amount of entities obtained one or more

Arabic names using external resources and/or translation. However, English names

have different variants when being written in the Arabic script. In addition, not all


YAGO3EN, AR

Arabic Anchor Text

Arabic Inlinks Titles

Arabic Categories

Entity-Keyphrases Dictionary

Entity Keyphrase

<Barack_Obama>

<Germany>

<Egypt>

الوالیات المتحدة

أنجیال میركل

القاھرة

English Inlinks

English C

ategories

Interwiki Dictionary

Translation

Figure 3.2: Building Keyphrases Dictionary in AIDArabic+

name variants can be generated by translation. Therefore, we introduce incorpo-

rating a transliteration module geared for PERSON names. While transliteration is

applicable on many NON-PERSON entities, applying it for such entities will create a

lot of inaccurate entries that should be either fully or partially translated. Thus,

we decided to exclude them from the transliteration process.

Technically, We applied these approaches on the English names from all sources,

namely, Given Name, Family Name, Redirects and rdfs:label (includes titles, anchor texts

and disambiguation pages) as in Figure: 3.1.

Entity Contextual Keyphrases

Contextual keyphrases are used as a set of descriptions for the entity. Entity keyphrases

are matched against the input text to compute a similarity score between the expected

context of the candidate entity and the context in which the mention exists.

As shown in Figure 3.2, we used the standard approaches in the original AIDArabic

to extract entity contexts from Wikipedia (Wikipedia Anchor texts, Wikipedia categories,

and Wikipedia inlink titles). Similar to AIDArabic, the English inlink titles and categories

are looked up in the English-Arabic inter-Wikipedia links dictionary. However, English

names without entries in the dictionary were neglected, consequently, many English

entities were not supported by any context description. Such entities that lack adequate

description cannot be promoted as a winning entity in the mapping process.


In AIDArabic+, we overcame the low coverage and quality of the Arabic Wikipedia

by applying the Entity-Name Translation and the Persons Names Transliteration tech-

niques on Wikipedia in-link titles remaining from the inter-Wikipedia dictionary look-up.

Furthermore, we trained our category translation module on a parallel corpus extracted

from categories English-Arabic inter-Wikipedia links. Finally, while it seems possible to

translate anchor texts, we did not perform that to avoid the inaccuracy resulting from

being noisy and sometimes long.

3.4 Language Specific Processing

Name matching and context similarity computation processes are necessary for successful

disambiguation process. Achieving robust matching requires clean input.

However, due to the specific morphological characteristics of the Arabic text, standard

English text normalization and tokenization techniques (e.g . converting to lowercase) are

not directly suitable.

For example, Arabic text has several specific features such as:

• Definitive Article (AL) “È@” is attached to the beginning of the word (e.g . The

library:�éJ.

�JºÖÏ @ ). Nevertheless, not each “È@” at the beginning of a word are treated

is a definitive article.

• Some propositions such as ¼ ,H. , and È and connectors (

¬, ð) appear connected

to the word beginning (e.g . H. at the beginning of “in France”:“ A��Q

®K.”).

• Several pronouns are attached to the end of the word. For example, the last two

characters Aë at the end of Aî�D�®KYg (meaning ”its park”).

• Sometimes, Arabic text is written with diacritics to express vowels and facilitate

pronunciation. Arabic diacritics appear as decoration above and under the normal

character (e.g . �H. , H.�

, �H. , �

H. ,and �H. ). They differ according to the meaning of

the word and its position in the sentence (i.e. subject or object)

In AIDArabic+, we incorporate an Arabic specific pre-processing component for

input text and data schema building.

There are two state-of-the-art systems that perform morphological-based analysis:

MADAMIRA [49] and Stanford Arabic Word Segmenter [42]. Stanford Word Segmenter

provides interpolatable handy Java API. Hence, we have used it for pre-processing. Our

pre-processing component perform two main steps:


Tokenization

Besides the normal word tokenization based on punctuation and spaces, we perform

word segmentation to split clitics and attached connectors. Split suffixes and prefixes

are marked with special character ’+’ indicating that they are originally connected

to the next or previous word. This allows reconstructing the original input text. For

example, “ A��Q

®K.”(pronounced as “BFranca”) is segmented into ” A�

�Q

¯ + H. ” and ” Aî

�D�®KYg”

is segmented as ” Aë+�

I�®KYg”. It is worth noting that Stanford Word Segmenter does

not split the definitive articles.

Splitting prefixes and suffixes, i.e. clitics, should increase name matching accuracy,

and hence, enhancing the candidate retrieval process coverage and quality. Furthermore, it

allows better keyphrases matching which is essential for achieving accurate disambiguation

results.

Normalization

Unlike English, Arabic text normalization include several steps. Applied normalization

should be customized according to application. The most common normalizations are:

• Removing Diacritics: Despite the fact that removing diacritics will increase

ambiguity of some words, it is important to obtain a uniform representation of the

same word. For example, sentence ”�ÕË�ð

� @

�é�

�JKY�

�Ó ú

¯�

áKA�J

��

��K

� @

�H

�Q�.�

�Ë

� @

�YË�

�ð” after removing

diacritics will be ”ÕËð @

�éJKYÓ ú

¯ áKA

�J

��

�K

@

�HQ�. Ë

@ YËð”.

• Normalizing Hamza: It includes replacing the different forms of letter Hamza ( @,

@ ,

�@, @

c, and

c@) to the normalized form @. This helps avoid common typing mistakes

and confusing different states of the same word.

• Normalizing Ya: In order to avoid a common writing mistake in the informal

text, the character Ya/ø

(with dots) is replaced with ø (without dots).

• Normalizing Ta-marbutah: Also, to avoid different writing forms tah-marbutah�é� &

�è (with dots) is replaced with è or é� (without dots).

• Removing Tatweel: Some informal text uses series of ’ ’ (U+0640) to extend

the word. All Tatweel characters should be removed to get the pure word.

• Normalizing Punctuation: In some cases, it is useful to replace the Arabic

punctuation with equivalent ASCII symbols.


In AIDArabic+, we apply Diacritics Removal, Hamza Normalization, Ya Normaliza-

tion, Ta-marbutah Normalization and Tatweel Removal to achieve decent input quality

that guarantee higher matching recall without sacrificing the precision. Recalling our

example in Section 3.3, the input sentence:

à Q

�m�

��' B é� K.�

A�J»�

á«�é��K��

ñ�«

�è�

�Q K� Am.

Ì�

��ú

G�

�Q��®Ë @

��

�� A«

�l�

��

�Q�K

�Y

��¯

after normalization and tokenization will be

à Qm�

�' B è+ H. A

�J» á« é

�Kñ

« è Q

KAg. + È ú

GQ

�®Ë @

�

�A« l�

��QK Y

�¯

As in the example, names and contextual keyphrases in the normalized sentence

become more clear for matching against AIDArabic+ dictionaries. Proposition + È

attached to the beginning of the word�è Q

KAm.

Ì is detached and can be treated as a stop

word. Similarly, the pronoun è+ has been detached from the end of éK. A�J».

Chapter 4

EDRAK

Entity-Centric Resource For Arabic Knowledge

EDRAK is an entity-centric resource developed as back-end schema for AIDArabic+

NED as shown in Section 3.3. This chapter focuses on the automatic generation

techniques beyond Wikipedia used in EDRAK together with the decisions taken

within each technique. Section 4.1 describes the integration of external dictionaries.

Then, named-entity translation methods are discussed in Section 4.2, person names

transliteration in 4.3, and Arabic names splitting in 4.4. Finally, technical details about

EDRAK as a general purpose standalone resource is explained in secion 4.5.

4.1 External Name Dictionaries

Wikipedia, as the largest comprehensive online encyclopedia, is the most used corpus for

creating knowledge bases such as YAGO [26], DBpedia [6] and Freebase [10]. Due to the

limited size of the Arabic Wikipedia, building a strong semantic resources becomes a

challenge. One approach to go beyond Wikipedia limits is to capture possible Arabic

names mentioned in other resources such as websites and online news, then attaching them

to the corresponding entities. These resources are usually harvested through automatic

or semi-automatic processes. Among the generated resources, some are entity-aware,

while others are purely textual names dictionaries.

19

20 Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge

4.1.1 Entity-Aware Resources

Entity-Aware resource is the type resource that has canonical entities registered a long

with their names and, in some cases, their context description. Google-Word-To-

Concept (GW2C) [54] is a multilingual Entity-Aware resource that harness Wikipedia

concepts, including Named Entities (NE) and their possible names from both Wikipedia

and non-Wikipedia Web pages.

Resource Description: Concepts’ strings (i.e. names) are harvested from:

• English Wikipedia pages titles.

• English anchor texts from inter-Wikipedia links into the concept.

• Anchor texts from non-Wikipedia pages to Wikipedia concepts with English page.

Name-to-concept mappings are stored together with a strength score that is measured

through the conditional probability P (Concept|name) representing the ratio of links into

the Wikipedia concept having this name. Nevertheless, names in GW2C names are

stored without any kind of post-processed or cleaning.

éËA�®Ó 0.0013 Chuck (engineering) W08 W09 WDB Wx:1/500

��j.�JË @ l .

×@Q�.» 1 Spyware W08 W09 WDB Wx:8/8

ar:�éKQº�ªË@

�é�PYÖÏ @ 0.5 Ecole Militaire W08 W09 WDB Wx:3/3

AJë ¡

ª

�@ 0.0005 World War I KB W08 W09 WDB Wx:4/5357

AÓðP�èYëAªÓ 1 Treaties of Rome KB W:4/4 W08 W09 WDB Wx:3/3

ø

ñËñJ�. � AÒJJk. QÓ 1 M~arginimea Sibiului W09 W08 WDB Wx:1/1

Table 4.1: Sample of Google-Word-to-Concept raw data

GW2C contains 297M multilingual name-to-concept mapping. As shown in Table 4.1,

the first and the third columns in order contain retrieved names and their Wikipedia

concepts URLs. Second column contains the conditional probability computed from the

witness counts presented in the flags column (fourth column).

Integrating with AIDArabic+ Resource: GW2C is created automatically without

any manual verification or post-processing. Therefore, it contains noise that should be

filtered out. In order to include GW2C names in our names dictionary, we perform the

following steps:

Chapter 4 EDRAK: Entity-Centric Resource For Arabic Knowledge 21

1. Detecting Arabic names using off-the-shelf language detection tool developed

by Shuyo (2010) [53] to filter out non-Arabic records. This resulted in only 736K

out of 297M as Arabic entries.

2. Filtering out ambiguous names based on the provided conditional probability

scores. Excluding records with low scores filters out anchor texts such as “(Read

more) YKQÖÏ @

@Q

�¯@” , “(Wikipedia page) AKYJ�. JºKñË@

�éj

®�” or “(more on Wikipedia)

AKYJ�. JºKð úÎ« YKQÖÏ @”. We used 0.01 as a lower threshold on the provided scores.

3. Name-level post-processing to remove URLs, punctuation, common prefixes

and suffixes.

4. Mapping names to AIDArabic+ Entities using Wikipedia pages URLs.

4.1.2 Lexical Name Dictionaries

Lexical Name dictionaries is another type of resource that contains just name variants

in different languages without any notion of canonical entities. Since these dictionaries

do not consider the semantic differences, the name variants can be mapped to different

entities. Therefore, we use them as look-up dictionaries to translate English entity-

names to Arabic. We have utilized two dictionaries that have been exposed to a manual

verification.

62 P u Javier+Solana

62 P u AKBñ�+Q�J

¯A

g

62 P u AKBñ�+QKðA

g

62 P u AKBñ�+Q�JK. A

g

62 P ar AKBñ�+Q�J

¯A

gð

62 P sl Javierjem+Solano

Table 4.2: Sample of JRC-Names raw data

JRC-Names [55] is a multilingual resource of organisations and persons names

extracted from News Articles and Wikipedia. In the creation of JRC-Names, they used

manually compiled lists of language specific rules and triggers such as persons titles,

ethnic groups or modifiers to extract names of persons. In addition, a list of frequent

words (e.g . club, organization, bank etc.) was used to extract organization names.

The similarity between the names extracted from news and those from Wikipedia

page titles was computed to recognize name variants. Names in non-Roman script were

romanized. Hence, monolingual edit distance was used as a unified similarity function.

Names below the specified threshold were manually matched to the corresponding name


cluster. Finally, names that either appeared in five different news clusters, manually

validated or found in Wikipedia were included in the published dictionary.

The dictionary has 617k multilingual name variants with only 17k Arabic name

variants. As shown in Table 4.2, variants of the same name have a unique identifier. In

addition, types and partial language tagging are provided with the names.

National Investors Bank ú×ñ

�®Ë@ PAÒ

�J��B@ ½

JK. ORGANIZATION 0 0 0 0.3 0.38 0.5

BALTIC COUNTRIES�

�J¢ÊJ. Ë @ ÈðX ORGANIZATION 0 0 0.2 0.076

Yoli Adlestein áKA�J

��ËX@ ú

ÍñK PERSON 1 1 0.75 0.875

Nathan Byron àðQ�K.

àA

�KA

K PERSON 1 1 0.71 0.66

Table 4.3: Sample of CMUQ-Arabic-NET raw data

CMUQ Arabic-NET is an English-Arabic name dictionary compiled from Wikipedia

and parallel English-Arabic news corpora [7]. They used off-the-shelf NER system on

the English side of the corpora. The NER results were projected onto the Arabic side

according to the word alignment information. Additionally, they included Wikipedia

cross-languages links titles in their dictionary. The dictionary was manually annotated

to fit targeted use.

The full dictionary has 62k English-Arabic name pairs. Table 4.3 shows a sample

of the dictionary. First two columns are the English-Arabic pairs. The third column

contains the type of the entity name (i.e. person or organisation). The remaining are

annotations that are used for their target.

Including Dictionaries in EDRAK is performed as follows:

1. Pre-processing and language detection are applied on JRC-Names.

2. English names of the entities are normalized and matched strictly against these

dictionaries to get the accurate Arabic names.

3. Only new name variants are added to the resource.

4.2 Named-Entities Translation

Up to this point, English entities that do nor have any possible names in the dictionary

and/or context keyphrases still form a big part of our catalog. In addition, not all entity


SMT System

Bilingual Corpus

Target Language

Corpus

Training Translation Model

Training Language Model

Translation Model

Language Model

Decoder

Source language Input

Target language Output

Statistical Models

Figure 4.1: General Statistical Machine Translation Pipeline

names have already appeared in Arabic text corpora, some however are prominent enough

to appear in the near future. Accordingly, in this section we discuss using machine

translation on English names and keyphrases.

4.2.1 General Statistical Machine Translation

Statistical Machine Translation (SMT) is the process of generating possible translations

for text based on statistical models trained on bilingual parallel corpora [31]. Recently,

several translation systems are being developed based on SMT such as Moses [32],

Cdec [15], Phrasal [18], and Thot [48]. The main advantage distinguishing SMT from

other translation approaches such as rule-based translation and example-based translation

is being generic enough to be used with any language pairs.

Implementations of SMT mostly follow similar steps to train the models required

for the decoding process (Figure: 4.1). The parallel corpora word-alignment information

is extracted using one of the automatic statistical alignment tools such as GIZ++ [47]

or Fast Aligner [14]. Generated word-alignment information is used to produce the

translation tables/grammar. In addition, a statistical language model is generated from

the target language part of the parallel corpora. In some cases, other monolingual corpora

are used to generate richer models. Later, both the translation table and the language

model are used in the decoding phase while translating the input text. Usually, several

translations are generated for each input sentence. Resulting translations are ranked

based on the accumulated probability derived from the language model and translation

table.


Statistical machine translation is a viable option to translate English names into

Arabic. Off-the-shelf trained SMT systems such as Google 1 or Microsoft Bing 2 translation

services are trained on large parallel corpora. However, they are not geared for translating

NEs. While seeking suitable translation quality on natural language input text, NE

translation quality straggles as:

• Most of the existing SMTs do not handle named entities explicitly, only the language

model is responsible for generating the weights of name translations [22]. They do

not utilize any part-of-speech tagging or NER information.

• Entity names are domain specific. Hence, they can be easily missing from the

parallel training corpus which cannot cover all domains. SMTs will split the name

and translate each token separately resulting in wrong translations.

• Entity names tend to appear less than other nouns and verbs in the parallel training

corpora. Consequently, their translations have lower weights in the language models

rather than normal words [33, 5, 7]. For example, “North” as geographical direction

and “Green” as color have higher weight rather than their name counterpart.

• SMT system does not take NE type into consideration, yet different entity types

should not be translated similarly. For example, “Nolan North” (PERSON) and

“North Atlantic Treaty Organization” (ORGANISATION), both Norths are proper

entity names, yet the first should be transliterated while the later should be

translated.

4.2.2 Named-Entities SMT

Several research attempts focus on enhancing the quality of NE translation. Huang et

al . (2004) [28] introduced the usage of phonetic and semantic similarity in order to

improve NEs translation. Lee (2014) [34] proposed including part-of-speech tagging

information in the translation process in order to enhance the translation of person names

in text. Furthermore, Azab et al . (2013) [7] developed a classification technique to decide

whether to translate or transliterate named-entities appearing in a full text from English

to Arabic. They used combination of the token-based, semantic and contextual features

including the coarse-grained type tags (PERSON, ORGANIZATION) in the classification.

Nevertheless, since our problem is focused on NEs solely, we propose creating a NE

customized translation module.

1https://translate.google.com2https://www.bing.com/translator/


PER NON-PER ALL

Azab 28493 34116 62609Wikipedia 33962 79699 128790

Both 62455 113815 191399

Table 4.4: Entity Names SMT Training Data Size

Named-Entities Training Corpus

Parallel training corpora is a key player in achieving the desired translation quality.

Our proposed approach is to use training data purely designed for translating NEs.

Therefore, we compile our training corpus from the NEs existing in English-Arabic

cross-languages inter-Wikipedia links. The intuition is if our Knowledge base knows the

name of “William Hook Morley” (úÍPñÓ ¼ñë Õæ

Ëð) and “Edward Said” (YJª� XP@ðX@

)

in Arabic script (which is the case), this should be sufficient for our SMT to learn the

Arabic script of “Edward Morley”. By adapting names only, we guarantee a suitable

language model that provides higher weight for name translations.

Type-Aware Translation

Similar to the recent technique for translating named-entities within natural language

text proposed by Azab et al . (2013) [7], we propose utilizing type information in names

translation process. we used well structured type information provided in YAGO3 [26].

Training data have been split into two sets PERSON and NON-PERSONS. We do not split

the data on more fine-grained type-basis (e.g . ORGANIZATION vs LOCATION) in order to

maintain an adequate amount of training data [57] for each type. Table 4.4 shows the

size of the training data for each system.

We trained three SMT systems: PERSONS, NON-PERSON, ALL. The third system is

used as a fallback solution in case an entity of type T could not be translated using the

corresponding system. The main advantage of the type-aware architecture is allowing

translating person names differently. In addition, for non-person entities like “Goethe

Prize”, we are able to translate it using the fallback system which learned to translate

“Goethe” from the PERSONS part of the data, and “Prize” from the NON-PERSONS part.

4.2.3 Named-Entities Light-SMT

Since our target is to translate entity-names only, language models are not highly beneficial

in our case. Even more, language models may reduce the quality of translation, if the

name is not well represented in the target language training data. Therefore, we propose


Light-weighted SMT. Our approach intuition is that if we collect all Arabic names of

entities that have the token-to-translate in one of their English names, the corresponding

Arabic token should have the highest and the most distinguished occurrence count among

other Arabic tokens. Therefore, applying popularity voting among Arabic tokens of

the entities with the token-to-translate (in their English names) results in a proper

translation. Consider a simplified example, given a set of parallel names that all its

records have “Muller” in the English side such as (“Thomas Muller”/“QËñÓ �AÓñ�K”),

(“Gerd Muller”/“QËñÓ XQ�«” ) etc., “Muller” translation in Arabic “QËñÓ” is going to be

the most frequent name. Light-SMT module is built as follows:

1. The English side in the parallel English-Arabic training data is tokenized using

Stanford tokenizer and converted to lowercase.

2. Arabic side is normalized and tokenized as discussed in Section 3.4. Since a

single English name can be written in several forms with only a difference in

vowels, diacritics and/or Hamza, stemming is important for counting all such

representations as a single candidate.

3. Ambiguous tokens such as “(Disambiguation)” in English and its corresponding

Arabic word “iJ

�ñ�K” are considered noise and removed. Furthermore, tokens with

no mapping in Arabic such as ”a”,”an” and ”The” as well as punctuation are

eliminated.

4. In order to follow the type-aware approach, training data is split into PERSON,

NONPERSON. We created three indexes, two for each type and a third for type ALL

(which includes the whole parallel data combined). Each index is composed of:

(a) An inverse index from English tokens to their entities.

(b) An index from the entities to all tokens of their Arabic names together with

their normalized version.

In the decoding (i.e translation) phase, the entity-name passes through the following

steps to generate all possible translations for the name:

1. Entity type is resolved via YAGO types.

2. English Name is tokenized and normalized.

3. The translation of each token is generated according to the entity type (Figure:4.2):

(a) List of entities with this English token is retrieved from the inverted index.

(b) Arabic tokens of all entities in the list are retrieved from the Arabic index.


Offline Generated Translation Model

Vote between Different Representations

Vote between normalized Tokens

Source Tokens to Entities Inverted Index

Parallel Training

Data

<Edward_Morely>

<Ebenezer_Morley>

Entities to Target Tokens Index<Edward_Said><Edward_England>

<William_Morley>

<Edward_Said>

<Edward_England>edward

morely

ولیم

Edward

إدوارد

>T

yes

NoFail

Get Entities with EN Token

Get AR tokens for Entities

<Edward_Morely> <William_Morley>

مورلي

إدوردإدوارد سعید

<Ebenezer_Morley> مورلي إبنیزیر

إنجلند

<Edward_Barker>

<Edward_Barker> إدوارد باركر

<Edward_Morely> <Edward_Said>

<Edward_England> <Edward_Barker>

إدوردإدوارد سعید

إنجلندإدوارد باركر

إدواردإدورد إدواردTop-k

Figure 4.2: Single token translation using popularity voting

(c) A popularity voting is performed on the distinct Arabic stems. The Arabic

stem with the highest count is considered the proper translation for the

English token. In order to achieve suitable translation accuracy, we impose

two conditions: (i) the number of participating entities should be greater than

five, (ii) the winning stem should achieve at least a value of 0.3 from number

participating entities.

(d) In order to avoid rare Arabic representations of English names written in

Arabic script as well as incorrect representation, a second popularity voting is

performed among the original words contributing to the winning stem. The

top two representations are chosen as possible translations of the input token.

(e) If no translation is found for the token using its type data, these steps are

repeated with the ALL model.

4. Token translated successful are then joined together to generate all possible trans-

lations.

Up till this point, Light-SMT is capable of generating acceptable translation quality

for persons names as they usually follow the same token order across languages. In

contrast, multi-token NONPERSON names suffer from ordering problem. For example,

while translating “Max Planck Institute” to Arabic using LSMT, the result will be

“YêªÓ ½KCK. �» AÓ”. Each token is correctly translated but word “YêªÓ”, the Arabic


translation of “Institute” should appear in the beginning followed by the rest of the name

as “½KCK. �» AÓ YêªÓ”.

Since NEs are short and follow common patterns, we implemented a rule-based

reordering approach similar to Badr et al . (2009) [8]. This approach is efficient with

Arabic NEs case as they are usually written in the format of

( < Category > < list of genitives or proper noun > )

For English NEs, the category (e.g . “University”) either appears at the beginning

(e.g . “University of Saarland”) or at the end (e.g . “Saarland University”). The first

case is similar to Arabic order, and hence, no changes are required. The later however

should be flipped. We learned the list of categories that requires reordering from all

English nonperson names by considering the top thousand common tokens appearing

at the end of the name. Before translating, the English Name is reordered such that

category names are put at the beginning to follow the Arabic naming. For example,

“Goethe Prize” becomes “Prize Goethe”.

The main advantage offered by Light-SMT is that it allows controlling the translation

quality on the token level. In addition, it leverages the correct translation by combining

all similar representations. On the other hand, Light-SMT does not capture dependencies

between words. Also, it assumes one-to-one word mapping which is not the case, specially

with nonperson names. We tried to enhance it by applying it on n-grams and choosing

then choosing the translation of the longest n-gram in the top-k as the correct one.

Nevertheless, result samples do not show considerable improvement. In Section 5.2, we

examined the effect of Light-SMT buit on disambiguation full pipeline.

4.2.4 Named-Entities Full-SMT

Due to the limitation of the Light-SMT, decision was made to train an off-the-shelf

SMT framework. Therefore, we adopted Cdec [15], a full fledged SMT framework that

includes a decoder, aligner, and learning framework. We used it in combination with the

type-aware paradigm. We used the same training data in addition to CMUQ Arabic-NET

dictionary provided by Azab et al . (2013) [7]. We train the three system (PERSONS,

NON-PERSONS and ALL) as follows:

1. Parallel data is split into development data (5%) and training data.

2. English part is tokenized normally and the Arabic side is normalized and tokenized

as described in Section 3.4.


EnglishEntity-Name

Resolve Type is PEROSN

PEROSN Translation

NONPEROSN Translation

OOV filtering Fail?

FALLBACK Translation OOV filtering

Top3Arabic

Entity-Names

Figure 4.3: Type-Aware Entity-Name Translation using full SMT system

3. Symmetric Word-Alignment information is extracted from the English-Arabic

training pairs using Fast Aligner [14].

4. Word-Alignment information are used to generate translation grammar using

Cdec Grammar Extractor. Unique grammars are kept in Synchronous context-free

grammars(SCFGs) format.

5. The normalized Arabic part of the corpus is used to train the language model using

KenLM Language Model Toolkit [23] provided in Cdec framework.

6. Translation parameters are tuned using MIRA tuning tool [11] adopted by Cdec.

In the translation phase, the translation grammar, tuned weights, and the language

model are loaded to Cdec decoder. As we follow the Type-Aware framework (Figure: 4.3),

we start with resolving the type of the translated entity. Then, the English name is

translated according to the entity type. We configure Cdec decoder to retriever the

Top-5 translations. Translations with one or more Out-of-Vocabulary words are excluded.

Finally, only top-3 translations are taken into consideration. In the case of failing to get

at least one full translation, the fallback translation component is used to translate the

name with the same protocol.

The usage of a full fledged SMT solves the challenge of handling words dependencies

and reordering. Since SMT is language independent, our enrichment approach will

inherit this feature, allowing easily adopting it in other languages. On the other hand,

the language model may decrease the weights of some correct translations as explained

previously. Finally, it is hard to control single token translations.

4.3 Transliteration

Transliteration is the process of converting a Name in language L1 script to other language

L2 script while the pronunciation is preserved as much as possible. Most of SMT systems

use transliteration as a fallback protocol for name translation failures.


Named-Entity Translation allowed translating huge amount of English names as well

contextual keyphrases. However, there are still names that Named-Entities translation

fails to generate Arabic names for them, because they are not represented in the parallel

training data. Moreover, translation usually generates only the prominent Arabic

representation of the name. Thus, This section discusses how transliteration is adopted

to enrich our resource.

4.3.1 Transliteration Approaches

English to Arabic transliteration is performed using several approaches [30]. The simplest

is the Rule-based or Grapheme-based approach where each set of characters in the source

language is mapped to one or more sets of characters in the target language. Another

approach is Phoneme-based mapping that consider the similarity on sounds and phonetics

level. Words are represented as phonetics/sounds, then transformed to the close or similar

phonetics in the target language. Finally, the target language phonetics are transformed

to characters [3, 56]. A widely used approach is Statistical Machine Translation on

the Characters, where normal SMT systems are used but trained on parallel corpora of

segmented words [1, 44, 13, 2].

There are several transliteration services from English to Arabic. On the top of these

services there are:

Google Transliterate 3 is a service for multilingual transliteration. However, since

2011, it has been integrated in Google translate service. The translation service

does not guarantee generating transliteration rather than translation. For example,

“Green”, as a name, without any context will be translated.

Yamli 4 is a service5 that aims at transforming romanized Arabic text to Arabic

characters (i.e. backword Transliteration). Their engine is well trained on a general

romanized Arabic text as well as person names. However, it is designed to be an

interactive service that suggests several transliterations and expects the user to

choose the correct one (i.e. Recall oriented).

3Arrib [2] is transliteration service targeting romanized dialectal Arabic (e.g . chat in

Egyptian dialectal). Usually, non-formal Arabic text uses numbers as extension for

the English to cover Arabic phonetics without corresponding English ones. Thus,

they are using SMS/Chat training data [9] that is not suitable for transliterating

3https://developers.google.com/transliterate/4Yamli is the Arabic translation for “[he] dictates”5http://www.yamli.com/


a l - s a m m a n ||| à @ Ð � È @

a l b e r t S SPACE e i n s t e i n ||| �H P H. È

@ T SPACE

à ø

@�

H�

�à ø

@

j o h n ||| à ð h.

Table 4.5: Sample of character-level training data

formally written English names. Furthermore, it detects English names and concepts

and excludes them from the transliteration process [16].

These solutions are not designed to deal with names only. Their output accuracy is

not satisfactory for our target.

4.3.2 Character-Level Statistical Machine Translation

In order to create a transliteration system geared towards names, we train our SMT

system on the character level. For training, we consider PERSON data in Table 4.4. PERSON

parallel data were used to train Cdec SMT system as follows:

1. Spaces are replaced with special symbols S SPACE and T SPACE.

2. Words-character are separated with spaces as shown in Table: 4.5. Thus, each

name record is treated as a phrase and characters as its tokens.

3. We follow the same trainings steps with Cdec SMT as in Subsection 4.2.4.

In the decoding phase, the same segmentation as in the training is followed to prepare

the input data. Results with Out-Of-Vocabulary words (i.e. English characters) are

excluded. Then, Arabic transliteration are reconstructed by reversing the tokenization

steps. Finally, all names with at least one English character are excluded as failures.

We used transliteration on Given names and Family names, providing that names

are always transliterated. We did not apply transliteration on any other type for the

sake of achieving high quality results.

4.4 Arabic Names Splitting

AIDArabic uses full matching to retrieve the candidate entities for the sake of achieving

accurate disambiguation process. Nevertheless, person entities are not always mentioned

with the full name, only parts of their names may appear (e.g . given name and family


Type in English script Meaning in Arabic

Prefixes

Abd Worshiper YJ.«

Abo Father of ñK. @

Umm Mather of Ð@

Al Family of È�@

Gad XAg.

Connectorsbin/ben Son of áK. , áK. @

bent Daughter �I

�K.

SuffixesAllah, Ellah, Lellah The God é<Ë , éËB@ , é<Ë @

Al-Dawla State�éËðYË@

Al-Deen Religion áKYË@

Table 4.6: Arabic names common prefixes, suffixes and name connectors with theirmeaning

name). Therefore, splitting person names allows covering partial mentions. Unlike most

of Latin names, Arabic names are not just composed of given and last names. Hence,

they require different splitting rules.

Person names are extracted from rdfs:label relation for entities with type PERSON

in YAGO. After normalizing the names by Removing Tatweel, Normalizing Alif, and

Removing Diacritics, the following splitting rules are applied:

1. Arabic names prefixes XAg. , È�@ , Ð@ ,ñK. @ , YJ.« shown in Table:4.6 are combined with the

following token as one part (e.g . Umm Kulthum/Ðñ�JÊ¿ Ð

@, Abd-Alkareem/Õç'QºË@ YJ.«).

2. Name connectors such as �I

�K. , áK. , áK. @, which are common in names originated in

the Gulf countries as well as old Arabic names, are considered splitters and added

to its following part.

3. Common Arabic names suffixes é<Ë , éËB@ , é<Ë @ ,�éËðYË@ , áKYË@ with the previous token

as one part (e.g . Noor Al-Deen/ áKYË@ PñK).

4. Full Names composed of two parts are split into <Given Name> <Last Name>.

For example, Mohamed Salah/hC� YÒm× is divided into (Mohamed) YÒm× as the

given name and (Salah) hC� as the last name.

5. Three or more parts names are split into <Given Name> <Middle Name> <Last

Name>. For example, (Salman bin Abdulaziz Al Saud)/Xñª� È�@ QK

QªË@ YJ.«áK.

àAÒÊ�

is split into (Salman) àAÒÊ� as the given name, (bin abdulaziz ) QK

QªË@ YJ.«áK. as the

middle name, and (Al Saud) Xñª� È�@ as the family name.


Finally, resulting name partitions, given name, middle name, and last name are added

to enriched resource after applying the required normalisation and word segmentation.

4.5 Edrak as A Standalone Resource

In order to help advancing the Arabic research, we publically released EDRAK for the

research community as an Entity-centric stand alone resource.

4.5.1 Use-cases

EDRAK is not only useful for as data schema for NED, but it is also a valuable asset for

many Natural Language Processing NLP and Information Retrieval tasks. For example,

EDRAK contains a comprehensive dictionary for different potential Arabic names for

entities gathered from both the English and Arabic Wikipedia’s. EDRAK dictionary can

be used for building an Arabic Dictionary-based NER [12, 52].

In addition to the name dictionary, the resource contains a large catalog of entity

Arabic textual context in the form of keyphrases. They can be used to estimate Entity-

Entity Semantic Relatedness scores as in [25].

Entities in EDRAK are classified under the type hierarchy of YAGO [26]. Together

with the keyphrases, EDRAK can be used to build an Entity Summarization system

as in [58], or to build a Fine-grained Semantic Type Classifier for named entities as in

[64, 65].

4.5.2 Technical details

EDRAK is available in the form of an SQL dump, and can be downloaded from the

Downloads section in AIDA project page http://www.mpi-inf.mpg.de/yago-naga/

aida/. We followed the same schema used in the original AIDA framework [27] for data

storage. Highlights of the SQL dump are shown in Table 4.7. EDRAK’s comprehensive

entity catalog is stored in SQL table entity ids. Each entity has many potential Arabic

names together stored in SQL table dictionary. In addition, each entity is assigned a

set of Arabic contextual keyphrases stored in SQL table entity keyphrases.

It is worth noting that sources of dictionary entries as well as entities keyphrases are

kept in the schema (YAGO3 LABEL, REDIRECT, GIVEN NAME, or FAMILY NAME). Furthermore,

generated data (by translation or transliteration) are differentiated from the original

Arabic data extracted directly from the Arabic Wikipedia. Different generation techniques

http://www.mpi-inf.mpg.de/yago-naga/aida/

http://www.mpi-inf.mpg.de/yago-naga/aida/


Table Name Major Columns Description

entity ids- id- entity

Lists all entities together with their numerical IDs.

dictionary

- mention- entity- source

Contains information about the candidate entitiesfor a name. It keeps track of the source of the entryto allow application-specific filtering.

entity keyphrases

- entity- keyphrase- source- weight

Holds the characteristic description of entities in theform of keyphrases. The source of each keyphrase iskept for application-specific filtering.

entity types- entity- types []

Stores YAGO semantic types to which this entitybelongs.

entity rank- entity- rank

Ranks all entities based on the number of incominglinks in both the English and Arabic Wikipedia. Thiscan be used as a measure for entity prominence.

Table 4.7: Main SQL Tables in EDRAK

and data sources entail different data quality. Therefore, keeping data sources enables

downstream applications to filter data for precision-recall trade-off.

Chapter 5

Evaluation and Statistics

This chapter discusses the evaluation performed for the size and quality of EDRAK as in

Section 5.1 and its effect on the quality of AIDArabic+ results as in Section 5.2 .

5.1 Evaluation EDRAK

It is important to evaluate the effect of the enrichment approaches on the generated

resource quality and size which directly affects the overall NED system quality and

performance [62]. Statistics about EDRAK resource are shown in Subsection 5.1.1. In

addition, the manual assessment performed by the native Arabic speakers is discussed in

Subsection 5.1.3.

5.1.1 Statistics

EDRAK contains around 2.4M entities (with at least one name for each) classified

under YAGO type hierarchy. By this size, EDRAK is an order of magnitude bigger than

the original AIDArabic resource, that contains 143K entities, because it is constrained

by the amount of Arabic names and contextual keyphrases available in the Arabic

Wikipedia.

Table 5.1 shows a comparison between AIDArabic and EDRAK in terms of the name

and contextual keyphrases dictionaries. The name dictionary size increased form less

than 0.5M entity-name pairs to 21M pairs. The number of unique names is now 20

times that in AIDArabic. In addition, the average names per entity increased from 2.45

to 7.75 name/entity.

35

36 Chapter 5 Evaluation and Statistics

AIDArabic EDRAK

Unique Names 333,017 9,354,875Entities with Names 143,394 2,400,340Entity-Name Pairs 495,245 21,669,568Unique Keyphrases 885,970 7,918,219Entity-Keyphrase Pairs 5,574,375 211,681,910

Table 5.1: AIDArabic vs EDRAK: Sizes of Name and Contextual Keyphrases Dictio-naries

Technique # New Entities # Dictionary Entries

Google W2C 47,406 241,104CMUQ-Arabic-NET 19,706 23,338JRC 1664 4148Translation 3,549,248 11,222,876Transliteration 3,340,921 9,578,658Name Splitting 0 94,782

Table 5.2: Number of New Entities1 and Entity-Name pairs per Generation Technique

Semantic Type AIDArabic EDRAK

PERSON 47,483 1,220,032EVENT 11,065 199,846LOCATION 34,451 360,108ORGANIZATION 10,212 196,305ARTIFACT 15,650 359,071

Table 5.3: Number of Entities per Type in AIDArabbic vs EDRAK

The contributions of each generation technique are summarized in Table 5.2. Numbers

indicate that the automatic generation (i.e. translation and transliteration) contributes

way more entries than external name dictionaries. In addition, translation delivers more

entries than transliteration since it is applied on all types of entities, in contrast to only

persons names for transliteration. Furthermore, GW2C did not introduce many new

entities because it is not common to manually link a mention in an Arabic article to an

English Wikipedia page. For CMUQ and JRC, both are collected from news-wire, hence,

they added only to prominent entities.

Table 5.3 lists the number of entities per high level semantic type for both AIDArabic

entity catalog and EDRAK. The highest increase is observed in type PERSON as a result

of applying both translation and transliteration.

Similarly, the contextual keyphrases dictionary increased 42 times as shown in

Table 5.1. Although, we applied the generation techniques on the categories and the

Inlink titles only, the expansion in the contextual keyphrases was expected to be higher

Chapter 5 Evaluation and Statistics 37

Semantic Type AIDArabic EDRAK

citationTitle 67,031 67,031linkAnchor 2,469,923 2,469,923inlinkTitle 2,734,530 5,216,657wikipediaCategory 302,891 4,029,483wikipediaCategory TRANS 13,842,770inlinkTitle TRANS 186,056,046

Table 5.4: Conextual keyphrases dictionary AIDArabbic vs EDRAK

than the name dictionary. New contextual keyphrases are originated to: (i) new entities

that can be translated using the manual English-Arabic inter-Wikipedia links [67] or (ii)

automatically generated Arabic keyphrases using translation and transliteration. This

explains the expansion in the original sources, inlinkTitle and wikipediaCategory,

as shown in Table 5.4. Also, it worth noting that wikipediaCategories were only

translated while inlinkTitles were both translated and transliterated according to

their entity type.

5.1.2 Data Example

Many prominent entities do not exist in the Arabic Wikipedia, and hence do not

appear in any Wikipedia-based resource. For example, Christian Schmidt, the current

German Federal Minister of Food and Agriculture, and Edward W. Morley, a famous

American scientist, are both missing in the Arabic Wikipedia2. EDRAK’s data enrichment

techniques managed to automatically generate reasonable potential names as well as

contextual keyphrases for both. Table 5.5 lists a snippet of what EDRAK knows about

those two entities.

5.1.3 Manual Assessment

The target of the manual assessment is to quantify the quality of the generated names

and contextual keyphrases using different methods.

Setup

We evaluated all aspects of data generation in EDRAK. We included entity names

belonging to First Name, Last Name, Wikipedia redirects, and rdfs:label relation which

2as of June 2015


Entity Generated Arabic Names Generated Keyphrases

Christian Schmidt àñ��k.

YJÒ�

��

�IJÖÞ

��

�HYJÖÞ

��

àAJ

��Q»

àAJ

��Q»

�HYJÒ

��

��

àAJ

��Q»

�IJÖÞ

��

àAJ��Q»

É�®�J�Ó

�IJÖÞ

��

àAJ��Q»

�HYJÖÞ

��

àAJ��Q»

É�®�J�Ó

�HYJÖÞ

��

àAJ��Q»

á��J��Q»

�éJ

K AÖÏB@

�éKXAm�

�'B@ Ä

¯YË@

�èP@ Pð

�éJj��Ó

�éJ«AÒ

�Jk. @ AKPA

¯AK. ú

¯ XAm�

�'B@

àñJ�AJ�

úæ�Ê£B@ ©Ò

�Jm.×

�éJË @PYJ

®Ë @

�éJ

K AÖÏB@ Ä

¯YË@

�èP@ Pð

É�®

�J�Ó

�HYJÖÞ

��

àAJ��Q»

�é«@P QË @

àAÖÏ @ Z @P Pð

��PYKQ

¯ Q�

�JK.QK Aë

�éJ

K AÖÏB@

�éJË @PYJ

®Ë @ Ä

¯YË@

�èP@ Pð

½KPYKQ¯ Q�

�JK.QK Aë

�é«@P P

àAÖÏ @ Z @P Pð

úæ�Ê£B@

�é«ñÒm.

×

àAÖÏ

@

àñJ

K AÖÏQK.

úæ�Ê£B@

�é«ñÒj. ÖÏ @

�é�JËA

�JË @

�éÓñºmÌ'@

àA£Qå�

��PYKQ

¯ Q�

�JK.QK Aë

AKPA¯AK. ú

¯

�éJj��Ó

�éJ«AÒ

�Jk. @ XAm�

�'B@

àñJ�AJ�

É�®�J�Ó

�IJÖÞ

��

àAJ��Q»

�IËA

�JË @

�éÓñºmÌ'@

àA£Qå�

àAÖÏ @

�é«@P QË @ Z @P Pð

Edward W. Morley P@ðX@

XP@ðX@

XP@ðX@

úÍPñÓ ñJÊK. X XP@ðX@

úÍPñÓ XP@ðX@

úÍPñÓ XP@ðX@

úÍPñÓ QÓAJËð XP@ðX@

úÍPñÓ +ð XP@ðX@

úÍPñÓ +ð XP@ðX@

úÍPñÓ QÓAJÊK +ð XP@ðX@

úÍPñÓ QÓAJÊK +ð XP@ðX@

úÍPñÓ P@ðX@

XPðX@

úÍPñÓ +ð ø

@

XP@ðX

úÍQÓ

úÍPñÓ

úÍPñÓ

úÍQ�Ó

àñJºKQÓ@

�éJ

KAK

Q�¯

àñJ

KAJÒJ»

��. J«

�è Q

KAg.

ðQ�

¯

�éK. Qm.

��'

àQ�

�� +ð ��»�éJÖßXA¿ @

úæ.KQj.

�JË @

àñJ

KAK

Q�¯

�éJºKQÓB@

�éJºÊ

¯

�éJªÔg

.�éJK. Q

ªË @ ú

Í@PYJ

®Ë @

��

�®jÖÏ @

�éªÓAg. ñm.

�'Q

k

àQ�

�� +ð ��»�éªÓAg. �A¿

úÍPñÓ�éëñ

¯

J£

àñJ

KAK

QK +

¬

ú»Q�ÓB@�éJ

KAK

Q�¯

àñJºKQÓ@

àñJ

KAJÒJ»

àQ�

�� +ð ��»�éJÖßXA¿ B@

�éJ

KYJ. Ë @ Z AJÒJºË@ t�'PA

�K

�éJºÊ

¯

�éJºKQÓB@

�éJªÒm.

Ì'@

àñ��Q»

�HñJË @

ÐA�ñK.àð Q

KA

¯

�éJ

KYJ. Ë @ Z AJÒJºË + È ú

æÓ QË @ É�Ê�

��Ë @

�éJ»ñº

��

�éÊm.

×

XPñ®�KPAë H. Q

«

úÍPñÓð

àñ�Ê¾JÓ

�éK. Qm.

��'

ðQ�

¯ PAJ.

�J

k@

Table 5.5: Examples for Entities in EDRAK with their Generated Arabic Names andKeyphrases


carries names extracted from Wikipedia page titles, disambiguation pages and anchor

texts.

The data was generated for evaluation using full SMT system trained on Named-

Entities only. In order to examine the effect of considering semantic types in translation,

We implemented two approaches, the first is Type-Aware SMT as described in Subsec-

tion 4.2.4, and the second uses a universal SMT for translating all names (which is referred

to as Combined). For each name, the top-3 successful translation or transliteration

were generated if they exist.

Data assessment experiment covered all types of data against both translation

approaches. Additionally, we conducted experiments to assess the quality of translating

Wikipedia categories using system trained on parallel English-Arabic categories. Finally,

we evaluated the performance of transliteration when applied on English person names.

We randomly sampled the generated data and conducted an online experiment to manually

assess the quality of the data.

Task

We asked a group of native Arabic speakers to manually judge the correctness of the

generated data through our web-based tool (as shown in appendix A). Each participant

was presented around 150 English Names together with the top-3 potential Arabic

translations or transliteration proposed by cdec (or less if cdec proposed less than three

translations). Participants were asked to pick all possible correct Arabic names or None,

if all translations are incorrect. Participants had the option to skip the name (by choosing

Don’t know option), if they needed to. The experiment was designed such that each

English Name should be evaluated by three different persons.

Results and Discussion

In total, we had 55 participants who evaluated 1646 English surface forms, that were

assigned 4463 potential Arabic translations. These English names were annotated with

at least three participant to either one of the proposed translation or None. Participants

were native Arabic speakers that are based in USA, Canada, Europe, KSA, and Egypt.

Their homelands span Egypt, Jordan, and Palestine. Manual assessment results are

shown in Table 5.6. Evaluation results are given per entity type, translation approach

and name source. Since cdec did not return three potential translations for each name,

we computed the total number of translations added when considering up to top one or


Approach SourceCount@Top-K Prec@Top-K

1 2 3 1 2 3

Persons

Type-Aware

First Name 8 10 12 87.50 80.00 66.67Last Name 14 17 19 92.86 88.24 78.95rdfs:label 156 288 383 79.49 63.19 57.44redirects 113 210 285 69.91 57.62 50.18

Combined

First Name 7 10 12 100.00 90.00 75.00Last Name 16 22 25 87.50 81.82 76.00rdfs:label 160 307 421 81.25 64.82 57.24redirects 108 210 288 67.59 60.00 54.51

TransliterationFirst Name 26 52 76 80.77 61.54 56.58Last Name 94 188 279 70.21 63.83 55.91

Non-Persons

Type-Awarerdfs:label 269 519 742 53.16 43.16 36.66redirects 191 370 526 45.55 34.86 30.99

Combinedrdfs:label 273 533 770 49.82 41.84 36.75redirects 195 378 539 46.67 39.42 34.69

Categories Categories Categories 118 234 340 67.80 52.99 46.18

Table 5.6: Assessment Results of Applying SMT for Translating Entities and WikipediaCategories Names

two or three results. For each case, we computed the corresponding precision based on

participants annotations.

Data was randomly sampled from all generated data such that the size of each test

set reflects the distribution of the sources included in the original data. For example,

names originating from rdfs:label relation are an order of magnitude more than those

coming from FirstName, and LastName relations.

The quality of the generated data varies according to the entity type, name source and

generation technique. The quality of translated Wikipedia redirects is consistently less

than that of other sources. This is due to the nature of redirects. They are not necessarily

another variation of the entity name. In addition, redirects tend to be longer strings,

and hence are more error-prone than rdfs:labels. For example, “European Union common

passport design” which redirects to the entity Passports of the European Union could

not be correctly translated. Each token was translated correctly, but the final tokens

order was wrong. Evaluators were asked to annotate such examples as wrong. However,

such ordering problems are less critical for applications that incorporate partial matching

techniques. Similarly, categories tend to be relatively longer than entity names, hence

they exhibit the same problems as redirects.

Although the size of the evaluated FirstName and LastName data points is small,

the assessment results are as expected. Translating one token name is relatively an easy


task. In addition, cdec returned only one or two translations for the majority of the

names as shown in Table 5.6.

Results also show that the type-aware translation system does not necessarily improve

results, and one universal system can deliver comparable results for most of the cases.

Person names transliteration unexpectedly achieved less quality than translation.

This is a result of the fact that names are pronounced differently across countries. For

example, a USA-based annotator is expecting “Friedrich” to be written “½KPYKQ¯”, while

a Germany-based one is expecting it to be written as “ ��PYKQ

¯”. Similarly, the person

name “Johannes” is only known for German-based participants that it should be written

as “��AëñK” not as “�

�Aëñk. ”. We attempted to overcome this problem by inviting

Arabic speakers located globally in different areas.

Finally, inter-annotators agreement was measured using Fleiss’ kappa to be 0.484,

indicating moderate agreement.

5.2 AIDArabic+ Evaluation

In this section we discuss the experiment conducted to evaluate the effect of the enriched

data resource (EDRAK) and the Arabic specific tokenization and normalization on

AIDArabic+ results.

5.2.1 Arabic Corpus Creation

The first problem we faced is the lack of annotated Arabic benchmark. Creating a well

annotated corpus manually is a time consuming task. Therefore, we need to create our

benchmark automatically.

The main idea is to use a parallel corpus and annotate the Arabic part using

automatically generated evidences from the English counter part. This approach was

followed also by [7] to collect Named-Entities from parallel news and by [37] to create

persons only multilingual annotated corpus.

We used LDC2014T05 [35] news and web English-Arabic parallel manually translated

and aligned corpus. LDC2014T05 is developed mainly for SMT development. The corpus

Arabic documents are tokenized using MADA+TOKEN [19, 49] and aligned with the

tokenized English translation on the word-level (many-to-many mapping). We favored

using a manually word-aligned corpus over using an automatic alignment tool such as

GIZA++ [47] or FAST Aligner [14] to guarantee a better projection quality.


Type #Docs #Uniq. Entities #Mentions #Non-null Mentions

News-wire 702 2009 18,240 14,413Web 74 338 2055 1385

Table 5.7: LDC2014T05 Annotated Benchmark Statistics

We started with applying AIDA [27, 66], as a state-of-art NED system, accom-

panied with Named Entity Recognition on the tokenized English side. English AIDA

disambiguation results were project on the tokenized Arabic side as follows:

1. All English mentions (with their entity mapping) were projected on the Arabic

tokens using the word alignment information.

2. Tokens marked as GLUE at the boundaries of the Arabic mentions such as connected

prepositions e.g . character “ð” in “and Egypt”/Qå�Óð and connected pronouns at

the end are removed from the mention string.

3. Overlapping Arabic mentions, resulting from the translation nature, were combined.

4. Mentions were filtered such that:

(a) Arabic mentions mapped to two different entities are excluded.

(b) Long Arabic mentions are also excluded.

The Arabic documents and the produced ground truth was exported in CoNLL

dataset format. After excluding all documents with alignment problems, our Arabic

corpus contains total of 776 documents with total of 15798 non-null mentions. Table 5.7

shows the details of the annotated data.

5.2.2 Experiment Setup

Systems Setup

For testing, we built AIDArabic+ including the new source (EDRAK) and the Arabic

specific pre-processing component. We evaluated two data generation approaches: (i) us-

ing Yamli for transliteration and the type-aware Light SMT proposed in Subsection 4.2.3,

(ii) using the type-aware translation and transliteration using the full SMT framework.

In both, we used the external dictionaries introduced in 4.1.

We tested both AIDArabic+ configurations against AIDArabic [67] and Bablfy [43]

NED systems. Up to our knowledge, there is no other available systems supporting NED

on Arabic input.


Dataset SystemMentionPrec.

DocumentPrec.

Mappedto Entity

AIDArabic+ (Full-SMT) 73.23 71.34 94.69

LDC newsAIDArabic+ (Yamli & L-SMT) 70.83 68.94 92.73AIDArabic 69.07 67.26 87.19Babelfy (Full Matching) 30.32 31.16 39.75Babelfy (Partial Matching) 25.24 25.84 39.48

LDC webAIDArabic+ (Full-SMT) 68.16 60.10 93.86AIDArabic+ (Yamli & L-SMT) 66.06 56.86 92.13AIDArabic 62.02 52.48 85.56Babelfy (Full Matching) 22.33 21.13 38.62Babelfy (Partial Matching) 20.66 19.52 35.52

Table 5.8: Disambiguation Results for AIDArabic+ vs AIDArabic vs Babelfy

For all AIDA-based systems, we used YAGO3 as our back-end Knowledgebase built

from the English Wikipedia dump of 12Jan2015 combined with the Arabic Wikipedia

dump of 18Dec2014. The same configuration was used in the original AIDA local

similarity technique [27].

For Babelfy, we used their web service3 version 1.0. It offers two modes: named

entities full matching and partial matching. We ran both using a predefined set of

mentions. For fair comparison, we limited their candidate space to Wikipedia. We

resolved the corpus ground truth from YAGO3 to BabelNet [46] through BabelNet web

service getSynsetIdsFromWikipediaTitle.

Evaluation

We evaluated against mentions with non-null ground truth annotations. For fair com-

parison, Null annotations returned by all systems were considered wrong annotations.

We computed both mention precision and document average precision. Precision is

computed according to the number of correct annotations in contrast to the number of

all annotations returned by the system.

5.2.3 Results and Discussion

Results of our experiments are shown Table 5.8. AIDArabic+ consistently delivered

better results than competitors under test. Both versions of AIDArabic+ mapped above

92% of the mentions to non-null entities. AIDArabic+ built with full SMT achieved

better precision over Yamli & L-SMT build due to the better quality of the generated

3http://babelfy.org/guide


dictionaries. While the enhancement in the precision of the latter compared to AIDArabic

is less than 1%, the full-SMT version achieved 4% increment. Nevertheless, for the news,

our comprehensive KB could not shine enough, since most of the entities are prominent

enough to appear in the Arabic Wikipedia.

On the other hand, since entities in the web corpus are less prominent than the

news, the enriched KB showed better performance for the web documents. AIDArabic+

achieved 8% increment in the document precision and 6% in the mentions precision.

Babelfy with full and partial matching achieved less than 35% for both news-wire

and web corpora. Babelfy backend source does not apply entity name translation [46],

that explains its poor performance.

Results sampling shows that enhancements in AIDArabic+ resulted from the follow-

ing:

• New Entities: EDRAK covered new entities that were not covered in AIDAra-

bic schema by introducing at least one potential name. For example, names

“ñKñ

�Kñ»

�éJ

�¯A

®�K @” and “ÉJ

J�

�J�� á� @YKPñÊ

¯

�HðA�” were linked to ”Cotonou Agree-

ment”, and ”Sun-Sentinel newspaper” respectively, although there exist no Arabic

Wikipedia page for both. Thus, they were correctly disambiguated.

• Name Variants: Some entities already existed in the Arabic Wikipedia together

with their names, however, some English names have several potential forms in

Arabic. Transliteration was able to produce Arabic potential name variants. For

instance, the Arabic Wikipedia page of the Nobel prize winner“Jose Saramag” lists

his Arabic name as “ñ«AÓ@PA� éK

Pñk. ”. However, in our news corpus, the name

“ñ«AÓ@PA� éJ�ñ

k” was used instead. Our system could learn both forms and correctly

disambiguate that mention.

• New Names: Similarly, some entities lacks several prominent Arabic name aliases.

Translation and external dictionaries were able to expand the name dictionary

with such names. For example, entity United State Department of State may

be referred to as “�éJk. PA

mÌ'@” which did not exist in the Arabic Wikipedia, but was

translated from the English redirect “Department of State”.

Chapter 6

Conclusion and Outlook

6.1 Conclusion

In this thesis, we discussed adapting Named Entity Disambiguation effectively for Arabic

text. AIDArabic was the first attempt to enable NED on Arabic. Nevertheless, it

exhibited low recall due to the sparsity of structured Arabic resources and the complex

Arabic specific features. We are introducing AIDArabic+ to enhance NED on the

Arabic input by utilizing rich data schema and a customized Arabic pre-processing

component.

In order to overcome data sparsity of Arabic structured resources, we introduced

EDRAK as a back-end schema for AIDArabic+. EDRAK is an entity-centric Arabic

resource that contains around 2.4M entities, with their potential Arabic names, contextual

keyphrases and semantic types. Data in EDRAK has been extracted from the Arabic

Wikipedia and other available resources such as GW2C and name dictionaries. In

addition, we enriched EDRAK with automatically generated Arabic data based on the

English Wikipedia.

For the sake of achieving accurate data generation, we developed the Type-aware

Named-Entity translation, utilizing the fully fledged SMT framework Cdec and

a parallel corpus of Entity-Names. Furthermore, we developed a persons names

transliteration module to generate all possible variants of persons-names. Generated

data has been manually assessed by group of Native Arabic speakers. We made EDRAK

publicly available as a standalone resource to help advance the research for the Arabic

language.

Due to the morphological nature of Arabic, we integrated an Arabic pre-processing

module into AIDArabic+ architecture to correctly tokenize and normalize Arabic input.

45

46 Chapter 6 Conclusion and Outlook

Arabic customized pre-processing allowed better recall and precision for name and context

matching. We used Stanford Arabic Segmenter to perform the required morphological

analysis and tokenization.

Finally, in order to evaluate the effect of the proposed enhancements in AIDArabic,

we utilized a parallel word-aligned English-Arabic corpus (LDC2014T05 ) to create an

automatically annotated Arabic corpus. AIDA, as a state-of-art NED system was used

to generate annotations on the English side. Then, annotations were projected on the

Arabic side.

AIDArabic+ was able to resolve 94% of the mentions in the news-wire corpus to

non-null entities instead of 87% in the original AIDArabic. The expansion in the coverage

was achieved with 73% mention precision, which is 4% more than the precision for

AIDArabic and way better than Babelfy. Also, for the web articles, non-null mapping

increased from 85.6% to 93.9%, keeping a mention precision of 68% (8% increase over

AIDArabic). This indicates that our approach allows capturing more information about

non-prominent entities.

6.2 Outlook

There is still space for enhancing Named Entity Disambiguation on Arabic. The data

schema can be further enriched. Anchor texts have not been translated for the sake

of accuracy of the contextual keyphrases. Developing translation module with proper

training data for anchor texts can enrich the keyphrases dictionary, and hence, achieving

better precision.

AIDArabic+ used entities extracted from YAGO3 English and Arabic, no other

language was included. Evidences from languages other than the English Wikipidia can

be harnessed to enrich EDRAK entities repository and dictionaries. For example, more

entities can be captured from the German Wikipedia. Also, other languages that have

Arabic script such as Persian and Urdu can be further processed to provide a big number

of name entries specially for entities of type PERSON.

NED can be used for different applications. One of the applications built based on

AIDA NED is STICS [24]. STICS offers a web interface to search and explore news article

using canonical entities instead of normal text search. AIDArabic+ can be adapted as a

NED engine for STICS to support Arabic articles. This will introduce many use-cases

and challenges.

Bibliography

[1] Nasreen AbdulJaleel and Leah S. Larkey. Statistical transliteration for english-arabic cross

language information retrieval. In Proceedings of the Twelfth International Conference on

Information and Knowledge Management, CIKM ’03, pages 139–146, New York, NY, USA,

2003. ACM.

[2] Mohamed Al-Badrashiny, Ramy Eskander, Nizar Habash, and Owen Rambow. Automatic

transliteration of romanized dialectal arabic. In Proceedings of the Eighteenth Conference on

Computational Natural Language Learning, CoNLL 2014, Baltimore, Maryland, USA, June

26-27, 2014, pages 30–38, 2014.

[3] Yaser Al-Onaizan and Kevin Knight. Machine transliteration of names in arabic text. In

Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages,

SEMITIC ’02, pages 1–13, Stroudsburg, PA, USA, 2002. Association for Computational

Linguistics.

[4] Yaser Al-Onaizan and Kevin Knight. Translating named entities using monolingual and

bilingual resources. In Proceedings of the 40th Annual Meeting on Association for Computa-

tional Linguistics, ACL ’02, pages 400–408, Stroudsburg, PA, USA, 2002. Association for

Computational Linguistics.

[5] Yaser Al-Onaizan and Kevin Knight. Translating named entities using monolingual and

bilingual resources. In Proceedings of the 40th Annual Meeting on Association for Computa-

tional Linguistics, ACL ’02, pages 400–408, Stroudsburg, PA, USA, 2002. Association for


[6] Sren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, and Zachary Ives. Dbpedia:

A nucleus for a web of open data. In In 6th Intl Semantic Web Conference, Busan, Korea,

pages 11–15. Springer, 2007.

[7] Mahmoud Azab, Houda Bouamor, Behrang Mohit, and Kemal Oflazer. Dudley north

visits north london: Learning when to transliterate to arabic. In Proceedings of the 2013

Conference of the North American Chapter of the Association for Computational Linguistics:

Human Language Technologies, pages 439–444, Atlanta, Georgia, June 2013. Association for


[8] Ibrahim Badr, Rabih Zbib, and James Glass. Syntactic phrase reordering for english-to-arabic

statistical machine translation. In Proceedings of the 12th Conference of the European Chapter

47

48 BIBLIOGRAPHY

of the Association for Computational Linguistics, EACL ’09, pages 86–93, Stroudsburg, PA,

USA, 2009. Association for Computational Linguistics.

[9] Ann Bies, Zhiyi Song, Mohamed Maamouri, Stephen Grimes, Haejoong Lee, Jonathan Wright,

Stephanie Strassel, Nizar Habash, Ramy Eskander, and Owen Rambow. Transliteration

of arabizi into arabic orthography: Developing a parallel annotated arabizi-arabic script

sms/chat corpus. ANLP 2014, page 93, 2014.

[10] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: A

collaboratively created graph database for structuring human knowledge. In Proceedings of

the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08,

pages 1247–1250, New York, NY, USA, 2008. ACM.

[11] David Chiang. Hope and fear for discriminative training of statistical translation models. J.

Mach. Learn. Res., 13(1):1159–1187, April 2012.

[12] Kareem Darwish. Named entity recognition using cross-lingual resources: Arabic as an

example. In ACL (1), pages 1558–1567. The Association for Computer Linguistics, 2013.

[13] Chris Irwin Davis. Tajik-farsi persian transliteration using statistical machine translation. In

Proceedings of the Eighth International Conference on Language Resources and Evaluation

(LREC-2012), Istanbul, Turkey, May 23-25, 2012, pages 3988–3995, 2012.

[14] Chris Dyer, Victor Chahuneau, and Noah A Smith. A simple, fast, and effective reparame-

terization of ibm model 2. In NAACL/HLT 2013, pages 644–648, 2013.

[15] Chris Dyer, Adam Lopez, Juri Ganitkevitch, Johnathan Weese, Ferhan Ture, Phil Blunsom,

Hendra Setiawan, Vladimir Eidelman, and Philip Resnik. cdec: A decoder, alignment, and

learning framework for finite-state and context-free translation models. In Proceedings of

the Association for Computational Linguistics (ACL), 2010.

[16] Ramy Eskander, Mohamed Al-Badrashiny, Nizar Habash, and Owen Rambow. Foreign

words and the automatic processing of arabic social media text written in roman script.

EMNLP 2014, page 1, 2014.

[17] Paolo Ferragina and Ugo Scaiella. Tagme: On-the-fly annotation of short text fragments (by

wikipedia entities). In Proceedings of the 19th ACM International Conference on Information

and Knowledge Management, CIKM ’10, pages 1625–1628, New York, NY, USA, 2010. ACM.

[18] Spence Green, Daniel Cer, and Christopher D. Manning. Phrasal: A toolkit for new directions

in statistical machine translation. In In Proceddings of the Ninth Workshop on Statistical

Machine Translation, 2014.

[19] Nizar Habash, Owen Rambow, and Ryan Roth. Mada+ tokan: A toolkit for arabic

tokenization, diacritization, morphological disambiguation, pos tagging, stemming and

lemmatization. In Proceedings of the 2nd International Conference on Arabic Language

Resources and Tools (MEDAR), Cairo, Egypt, pages 102–109, 2009.

[20] Ben Hachey, Will Radford, Joel Nothman, Matthew Honnibal, and James R. Curran.

Evaluating entity linking with wikipedia. Artif. Intell., 194:130–150, January 2013.

BIBLIOGRAPHY 49

[21] Ondrej Halek, Rudolf Rosa, Ales Tamchyna, and Ondrej Bojar. Named entities from

wikipedia for machine translation. In ITAT, pages 23–30. Citeseer, 2011.

[22] Ondrej Halek, Rudolf Rosa, Ales Tamchyna, and Ondrej Bojar. Named entities from

wikipedia for machine translation. In ITAT, pages 23–30. Citeseer, 2011.

[23] Kenneth Heafield. KenLM: faster and smaller language model queries. In Proceedings of the

EMNLP 2011 Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh,

Scotland, United Kingdom, July 2011.

[24] Johannes Hoffart, Dragan Milchevski, and Gerhard Weikum. Stics: Searching with strings,

things, and cats. In Proceedings of the 37th International ACM SIGIR Conference on

Research & Development in Information Retrieval, SIGIR ’14, pages 1247–1248, New

York, NY, USA, 2014. ACM.

[25] Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum.

KORE: Keyphrase Overlap Relatedness for Entity Disambiguation. In Proceedings of the

21st ACM International Conference on Information and Knowledge Management, CIKM

2012, Hawaii, USA, pages 545–554, 2012.

[26] Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum. Yago2: A

spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell., 194:28–61,

January 2013.

[27] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Furstenau, Manfred Pinkal,

Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. Robust disambiguation

of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural

Language Processing, EMNLP ’11, pages 782–792, Stroudsburg, PA, USA, 2011. Association

for Computational Linguistics.

[28] Fei Huang, Stephan Vogel, and Alex Waibel. Improving named entity translation combining

phonetic and semantic similarities. In HLT-NAACL, volume 2004, pages 281–288, 2004.

[29] Heng Ji, HT Dang, J Nothman, and B Hachey. Overview of tac-kbp2014 entity discovery

and linking tasks. In Proc. Text Analysis Conference (TAC2014), 2014.

[30] Sarvnaz Karimi, Falk Scholer, and Andrew Turpin. Machine transliteration survey. ACM

Comput. Surv., 43(3):17:1–17:46, April 2011.

[31] Philipp Koehn. Statistical machine translation. Cambridge University Press, 2010.

[32] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,

Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer,

Ondrej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source toolkit for

statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on

Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA,

USA, 2007. Association for Computational Linguistics.

50 BIBLIOGRAPHY

[33] Young-Suk Lee. Confusion network for arabic name disambiguation and transliteration

in statistical machine translation. In COLING 2014, 25th International Conference on

Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29,

2014, Dublin, Ireland, pages 433–443, 2014.

[34] Young-Suk Lee. Confusion network for arabic name disambiguation and transliteration

in statistical machine translation. In COLING 2014, 25th International Conference on

Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29,

2014, Dublin, Ireland, pages 433–443, 2014.

[35] et al. Li, Xuansong. Gale arabic-english word alignment training part 1– newswire and web

ldc2014t05, 2014.

[36] Farzaneh Mahdisoltani, Joanna Biega, and Fabian M Suchanek. Yago3: A knowledge base

from multilingual wikipedias. 2015.

[37] James Mayfield, Dawn Lawrie, Paul McNamee, and Douglas W. Oard. Building a cross-

language entity linking collection in twenty-one languages. In Pamela Forner, Julio Gonzalo,

Jaana Keklinen, Mounia Lalmas, and Maarten de Rijke, editors, Multilingual and Multimodal

Information Access Evaluation: Second International Conference of the Cross-Language

Evaluation Forum, volume 6941 of Lecture Notes in Computer Science, pages 3–13. Springer,

2011.

[38] Paul McNamee, James Mayfield, Dawn Lawrie, Douglas W Oard, and David S Doermann.

Cross-language entity linking. In IJCNLP, pages 255–263, 2011.

[39] Paul McNamee, James Mayfield, Dawn Lawrie, Douglas W. Oard, and David S. Doermann.

Cross-language entity linking. In Fifth International Joint Conference on Natural Language

Processing, IJCNLP 2011, Chiang Mai, Thailand, November 8-13, 2011, pages 255–263,

2011.

[40] Pablo N. Mendes, Max Jakob, Andres Garcia-Silva, and Christian Bizer. Dbpedia spotlight:

Shedding light on the web of documents. In Proceedings of the 7th International Conference

on Semantic Systems (I-Semantics), 2011.

[41] David Milne and Ian H. Witten. Learning to link with wikipedia. In Proceedings of the 17th

ACM Conference on Information and Knowledge Management, CIKM ’08, pages 509–518,

New York, NY, USA, 2008. ACM.

[42] Will Monroe, Spence Green, and Christopher D. Manning. Word segmentation of informal

arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association

for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume

2: Short Papers, pages 206–211, 2014.

[43] Andrea Moro, Alessandro Raganato, and Roberto Navigli. Entity Linking meets Word Sense

Disambiguation: a Unified Approach. Transactions of the Association for Computational

Linguistics (TACL), 2:231–244, 2014.

BIBLIOGRAPHY 51

[44] Preslav Nakov and Jorg Tiedemann. Combining word-level and character-level models for

machine translation between closely-related languages. In Proceedings of the 50th Annual

Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL

’12, pages 301–305, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.

[45] Roberto Navigli and Simone Paolo Ponzetto. Babelnet: The automatic construction,

evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell.,

193:217–250, December 2012.

[46] Roberto Navigli and Simone Paolo Ponzetto. Babelnet: The automatic construction,

evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell.,

193:217–250, December 2012.

[47] Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment

models. Computational Linguistics, 29(1):19–51, 2003.

[48] Daniel Ortiz-Martınez and Francisco Casacuberta. The new thot toolkit for fully automatic

and interactive statistical machine translation. In Proc. of the European Association for

Computational Linguistics (EACL): System Demonstrations, pages 45–48, Gothenburg,

Sweden, April 2014.

[49] Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander,

Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M Roth. Madamira: A fast,

comprehensive tool for morphological analysis and disambiguation of arabic. Proceedings of

the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland, 2014.

[50] Lev Ratinov, Dan Roth, Doug Downey, and Mike Anderson. Local and global algorithms for

disambiguation to wikipedia. In Proceedings of the 49th Annual Meeting of the Association

for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages

1375–1384, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.

[51] Matthew Rowe, Milan Stankovic, and Aba-Sah Dadzie. #microposts2015 – 5th workshop on

’making sense of microposts’: Big things come in small packages. In Proceedings of the 24th

International Conference on World Wide Web Companion, WWW ’15 Companion, pages

1551–1552, Republic and Canton of Geneva, Switzerland, 2015. International World Wide

Web Conferences Steering Committee.

[52] Khaled Shaalan. A survey of arabic named entity recognition and classification. Computa-

tional Linguistics, 40(2):469–510, 2014.

[53] Nakatani Shuyo. Language detection library for java, 2010.

[54] Valentin I. Spitkovsky and Angel X. Chang. A cross-lingual dictionary for english wikipedia

concepts. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck,

Mehmet Uur Doan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and

Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language

Resources and Evaluation (LREC’12), Istanbul, Turkey, may 2012. European Language

Resources Association (ELRA).

52 BIBLIOGRAPHY

[55] Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva, and Erik van der Goot.

Jrc-names: A freely available, highly multilingual named entity resource. In Proceedings of

the International Conference Recent Advances in Natural Language Processing 2011, pages

104–110, Hissar, Bulgaria, September 2011. RANLP 2011 Organising Committee.

[56] Tao Tao, Su-Youn Yoon, Andrew Fister, Richard Sproat, and ChengXiang Zhai. Unsupervised

named entity transliteration using temporal and phonetic correlation. In Proceedings of the

2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, pages

250–257, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics.

[57] Jean Tavernier, Rosa Cowan, and Michelle Vanni. Holy moses! leveraging existing tools and

resources for entity translation. In Proceedings of the International Conference on Language

Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco, 2008.

[58] Tomasz Tylenda, Mauro Sozio, and Gerhard Weikum. Einstein: physicist or vegetarian?

summarizing semantic type graphs for knowledge discovery. In Proceedings of the 20th

International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28 -

April 1, 2011 (Companion Volume), pages 273–276, 2011.

[59] Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Michael Roder, Daniel Gerber, Sandro Athaide

Coelho, Soren Auer, and Andreas Both. AGDISTIS - agnostic disambiguation of named

entities using linked open data. In ECAI 2014 - 21st European Conference on Artificial

Intelligence, 18-22 August 2014, Prague, Czech Republic - Including Prestigious Applications

of Intelligent Systems (PAIS 2014), pages 1113–1114, 2014.

[60] Ricardo Usbeck, Michael Roder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both,

Martin Brummer, Diego Ceccarelli, Marco Cornolti, Didier Cherix, Bernd Eickmann, Paolo

Ferragina, Christiane Lemke, Andrea Moro, Roberto Navigli, Francesco Piccinno, Giuseppe

Rizzo, Harald Sack, Rene Speck, Raphael Troncy, Jorg Waitclonis, and Lars Wesemann.

GERBIL - General entity annotator benchmarking framework. In WWW 2015, 24th

International World Wide Web Conference, May 18-22, 2015, Florence, Italy, Florence,

ITALY, 05 2015.

[61] Marieke van Erp, Giuseppe Rizzo, and Raphael Troncy. Learning with the web: Spotting

named entities on the intersection of NERD and machine learning. In Proceedings of the

Concept Extraction Challenge at the Workshop on ’Making Sense of Microposts’, Rio de

Janeiro, Brazil, May 13, 2013, pages 27–30, 2013.

[62] Gerhard Weikum, Johannes Hoffart, Ndapandula Nakashole, Marc Spaniol, Fabian M

Suchanek, and Mohamed Amir Yosef. Big data methods for computational linguistics. IEEE

Data Eng. Bull., 35(3):46–64, 2012.

[63] Jonathan Wright, Kira Griffitt, Joe Ellis, Stephanie Strassel, and Brendan Callahan. Annota-

tion trees: Ldc’s customizable, extensible, scalable, annotation infrastructure. In Proceedings

of the Eighth International Conference on Language Resources and Evaluation (LREC-2012),

Istanbul, Turkey, May 23-25, 2012, pages 479–485, 2012.

BIBLIOGRAPHY 53

[64] Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart Marc Spaniol, and Gerhard Weikum.

HYENA: Hierarchical Type Classification for Entity Names. In Proc. of the 24th Intl.

Conference on Computational Linguistics (Coling 2012), December 8-15, Mumbai, India,

pages pp. 1361–1370, 2012.

[65] Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart Marc Spaniol, and Gerhard Weikum.

HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text.

In Proc. of the 51st Annual Meeting of the Association for Computational Linguistics (ACL

2013), Sofia, Bulgaria, August 4-9, 2013, pages 133–138, 2013.

[66] Mohamed Amir Yosef, Johannes Hoffart, Ilaria Bordino, Marc Spaniol, and Gerhard Weikum.

AIDA: an online tool for accurate disambiguation of named entities in text and tables.

PVLDB, 4(12):1450–1453, 2011.

[67] Mohamed Amir Yosef, Marc Spaniol, and Gerhard Weikum. AIDArabic: A named-entity

disambiguation framework for Arabic text. In The EMNLP 2014 Workshop on Arabic

Natural Language Processing (ANLP 2014), pages 187–195, Dohar, Qatar, 2014. ACL.

Appendix A

Manual Assessment Interface

Following figures show the manual assessment web interface.

Figure A.1: Manual Assessment: Welcome page with instructions and steps video

55

56 Appendix A Manual Assessment Interface

Figure A.2: Manual Assessment: Data Evaluation Page: Each English name has atmost three translations and none and Don’t know choices

Documents

AIDArabic+ Named Entity Disambiguation for Arabic Textgadelrab/downloads/Mohamed_Gade… · Arabic content to an automatically generated knowledge base from Wikipedia. The contributions