50
Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John Bateman Renate Henschel Judy Delin talk given by: Guowen Yang Taipei, September 200

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Embed Size (px)

Citation preview

Page 1: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

A brief introduction to the GeM annotation schema for complex document layout

John BatemanRenate Henschel Judy Delin

talk given by: Guowen Yang

Taipei, September 2002

Page 2: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Overview of TalkOrientation:

– describing the approach to annotation of multimodal documents developed in the GeM project

• What is the GeM project?– goals, methods, requirements

• Summary of annotation problems raised

• Annotation solutions adopted

Page 3: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

The GeM project (‘Genre and Multimodality’)

– supported by the British Economic and Social Research Council (ESRC)

– Cooperation: • University of Stirling• University of Bremen• Enterprise Information Design Unit

– Goal: to put the description of multimodal page-based documents on a sound

empirical footing

Page 4: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

The problem of data

– there is now much theorizing about how multimodal documents work

– but the empirical basis of this theorizing is often less than strong

Page 5: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Basic GeM hypotheses– Documents belonging to different kinds of ‘genres’ will exhibit different

kinds of multimodal patterns just as text sorts exhibit different lexicogrammatical patterns

– It should be possible to map out these patterns for different genres

– There should be a regular relationship between genre-type and the patterns found

Page 6: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Requirement– An annotated corpus needed to be constructed

containing the extra information that we know/expect to be most useful in establishing descriptions of multimodal documents

– The extra information is then to serve as the basis for generalizations about genre

Page 7: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

The problem of data selection and description

– what kinds of documents are we talking about?

– what kinds of annotation do we need?

Page 8: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

– what kinds of documents are we talking about?

Any page-based medium which combines information

from a variety of modalities in order to get its message

across

Page 9: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Initial genres selected for the GeM corpus

– field guides (birds)

– instruction manuals (telephones)

– print newspapers

– electronic web-based versions of newspapers

Page 10: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Field guides

Page 11: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Instruction manuals

Page 12: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Print newspapers

Page 13: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Web-based newspapers

Page 14: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Motivations for selections– all contain combinations of graphical, textual,

photographic material

– all use the layout of these elements in complex ways

– for all the documents taken we were able to obtain feedback and discussion from their designers

Page 15: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Some Relations to Natural Language Processing

– it is our belief that we can approach the design and function of these documents using established linguistic techniques

– the ‘unit of analysis’ is scaled-up from the sentence or the text to the page (at least)

– given a formal specification of the motivation and realization of such documents, we can consider their automatic generation

Page 16: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

The problem of data selection and description

– what kinds of documents are we talking about?

– what kinds of annotation do we need?

Page 17: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

– what kinds of annotation do we need?

Page 18: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

The GeM annotation layers

• Content structure

• Rhetorical structure

• Layout structure

• Navigation structure

• Linguistic structure

Page 19: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

The GeM annotation layers

• Content structure

• Rhetorical structure

• Layout structure

• Navigation structure

• Linguistic structure

genre

form

Page 20: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Practical information required– the GeM model also takes seriously the notion that the

concrete, practical conditions of production (technology, material, time-available, etc.) all contribute substantially to the properties of a genre

• Canvas constraints

• Production constraints

• Consumption constraints

Page 21: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Requirements

– From the GeM perspective, a page-based multimodal document requires analysis from at least these levels and considering the sources of constraint identified.

– Only then do we have enough information to consider:• motivation of design• critique of design and communicative effectiveness• repurposing• automatic generation

Page 22: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Pointers– The assumptions made, and the particular layers of

analysis adopted, are motivated and introduced at length in:

• Delin/Bateman/Allen: Information Design Journal• Delin/Bateman: Document Design

– Details on the website http://purl.org/net/gem

Page 23: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Overview of TalkOrientation:

– describing the approach to annotation of multimodal documents developed in the GeM project

• What is the GeM project?– goals, methods, requirements

• Summary of annotation problems raised

• Annotation solutions adopted

Page 24: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Overview of TalkOrientation:

– describing the approach to annotation of multimodal documents developed in the GeM project

• What is the GeM project?– goals, methods, requirements

• Summary of annotation problems raised

• Annotation solutions adopted

Page 25: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Summary of annotation problems raised

– form of annotation to select

– criteria for recognising units

– multiple non-isomorphic intersecting hierarchies

– non-linear information

– complex query requirements

Page 26: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Annotation solutions adopted

– form of annotation to select

TEI: Text Encoding Initiative

CES: Corpus Encoding Standard

XCES: XML version

GEM annotation scheme

Page 27: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Annotation solutions adopted

– criteria for recognising units• basic vocabulary of the page:

images, signs, sentences, numbers, ...

• layout units: hierarchy determined visually and by considering the degree to which elements ‘belong together’

• rhetorical structure: traditional analysis according to Mann&Thompson’s rhetorical structure theory (RST)

• navigation units: elements pointing elsewhere in the document

Page 28: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Annotation solutions adopted

– multiple non-isomorphic intersecting hierarchies

• stand-off annotation...

Page 29: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

XML stand-off annotation for encoding the GeM layers

• a single ‘base’ element annotated file

• several ‘stand-off’ layers of annotation

• a Document Type Definition (DTD) for each layer of annotation

• each annotation layer corresponds to a GeM analysis layer

Page 30: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

GeM layers: the base file <unit id="u-21.5">---------------</unit> <unit id="u-21.6" src="gannet.jpg" alt="gannet-photo"/> <unit id="u-21.7"> Huge (90cm) unmistakable seabird. </unit> <unit id="u-21.8"> Watch for white, cigar-shaped body and long straight, slender, black-tipped wings. </unit> <unit id="u-21.9"> In summer, yellow head of adult inconspicuous. </unit> <unit id="u-21.10"> Plunges spectacularly for fish. </unit> <unit id="u-21.11"> Sexes similar. </unit>

Basic ‘vocabulary’ of the page, segmented

and numbered.

Actual ordering and positioning on the page irrelevant at this stage.

Predominantly ‘flat’ structure.

Page 31: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

base units

Layout SemanticContent

RSTsegments

navigationalelements

layout units

Distribution of information across layers

Page 32: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

base units

Layout SemanticContent

RSTsegments

navigationalelements

layout units

Distribution of information across layers

Page 33: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

• Working visually from the page, decompose the objects on the page in terms of their visual unity

• Transform the page decomposition into a hierarchical structure

• Specify presentation information for units: e.g., font size, type, colour, image type, resolution, etc.

Example: Derivation of layout structure

Page 34: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Page 35: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Page 36: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Page 37: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

• provides a place for assigning specific information about the layout units

• contents given by collections of the base units of the page

Complete layout structure

Page 38: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

GeM layers: layout units (1)

Layout units content defined by cross

references (xrefs) to base units

Content here not formally used and may

be ommitted

<layout-unit id="lay-flegg-text" xref="u-21.7 u-21.8 u-21.9 u-21.10 u-21.11"> Huge (90cm) unmistakable seabird. Watch for white, cigar-shaped body and long straight, slender, black-tipped wings. In summer, yellow head of adult inconspicuous. Plunges spectacularly for fish. Sexes similar.</layout-unit>

Page 39: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

GeM layers: layout units (2)

Layout units contain typographical details

common over the unit and its children

Layout units again identified via cross-references

Typographical information modelled on CSS and

XSL:FO

<text xref="lay-21.12 lay-21.14 lay-21.16 lay-21.18 lay-21.20" font-family="sans-serif" font-size="10" font-style="normal" font-weight="bold" case="mixed" justification="right" color="black"/>

Page 40: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

GeM layers: layout units (3)<layout-root id="page-21"> <layout-leaf xref="header-21"/> <layout-chunk id="body-21"> <layout-leaf xref="lay-21.2"/> <layout-leaf xref="lay-21.3"/> </layout-chunk> <layout-leaf xref="page-no-21"/></layout-root>

Layout structure is recursive

page-21

header-21 body-21 page-no-21

lay-21.2 lay-21.3

Page 41: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Annotation solutions adopted

– non-linear information• positioning of layout units within a page is

specified two-dimensionally with respect to a generalized page model

• the page model decomposes the page area into a hierarchy of grids

• specifying the grid for a page is part of the annotation task.

Page 42: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

• Working visually from the page, decompose the objects on the page in terms of their visual unity

• Transform the page decomposition into a hierarchical structure

• Specify presentation information for units: e.g., font size, type, colour, image type, resolution, etc.

• Inspect the page for any local or global grid structure

• Relate layout units to grid positions

Example: Derivation of layout structure

Page 43: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

• each sub-tree can additionally be assigned to a position in a hierarchically ordered page grid

Complete layout structure + page model

Page 44: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Complete layout structure + page model

Page 45: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

85%

5%

10%

14cm

GeM layers: area model

Layout units are related to identified elements of a

hierarchical grid specified in the area model

<area-root id="page-frame" cols="1" rows="3" hspacing="100" vspacing="10 85 5" height="16cm" width="14cm"> <sub-area id="body-frame" location="row-2" cols="2" rows="1" hspacing="50 50" vspacing="100"/></area-root>

<layout-root id="page-21"> <layout-leaf xref="header-21" location="row-1"

area-ref="page-frame"/> ... </layout-root>

16cm

Page 46: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Annotation solutions adopted

– complex query requirements

• Xpath Queries using standard

tools

Page 47: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Conclusions

• The annotation scheme allows detailed annotation of complex page-based documents

• Regularities can be sought using complex Xpath queries

• The system is open-ended and extensible without any redefinition of existing resources

Page 48: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Ongoing Work• Further collection and ongoing annotation of corpus

– http://purl.org/net/gem

• Use of results for criticism of document design and for exploring the relation between layout and rhetorical structure– Delin/Bateman: Document Design, 2002

• Use of Xpath queries within sequences of extensible style sheet transformations for automatic document generation– Henschel/Bateman/Delin: Konvens2002, Saarbrücken

Page 49: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Future Work• Extension of annotation schemes

– Current violations of the grid area model handled by relative offsets, need more flexible approach

– non-rectilinear grids for more complex design– consideration of dynamic elements, animation, etc.

• Extension of genres considered– advertisements– scientific documents

• Extension of languages considered

Page 50: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Thank you !