21
INVESTIGATING TYPE-TOKEN REGRESSION AND ITS POTENTIAL FOR AUTOMATED TEXT DISCRIMINAT1C)N Pascua1 Cantos Gómez Departaniento de Filología Inglesa Universidad de Murcia ARSI'KACT ihe rttot1i.atio17 of'iltl~ pre.~e~~t puper ;.S hhraetl 017 th~, i17rl1iiior7 tliat th~. sole IISL, of'rlirlo ori Il,,\-/cal den.sih r~~lir/ii,iz io tcvr surrrples of'i.urioirs Itrr7gr1agi~. oirthors. lirigiristii. ilorrrcrir7~. etc., rrriseht o poteritiul iirclicator,for ~rrrronlateil tci;t rlrscririiirtirlio~i. 111 ortler lo look f0r o r~~lt~rhl~~ cir7d i.crlid I~XIL~LII i/~~ll.Yih ir7rlev. ii,~~ sholl rei.ieii, rrr?<l rlarif~. thc ir~crthertrair~~irl r~~>lirtiort.ship ~CIII,LVI~ hpes (iror~IfOr17rs) or7dioke17~ (iv»rtis/ h\. (I~SCI~SSII~~~ cr~7dco17sir11~~ti~i~~ ~ICIL>~/IIIIIC i.e,qrl,sr 1017 r~~ocll~ls ihcrt rtright hldp to tliljcr~~r7titrtr tc.vt hpq fr~iiir etrch oth~~r. .-Ir/tlitior~olli~. ii~~~.sh~rll Irse n111ltii~irriotr siotr.stic.al rilodi~ls ~i~lirster crr7tl%Ti.s iit7il CII~crir71ir1trr~i,/irrictior1 ar70l~lci.s/ 10 C~I~IP/L>IIICI~/ /he r17athenruticd Ie.~'sii.irl iler7sih. rr~gri~ssiori 17rodc~l (7'l"l'-/orrrlir/tr/. K1:Y WORDS: ~.orplis Irrig~ri.stics. !ip~~-tokcr7 regres.tioi1, le~l hpolo~.. ~rirtoriratetlte\-t clas.~ificirtior7. KI:SLlMC.N La iirotii~erci(irr tI~~Iprese~it~~ articlrlo ~I(IL.~ tic> 10 1~71rriciór7 C/L' 911e la solii ~rtili-ocióri di, 10 densidudll;~ico 11ij nr~ratrcrs tevtr~oles ~L>~CJIC>C~~I~~L>S tr [lifL~r.~>r7t~~.s idiorrras, IIIIIOI.L~,S, dorrrir7r0~ lirigiiísti~os, elc., P I I C ~ ~ .sL~~/?o~~~~cI«IIIIL>~~~~ i.álido ptrru ~lircrirrrrr7ur textos (12 forrrr« trr~torrrii/~c-t~. C'OII ('1 firi de CIIL.OJ~~.~~ 1111 ~I~LIICLJ de der7.sidad Il;.~rca i~cíli&?~,/iahlc. he~t~o.s i.~~i.islrtIo J. ~~I~rrific~~lo lo relo~.r(ir7 r~,atcrrráiictr eritre tipos (fornia<) J. tokern (polahros), puro ~.ori.stririr i77ocI~lo.s tle r~egresi(iii U ~ C L ~ I ~ ~ ~ ~ O S ~/IJI> r70.s pi~r111itirr7 (/i<lir7g1rir lipox d;> testo.~. Por t~fi(iiIidi~ri~. h~~rllos Itecho IISO dl, rr1ot1~lo.v ~~.~ad;sticos rr1irItii~~1riu~7ti.s (a17illrsi.s tle coriplorrreriidor J. ariálisrs discrirrririar7tc/ c,or7 el firi r.orrr/~letrrerrtcir.j. optir1ri:ur el rr7od1,lo rrrot~,irrútico di, regrrsión perro 10 rler7siikr~l ldxi~.o (lo /Omi~ir/u TI.T/. PA1,AHRAS CLAVE: lir7ji1risiictr tlel L.orplrs. r-qre.viciri rle tipos (forrrra.s/ toker7.s (piilcrhrirs/. ti~>ologiir de re.vto.s. cltrs~rcrr~~i~jri alrtorrrático c/i, te.~tos. C'1rnt1ernos de Filolopíir Ir7gIi~vir. 9( 1 ). 2000. pp. 7 1-9 1

INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

INVESTIGATING TYPE-TOKEN REGRESSION AND ITS POTENTIAL FOR AUTOMATED TEXT DISCRIMINAT1C)N

Pascua1 Cantos Gómez Departaniento de Filología Inglesa

Universidad de Murcia

ARSI'KACT

ihe rttot1i.atio17 of'iltl~ pre.~e~~t puper ;.S hhraetl 017 t h ~ , i17rl1iiior7 tliat th~. sole IISL, of'rlirlo ori Il,,\-/cal den.sih r~~lir/ii,iz io tcvr surrrples of'i.urioirs Itrr7gr1agi~. oirthors. lirigiristii. ilorrrcrir7~. etc., rrriseht o poteritiul iirclicator,for ~rrrronlateil tci;t rlrscririiirtirlio~i. 111 ortler lo look f0r o r ~ ~ l t ~ r h l ~ ~ cir7d i.crlid I ~ X I L ~ L I I i/~~ll.Yih ir7rlev. i i , ~ ~ sholl rei.ieii, rrr?<l rlarif~. thc ir~crthertrair~~irl r~~>lirtiort.ship ~ C I I I , L V I ~ hpes (iror~IfOr17rs) or7dioke17~ (iv»rtis/ h\. ( I ~ S C I ~ S S I I ~ ~ ~ cr~7dco17sir11~~ti~i~~ ~ I C I L > ~ / I I I I I C i.e,qrl,sr 1017 r~~ocll~ls ihcrt rtright hldp to tliljcr~~r7titrtr tc.vt h p q fr~ i i i r etrch o th~~r . .-Ir/tlitior~olli~. i i~~~.sh~rll Irse n111ltii~irriotr siotr.stic.al rilodi~ls ~i~lirster crr7tl%Ti.s iit7il CII~crir71ir1trr~i,/irrictior1 ar70l~lci.s/ 10 C ~ I ~ I P / L > I I I C I ~ / /he r17athenruticd Ie.~'sii.irl iler7sih. rr~gri~ssiori 17rodc~l (7'l"l'-/orrrlir/tr/.

K1:Y WORDS: ~.orplis Irrig~ri.stics. !ip~~-tokcr7 regres.tioi1, l e ~ l h p o l o ~ . . ~rirtoriratetlte\-t clas.~ificirtior7.

KI:SLlMC.N

La iirotii~erci(irr t I ~ ~ I p r e s e ~ i t ~ ~ articlrlo ~ I ( I L . ~ tic> 10 1~71rriciór7 C/L' 911e la solii ~rtili-ocióri di, 10 densidudll;~ico 11ij nr~ratrcrs tevtr~oles ~ L > ~ C J I C > C ~ ~ I ~ ~ L > S tr [lifL~r.~>r7t~~.s idiorrras, IIIIIOI.L~,S, dorrrir7r0~ lirigiiísti~os, elc., P I I C ~ ~ . s L ~ ~ / ? o ~ ~ ~ ~ c I « I I I I L > ~ ~ ~ ~ i.álido ptrru ~lircrirrrrr7ur textos (12 forrrr« trr~torrrii/~c-t~. C'OII ('1 firi de C I I L . O J ~ ~ . ~ ~ 1111 ~ I ~ L I I C L J de der7.sidad Il;.~rca i~cíli&?~,/iahlc. he~t~o.s i.~~i.islrtIo J. ~ ~ I ~ r r i f i c ~ ~ l o lo relo~.r(ir7 r~,atcrrráiictr eritre tipos (fornia<) J. tokern (polahros), puro ~.ori.stririr i77ocI~lo.s tle r~egresi(iii U ~ C L ~ I ~ ~ ~ ~ O S ~ / I J I > r70.s pi~r111itirr7 (/i<lir7g1rir lipox d;> testo.~. Por t~fi(iiIidi~ri~. h~~rllos Itecho I I S O dl, rr1ot1~lo.v ~~.~ad;st icos rr1irItii~~1riu~7ti.s (a17illrsi.s tle coriplorrreriidor J. ariálisrs discrirrririar7tc/ c,or7 el firi r.orrr/~letrrerrtcir.j. optir1ri:ur el rr7od1,lo rrrot~,irrútico di, regrrsión perro 10 rler7siikr~l ldxi~.o (lo /Omi~ir/u TI.T/.

PA1,AHRAS CLAVE: lir7ji1risiictr tlel L.orplrs. r-qre.viciri rle tipos (forrrra.s/ toker7.s (piilcrhrirs/. ti~>ologiir de re.vto.s. cltrs~rcrr~~i~jri alrtorrrático c/i, te.~tos.

C'1rnt1ernos de Filolopíir Ir7gIi~vir. 9( 1 ). 2000. pp. 7 1-9 1

Page 2: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

Pascual Cantos Góinrz

l . INTRODUCTION

Simple extracts given froni frequenc) lists oiiIy shon the entries for individual \\ords. Most

frequency software also produces usefiil totals and sometimes offers a range of statistics based

on then~. The most comnlon totals calculated for word frequency lists are usually reikrred to as

total tokens aild total types and it is important to understand the distinction between them.

In this context. a tolieil is an individual occurrence oi'any word form. The paragraph:

"Linglli.s/s nirlj. \i,onder i i ~ h j ~ ~ h e j ' necd .r/clíis/ic.s. The donzintrri~ ~ h e o i . e / i ~ u l , f i ~ ~ ~ ~ ~ i e ~ i ~ o r k in /kc,ficld I ~ L I I of gcno.cl/i~v grclnznzri~~. h~1.s LIS i1.s prjn1~1q' U ' L I / ~ I - . Y O I I ~ C ~ ,i~~dgt>nze~i/s r1ho11/ /he ii~ell-

, f0r1izctlncs.s 0f'sen/er7ce.s. The.ve ,jirdgerne17/s ~l.sirrll(, c.oilic fi .ci l i i

li~ig~:l,i.rí.\ íhcnl.vcl\~ev, urc eiíher.-or u'ccisior7s. rlricl rrl~lle /o /he lungur~gc rrhiliíj' of'un ideul I ~ L I I ~ V C .spe~lker j ~ i ( I honiogenco~~.~ .speccli co111niur7iíj'".

contains altogether 56 words or tokens. but these represent only 48 different word forn~s or types.

The fiequencl- list shows the iiurnber OS tokens found for each type. In this case. the following

5 types have more than one token:

the 4 of ?

in 3 judgernents 3 linguists 2

Retween them. these types account for 1-3 of the tokens. The otlier 43 types occur only

once and make up the overall total os56 tokens.

The distribution oi'tokens between the types in a text can provide a useful measure of the

degree of lexical \-ariety within it. and may even provide n starting-point for examining lexical

differences between difkrent types of text. S~J-les. authors. etc. Several statistics can be calculated

from the information contained in the list. The sirnplest is the ratio ol'tohens and types. in other

words. the mean frequency of each different uord form. In the case of the paragraph used above.

this is 56 143 = 1.3. This indes ( 1.3) indicates thnt each word Sorm or type occurs on aL2erage 1.3

times. Similarly. the reverse can be illustrated. that is. once the amouilt of types and tokens

relative OS a text sample are known. we can calculate its lexical diversit~ or lexical density by

dividing the total nunlber oi'types by the total number oi'tokens: 4.3 156 = 0.76. If \ve eventually

nlultiply this quotient b! 100. then we get the mean percentage of dillkrent t!.pes per one hundred

lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type-

token ratio. are not ver). signilicant and reliable. This is obviousl! caiised by the sniallness of the

test sample (jiist 56 tokens or words). 1,onger texts. such as foiir 01. Joseph Conrand's novels

Cuu(1c.r.110~ Filologicr Itlglrc.[r. 9( I ). 2000. pp. 71-9 1

Page 3: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

In\.estigating 'l'ype-1-oken Kegression and i t s Poteritial for Auioinated 'Text Disci.iiiiination 77

(Siggo, of Ilie il~'~~r-c.i.s.si~.v '. Lor-d.lir?l. Hetnl (?f'Dtn.k17e.s.s and T11e Secrel Age~ll). result in a token-

type ratio of 15.33 and the type-token ratio of6.56%. These figures or ratios are affected by the

overall number oftokens and types in the four novels (271.056 tokens and 17.795 types).

1 Iowever. the reliability ot'the token-type and type-token ratio as quantitative indicators

of'lesical diversity or lesical density are constrained because of their dependence o11 text size - while test leiigth (tokens) is theoreticalIy unliniited. the number oi'different words in use (types)

in a language is finite (Holn-ies 1994: 92). That is. while any linguistic corpus increases IinearIy

in tokens in a completely regular or stable shape. its increase iii types -though close to that of the

tokens at the beginning- stai-ts declining tlie more the corpus prows. as it contributes íewer new

types. The cumulative tokens are distributed linearly. wllile the ciimulative types are distribiited

curvilineariy (Biber 1993: 350: see Fig. 1) .

Consequeiitly. to overcome this reliabilih problem of the tokeii-type and type-token

ratio aiid compare tests or corpora with respect to their lexical density. the texts to be compared must be based o n sanlples of the same size. disregarding the total length of the text or corpora.

This ensures that the cornparisons based oii the token-type andior token-type indices beconie somehon useful and relevant (Biber 1988: 338-0).

In what folloa-s. we shall try to overcon~e this apparent reliability problem ofthe token- type relationship not b> nieans ofequalising the text samples to the same nuniber of tokens biit

by nieans of investigating and determining the hyperbolic fiinction o f s p e s relative to different text samples (linguistic domains. authors. etc.). We are confident that the type-token functioii

is a positive indicator Ior discriminating text samples. in the belief that the non-linear g r o ~ tli oí' types is idios>.ncratic and to some extent ~inique. depending heavily on the topics.

authorship. etc.. oi'tlie various text samples.

C'i~rrderrios cle b ilolo,yio Ir~lcso. 9( 1 ). 2000. pp. 7 1 -9 1

Page 4: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

74 Pasc~ial Cantos Cióinez

Ir. TYPE-TOKEN RELATIONSE-IIP

Our aim is to look tor a stable text independent index that determines tlle Qpe-token relatioilship. The problem. as already outlined. is that tokens increase linearly and types do so in a curvilinear way (Fig. 1).

Regarding the increasing rate of word forms (types). Heaps (1978) reported that the tollowing expression is true tbr a general English text ofup to at least 20.000 xvords. wl-iere B

(types) is related to the total number of tokens Ay by an equation relative to the waj. the text

length increases: D = k>\' hence. log D = h log A- 4 log k

and where k and h are constants that depend on the particular text sample. He emphasized tlle

linear relation between log D and log~Vas taking common logarithms oí'both sides o f B = k6'.

respectivelj . The purpose of his research was to create and manage index tiles et'iicientlj. for

document retrieval. This explains why he experimented on a collection of title words oE

documents rather than on general English text (or corpora). Nevertheless. he did not give anj explanation about how- the equation was derived.

Note that Heaps just insisted that the expression above is true for general English text

oí' '.iip to at least 20.000 words" rather than for texts of any size. This implies that tlle dependent constants or the expression itself might change as the corpus size greatlj- grow-s. In

other words. even if we were to tind a function that fits the given data (corpus). tllere is no certainty that the function would always hold.

A positive contribution to this issue can be found in Sánchez and Cantos (1997). These authors offer a detailed explanation on the type and lemma growhs based on the observations

oí' the ('CíIjfBRE C'orp11.r (Corpus oí' Contemporary Spanish). They concluded that tokens

represent a linear function (,, = '11) and types a kind ol'hyperbolic function (v = c i h ) . The

calculation of the slope LI is straightforward. We just need a snlall sample and by means of any

available concordance progranl get the overall tokens and types. For instante. if we assume .u

to be the tokens a n d j ~ to be the types. we get. by simply instantiating the values obtained fronl.

saj. a 250.000 token sample with 26.812 types. the tbllowing:

26.812=~11hj0,000 ~i=j;. 624

Now. h e have to make cure that this constant value LI is indeed reliable in calculating the

nunlber of types Srom a given number of tokens. To check this. Sánchez and Cantos defined and applied the / jpe- /oke~~,for~~i~iI~i (TITTfurn~lil~i. hereafter):

lp=j;. 624di

C'1íií~leri70s de Filologio 11iglc.su. 9( 1 ). 7000. pp. 7 1-9 1

Page 5: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

lnvestigating Type-Token Regression and its Potential for Automated l'ext Discrimination 75

where y stands for types and x for tokens. and compared the results with those obtained from

real evidence. the C'UMBRE C'orpu.~. Trrble 1 below shows eight samples chosen (1.000.000.

2.000.000. 3.000.000. 4.000.000. 5.000.000. 6.000.000. 7.000.000 and 8.000.000 tokens). The estimation results obtained by means of our TYTTformula are quite close to the real

ones: corpus-based. The differences between the real data and the estimations range from +4.761% to -1.353%. which translated into total figures goes from +3611 to -1778 types.

based)

Tab. 1 Corpus-based dalo vs dalo based on esirmuiion (CUMBRE Corpt~s)

It is noteworihy that the estimations are just based on a single L/-irafue obtained from a

subcorpus. that is only 250.000 tokens. This pives an idea of the reliability and validity of the formula. The tr-valzie. though based on just 250.000 tokens. showed a great deal of accuracy

in the projection of various multi-million token samples. This cr-i~alzle is the sort of parameter

that tells the function y = h the initial slope the curve is to have from 250.000 tokens on.

The T-fornzula has undergone thorough testing and severa1 more trials were

undertaken. taking various samples from specific sublanguages. namely. press and general

Iiction. The tests were carried out by means of four 250.000 word samples from newspaper and

general Iiction language. The a-vtrlzies (press: 56.17; and general Iiction: 5 1.45) were obtained

calculating the mean of al1 a-values of the various samples (press: 56.12 for 250.000.56.48 for

500.000.56.34 for 750.000 and 55.74 for 1.000.000 tokens; general Iiction: 50.77 for 250.000. 51.45 for 500.000.52.29 for 750,000 and 51.32 for 1.000.000 tokens). The results for the real

corpus data and the projections are given in Tables 2 and 3 below. The striking similarities

between the real data and the estimated ones conIirm once more that the formula is indeed reliable for calculating the types from a piven number of tokens. and shows that tokens and types are functionally dependent on each other. This dependency can be mathematically modeled even before compilinp any corpus.

Cziarlerno~ rle Filología Inglesu. 9( 1 ). 2000. pp. 7 1-9 1

Page 6: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

7 6 Pascual Cantos Góiner

Toh. 2 1 esting the T1'T-For.riir1la witli press saiiiples

7'rlh. 3 I'esting the TI'T-F>rnii~l<r with peneral liction saiiiples

Similar test were perlornled not just for Spanish but also l»r English:

Page 7: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

Inves~igating ' 1 ype-Tokeii Rrgression aiid iis Potriitial foi- Automated Teut Discriiriination 77

The evidence o f the experimental results allows us to state that li.equencies o f different types are not only distributed 'curvilinearly. (Biber 1993: 350). hui are distributed in a predictable way. that is. they are subject to mathematical nlodelling. W e can still go fiirther and say that i f the relationship between types and tokens holds then we might be able to construct regression nlodels (for a detailed discussion on the adecluacy ol' regression inodels !br

typellemma predicition see Yang. Cantos and Song,fOi~/hcon~it~g).

111. TYPE-TOKEN REGRESSION

In order to construct a regression model. boih the information which is going to be ~ised to n~ake the prediction and ihe ii~lbrn~ation which is to be predicted must be obtained from a Corpus sample. The relationship between the two pieces ol'inl'ormation is then inodelled with

a linear transformation. Then in the iuture. only the first information is necessar);. and the regressinn model is used to transform this information into the predicted. In other words. it is

necessary to have iniormation on both variables (types and tokens) before the model can be

constructed. A notional scheme is now necessary to describe t l~e procedure:

.Y is the variable iised to predict. and is sometimes called the independent variable. In

our case. it would be the ainount ol'1oke17.s.

1' is the observed value oi'the predicted variable. and is sometimes called the dependent

variable. I t would be the total ~jpes.

J . ' is ttie predicted value o f tlie depeildent variable. It would be tlie predicted nuinber

o f og7e.v.

Tlie goal in regression models is to create a model where the predicted J.' ai~d the

obserwd values o f the variable to be predicted are as similar as possible. The more similar

the values. the better the model. A visual representation ol'the relationship between the s and J' variables produces a

('liu~1crrio.s tic Filolo~qirr 1nple.rrr. 9( 1 ). 1000. pp. 7 1 -<) I

Page 8: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

78 Pascual Caiitos Góinez

regression liiie or linear relationship between s and J. taking nornlally tlie íbrm of a straight

line. In general. an] algebraic relation of the ibrm

) ! = a I p-

uill have a graph which is a straight line. The quantity of P i s called the slope or gradient o!'

the line and a i s often refkrred toas the intercept or intercept o11 they-axis. The values of aand

p remain fixed. irrespectively of the values 01's and J..

If we observe. however. the type-tokeri slopes obtained. we realise that the relatioiiship between s aiid!, is not linear but curvilinear. For example. see tlie type grouZh for English (Fig.

2).

5 . . . , I . 2 , 2<.> ' 3 5 . ' L ... iir

Tokens

Fig.2 Retrl 1.5 c2c.loiiaieil hpe-grovi.fh

What we need to do here is to linearize the type-token relationship b ~ . means oí'

transfomiing the data. This matheniatical transibrmation allows tlie dala to Iit better to simple repression niodels. Figz~r-e 2 shows that the relationsllip between Lhe Luo variables x (tokens)

and J ' (types) is clearly not a straight line. It is similar in shapr to curves which can be

expressed bq an equation of the hrm: 1- = .-Lr,.h

where il and B are constants or parameters. Nou instead of) ' consider its logarithn~. log 1.:

log 1. = log log 1. = log .4 -i h log .\.

C'i~trtlc~rrior de F~loldgíu in~lectr. 9( 1 ). 7000. pp. 7 1-9 I

Page 9: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

I i ivesr i~~t i i ig Type- l oheii Kegressioii aiid its Potential f«r Autoiiiatrd Text Discriniiiiation

tlie equation can be written:

TI7 = r r t hZ

u-hich is exactl!. the form of the simple linear regression model. Figiil-e 3 shows a grapli of II'

(/og 13 against Z(lox,\3 for English and Spanish type growths and indicates a niuch niore linear

relationship than u-as apparent in the previous figure (Fig. 2). A linear regressioii could then

be safely fitted to the logarithni ofthe original scores.

The evidence ol'the experimental results allows us to state tliat frecluencies of'different

types are not oiil~. distrihuted 'curvilinearly' (Biber 1993: 230). but are distributed in a predictahle way. that is. thej are s~~b~jec t to matl~eniatical niodelling.

The anal>iic technique h r predicting types applied by Sánchez and Cantos (1997) is

siniple and straightforward aiid tlie resulting Ibrrnula T).PE,Y=K JTOKEA:Y

is easy to use. flexible and can be applied quickly to any corpora or language samples. The practicality of this Iorniula relics on its simplicity which -and this is iniportant- goes liand in

hand witli its effectiveiiess and transparency. In particular. the Tl-T-For-lniilrr d ~ i e to its thorough

(~'iiar/c~r.nr).s í ie Filolo,qír~ fitfifesir. 9( 1 ). 700U. pp. 7 1-9 1

Page 10: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

80 Pascual Caiiios Cióniez

testing on various text saniples of various sizes. seeins very reliable with a niore than acceptable error niargin o1'i5%. and this speaks eloquently of its validity.

The most positive contributions ofbthe TIT-Foi.inirlo can be sunirnarised in the following

points:

It is a stable indicator oi'lexical diversity and lesical density. It overcomes the reliability flau of both the token-type ratio and type-token one as it

is not constrained or dependent on text lengtli. It can be used as a predictive tool to account for thr total amount ofword forms (types)

and iemmas any hypothetical Corpus might contain (see Sanchez and Cantos 1998).

5 "

Tokens

A revealing issue is that the application ol'the TIT-Formlllrr on dif'ferent text samples

yields. giving idiosyncratic. unique and distinctive slopes. The contrastive graph above (Fig.

4 ) clearly reveals that. lbr example. Conrad's lexical density is superior to Dogle's and

Shakespeare's. And this is further evidenced by their correspondent linear regression

transformation niodels (Fig. 5 ) .

This exridence suggests that the TIT-Foi.inul(r might also be valid for text. author and

language classifications. among others. In u-hat fc>llows we shall experiment on this issue using the C'b'hlBRE C'orpirs (Corpus oi'conteniporary Spanish).

C'z~adcriioc de I. 110Iofii~1 I I I~ / : ICSU. 9(1). 2000. pp. 7 1-9 1

Page 11: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

Investigatiiig T)pe-Token Kegressioii aiid its Poteiiiial for Autoinated Text Discriiiiinatioii

IV. COMPARING TYPE-TOKEN REGRESSIONS

In this experin~ent. we (a) extracted (from the C'LWBRE C'orp1r.r) 1 1 different text samples tiom

textbooks and manuals tor secondary education and university leve1 relative to various sub.jects

or linpuisiic domains. (b) obtained their total anlounts of tokens and types. and (c) calculated their K-ilrrlzic,s (constant value: see TI"T-Fo~.nii<lo). The results are illustrated below in T~rhle

5.

Sample

C'11iitfen1os de Filolog;~ íi7gl~.srr. 9( 1 ). 2000. pp. 71 -9 1

Natural Scicriccs

lJhilohoph!

Tokens

4 1650

20385

Types íí-vnlirr

5982

3344

?O..< l

23.41

Page 12: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

Pascual Cantos Góiiiez

The mean K-~~iIlle for the 1 1 saniple is 27.29 and its standard de\.iation 9.43. Comparing

these figures with the individual K-~~i1iie.c froni tlie table above (T~ib. j) reveals a great deal 01'

variahility or dispersion anlong the various text saniples. The sample onp1iysic.c compared with

the sociol lo^^' one indicates huge differences in lexical deiisity. not to say the relation between

I I I L I / ~ ~ I I I L ~ / ~ C . Y and circhi~cciine. Ilowever. geogrcg?hj> and hislor)' seem to have a ve5 similar

lexical density. The outstanding lexical density ofcirchilec1111.e can be explained on tlie basis

that it might contain man! proper names (artists. architects. places. etc.) and specitic terms.

wliereas the very lou density of m~i/h~~n~cr/ics niight prohablj, rely on its high proportion of

figures and forniulaic expressions in substitution of word forins. The histogram (Fig. 6)

displays graphicallq the various text types nrdered according to their lexical densities (K-

1~111ie.s). Interesting. here is the Sact how the lexical density scale moves smoothlj- froni pure

science sub-jects (mu/heniu/ic.s, cony?iiring. cheniislr)~. etc.) to more arts and humanistic content

texts. Additionally. neighbourhood on the histogram niight suggest s~ib~ject relatedness: the

more dissimilar the lexical density iiidices (K-I'LIIII~S) the less the subjects relate to each other.

The K- i~~ l i r c .~ suggest that discrimination between cheniisrr:)~ (18.46) and socio/o,g~

(42.03) texts niight indeed be possible as both figures diverge significantlj.. Ilnuever. a sole

K-i:~rlire based distinction betueen L ~ ¡ I C ~ I ~ . S I I ~ ~ (1 8.46) andphysic,~ ( 19.27) seems less reliable.

dile to its closeness. Intuiti\,ely. i t seenis as if a really fine grained classiiication is not viable.

To carry on exploring the extent and potential ofour mathematical regression model.

m-e proceeded in constructing a purely statistical niodel. We started esperimenting with a

descriptive. non-inferential statitistical technique: cluster anal>-sis.

To put it succinctly. cluster analysis classifies ii se1 ofobservations into two or more

nlutually exclusive groups based on the combination ni' interval variables. The purpose 01' cluster anal>.sis is to discover a system oi'organizing ohservations into groups. where nienibers

Page 13: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

Iiivestipatiiig Type-Token Kegression aiid its Potential Sor Autoinaied ' 1 ex1 Discrimination 8 3

o!' the group share coninion properties. Cluster analysis classilies iinknoun groups while

discrimiiiant iiinction anal>.sis classifies knonn groups. A coninion approach to doing a cluster analyis is to iirst create a table «r niatrix of relative similarities or differences between al1

ob-iecct and second to use this inibrmation to combine objects into groups. Tlie table oi'relative similarities is called a proxiniity or dissiniilarity matrix. Toble 6 displays the dissiniilarit~

niatrix (note that both proxiinity matrices are symmetrical. Syminetrical means that rou and

column entries can be interchanged or that the nuiiibers are thr same on each half of the niatrix defined by a diagonal running t'rom top left to bottoni right). The distance measure usrd is the

.,y ilc11,cd Ei~e l idc~>(~~i dis~~111ce.

120»king at the rnatrix we lind that the least dissiniilarity or closest siniilarit>~ of al1 is

0.18. between the ki.\./o,-, text sample and the grogi.~l]?hj. onr. U'e could say that these seeni to

iorni the pair that is most alike. Phj1.sic.s and che~iii.slq~ have a very low dissimilarity indes

(0.66) and could be grouped. too. Since hi.s/o~:). is related to geogrc~]?lij, we could s a - that these

forma cluster. On the opposite scale. u e find the hugest difference between 111nlheni~1ric.v and

nrchireclil/.e ( O 16.27).

After the distances between tlie text types have been fouiid. tlie next step in the cluster anal>-sis procedui-e is to di\.ide the text types into groups based on the distances. The results of

tlie applicatioii of the clustering technique are best described ~ising a dendograni or bina? tree. Tlie ob.jecct are represented as nodes in the deiidograiii and the branches illustrate when the

cluster nietliud joins subgroups coiitainiiig that ob.jeect. The length of the branch indicates the distance between the subgroups when theq are joined.

C'~/cr~Icr.nos 'le Filolo~i'i Iiigles~r. 9( 1 ). 7000. pp. 7 1-9 1

Page 14: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

,- " L -4

I3 tE . l

:-e 2 H j s t llat

P t-, i 1 r r c ~ , F r , 3 Comy "13 t j. I ! r : : .

The interpretation of'the dendogram is fairly straightforward. For example. Geo'Hi.s ':Y(r/ Iorm a group. C'hen7/Ph~,s,'('oi?y,~~Zfu1h fornl another group and dr.ch,',Soc is called a "runt"

because they do not enter any group until near the end of the procedure. A dendogran) that

clearly differentiates groups ofobjects will have small distances in [he Sar branches ofthe tree

and large differences in the near branches. The dendogram above illustrales 1 cluster or

solution at distance 25. 2 clusters at distance 10. 3 at 4. 4 al 3. 5 at 1 and 11 al 0. This results

into 6 possible solutions or groupings (see Ttrh. 7).

Cluster analysis methods always produce a grouping. The grouping produced by the cluster analysis may or may ilol prove uselul for cIassiS.int objects To validate these cluster

arialysis outputs we shall use them in conjunction with discriminant function analysis o11 the

resulting groups (solutions) to discover the linear structure of either the measures used in the

cluster analysis andlor different measures.

C'itrriiernos tic, Filologíit 1tigles~r. 9( 1 ). 7000. pp. 7 1-9 1

Page 15: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

Iiivestigating l'ype-Token Kegression and its Potenrial for Autoinated l'exi Discriniinaiioii 85

Obvioiisl). the best solution is 1 (the posibility of discriminating al1 11 text types).

whereas 6 is cleaily the worst one (unable to differenciate any text type).

Cluster analysis is a positive exploratory tool for cliistering possible groupinp solutions

and fbr constructing at a later stage a group menibership predictive model by neans o r the

discrimiiiaiit Iunction analysis. This later multivariate technique is based on a linear

combination of the interval variables ( K - ~ ! ~ t l z ~ r . s ) . lt begins with a set of observations where both

group n~enibership and the values of the interval variables are known. The end result of the

procedure is a rnodel that allows prediction of'membership when only the inteval variables are known. A second purpose ofdiscriminant function analysis is an iinderstanding ol'the data set.

as a careful examination oi'the prediction n~ode l that results Iiom the procedurecan pive insight

into Ilie relationship between group membership and the variables used to predict group

nienibership.

Solution

1

'7 -

>

4

5

6

C'i~crd~,~.iios LIL, Filologitr In(gle~c~. 9( 1 ). 7000. pp. 7 1-9 1

Rescaled Distance

0

1

?

4

1 O

2 5

Clusters

I I : Geo lf!\ ;1/[7/ '1 le</ f'11iI c ' / l L ~ / l l

I>hi~.s COlll/~ ,\ l(lt11 .A l~cll Soc

5 : Gco Hrs /\'U/

,\Iedl'hil C ' I I ~ I ~ I ; I > I J , I , S . C ' O I I I ~

I\ 117th -1rclr SOC

4: G L J O H I . ~ ,Ya/ n I L Y I Pllll C/l~,lll ~P/ll,.\ c-'0111/1 A il7!/1 .A/.c,li SOL.

-) . C;(~O~HIY Ccrt,A/~>d I'lril C%L,III. f'h:~ , ( ~ I I I / I . \ l~ t l~ .-lrch Soc

?. -. G.L>o t1i.y M U I Aled/Phil C'I~L~III,PIIJ..\ Conlp '11lutl7

.4rc7/1 SOL,

I : C;eo,lfr.~' 'Vcrtdkd I'liil C'~CIII .P~J. .F COII I I I Muth .Are11 iSo~

Page 16: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

86 Pascual Cantos Gónirz

In order to construct a nlodel ~ising discriminant function analysis. we added 1 1 more

test saniples. one for each text type. as the data available was insufficiient. Next. iising the

esploratory cluster analysis data. we constructed the first model. taking solutioii 1. that is. a

model sub,ject to discriminatel 1 diIIerent text types (namely. Geogr~phj-. Hi.s/or~~. X11ilrc11

,G.iolces. j1fedicirle. Pliiloso~phj~. C'henlis~g: P~j..sic.v. C 'o~npi~~er ,Tc~ience. h f ~ / / h e n z ~ / / i c ~ ,

: I ~ L . ~ ~ I ~ L . I L I I . ~ and ,Soc~ioiog,). The case ii~iniber. act~ial gioiip. group assignrnents (Higl?e.v/ G'/,oup

and 2"" Highe.v/ Groip) and discriminailt scores are given below ( T ~ ~ h l e S; note that wrong

group assignment in Highc.v/ Gr01y7 is marked with "**").

Tirh. S U ~ s c r ~ i i r i ~ r i u t t ~ , f i ~ ~ ~ c ~ i o ~ t o~iirli:si.s tor !ol~iliorr I ( 1 1 clii.sl~~,.;'groirl~x/

The discriminant niodel Sor 1 1 text types re\,ealed a siiccess rate (correct gioup

assignment) of 8 1 .8 1 n/o (it failed in correctly assigning cases 10. 1 -3. 1 3 and 16. which were.

however. correctly classified in the second choice -2"" Hix/7e.c1 (;i.orpj.

The next discriminant niodel based aii solutioii 3 (5 text types) resulted into a \.erj

promising 95.39ó success rate. It just lailed in classi&iiig correctlj case 19 (hrr l test) ~ ~ h i c h

was groiiped to the ~CIed'Phil cluster.

C'iíirdertios rlc Frlologiii Irt,qle.~cr. 9( 1 ). 3000. pp. 7 1-9 1

Page 17: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

Investigatiiig I'ype-Tokeri Kegression and its Poteiitial Sor Autoinated Text 1)iscrimiiiation 8 7

The nest sol~ition (3 with 4 clustersigroups) differs fiom solution 2 in that it groups

h1f11h within the C'heni P.,..vC70ny) cluster. without solvinp tlie wrong groiip assignirnt 01'

solution I (case 19). This is only solved within solution 4 (where h¿r/ is grouped within

il.letiT'hil. resiilting iiito j~ist 3 clusters: (1) G'~~o~Hi.s~.~Vt~/,!iZfeu','PIIil. (2) C'hem 'P~~ . \ , 'C70 t i i~~ ,~ \ t r /h

and ( 3 ) .4rc.h :Soc.). with a success rate OS 100%. However. this soliition has a serious Ila\v: its

niinimal accurac?; and discriminaton power (Tcrh. 10 display a suniniary oi'all 6 solutions).

Iuh. 1 O Siiriiiiicrr~~ o1 IIIC i.urioi~s soliitiuns, c,ilrster dii.isiotis ~rt7ti ussociutcei T I I C L ~ L ' S . ~ r u / ~ ~

Solution

I

2

3

4

5

6

C~rot/o.no.\ O'L, L.rtlolupiir /ng/c..~cr. 9( 1 ). 7000. pp. 7 1-9 1

Clusters

I I

5

4

3

?

1

Success Kate

81.8I0,o

95.S0Ú

95.55,"

1 00? 0

1009;o

l 0O0,'o

Page 18: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

8 8 Pascual Cantos Goi i i e~

The previous analyses are very revealing and it is now up to the reader to choose or decide which is the best solution. depending on hislher researcli goals (recall that discriniinant

hnction analysis is not i~z/erel~/itrl). Nevertheless. it is our opinion that the hest nlodel is

solution 2. because oi'its reasonable discrinlation power (it is able to discrinlinate 5 different text types: (1 ) Geo, Hi,~/,!h~~tr/. ( 2 ) ,14ed;Plzil. ( 3 ) ('hen7/Pl7j~.s;'('o177p, (4) Mtr/h and (5) III.L.~,'SOC)

and its accuracy (95.5%).

Another positive contribution ofdiscrimiiiant function analysis is that once the groups (interval variables) are known we can construct a model tliat allows prediction of membership.

This is done by means of the resulting discriminant function coefficients She coetlicients for solution 7 are:

To illustrate its prediction power. take. for exainple. a text with a K-vcrlirc = 14.01

TEATTIT'E = ( ' O I I . \ / ~ I I ~ I + Ik-tALCTE * 14.01)

K VALUE (Constant)

We just need to niaximize the five coefiicients:

.41.ch/,Yoc = -336.Yl4 -1 (15.734 * 14.01) = -1 16.48

Geo,'Hi.s/ 'ÍI~'tr1 = -1X6.06Y + (1 l . 67 * 14.01) = -22.5 7

.2_lc11h = -40.603 + (5.365 * 1401) = 34.56

,Wetl"Phi/ = -121, Y09 + (Y.424 * 14.01) = 10.12

('hen7!'Phj:s,'('omi? = -66.267 + (6.909 * 14.01) = 30.52

This results in that a hypothetical text with a K-;-i,tr/uc = 14.01 is niost likely to he classified in

first choice as heiiig a nztr/hen7tr/ics text. as A,lcr/h is tlie highest resulting coefficieiit (34.56);

and in second choice. it would be classiíied as ('hemlPhj:s/('onip (30.52). Siniilarly. the least

likely group nienibership would be Al.chNYoc (-1 16.48).

TEXTYPE

V. DISCUSSION AND CONCLUSIONS

Are h/Soc

15.734 -336.914

Fronl the evidence above. me are confident tliat the K-:-i.trli~e is indeed a stable and robust lexical

density indicator conipared to the ~ype-/oken ~.tr/io and lc~ loke17-/jpe ~.a/io. This constant value

C'lltrdertios di. Fi/ologícl Inglcco. 9( 1 ). 2000. pp. 7 1-9 1

ChemlPhys/

Com p 6,909

-66267

GeoíHist/Nat

1 1.670 - 186069

Math

5.365 -40603

MedPhil

9.434 -121909

Page 19: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

Invesiipaiin 1 )pe-Tohen Kegression arid its Potential fnr Autoinated Text Discriini~iatiori 89

seems not just a reliable lexical densi0 indicator b ~ i t also a decisive index Ior typeilemnia

prediction in a11 .Y tokeii corpus. Distinct to tlie ope-iokeri'iokeri-opc r.~iiio. the li-i.~llue is texi

length independent aiid stajjs unaltered throughout. This reliability and validity resul~s in a

iiseful lexical densit). indicator. It is precisely its robustness that has niotivated the preseiit studj-. on the assuniption that

dillerent test types/sarnples relative to distinct linguistic doniains are likel!, to exliibit unique

i(-i-~ilire.v. The experimental data. as well as previous experinients. have revealed that different

languages. authors or linguistic domains. etc.. dit'fer from each other. among niany other things. in their lexical densit?. that is. in the relation of distinct word torms (types) (o tlie text/corpus

word size. This enables us. for instante. to distinguish (a) languages: general Spanish has a K-

I ~ L I I Z I ~ 01' 54.29. whereas general English 41.43; (b ) text types: Spanish iiction: 50.77 and

Spanish press: 56.12: ( c j authors: Conrad: 35.75. Shakespeare: 37.96 and Doj-le: 26.78: and

(4) linguistic domains: archicture (Spanisli): 44.22. chemistry (Spanish): 18.46. computing (Spanish): 17.09. geography (Spanish): 33.32. history (Spanish): 32.9. mathematics (Spanisli):

13.95. medicine (Spanisli): 36.26. natural sciences (Spanish): 29.31. pliilosophy (Spanish): 23.42. phjrsics (Spanish): 19.27 and sociolog'- (Spanish): 42.03. Clearlq-. distinct K-i-~lllie.v

indicate different texl types. autliors. linguistic doniains. etc.

If we conceiitrate on the examined linguistic domains. me can appreciate a huge

uriation belween ~irchiieciiire (44.22) and ni~iiheni~i~ics ( 1 3.95). for example. This sugpests that

discrirniating these two dornains u-ould not be too difficult. However. distinguisliing bet~veen

googi'~iphj, (33.32) and hisioy' (32.9) seenis nearly impossible.

Interesting in this cense are Figiire 6. the cluster analysis and the discriminant function

analysis. Figii~.c 6 represcnts \,isuall>. the K-19~rlue ordered linguistic doniains. where we can

appreciate a logical and smooth text type transition. that goes froni piire science (m~iihenzaiits)

to clear humanity conteiits (.sociolo~,?:~~rchiieciiire). This stratification is based on a single

lexical density feature: the K-I:LIIIIC>. Complenientary. tlie cluster analysis offers an exploratorj

grouped hierarcliical structiire of the text types. highlighting the m-jor flau of the K-i~ilue:

i n c a p a c i ~ of distinguishing between closely nearby K-vu1iie.s. as little dissimilar lexical

densities are grouped into single clusters. Clearly. the li-i-~~lzle Iails to distinguisli betw-een (a)

g~~~gi .c i j~h j ' . hi.si(>q~ and I~LIILI~( I I .scie~ice.s; ( b) medicirie a n d p h i l o . s o l ~ ~ ~ : (e) cheniisig,. p h j ~ i c s aiid

cor~i l~i~i i r~g: aiid (d) a socio lo^^ and ~irchiieciilre. Houever. the final modelling oí'the data b>-

nieans of the discriminant Iunction analysis reveals that the A--i~cilile is valid and reliable to

s~iccessf~illy differentiate (a) geogr~7p/i~: hi .vioq~ln~ii i~r~i l SC~CIICC.Y. (b) nie~lici~ic~ 17hil0soj7hj~. (e)

c~hen~i.vir~:pli~~.ciccs~ conipiriiiig. (d) .sociolo~.'ur-chifec~i~re and (e) ni~riheniciiics from each other.

Thougli a potential text discriminator using Ií-i:ullle does not. in principie. produce a

v e n specific classitication. it does no1 invalidate the use of lexical dens i t~ for text

diflkrentiation. The resultiiig text classilicatioil froni the experiment is far l iom being erroneous or exaggeratedl). generic. On the contrarj-. it discrimates clearly distintive text type cl~isters:

C'r~uiiertio~ tle Filol~ifiiu li7sqllc..si~. 9( I ). 2000. pp. 7 1-9 1

Page 20: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

90 Pascual Cantos Cioiiie/

( a ) n ~ ~ i / h e m ~ i ~ i c . s . (b ) chenzi.s1i:~~~j~h~~,ric~s~~oi~11~zi1ii~g. (e) i ~ ~ e ~ i i ~ i i i e í ~ ~ l 7 i l o . s o 1 ~ I i ~ ~ . ( d )

c h e i ~ ~ i . ~ / i : ~ ~ ~ / ~ h ~ ~ . c i c . c coi7lpzi/it7g and (5) sociolol=):~urchi/ec~iire. with an accuracy rate of 95.5%.

111 suni. u e are coniident oí'the ~iseiiilness of the lexical densi t - for autoniated text classification. il'a reliable aiid valid lexical densit?- index such as the K - ~ ~ c ~ l i i c is used. Tlie

con.iunction «f the K-vdzie with niultivariate statitistical tecliniques (cluster analysis and

discriniinant i'unction analysis) has resulted into very posihve and promising data niodels.

where the potential preciseness o f t h e text typifation has been much more specilic than one inight espect at lirst sight. It needs to be i-ecalled that neither linguistic knowledge. linguistic paradipnis nor linguistic feature data were used. just a single index specifying the relationsliip

belween words (tokens) and word forms (types) relative to each text sample.

Page 21: INVESTIGATING TYPE-TOKEN REGRESSION AND ITS ......lvords of the text (7646 in our example). Roth indices obtained here. ille token-t!pe and the type- token ratio. are not ver). signilicant

Inves~igatin: '1 ype-Token Regressioii and iis Porential for A~itoinated Test Discriiriiiiation 9 1

REFERENCES

Hiber. D. ( 1988) lin.irr/iot~ r~cr.oss ,Sj~eccli trnd Fti-i~itlg. Cambridge: Can~bridge l!iiiversity Press.

Riber. D. ( 1 993 ) "Representativeiiess iii Corpus Design". L ~ / ~ ~ . L I I > ' ríiicl Litlgili.s/ic C'onlpz~~ing. 8(4): 243-357.

Meaps. H. S. ( 1978) It1fOrn7ririotl Re/r.iei~iI: C'onl~~z1/~1/iot7uI utid Thcot~e~icol A.sprc/.r. New York: Acadeniic Press.

Holnies. D. 1. ( 1994) "Authorship Attributiori". ('on~pzl~rl:, rrt?d /hc Hi~n~rrtri/ie\. 38: 87- 106.

Sánchez. A. and P. Cantos (1997) "Predictabili~ ol' Word Fornis (Types) and Lemmas in Lingiiistic Corpora. A Case Study Based on the Anal>,sis of lhe CUMBRE Corpiis: An 8-Million-Word Corpus OS Contemporary Spanish". J t~ /e~ . t~o / io~i r~ l .Jollrnrrl qf ('o,pz~.v I,irlgz~i.s/ic.s. 3í3): 259-380,

Sánchez. A. and P. Cantos (1998) "E1 ritmo incremental de palabras nuevas en los repertorios de textos. Estudio experimental y comparativo basado en dos Corpus lingüísticns equivalentes de cuatro niillones de palabras. de las lenguas inglesa y española y en cinco autores d e ambas lenguas". .4TL.4i\?TILS. XIX (1): 305-333.

Yang. D.-H.. P. Cantos and M. Song (for~l~cotni t~g) "An Algorithm Sor Predicting tlie Relationship betweei-i Lemmas and Corpus Si/e". ETRI (Electrnnics and Teleconin~unications Research Institute) .Joz~rt~~r/.