Transcript
Page 1: b Þasakawa/2019cnps_handson/...ryËWikipedia word2vecÀ Ë\ b Þ .. ýÓ¿# 1 V R 2þ X1SJàCJ V2{ FGCJ Y ü ' . ÓY [/Ë> 'Ò & O D Rë(word.embedding) ! . - Ô J h (Mikolov.et.al.,

ryËWikipediağword2vecÀ�ěË\�běğÞ�..

ýÓ¿#1Ŗ�R��2þŘ1SJàCJŖ2{�FGCJř�

üĠĒħĞ�.ÓYś/Ë>ħÒĦŏńŒë(word.embedding)ĥĕĠëʼnĶŅŒ¬ÝŏńŒë(vector.space).ě9ġĮĭ�Ôğ��J¶h�(Mikolov.et.al.,2013)Ċe8İÀ�ēĭ4»bĊĄĭďěĊ¨ĐĮę�£ĐĮęąĭĀ.þ/Ë>ħÒĦŏńŒěĠś½�ÇË&�ĞĈąę5āğ/ËĠ³Ë\nÜİ�!ěēĭ À�ĚĄĭŔŕŋłŅʼnĶŅŒ(onehot.vector).İMÀ�ʼnĶŅŒë(dense).ĞBmēĭĕħĞ·�ĐĮĕ�ÝÀ�Ğ�zēĭ(Elman,.1990)Ā.þ/Ë>ħÒĦŏńŒĚÀ�ĐĮĕ¬Ý�ğʼnĶŅŒÝĚĠ+®ś�®ĝĜğ�®Ċ][¢ĞĠKµĚċĭëďěĉīe8ğÀ�ğ�ÂěĦĝđĆĭĀ..�ř.vec(�)Avec(�)+vec(F)=vec(F�).(�;0�)Ġ�Ýğçl�»ě�ºđĆĭĀ.........

üq�!!ĸŗŇĽşþryËWiki..(hDp://dumps.wikimedia.org/jawiki/latest).Śþ12/sep/2016.ĞŀĴŕœŗņ.ŚþMeCabĞĪĘę]f±Ř/ËřĞ()..ĸŗŇĽĺijľş.MeCabğ'*1,265,697,011]f±(()±�)Ā'�:nĊ5:x�ğËİáąĕ�ĝĬnĚ1,140,357.(vocabularies),.ŅŗĿŒĚ5,774,139ŅŗĶŕĀ�.ŚWord2vecĞĪĬJ¶.þþWord2vecğ�!Ġ128ĀĴIJŕņĴW.10,........Í �ĺŕňőŕķ.10,.skipgram.İ�ąĕĀ..Æ|OÌş.wikipediaĞ'�ēĭ/ËěNTTńŗĿʼnŗĽ(�ßNTTADBě9ģ)ĞI=ēĭ/ËěğŌłŁŕķĊěĮĕ17,842ËĀ�

Fig..1..5�b�ělK�ğ¤Þ;�

ü�_ğ�K�.�_ĐīĞŏńŒğÆ|ēĭİÕħśJ¶ĐĮĕʼnĶŅŒĊĜğĪĆĝ^)İ}ĕđęąĭğĉİk�ēĭ�KĚĄĭĀë lK°ZİjĎĭďěĚśpËĊ'�đĕĬv�B.ēĭÅÉKĞĪĭi�ݽ-�gēĭďěİÊĦĭĀ!

a) word frequency in NTT-DB�

b) written word imageability in NTT-DB�

c) written word familiarity in NTT-DB �

frequency count�pred

icte

d fr

eque

ncy

coun

t�V-imageablity�pr

edic

ted

V-i

mag

eabi

lity�

V-familiarity�

pred

icte

d V

-fam

ilia

rity�

ü0·o�.�DÙguŖÓ¿#.(1999,.2000)..ryËğË\�b­ŝU/ËÄMZś­7U/ËæZś�¥?Ŝ.

�ÝQGŖ�ãà¦âŖ�ÃÎEŖÑT~Ŗ����ŖDÙguŖÓ¿#.(2008)..ryËğË\�b­8U/Ëa bś�¥?Ŝ�

Mikolov,.T.,.Yih,.W..tau,.&.Zweig,.G..(2013)..LinguisYc.regulariYes.in.conYnuous.space.word.representaYons..In.Proceedings.of.the.2013.Conference.of.the.North.American.Chapter.of.the.AssociaYon.for.ComputaYonal.LinguisYcs:.Human.Language.Technologies.(NAACL),.Atlanta,.WA,.USA..

�.�

WOMAN�AUNT�

UNCLE�MAN�

QUEEN�

KING�

QUEENS�

KINGS�

QUEEN�

KING�

ü£¢!.NTT.ńŗĿʼnŗĽĻőŗľĞ3ÛĐĮęąĭ/ËÄMZĈĪĢ/ËæZ(DÙěÓ¿,.1999,.2000)ś/Ëa b(�Ýī,2008)ĝĜğË\�bĠśryËğ�āĝ�bİ2tđĕ�b�ĚĄĭĀ/ËÄMZĠ357ğÁé¸ĞĪĭÅ¢ÉK�ĚĄĬś�ğ5/ËĞOēĭĂĝĒĦăğ©ZİÀđęąĭĀ/ËæZĠp¹12Y(İĸŗŇĽěđę/Ëğ'�:nİnćĭďěĞĪĘę�gĐĮĕĨğĚĄĭĀĥĕ/Ëa bĠś/ËÄMZě6�ĞÅ¢ÉK�ĚĄĬś5/ËĞOēĭe8ğijŎŗļđĩēĐİÀđęąĭĀ�qśword2vecŘe.g.,.Mikolov,.et.al.,.2013)İ�ąę�¯ĐĮĭŏńŒĠś¼CĝÚğŃĵĽŅİJ¶ēĭďěĚś/Ëğ$Ïĩ�oÿĔđęśe8¢¬ÝİÀ�4»ĚĄĭě·ćīĮęąĭĀy§«ĠśË\�bńŗĿʼnŗĽğ�ěword2vecĞĪĘęJ¶ĐĮĕ/ËğʼnĶŅŒ¬ÝÀ�ěğÞ�İÆ|ēĭďěİ£¢ěđĕĀēĝįėśword2vecğʼnĶŅŒ¬ÝĠ�İJ¶đś�İÀ�đęąĭğĉİsīĉĞēĭďěİe;đśa�Léĉī`īĮĕNTTńŗĿʼnŗĽğd@ěryËĴIJĵŊńIJığ/Ë>ħÒĦŏńŒěİ�ÐēĭďěĚś�ńŗĿĚÀ�ĐĮĕd@ğÞÔğ�ÈİÊĦĕĀ�

Fig..1'..5�b�ělK�ğ¤Þ;�

2 3 4 5 6

34

56

78

afam

alldata$afam

alld

ata$

‘fitte

d af

am‘

A-familiarity�

pred

icte

d A

-fam

ilia

rity�

c') spoken word familiarity in NTT-DB�

a') word frequency in Wiki�

1 2 3 4 5

12

34

56

logFreqWiki

alldata$logFreqwiki

alld

ata$

‘fitte

d lo

gFre

qwik

i‘pr

edic

ted

freq

uenc

y co

unt�

frequency count�

2 3 4 5 6 7

34

56

7

ImgA

alldata$ImgA

alld

ata$

‘fitte

d Im

gA‘

A-imageablity�pred

icte

d A

-im

agea

bili

ty�b') spoken word imageability in NTT-DB�

üdemo!ĺijŅ�.y¡ÀĚ�ąĕword2vecJ¶�ĦńŗĿŘvector. datařśĈĪĢpython. ( jupyter.notebook)ĞĪĭword2vecğĺŕňŒĸŗņ:.hDps://github.com/ShinAsakawa/2017jpa/�ĕĖđńŗĿĠ. 2017Y7wğryËĴIJĵŊńIJıĀ�!n. 200,. 300,.ğ�ªĀıŒĹőľōĠëskipgram,ëĈĪĢëCBOWĀ.

ü²}ŘØ:V(|ř!.17,842Ëğword2vecŇŐŎŗĿ(128�řĞĪĭ5Ë\�b�ğ:VŏńŒ..adjustedAR2.NTTADB/ËæZŘOnřþëëëëëëëëëëëëëëëëëë.0.5495.NTTADB/Ëa bŘoHřëëëëëëëëëëëëëëëëë.0.4232.NTTADB/ËÄMZŘoHřþëëëëëëëë..........0.3741..Wikipedia%ğ/Ë'�æZŘOnřþ...0.6539.NTTADB/Ëa bŘåAř........................0.4242.NTTADB/ËÄMZŘåAřþëëëëëëëë..........0.3189.

ü·N..ûøùôíúõóĞĪĬJ¶đĕ5/ËğʼnĶŅŒ�ĞĪĭË�bğlKĠśæZĞOđęêg´ĚĄĭĀ�qśÅ¢ĝÉK�ĚĄĭÄMZ1Ģa bĞOđęĠæZğlKĪĬ,ĭĀ�¸İ�Ðēĭěÿ¾XĝĊīa bğqĊ°ZĊêąĀa bĊe8ğijŎŗļđĩēĐݨē�bĚĄĭďěĉīśûøùôíúõóğŇŐŎŗĿ¬ÝĊe8ěÞÔēĭďěěÞ�đęąĭě·ćīĮĭĀ�ĥĕśæZğlKĈąęĠśJ¶Ğ�ąĕòö÷öĞĈčĭæZğqĊêg´ĚĄĬśðññìïîĊp¹Ğ'�ēĭæZĚĄĭěąĆ�bğÖąĊ�ĮęąĭĀĥĕśoH/ËÄMZĞ�ĤęåA/ËÄMZğlKg´ĊcąĀďĮĠoHŘ�HřĔğĨğäđĐĝĜğ�bĊÞ�đęąĭĕħě·ćīĮĭĀ�qśa bĠ"*ŎńIJıĞĪĭÖąĠPĐČś��bğoH/ËěåA/ËĞĪĭe8ğijŎŗļđĩēĐĞCċĝÖąĊĝąĕħě·ćīĮĭĀ!

APPENDIX.A.word2vecĚJ¶đĕvector�ğ.Ş�!ĞĪĭ<ěè×ğÞ�.ŘMikolov.et.al.,2013þĞĪĭř�

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

୰ᅜ

᪥ᮏ

ࢶࢻ

ࢫࢩࢠ

ࢥࢺ

ி

ᮾி

ࢡࢫ

ࢻࢻ

APPENDIX.BþNTTADB3ÛğæZśÄMZśa bÝğ¤Þ.(DÙśÓ¿ś1999;.2000Š�Ýī,.2008ĪĬ)�.

Recommended