1
ryËWikipediağword2vecÀěË\běğÞ ýÓ¿# 1 ŖR 2 Ř 1 SJàCJŖ 2 {FGCJř üĠĒħĞ ÓYś/Ë>ħÒĦŏńŒ (word embedding) ĥĕĠ ʼnĶŅŒ¬ÝŏńŒ (vector space) ě9 ġĮĭÔğJ¶h(Mikolov et al., 2013)Ċe8İÀēĭ4»bĊĄĭďěĊ¨ ĐĮę£ĐĮęąĭĀ /Ë>ħÒĦŏńŒěĠś½ÇË&Ğ Ĉąę5āğ/ËĠ³Ë\nÜİ!ěēĭ  ÀĚĄĭŔŕŋłŅʼnĶŅŒ(onehot vector) İMÀʼnĶŅŒ (dense) ĞBmēĭĕħĞ· ĐĮĕÝÀĞzēĭ(Elman, 1990)Ā /Ë>ħÒĦŏńŒĚÀĐĮĕ¬Ýğ ʼnĶŅŒÝĚĠ+®ś®ĝĜğ®Ċ][ ¢ĞĠKµĚċĭ ďěĉīe8ğÀğ ěĦĝđĆĭĀ ř vec()Avec()+vec(F)=vec(F)(;0) ĠÝğçl»ěºđĆĭĀ üq ĸŗŇĽş ryËWiki (hDp://dumps.wikimedia.org/jawiki/latest) Ś 12/sep/2016 ĞŀĴŕœŗņ Ś MeCabĞĪĘę]f±Ř/ËřĞ() ĸŗŇĽĺijľş MeCabğ'*1,265,697,011]f±(()±)Ā ':nĊ5:xğËİáąĕĝĬnĚ 1,140,357 (vocabularies), ŅŗĿŒĚ5,774,139 ŅŗĶŕĀ ŚWord2vecĞĪĬJ¶ Word2vecğ!Ġ128ĀĴIJŕņĴW 10, Íĺŕňőŕķ 10, skipgram İąĕĀ Æ|OÌş wikipediaĞ'ēĭ/ËěNTTńŗĿʼnŗĽ (ßNTTADBě9ģ)ĞI=ēĭ/ËěğŌłŁ ŕķĊěĮĕ17,842ËĀ Fig. 1 5bělKğ¤Þ; ü_ğK _ĐīĞŏńŒğÆ|ēĭİÕħśJ¶Đ ĮĕʼnĶŅŒĊĜğĪĆĝ^)İ}ĕđęąĭ ğĉİkēĭKĚĄĭĀ lK°ZİjĎ ĭďěĚśpËĊ'đĕĬvB.ēĭÅ ÉKĞĪĭiݽ-gēĭďěİÊĦĭĀ a) word frequency in NTT-DB b) written word imageability in NTT-DB c) written word familiarity in NTT-DB frequency count predicted frequency count V-imageablity predicted V-imageability V-familiarity predicted V-familiarity ü0·o DÙguŖÓ¿# (1999, 2000). ryËğË \bŝU/ËÄMZś7U/ËæZś ¥?Ŝ ÝQGŖãà¦âŖÃÎEŖÑT~Ŗ ŖDÙguŖÓ¿# (2008). ry ËğË\b8U/Ëabś¥?Ŝ Mikolov, T., Yih, W. tau, & Zweig, G. (2013). LinguisYc regulariYes in conYnuous space word representaYons. In Proceedings of the 2013 Conference of the North American Chapter of the AssociaYon for ComputaYonal LinguisYcs: Human Language Technologies (NAACL), Atlanta, WA, USA. WOMAN AUNT UNCLE MAN QUEEN KING QUEENS KINGS QUEEN KING ü£¢ NTT ńŗĿʼnŗĽĻőŗľĞ3ÛĐĮęąĭ/ ËÄMZĈĪĢ/ËæZ(DÙěÓ¿, 1999, 2000)ś/Ëab(Ýī,2008)ĝĜğË\ bĠśryËğāĝbİ2tđĕb ĚĄĭĀ/ËÄMZĠ357ğÁé¸ĞĪĭ Å¢ÉKĚĄĬśğ5/ËĞOēĭĂĝ ĒĦăğ©ZİÀđęąĭĀ/ËæZĠp¹12 Y(İĸŗŇĽěđę/Ëğ':nİnćĭ ďěĞĪĘęgĐĮĕĨğĚĄĭĀĥĕ/Ë abĠś/ËÄMZě6ĞÅ¢ÉK ĚĄĬś5/ËĞOēĭe8ğijŎŗļđĩē ĐİÀđęąĭĀqśword2vecŘe.g., Mikolov, et al., 2013)İąę¯ĐĮĭŏńŒĠś¼ CĝÚğŃĵĽŅİJ¶ēĭďěĚś/Ëğ$ ÏĩoÿĔđęśe8¢¬ÝİÀ4»ĚĄ ĭě·ćīĮęąĭĀy§«ĠśË\bńŗ ĿʼnŗĽğěword2vecĞĪĘęJ¶ĐĮĕ/ ËğʼnĶŅŒ¬ÝÀěğÞİÆ|ēĭďě İ£¢ěđĕĀēĝįėśword2vecğʼnĶŅŒ ¬ÝĠİJ¶đśİÀđęąĭğĉİs īĉĞēĭďěİe;đśaLéĉī`īĮ ĕNTTńŗĿʼnŗĽğd@ěryËĴIJĵŊ ńIJığ/Ë>ħÒĦŏńŒěİÐēĭďě ĚśńŗĿĚÀĐĮĕd@ğÞÔğÈ İÊĦĕĀ Fig. 1' 5bělKğ¤Þ; 2 3 4 5 6 3 4 5 6 7 8 A-familiarity predicted A-familiarity c') spoken word familiarity in NTT-DB a') word frequency in Wiki 1 2 3 4 5 1 2 3 4 5 6 alldata$logFreqwiki predicted frequency count frequency count 2 3 4 5 6 7 3 4 5 6 7 A-imageablity predicted A-imageability b') spoken word imageability in NTT-DB üdemo ĺijŅ y¡ÀĚąĕ word2vec J¶ĦńŗĿ Ř vector data řśĈĪĢ python (jupyter notebook) ĞĪĭ word2vec ğĺŕňŒĸŗņ : hDps://github.com/ShinAsakawa/2017jpa/ ĕĖđńŗĿĠ 2017 Y7 wğryËĴIJĵŊ ńIJıĀ!n 200, 300, ğªĀıŒĹőľ ōĠ skipgram, ĈĪĢ CBOWĀ ü²}ŘØ:V(|ř 17,842 Ëğword2vec ŇŐŎŗĿ(128 řĞĪĭ 5Ë\bğ:VŏńŒ adjustedAR 2 NTTADB/ËæZŘOnř 0.5495 NTTADB/ËabŘoHř 0.4232 NTTADB/ËÄMZŘoHř 0.3741 Wikipedia%ğ/Ë'æZŘOnř 0.6539 NTTADB/ËabŘåAř 0.4242 NTTADB/ËÄMZŘåAř 0.3189 ü·N ûøùôíúõóĞĪĬJ¶đĕ5/ËğʼnĶŅŒĞ ĪĭËbğlKĠśæZĞOđęêg´Ě ĄĭĀqśÅ¢ĝÉKĚĄĭÄMZ1 ĢabĞOđęĠæZğlKĪĬ,ĭĀ ¸İÐēĭěÿ¾XĝĊīabğqĊ° ZĊêąĀabĊe8ğijŎŗļđĩēĐİ ¨ēbĚĄĭďěĉīśûøùôíúõóğŇŐŎŗ Ŀ¬ÝĊe8ěÞÔēĭďěěÞđęąĭě ·ćīĮĭĀ ĥĕśæZğlKĈąęĠśJ¶Ğąĕòö÷ö ĞĈčĭæZğqĊêg´ĚĄĬśðññìïîĊ p¹Ğ'ēĭæZĚĄĭěąĆbğÖą ĊĮęąĭĀĥĕśoH/ËÄMZĞĤ ęåA/ËÄMZğlKg´ĊcąĀďĮĠ oHŘHřĔğĨğäđĐĝĜğbĊÞ đęąĭĕħě·ćīĮĭĀqśabĠ" *ŎńIJıĞĪĭÖąĠPĐČśbğoH /ËěåA/ËĞĪĭe8ğijŎŗļđĩēĐ ĞCċĝÖąĊĝąĕħě·ćīĮĭĀ APPENDIX A word2vecĚJ¶đĕvectorğ Ş!ĞĪĭ<ěè×ğÞ ŘMikolov et al.,2013 ĞĪĭř -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 ୰ᅜ ᪥ᮏ ி ᮾி ࢡࢫ APPENDIX B NTTADB3ÛğæZśÄMZśabÝğ¤Þ (DÙśÓ¿ś1999; 2000ŠÝī, 2008ĪĬ)

b Þasakawa/2019cnps_handson/...ryËWikipedia word2vecÀ Ë\ b Þ .. ýÓ¿# 1 V R 2þ X1SJàCJ V2{ FGCJ Y ü ' . ÓY [/Ë> 'Ò & O D Rë(word.embedding) ! . - Ô J h (Mikolov.et.al.,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: b Þasakawa/2019cnps_handson/...ryËWikipedia word2vecÀ Ë\ b Þ .. ýÓ¿# 1 V R 2þ X1SJàCJ V2{ FGCJ Y ü ' . ÓY [/Ë> 'Ò & O D Rë(word.embedding) ! . - Ô J h (Mikolov.et.al.,

ryËWikipediağword2vecÀ�ěË\�běğÞ�..

ýÓ¿#1Ŗ�R��2þŘ1SJàCJŖ2{�FGCJř�

üĠĒħĞ�.ÓYś/Ë>ħÒĦŏńŒë(word.embedding)ĥĕĠëʼnĶŅŒ¬ÝŏńŒë(vector.space).ě9ġĮĭ�Ôğ��J¶h�(Mikolov.et.al.,2013)Ċe8İÀ�ēĭ4»bĊĄĭďěĊ¨ĐĮę�£ĐĮęąĭĀ.þ/Ë>ħÒĦŏńŒěĠś½�ÇË&�ĞĈąę5āğ/ËĠ³Ë\nÜİ�!ěēĭ À�ĚĄĭŔŕŋłŅʼnĶŅŒ(onehot.vector).İMÀ�ʼnĶŅŒë(dense).ĞBmēĭĕħĞ·�ĐĮĕ�ÝÀ�Ğ�zēĭ(Elman,.1990)Ā.þ/Ë>ħÒĦŏńŒĚÀ�ĐĮĕ¬Ý�ğʼnĶŅŒÝĚĠ+®ś�®ĝĜğ�®Ċ][¢ĞĠKµĚċĭëďěĉīe8ğÀ�ğ�ÂěĦĝđĆĭĀ..�ř.vec(�)Avec(�)+vec(F)=vec(F�).(�;0�)Ġ�Ýğçl�»ě�ºđĆĭĀ.........

üq�!!ĸŗŇĽşþryËWiki..(hDp://dumps.wikimedia.org/jawiki/latest).Śþ12/sep/2016.ĞŀĴŕœŗņ.ŚþMeCabĞĪĘę]f±Ř/ËřĞ()..ĸŗŇĽĺijľş.MeCabğ'*1,265,697,011]f±(()±�)Ā'�:nĊ5:x�ğËİáąĕ�ĝĬnĚ1,140,357.(vocabularies),.ŅŗĿŒĚ5,774,139ŅŗĶŕĀ�.ŚWord2vecĞĪĬJ¶.þþWord2vecğ�!Ġ128ĀĴIJŕņĴW.10,........Í �ĺŕňőŕķ.10,.skipgram.İ�ąĕĀ..Æ|OÌş.wikipediaĞ'�ēĭ/ËěNTTńŗĿʼnŗĽ(�ßNTTADBě9ģ)ĞI=ēĭ/ËěğŌłŁŕķĊěĮĕ17,842ËĀ�

Fig..1..5�b�ělK�ğ¤Þ;�

ü�_ğ�K�.�_ĐīĞŏńŒğÆ|ēĭİÕħśJ¶ĐĮĕʼnĶŅŒĊĜğĪĆĝ^)İ}ĕđęąĭğĉİk�ēĭ�KĚĄĭĀë lK°ZİjĎĭďěĚśpËĊ'�đĕĬv�B.ēĭÅÉKĞĪĭi�ݽ-�gēĭďěİÊĦĭĀ!

a) word frequency in NTT-DB�

b) written word imageability in NTT-DB�

c) written word familiarity in NTT-DB �

frequency count�pred

icte

d fr

eque

ncy

coun

t�V-imageablity�pr

edic

ted

V-i

mag

eabi

lity�

V-familiarity�

pred

icte

d V

-fam

ilia

rity�

ü0·o�.�DÙguŖÓ¿#.(1999,.2000)..ryËğË\�b­ŝU/ËÄMZś­7U/ËæZś�¥?Ŝ.

�ÝQGŖ�ãà¦âŖ�ÃÎEŖÑT~Ŗ����ŖDÙguŖÓ¿#.(2008)..ryËğË\�b­8U/Ëa bś�¥?Ŝ�

Mikolov,.T.,.Yih,.W..tau,.&.Zweig,.G..(2013)..LinguisYc.regulariYes.in.conYnuous.space.word.representaYons..In.Proceedings.of.the.2013.Conference.of.the.North.American.Chapter.of.the.AssociaYon.for.ComputaYonal.LinguisYcs:.Human.Language.Technologies.(NAACL),.Atlanta,.WA,.USA..

�.�

WOMAN�AUNT�

UNCLE�MAN�

QUEEN�

KING�

QUEENS�

KINGS�

QUEEN�

KING�

ü£¢!.NTT.ńŗĿʼnŗĽĻőŗľĞ3ÛĐĮęąĭ/ËÄMZĈĪĢ/ËæZ(DÙěÓ¿,.1999,.2000)ś/Ëa b(�Ýī,2008)ĝĜğË\�bĠśryËğ�āĝ�bİ2tđĕ�b�ĚĄĭĀ/ËÄMZĠ357ğÁé¸ĞĪĭÅ¢ÉK�ĚĄĬś�ğ5/ËĞOēĭĂĝĒĦăğ©ZİÀđęąĭĀ/ËæZĠp¹12Y(İĸŗŇĽěđę/Ëğ'�:nİnćĭďěĞĪĘę�gĐĮĕĨğĚĄĭĀĥĕ/Ëa bĠś/ËÄMZě6�ĞÅ¢ÉK�ĚĄĬś5/ËĞOēĭe8ğijŎŗļđĩēĐİÀđęąĭĀ�qśword2vecŘe.g.,.Mikolov,.et.al.,.2013)İ�ąę�¯ĐĮĭŏńŒĠś¼CĝÚğŃĵĽŅİJ¶ēĭďěĚś/Ëğ$Ïĩ�oÿĔđęśe8¢¬ÝİÀ�4»ĚĄĭě·ćīĮęąĭĀy§«ĠśË\�bńŗĿʼnŗĽğ�ěword2vecĞĪĘęJ¶ĐĮĕ/ËğʼnĶŅŒ¬ÝÀ�ěğÞ�İÆ|ēĭďěİ£¢ěđĕĀēĝįėśword2vecğʼnĶŅŒ¬ÝĠ�İJ¶đś�İÀ�đęąĭğĉİsīĉĞēĭďěİe;đśa�Léĉī`īĮĕNTTńŗĿʼnŗĽğd@ěryËĴIJĵŊńIJığ/Ë>ħÒĦŏńŒěİ�ÐēĭďěĚś�ńŗĿĚÀ�ĐĮĕd@ğÞÔğ�ÈİÊĦĕĀ�

Fig..1'..5�b�ělK�ğ¤Þ;�

2 3 4 5 6

34

56

78

afam

alldata$afam

alld

ata$

‘fitte

d af

am‘

A-familiarity�

pred

icte

d A

-fam

ilia

rity�

c') spoken word familiarity in NTT-DB�

a') word frequency in Wiki�

1 2 3 4 5

12

34

56

logFreqWiki

alldata$logFreqwiki

alld

ata$

‘fitte

d lo

gFre

qwik

i‘pr

edic

ted

freq

uenc

y co

unt�

frequency count�

2 3 4 5 6 7

34

56

7

ImgA

alldata$ImgA

alld

ata$

‘fitte

d Im

gA‘

A-imageablity�pred

icte

d A

-im

agea

bili

ty�b') spoken word imageability in NTT-DB�

üdemo!ĺijŅ�.y¡ÀĚ�ąĕword2vecJ¶�ĦńŗĿŘvector. datařśĈĪĢpython. ( jupyter.notebook)ĞĪĭword2vecğĺŕňŒĸŗņ:.hDps://github.com/ShinAsakawa/2017jpa/�ĕĖđńŗĿĠ. 2017Y7wğryËĴIJĵŊńIJıĀ�!n. 200,. 300,.ğ�ªĀıŒĹőľōĠëskipgram,ëĈĪĢëCBOWĀ.

ü²}ŘØ:V(|ř!.17,842Ëğword2vecŇŐŎŗĿ(128�řĞĪĭ5Ë\�b�ğ:VŏńŒ..adjustedAR2.NTTADB/ËæZŘOnřþëëëëëëëëëëëëëëëëëë.0.5495.NTTADB/Ëa bŘoHřëëëëëëëëëëëëëëëëë.0.4232.NTTADB/ËÄMZŘoHřþëëëëëëëë..........0.3741..Wikipedia%ğ/Ë'�æZŘOnřþ...0.6539.NTTADB/Ëa bŘåAř........................0.4242.NTTADB/ËÄMZŘåAřþëëëëëëëë..........0.3189.

ü·N..ûøùôíúõóĞĪĬJ¶đĕ5/ËğʼnĶŅŒ�ĞĪĭË�bğlKĠśæZĞOđęêg´ĚĄĭĀ�qśÅ¢ĝÉK�ĚĄĭÄMZ1Ģa bĞOđęĠæZğlKĪĬ,ĭĀ�¸İ�Ðēĭěÿ¾XĝĊīa bğqĊ°ZĊêąĀa bĊe8ğijŎŗļđĩēĐݨē�bĚĄĭďěĉīśûøùôíúõóğŇŐŎŗĿ¬ÝĊe8ěÞÔēĭďěěÞ�đęąĭě·ćīĮĭĀ�ĥĕśæZğlKĈąęĠśJ¶Ğ�ąĕòö÷öĞĈčĭæZğqĊêg´ĚĄĬśðññìïîĊp¹Ğ'�ēĭæZĚĄĭěąĆ�bğÖąĊ�ĮęąĭĀĥĕśoH/ËÄMZĞ�ĤęåA/ËÄMZğlKg´ĊcąĀďĮĠoHŘ�HřĔğĨğäđĐĝĜğ�bĊÞ�đęąĭĕħě·ćīĮĭĀ�qśa bĠ"*ŎńIJıĞĪĭÖąĠPĐČś��bğoH/ËěåA/ËĞĪĭe8ğijŎŗļđĩēĐĞCċĝÖąĊĝąĕħě·ćīĮĭĀ!

APPENDIX.A.word2vecĚJ¶đĕvector�ğ.Ş�!ĞĪĭ<ěè×ğÞ�.ŘMikolov.et.al.,2013þĞĪĭř�

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

୰ᅜ

᪥ᮏ

ࢶࢻ

ࢫࢩࢠ

ࢥࢺ

ி

ᮾி

ࢡࢫ

ࢻࢻ

APPENDIX.BþNTTADB3ÛğæZśÄMZśa bÝğ¤Þ.(DÙśÓ¿ś1999;.2000Š�Ýī,.2008ĪĬ)�.