American National Corpus

Daga Wikipedia, Insakulofidiya ta kyauta.
American National Corpus
URL (en) Fassara http://www.anc.org/
Iri text corpus (en) Fassara da yanar gizo
Language (en) Fassara Turancin Amurka
Service entry (en) Fassara 1990

American National Corpus (ANC) tarin rubutu ne na Turanci na Amurka wanda ke dauke da kalmomi miliyan 22 na rubuce-rubuce da bayanan da aka samar tun daga 1990. A halin yanzu, ANC ta haɗa da nau'o'i daban-daban, gami da nau'ikan da ke fitowa kamar imel, tweets, da bayanan yanar gizo waɗanda ba a haɗa su a cikin corpora na baya ba kamar British National Corpus. An rubuta shi don wani bangare na magana da lemma, ƙananan parse, da kuma sunayen ƙungiyoyi.

Ana samun ANC daga Ƙungiyar Bayanan Harshe . Kalmomi miliyan goma sha biyar na ƙungiyar, da ake kira Open American National Corpus (OANC), ana samun su kyauta ba tare da hani kan amfani da shi daga gidan yanar gizon ANC ba.

An bayar da ƙungiyar da bayananta bisa ga ƙayyadaddun Tsarin Harsuna na ISO/TC 37 SC4. Ta amfani da kayan aikin watsawa kyauta (ANC2Go), ana ba da bayanan corpus da zaɓaɓɓen bayanin mai amfani a cikin nau'i-nau'i da yawa, gami da tsarin CoNLL IOB, tsarin XML wanda ya dace da ka'idar Encoding na XML Corpus (XCES) (amfani tare da British National Corpus ' s XAIRA injunan bincike), tsarin UIMA mai dacewa, da kuma tsarukan da suka dace don shigar da software iri-iri iri-iri. Ana kuma samun abubuwan toshe bayanai don shigo da bayanan cikin Janar Architecture don Injiniyan Rubutu (GATE).

ANC ta bambanta da sauran ƙungiyoyin Ingilishi saboda an ƙirƙira ta sosai, gami da ɓangarori daban-daban na bayanin magana (tambarin Penn, CLAWS5 da alamun CLAWS7), bayanan fastoci mara zurfi, da annotations don nau'ikan abubuwan da aka ambata . Ana ƙara ƙarin bayani kan duk ko sassan ƙungiyar yayin da suke samuwa, galibi ta hanyar gudummawar wasu ayyuka. Ba kamar kamfani na kan layi ba, wanda saboda haƙƙin haƙƙin mallaka yana ba da damar isa ga jimloli ɗaya kawai, duk ANC yana samuwa don ba da damar bincike da ya shafi, misali, haɓaka ƙirar harshe na ƙididdiga da cikakken bayanin harshe.

Ana samar da bayanan ANC ta atomatik kuma ba a inganta su ba. Rukunin kalmomi 500,000 da ake kira da Manual Annotated Sub-Corpus (MASC) an ƙirƙira su don kusan nau'ikan bayanan harshe guda 20, waɗanda duk an tabbatar da su da hannu ko aka samar da su. Waɗannan sun haɗa da bayanin syntactic Penn Treebank, bayanin fahimtar ma'anar WordNet, bayanan firam ɗin FrameNet, da sauransu. Kamar OANC, MASC yana samuwa kyauta don kowane amfani, kuma ana iya sauke shi daga rukunin yanar gizon ANC ko daga Consortium Data Consortium . Hakanan ana rarraba shi a cikin nau'i mai alamar sashe na magana tare da Kayan aikin Harshen Halitta .

ANC da ƙananan ƙungiyoyinta sun bambanta da irin wannan kamfani musamman a cikin kewayon bayanan harshe da aka bayar da kuma haɗa nau'o'in zamani waɗanda ba su bayyana a cikin albarkatun kamar British National Corpus ba. Har ila yau, saboda farkon abin da aka yi amfani da shi na corpora shine haɓaka nau'ikan harshe na ƙididdiga, cikakkun bayanai da duk bayanan suna samuwa, don haka ya bambanta da Corpus of Contemporary American English (COCA) wanda ke samuwa kawai ta hanyar bincike na yanar gizo.

Ci gaba da girma na OANC da MASC ya dogara ne akan gudunmawar bayanai da bayanai daga ilimin harsunan lissafi da al'ummomin harshe na corpus.

Duba kuma[gyara sashe | gyara masomin]

  • British National Corpus
  • Oxford English Corpus
  • Corpus of Contemporary American English (COCA).

Manazarta[gyara sashe | gyara masomin]

  • Ide, N. (2008). Ƙungiyar Ƙasa ta Amirka: Sa'an nan, Yanzu, da Gobe . A cikin Michael Haugh, Kate Burridge, Jean Mulder da Pam Peters (eds.), Zaɓaɓɓen Hukunce-hukuncen Bita na 2008 HCSNet Workshop akan Zayyana Ƙwararrun Ƙwararrun Ƙasar Australiya: Harsunan Haɓaka, Cascadilla Proceedings Project, Sommerville, MA.
  • Ide, N., Suderman, K. (2004). Sakin Farko na Corpus na Ƙasar Amirka . Abubuwan da aka gabatar na Taro na Harshe na Hudu da Taro na kimantawa (LREC), Lisbon, 1681-84.
  • Ide, N., Baker, C., Fellbaum, C., Passonneau, R. (2010). Ƙarƙashin Jama’a na Taron Shekara-shekara na 48 na Ƙungiyar Ƙididdigar Harsuna, Uppsala, Sweden.

Hanyoyin haɗi na waje[gyara sashe | gyara masomin]