|
美國國家語料庫(ANC)介紹 | |||
作者:admin 文章來源:本站原創 點擊數: 更新時間:2011-11-16 |
|
||
說明:引用此文請注明出處,并務請保留后面的有效鏈接地址,謝謝!
美國國家語料庫(ANC)介紹 (歡迎收藏本頁)
■ANC = The American National Corpus美國國家語料庫
美國國家語料庫(American National Corpus,ANC)是目前規模最大的關于美國英語使用現狀的語料庫,它包括從1990年起的各種文字材料、口頭材料的文字記錄。ANC已出版過兩個版本,第一個版本包含1,000萬口語和書面語美式英語詞匯,第二個版本則包含了2,200萬口語和書面語美式英語詞匯。 ■The First Release of the ANC The First Release of the ANC is a beta version. It contains over 10,000,000 words of written and spoken American English, annotated for lemma and part of speech. It is available for research and education for a nominal licensing fee from the Linguistic Data Consortium. Commercial users can obtain the corpus and gain rights to use it in commercial products by joining the ANC Consortium. The texts included in the first 10 million words of the ANC are those that were first received. Therefore the corpus is not balanced. There has been no hand-validation of the XML tagging or the part of speech annotation tags. Headers are minimal, although they contain fairly complete information concerning domain, subdomain, subject, audience, and medium. Check the list of known bugs and caveats for a description of the limitations we are currently aware of. One of the aims of releasing this first 10 million words is to get feedback from the community about its structure and annotation, so that modifications can be made, if necessary, for the final release of the full 100 million words. We therefore invite comments and bug reports from the community of ANC users. Please contact anc@cs.vassar.edu . ■The Second Release of the ANC The Second Release of the American National Corpus contains over 22,000,000 words of written and spoken American English, annotated for lemma, part of speech, noun chunks, and verb chunks. Part of speech tags using the Penn tagset are included for all data in the Second Release, and many documents are also PoS-tagged using the Biber tagset. The ANC Second Release is available for research and education for a nominal licensing fee from the Linguistic Data Consortium. Commercial users can obtain the corpus and gain rights to use it in commercial products by joining the ANC Consortium. Please consult the LDC Catalog entry for the ANC Second Release. The First and Second Releases of the ANC include materials which have been acquired to date, and therefore the current release of the ANC is not balanced. There has been no hand-validation of the XML tagging or the annotation. Headers are typically minimal, although most contain complete information concerning domain, subdomain, subject, audience, and medium. Check the list of known bugs and caveats for a description of the limitations we are currently aware of. One of the aims of the Second Release is to get feedback from the community about its structure and annotation, so that modifications can be made, if necessary, for the final release of the full 100 million words. We therefore invite comments and bug reports from the community of ANC users. Please contact anc@cs.vassar.edu. ■ANC address: more corpus addresses: |
|||
文章錄入:admin 責任編輯:admin | |||
【發表評論】【加入收藏】【告訴好友】【打印此文】【關閉窗口】 |
|
||||||
| 網站地圖 | 版權申明 | 設為首頁 | 加入收藏 | 會員中心 | 取回密碼 | 友情鏈接 | 用戶留言 | 管理登錄 | ||||
|