资 源 简 介
EVBCorpus is an English-Vietnamese Bilingual Corpus
The EVBCopus contains over 10,000,000 words (10 million) from 15 bilingual books, 100 parallel English-Vietnamese / Vietnamese-English texts, 250 parallel law and ordinance texts, and 1,000 news articles. The composition, annotation, encoding and availability of the corpus are meant to facilitate developments of language technology and studies in bilingual terminology extraction, primarily for the English-Vietnamese-English language pair.
English-Vietnamese Bilingual Corpus (EVBCorpus)
The building EVBCorpus process includes four main steps: (1) collect data and align bitext at the paragraph level; (2) align bitext at the sentence level, (3) linguistic analysis and tagging; (4) annotate and correct corpus with toolkits. As result, the EVBCopus was aligned at the sentence level; and a part of this corpus containing 1,000 news articles was aligned semi-automatically at the word level.
If you ar