Building a 50M Corpus of Tajik Language

Publication details

Authors	DOVUDOV Gulshan POMIKÁLEK Jan SUCHOMEL Vít ŠMERK Pavel
Year of publication	2011
Type	Article in Proceedings
Conference	Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011
MU Faculty or unit	Faculty of Informatics
Citation
Web	https://nlp.fi.muni.cz/raslan/2011/paper07.pdf
Field	Linguistics
Keywords	language corpora; corpus; corpus building; tajik
Description	Paper presents by far the largest available computer corpus of Tajik Language of the size of more than 50 million words. To obtain the texts for the corpus two different approaches were used. The paper brings a description of both of them, discusses their advantages and disadvantages and shows some statistics of the two respective partial corpora. Then the paper characterizes the resulting joined corpus and finally discusses some possible future improvements.
Related projects:	Centrum komputační lingvistiky

10 reasons why you will fall in love with MU