csTenTen17, a Recent Czech Web Corpus

Informace o publikaci

Autoři	SUCHOMEL Vít
Rok publikování	2018
Druh	Článek ve sborníku
Konference	Proceedings of the Twelfth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	https://nlp.fi.muni.cz/raslan/2018/paper10-Suchomel.pdf
Klíčová slova	Czech corpus; web corpus; text processing
Popis	This article introduces a very large Czech text corpus for language research – csTenTen17 compiled from texts downloaded in 2015, 2016 and 2017. The corpus is consisting of 10.5 billion words reaching double the size of its predecessor from 2012. A brief comparison with other recent Czech corpora follows.
Související projekty:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

Studijní programy