Vocabulary Size of Czech Native Speakers: A Statistical Approach

Informace o publikaci

Autoři	BLAHUŠ Marek JAKUBÍČEK Miloš KOVÁŘ Vojtěch KOVAŘÍK František
Rok publikování	2025
Druh	Článek ve sborníku
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	Electronic lexicography in the 21st century (eLex 2025): Intelligent lexicography. Proceedings of the eLex 2025 conference
Klíčová slova	vocabulary size; native speaker; manual annotation; semi-automatic dictionary drafting; Dictionary Express
Přiložené soubory	eLex2025-10-Blahus_etal.pdf
Popis	This paper explores the theory of measuring vocabulary size, including the various methods that can be used and the parameters that have to be set. We have examined the experiments carried out on English and Dutch. Goulden et al. (1990) claims the average native speaker knows about 17,000 English base words (non-derived words). Keuleers et al. (2015) and Brysbaert et al. (2016) claim the average native speaker with secondary education knows about 42,000 headwords (lemmas). We have conducted an experiment similar to that of Keuleers and Brysbaert on Czech, with the input of 100,000 letter sequences from the wordlists of large web corpora. We assume the vocabulary size of Czech native speakers (as well as the vocabulary size of native speakers of any language) could be bigger, exceeding 57,000 (Czech) headwords, should we provide the participants with more inputs (150,000 sequences, or even more) or should we count the specialized terminology of their fields of interest.

Studijní programy