Informace o publikaci

From Word of the Year to Word of the Week: Daily-updated Monitor Corpora for 25 Languages

Autoři

HERMAN Ondřej JAKUBÍČEK Miloš KRAUS Jan SUCHOMEL Vít

Rok publikování 2025
Druh Článek ve sborníku
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
www Electronic lexicography in the 21st century (eLex 2025): Intelligent lexicography. Proceedings of the eLex 2025 conference
Klíčová slova monitor corpus; web corpus; trend analysis; neologism detection; word sense shift analysis
Přiložené soubory
Popis This paper introduces a long-term, privately funded programme to collect time-stamped monitor Trend Corpora in a wide range of languages, designed to study linguistic trends and language change over time. Accessible via the Sketch Engine platform, the corpora range in size from 3 million tokens (Irish) to 100 billion (English). 25 languages are covered– including Arabic, English, French, German, Italian, Polish, Portuguese, and Spanishwith ten more to be added soon. Corpus texts come from global websites providing RSS/Atom feeds, mostly news, covering content from as early as 2014. New articles– up to 180,000 on weekdays– are collected daily and updates are published twice a week. Processing includes text cleaning, de-duplication, and linguistic annotation. The project builds on the JSI Newsfeed Corpus (Krek et al., 2017), but since 2021 for English and 2023 for other languages it has expanded independently in scope and data sources. Trend corpora in Sketch Engine support diachronic analysis across multiple time frames and integrate with features like concordance search and Word Sketch. The paper also presents feed activity statistics and showcases examples of functionality offered by Trend Corpora such as neologism detection, word sense shift analysis, and timeline-based analysis of trending words and phrases.

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.

Další info