Zde se nacházíte:
Informace o publikaci
From Word of the Year to Word of the Week: Daily-updated Monitor Corpora for 25 Languages
| Autoři | |
|---|---|
| Rok publikování | 2025 |
| Druh | Článek ve sborníku |
| Fakulta / Pracoviště MU | |
| Citace | |
| www | Electronic lexicography in the 21st century (eLex 2025): Intelligent lexicography. Proceedings of the eLex 2025 conference |
| Klíčová slova | monitor corpus; web corpus; trend analysis; neologism detection; word sense shift analysis |
| Přiložené soubory | |
| Popis | This paper introduces a long-term, privately funded programme to collect time-stamped monitor Trend Corpora in a wide range of languages, designed to study linguistic trends and language change over time. Accessible via the Sketch Engine platform, the corpora range in size from 3 million tokens (Irish) to 100 billion (English). 25 languages are covered– including Arabic, English, French, German, Italian, Polish, Portuguese, and Spanishwith ten more to be added soon. Corpus texts come from global websites providing RSS/Atom feeds, mostly news, covering content from as early as 2014. New articles– up to 180,000 on weekdays– are collected daily and updates are published twice a week. Processing includes text cleaning, de-duplication, and linguistic annotation. The project builds on the JSI Newsfeed Corpus (Krek et al., 2017), but since 2021 for English and 2023 for other languages it has expanded independently in scope and data sources. Trend corpora in Sketch Engine support diachronic analysis across multiple time frames and integrate with features like concordance search and Word Sketch. The paper also presents feed activity statistics and showcases examples of functionality offered by Trend Corpora such as neologism detection, word sense shift analysis, and timeline-based analysis of trending words and phrases. |