Informace o publikaci

Lemmatization of Czech and Croatian Noun Clusters for Terminology Extraction.

Autoři

BLAHUŠ Marek PETREKOVÁ Katarína

Rok publikování 2025
Druh Článek ve sborníku
Konference Recent Advances in Slavonic Natural Language Processing, RASLAN 2025
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
www Proceedings of the Nineteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2025.
Klíčová slova noun clusters; lemmatization; terminology extraction
Přiložené soubory
Popis During terminology extraction, terms discovered in corpora are presented in their canonical form. Lemmatization of multi-word terms consisting of noun clusters can be ambiguous due to the lack of information on their internal structure. In this paper, we show that grammatical case alone is often not sufficient for the construction of canonical forms of noun clusters. We focus on two-noun clusters in the genitive, which are the most frequent type with ambiguous parsing. Based on corpus research, we design rules that make use of multiple morphological categories to improve the lemmatization of noun clusters found in Czech and Croatian corpora. In addition to case, we also take note of gender, animacy, and whether the noun is a proper noun. The improvements lead to more accurate and more unified forms of the terms produced during terminology extraction for these two languages in Sketch Engine.
Související projekty:

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.

Další info