Informace o publikaci

Comparison of Embedding Methods for Retrieval Under Noisy Institutional Labels

Logo poskytovatele
Autoři

NOVOTNÁ Tereza HARAŠTA Jakub

Rok publikování 2025
Druh Článek ve sborníku
Konference JURIX 2025 Proceedings (Frontiers in Artificial Intelligence and Applications, volume 416: Legal Knowledge and Information Systems)
Fakulta / Pracoviště MU

Právnická fakulta

Citace
www Plný text výsledku
Doi https://doi.org/10.3233/FAIA251605
Klíčová slova legal information retrieval; case law; embeddings; evaluation; noisy labels; Czech Constitutional Court
Popis Retrieving relevant case law remains a time-consuming task. We compare two embedding models for Czech Constitutional Court decisions: (i) a large general-purpose OpenAI embedder and (ii) a domain-specific BERT trained from scratch on ~34,000 decisions. We introduce a noise-aware evaluation using IDF-weighted keyword overlap as graded relevance, dual thresholds (0.20, 0.28), paired-bootstrap significance, and nDCG diagnostics. Despite conservative absolute nDCG due to noisy institutional labels, the OpenAI embedder consistently and significantly outperforms the domain BERT across all ranks and thresholds. Our framework enables robust evaluation under imperfect gold standards typical of legacy judicial databases.
Související projekty:

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.

Další info