Comparison of Embedding Methods for Retrieval Under Noisy Institutional Labels

Informace o publikaci

Autoři	NOVOTNÁ Tereza HARAŠTA Jakub
Rok publikování	2025
Druh	Článek ve sborníku
Konference	JURIX 2025 Proceedings (Frontiers in Artificial Intelligence and Applications, volume 416: Legal Knowledge and Information Systems)
Fakulta / Pracoviště MU	Právnická fakulta
Citace
www	Plný text výsledku
Doi	https://doi.org/10.3233/FAIA251605
Klíčová slova	legal information retrieval; case law; embeddings; evaluation; noisy labels; Czech Constitutional Court
Popis	Retrieving relevant case law remains a time-consuming task. We compare two embedding models for Czech Constitutional Court decisions: (i) a large general-purpose OpenAI embedder and (ii) a domain-specific BERT trained from scratch on ~34,000 decisions. We introduce a noise-aware evaluation using IDF-weighted keyword overlap as graded relevance, dual thresholds (0.20, 0.28), paired-bootstrap significance, and nDCG diagnostics. Despite conservative absolute nDCG due to noisy institutional labels, the OpenAI embedder consistently and significantly outperforms the domain BERT across all ranks and thresholds. Our framework enables robust evaluation under imperfect gold standards typical of legacy judicial databases.
Související projekty:	Forensic Support for Building Trust in Smart Software Ecosystems CEDMO 2.0 NPO

Studijní programy