Zde se nacházíte:
Informace o publikaci
Comparison of Embedding Methods for Retrieval Under Noisy Institutional Labels
| Autoři | |
|---|---|
| Rok publikování | 2025 |
| Druh | Článek ve sborníku |
| Konference | JURIX 2025 Proceedings (Frontiers in Artificial Intelligence and Applications, volume 416: Legal Knowledge and Information Systems) |
| Fakulta / Pracoviště MU | |
| Citace | |
| www | Plný text výsledku |
| Doi | https://doi.org/10.3233/FAIA251605 |
| Klíčová slova | legal information retrieval; case law; embeddings; evaluation; noisy labels; Czech Constitutional Court |
| Popis | Retrieving relevant case law remains a time-consuming task. We compare two embedding models for Czech Constitutional Court decisions: (i) a large general-purpose OpenAI embedder and (ii) a domain-specific BERT trained from scratch on ~34,000 decisions. We introduce a noise-aware evaluation using IDF-weighted keyword overlap as graded relevance, dual thresholds (0.20, 0.28), paired-bootstrap significance, and nDCG diagnostics. Despite conservative absolute nDCG due to noisy institutional labels, the OpenAI embedder consistently and significantly outperforms the domain BERT across all ranks and thresholds. Our framework enables robust evaluation under imperfect gold standards typical of legacy judicial databases. |
| Související projekty: |