You are here:
Publication details
Measuring Redundancy in Czech Electronic Health Records: Near-Duplicate Detection and Cluster Analysis
| Authors | |
|---|---|
| Year of publication | 2025 |
| Type | Article in Proceedings |
| Conference | Recent Advances in Slavonic Natural Language Processing, RASLAN 2025 |
| MU Faculty or unit | |
| Citation | |
| web | Proceedings of the Nineteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2025. |
| Keywords | Electronic health records; EHR; corpus; dataset; redundancy; near-duplicate; deduplication; Czech. |
| Description | Electronic health records (EHRs) contain extensive repetition arising fromtemplatedstructures,copy-pastepractices, andrecurrentclinical phrasing. While such redundancy facilitates documentation consistency, it also affects the efficiency of data processing and downstream natural language processing applications. This study investigates the internal textual redundancy of a Czech dataset of narrative parts of oncology health records using a fast near-duplicate detection method and a subsequent clustering analysis. We quantify the degree and distribution of repeated content across documents, visualize the resulting clusters to identify patterns, and experiment with creating cluster-aware pruned datasets for more efficient language model training. For comparison, we report baseline redundancy measures on a Czech literary corpus, illustrating the contrast between natural and clinical text. Inadditiontoprovidinginsightintohowredundancyshapesthelinguistic and informational landscape of Czech EHRs, we discuss our findings in the context of state-of-the-art clinical LLMs for English, making a case not only for continued development of redundancy-mitigating approaches, but also for the use of synthetic health record data. |
| Related projects: |