Publication details

Measuring Redundancy in Czech Electronic Health Records: Near-Duplicate Detection and Cluster Analysis

Authors

ANETTA Krištof HORÁK Aleš

Year of publication 2025
Type Article in Proceedings
Conference Recent Advances in Slavonic Natural Language Processing, RASLAN 2025
MU Faculty or unit

Faculty of Informatics

Citation
web Proceedings of the Nineteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2025.
Keywords Electronic health records; EHR; corpus; dataset; redundancy; near-duplicate; deduplication; Czech.
Description Electronic health records (EHRs) contain extensive repetition arising fromtemplatedstructures,copy-pastepractices, andrecurrentclinical phrasing. While such redundancy facilitates documentation consistency, it also affects the efficiency of data processing and downstream natural language processing applications. This study investigates the internal textual redundancy of a Czech dataset of narrative parts of oncology health records using a fast near-duplicate detection method and a subsequent clustering analysis. We quantify the degree and distribution of repeated content across documents, visualize the resulting clusters to identify patterns, and experiment with creating cluster-aware pruned datasets for more efficient language model training. For comparison, we report baseline redundancy measures on a Czech literary corpus, illustrating the contrast between natural and clinical text. Inadditiontoprovidinginsightintohowredundancyshapesthelinguistic and informational landscape of Czech EHRs, we discuss our findings in the context of state-of-the-art clinical LLMs for English, making a case not only for continued development of redundancy-mitigating approaches, but also for the use of synthetic health record data.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.

More info