Informace o publikaci

Verbatim Memorisation and Large Language Models and EU Copyright Law

Autoři	HAVLÍKOVÁ Štěpánka
Rok publikování	2025
Druh	Článek v odborném periodiku
Časopis / Zdroj	IRI§
Citace
Klíčová slova	Data Memorisation, AI, Copyright
Popis	Empirical studies suggest that – although technically not storing the raw training dataset – language models as statistical models assigning a probability to a sequence of words may be able to extract hundreds of verbatim text sequences from the model’s training data. And thus if language models are trained on publicly available data, such data memorisation might lead to infringement of copyright and database rights. Recently adopted set of two exceptions from copyright and database protection for purposes of so -called “text and data mining” introduced by the CDSM Directive could emerge as pivotal when aiming to justify use of publicly available data to train artifi cial intelligence. However, the applicability TDM exceptions is limited as to the purpose of generating new information as well as to the scope of permitted actions permitting solely reproduction or extraction of protected content. Although language models adopt additional measures to prevent data memorisation and dissemination of verbatim snippets – such as de-duplication or outputfi lters – these measures might not be bulletproof, especially due to jailbreaking which may manipulate AI models into bypassing such measures. Question remains, is there a meaningful solution to preventing copyright infringement while not hindering training of language models on publicly available data?