Introduction
The rapid advancement of artificial intelligence systems — particularly generative models — has given rise to one of the most complex areas of contemporary technology law: the relationship between AI and copyright.
In recent years, this issue has moved to the forefront of legal, technological, and economic debate. On the one hand, AI unlocks new possibilities in content creation, data analysis, and the automation of creative processes. On the other, it raises fundamental questions about the permissible use of protected works and the allocation of responsibility for content generated by AI systems.
Public discourse often reduces the copyright dimension of AI to a single question: whether AI models can be trained on copyrighted material. In reality, this is only one element of a much broader legal landscape.
A proper legal analysis must take into account the entire lifecycle of AI systems — from the sourcing of training data, through model training and content generation, to the downstream use of AI outputs in business operations.
Within the European Union, answers to these questions emerge from several overlapping regulatory frameworks. Copyright law remains central, in particular the DSM Directive, which introduced specific exceptions for text and data mining. At the same time, regulatory focus is increasingly shifting toward AI governance. The AI Act introduces new compliance obligations, especially for providers of general-purpose AI models.
From the perspective of organisations developing or deploying AI technologies, this relationship should be approached in a systemic way. It is not a standalone legal issue, but part of a broader technology governance framework — encompassing the management of training data, the assessment of legal risks associated with AI-generated content, and the design of effective compliance processes.
1. The Role of Copyright Across the AI System Lifecycle
The relationship between artificial intelligence and copyright manifests itself at multiple stages of an AI system’s lifecycle. Legal issues may arise both at the stage of sourcing and using training data, and later — when AI models generate content and such outputs are deployed in business activities.
At the early stages of AI projects, the way training data is sourced is of critical importance. Generative models require vast datasets, which may include both public domain materials and content protected by copyright. Questions about the lawfulness of using such data arise already at this point.
Further issues relate to the model training process itself. Depending on the system architecture, training may involve analysing large datasets, copying them into system memory, and transforming them into mathematical representations that enable the model to identify statistical patterns. From a legal perspective, this raises the question of whether the operations carried out during training — particularly the copying and processing of protected works — constitute “use” within the meaning of copyright law, and therefore fall within the scope of the exclusive rights of authors.
Additional complexities emerge in relation to content generated by AI systems. This includes the question of whether such outputs can qualify for copyright protection, as well as whether their use may infringe the rights of third parties.
For these reasons, there is a growing recognition of the need for a systemic approach to managing copyright in AI projects, often referred to as copyright governance for AI. This approach involves assessing copyright-related risks across the entire technology lifecycle.
2. Legal Foundations of the Relationship Between AI and Copyright
Copyright as the Starting Point
The primary point of reference remains the traditional framework of European Union copyright law. According to the established case law of the Court of Justice of the European Union, copyright protection is granted to works that constitute the author’s own intellectual creation, as confirmed, inter alia, in Infopaq (C-5/08) and Painer (C-145/10).
This means that copyright protects the expression of ideas, rather than ideas, facts, or information themselves. This distinction is particularly important in the context of AI systems, which learn by analysing vast datasets containing both protected elements and non-protected information.
Exclusive Rights of Authors
Copyright grants authors a range of exclusive rights, most notably the right of reproduction and the right of distribution. In the context of AI training, the right of reproduction is particularly significant, as it also covers the digital copying of content in data processing operations. For this reason, the use of protected works in the training of models raises the question of the legal basis for such processing under copyright law.
At the same time, copyright law provides for various exceptions and limitations to these exclusive rights. Their purpose is to strike a balance between protecting the interests of authors and enabling innovation, scientific research, and the free flow of information within the information society. In practice, this means that in certain circumstances, the use of protected works may be permissible even without the consent of the rights holder, provided that the conditions set out in the law are met.
In the context of the development of digital technologies, particular importance is attached to exceptions that allow for the automated processing of large datasets. Processes such as text analysis, identification of statistical patterns, and data mining form the foundation of many AI-driven technologies. For this reason, the EU legislator has introduced specific rules on text and data mining, which define the conditions under which protected content may be lawfully used in data analysis processes.
3. Text and Data Mining in EU Law
One of the most important regulatory instruments supporting the development of data-driven technologies, including artificial intelligence systems, is Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market (the Digital Single Market Directive, or DSM Directive). This framework introduced specific exceptions for text and data mining (TDM) into EU law, enabling the automated analysis of large datasets in research and technological processes.
Text and data mining (TDM) refers to the automated processing of large volumes of data, including text and other digital content, for the purpose of identifying patterns, trends, or statistical relationships. Such operations underpin many modern data analysis methods, including techniques used in the training of AI models.
The TDM exceptions introduced by the DSM Directive operate as limitations to the exclusive rights of authors, in particular the right of reproduction. Their objective is to enable automated analysis of large datasets while preserving the core mechanisms of copyright protection.
The DSM Directive provides for two main TDM exceptions. The first, set out in Article 3, applies to research activities carried out by research organisations and cultural heritage institutions. It allows for the reproduction of works and other protected subject matter for the purposes of TDM in scientific research.
The second exception, provided for in Article 4, has a broader scope. It permits reproductions and extractions of content for TDM purposes by other entities, including commercial actors, provided that access to the materials being analysed is lawful.
A key feature of this framework is the opt-out mechanism set out in Article 4(3) of the DSM Directive. It allows rights holders to reserve the use of their works for TDM purposes. Such a reservation may be expressed, in particular, through appropriate technical means indicating that automated processing of the content is not permitted.
The TDM regime is of particular importance in the context of AI development. In practice, many AI models — including generative models — are trained on very large datasets comprising text, images, and other digital materials. The identification of statistical patterns within such datasets functionally corresponds to TDM processes.
For this reason, the TDM exceptions introduced by the DSM Directive are often seen as a key component of the legal framework enabling the development of AI technologies in Europe. At the same time, the opt-out mechanism and the requirement of lawful access are designed to maintain a balance between the interests of entities developing data-driven technologies and the protection of the rights of authors and other rights holders.
TDM rules form an essential part of the legal framework governing data analysis in the European Union. In recent years, these rules have been complemented by provisions on AI system governance introduced under the AI Act, which establish additional obligations related to the governance of data used in AI systems.
4. The AI Act and New Compliance Obligations
The AI Act represents the first comprehensive legal framework governing the development and use of artificial intelligence systems in the European Union. While it does not alter the substantive rules of copyright protection, it introduces provisions that are highly relevant for the governance of data used in AI systems.
Particular importance is attached to the rules on general-purpose AI models (GPAI). The AI Act requires providers of such models to implement appropriate measures to ensure compliance with EU copyright law. In particular, it imposes obligations to adopt and maintain a copyright policy and to publish certain information regarding the data used in the training of models.
The objective of these provisions is to enhance transparency in the use of training data and to mitigate risks associated with potential copyright infringements. In practice, this entails the need to implement procedures that enable the identification of data sources, the assessment of conditions governing their use, and the consideration of any reservations expressed by rights holders.
In this context, the governance of training data becomes a critical component of AI projects. Organisations developing or deploying AI models should implement procedures that allow for the assessment of the lawfulness of data sourcing and use, including the analysis of the potential applicability of text and data mining exceptions under the DSM Directive.
As a result, the AI Act reinforces the importance of training data governance as a key element of legal risk management in AI-driven projects. Although it does not directly modify the scope of copyright protection, it significantly elevates the role of compliance processes designed to ensure that AI model training aligns with applicable copyright laws.
5. Training Data Governance
Training data governance is one of the key components of a systemic approach to ensuring that AI projects comply with copyright law. In practice, most legal risks associated with the use of protected content arise at the stage of data sourcing, preparation, and processing.
Training data governance encompasses not only technical aspects related to data quality and representativeness, but also the assessment of the legal basis for data use. In particular, organisations should be able to determine whether the data used in model training has been obtained lawfully and whether its use falls within the scope of permitted use or applicable exceptions under copyright law, including those relating to text and data mining.
From a copyright perspective, identifying data sources and the conditions governing their use is of central importance. This applies both to data obtained from publicly accessible online resources and to data made available under licence agreements or through collaboration with business partners. A lack of transparency in this area may lead to difficulties in assessing the risk of copyright infringement and in demonstrating compliance with applicable regulations.
An important element of training data governance is also the consideration of mechanisms such as the opt-out provided for under the DSM Directive. In practice, this requires assessing whether rights holders have reserved the use of their content for text and data mining purposes, and whether the organisation has appropriate mechanisms in place to respect such reservations.
From the perspective of the AI Act, training data governance gains additional significance as part of the broader compliance framework. Obligations relating to general-purpose AI models (GPAI), including the requirement to implement a copyright policy and transparency obligations, necessitate the formalisation of processes related to data sourcing and use.
In practice, this means that organisations should implement procedures enabling the documentation of data sources, the assessment of their legal status, and the monitoring of how such data is used in model training processes. This approach not only reduces the risk of copyright infringement but also enhances an organisation’s ability to demonstrate compliance in the event of audits or legal disputes.
As a result, training data governance is becoming a central element of legal risk management in AI projects. Its importance extends beyond mere compliance with copyright law and forms part of a broader framework for the responsible development and use of artificial intelligence technologies.
6. Training AI Models on Copyright-Protected Content
One of the most complex issues at the intersection of artificial intelligence and copyright law concerns the use of protected content in the training of AI models. This issue goes beyond the question of access to data and extends to the nature of the operations performed on that data and their legal qualification under copyright law.
The training of AI models involves analysing very large datasets in order to identify statistical patterns and relationships between their elements. Depending on the system architecture and the machine learning methods applied, these operations may include the temporary reproduction of content, its transformation into mathematical representations, and the repeated processing of the same data across multiple training iterations.
From a copyright perspective, the key question is whether such operations constitute “use” of works within the meaning of the rules governing the exclusive rights of authors. In particular, this relates to the right of reproduction, which under EU law covers not only permanent copying but also certain forms of temporary reproduction, provided they have economic or functional significance.
In this context, it is important to distinguish between the analysis of content and its exploitation in the traditional sense. As a rule, training AI models does not involve the dissemination of works or their direct communication to the public, but rather their use as a source of statistical information. This does not, however, eliminate the risk that operations performed on the data — especially copying within system memory — may be qualified as an interference with exclusive rights.
An additional layer of complexity arises from the fact that the training process results in models that do not store content in its original form, but rather in the form of parameters and mathematical representations. This raises the question of whether, and to what extent, such representations can be linked to the works used during training. This issue is particularly relevant in the context of generative models, which may reproduce certain styles, structures, or — in extreme cases — fragments of content resembling source materials.
Assessing the permissibility of training AI models on copyright-protected content therefore requires consideration of several interrelated factors. First, it is necessary to determine whether the data used in training is protected by copyright. Second, the nature of the operations performed on that data must be assessed, particularly in light of the right of reproduction. Third, it is essential to consider whether any legal exceptions apply, including those relating to text and data mining.
In practice, this means that the legality of AI model training cannot be assessed in the abstract, but must take into account the specific circumstances of a given project, including the source of the data, the manner in which it was obtained, the methods of processing applied, and the intended use of the model. This type of assessment forms an important part of a broader governance framework, enabling organisations to identify and mitigate legal risks associated with the use of artificial intelligence.
As a result, the issue of training AI models on copyright-protected content remains one of the key areas of tension between technological development and the protection of authors’ rights. Its resolution requires not only the interpretation of existing legal frameworks, but also consideration of rapidly evolving market practices and emerging case law.
7. Generative AI and New Challenges for Copyright Law
The development of generative artificial intelligence is significantly reshaping the functioning of copyright law, shifting the focus of analysis from the data used in model training to the content generated by these systems. While earlier stages of the AI lifecycle centre on the permissibility of using existing works, generative AI brings to the forefront the question of the legal status of AI-generated outputs.
One of the fundamental issues is whether content generated by AI systems can qualify as “works” within the meaning of copyright law. Under the established approach of EU law, copyright protection is granted only to results that constitute the author’s own intellectual creation, which implies the existence of a creative element and a link to human authorship. As a consequence, content generated fully autonomously by AI systems will, as a rule, not meet the criteria for copyright protection.
This does not mean, however, that the use of generative AI falls outside the scope of copyright law. In practice, the key issue lies in the relationship between generated content and the materials used in training the models. In particular, the question arises whether generated outputs may infringe copyright by reproducing or too closely imitating protected works.
This risk is especially relevant where generative models are capable of reproducing distinctive elements of style, structure, or composition, and in extreme cases, generating content that closely resembles specific source materials. The assessment of such situations requires determining whether the generated output constitutes a derivative work, an adaptation of a protected work, or an unauthorised reproduction.
From a copyright perspective, it is also important to distinguish between inspiration and infringement. Copyright law does not protect ideas, styles, or concepts as such, but rather their specific expression. As a result, not every similarity between generated content and an existing work will amount to infringement. However, the boundary between permissible inspiration and unlawful use is inherently context-dependent and may give rise to disputes in practice.
An additional challenge concerns the attribution of liability for potential infringements. In the context of generative AI, a complex ecosystem of actors is involved, including model providers, deployers, and end users generating content. Determining the scope of responsibility of each of these actors requires consideration of both copyright law principles and regulatory frameworks governing AI systems, including compliance obligations under the AI Act.
The development of generative AI is also giving rise to new business models based on content generation, further complicating legal assessment. In particular, questions emerge regarding the commercial use of generated materials, the rules governing their further dissemination, and potential claims by copyright holders.
As a result, generative AI represents one of the most dynamic areas in the evolution of copyright law. It requires not only the application of existing legal frameworks, but also their interpretation in light of emerging technologies, as well as recognition of the growing importance of governance and legal risk management in AI-driven projects.
8. Legal Risks and Recommendations for Organisations (AI + Copyright)
The development of artificial intelligence systems, particularly generative models, is giving rise to new categories of legal risks related to copyright. These risks emerge at different stages of the AI lifecycle and require a systemic approach that combines both legal and organisational perspectives.
From an organisational standpoint, it is essential to properly identify key risk areas and implement appropriate governance mechanisms to mitigate and manage them effectively.
Key Legal Risks
One of the primary risks concerns the use of training data in a manner that infringes copyright. This is particularly relevant where data is sourced from publicly available materials without adequate assessment of its legal status or without taking into account mechanisms such as the opt-out provided for under the DSM Directive. A lack of control over data sources may make it difficult to demonstrate the lawfulness of their use.
A second important risk area is the potential generation of content that infringes the copyright of third parties. In practice, this may include both unintended reproduction of fragments of protected works and the generation of content that is substantially similar to existing materials. This risk is especially significant in commercial contexts, where generated content is used in products or services.
Another challenge relates to the lack of transparency in processes associated with model training and data use. In the event of a legal dispute, organisations may be required to demonstrate how data was sourced and used. Insufficient documentation in this area can significantly hinder the ability to defend against claims.
The complexity of the supply chain involved in the development and deployment of AI systems is also a relevant factor. Responsibility for potential infringements may be distributed among model providers, technology integrators, and end users. The absence of a clear allocation of roles and responsibilities increases both legal and operational risks.
Recommendations for Organisations
In response to these risks, organisations should implement a coherent approach to managing copyright in AI projects as part of a broader AI governance framework.
First, it is essential to establish robust training data governance procedures, including the identification of data sources, the assessment of their legal status, and the documentation of the legal basis for their use. These procedures should also take into account the applicability of text and data mining exceptions and opt-out mechanisms.
Equally important is the introduction of risk assessment mechanisms for AI-generated content. In practice, this may involve testing models for potential reproduction of protected content, as well as implementing review processes before such content is used in business activities.
Organisations should also ensure that contractual arrangements with AI technology providers and business partners are properly structured. Agreements should clearly define liability for copyright infringements, the rules governing data use, and obligations related to regulatory compliance.
Another key element is ensuring an appropriate level of transparency and documentation. Implementing mechanisms to document data sources, their use, and model training parameters enhances an organisation’s ability to demonstrate compliance with applicable laws.
Finally, legal risk management in the AI domain should be treated as an ongoing process. The rapid pace of technological development and the evolving regulatory landscape require continuous updates to procedures, active monitoring of legal changes, and ongoing adaptation of organisational practices.
What to do / What to avoid (AI + Copyright)
| Recommended Actions | What to Avoid |
| Identification and documentation of training data sources | Using data without knowledge of its origin |
| Assessment of the legal status of data and conditions of use | Assuming that publicly available data is always permissible to use |
| Taking into account text and data mining exceptions and the opt-out mechanism | Ignoring reservations expressed by copyright holders |
| Implementing copyright compliance policies | Treating compliance as a purely formal exercise |
| Testing models for the generation of content similar to protected works | Assuming that generative models cannot infringe copyright |
| Verifying AI-generated content before commercial use | Automatically publishing AI-generated content without review |
| Ensuring transparency and documentation of model training processes | Lack of documentation of data sources and their use |
| Regulating liability in contracts with AI technology providers | Leaving liability issues unaddressed |
| Regularly updating procedures in response to legal developments | Treating governance as a one-off activity |
Conclusion
The relationship between artificial intelligence and copyright in the European Union is inherently multi-dimensional, spanning the entire lifecycle of AI systems — from the sourcing of training data, through model training, to the generation and use of content. Analysing this area requires consideration of both traditional principles of copyright law and newer regulatory frameworks, including rules on text and data mining and obligations introduced under the AI Act.
In practice, this necessitates adopting a systemic approach in which copyright considerations form an integral part of broader AI governance frameworks. In this context, particular importance should be placed on training data governance, the assessment of risks associated with AI-generated content, and the implementation of procedures that enable organisations to demonstrate regulatory compliance.
The rapid development of generative AI technologies means that the boundaries between permissible data use and copyright infringement remain unclear in many areas. As a result, organisations should not only apply existing legal rules, but also actively manage legal risks and continuously adapt their practices to an evolving regulatory landscape.