What is the Content Training Summary Template?
The European Commission recently released an explanatory notice and template to help providers of general-purpose AI (GPAI) models summarise the content used to train their models. The template supports GPAI providers in meeting their obligations under Article 53 of the EU AI Act, making a summary about the content used for training of all GPAI models publicly available.
Crucially, it also represents another step towards building trust in AI by increasing transparency, in line with the objectives of the regulation.
While the summary of information about a GPAI model provided using the Template is publicly available, the Commission has accounted for the need to protect trade secrets and confidential business information. As such, the explanatory notice clarifies that the summary should be ‘generally comprehensive in its scope instead of technically detailed to facilitate parties with legitimate interests, including copyright holders, to exercise and enforce their rights under Union law.’
Section One: General Information
The first section of the template includes general information about the GPAI provider and model, including provider contact information, versioned GPAI model name, model dependencies, and the date on which the model was placed on the Union market. Providers must detail the modalities present in the training data so far as they are identifiable, including:
- Text
- Image
- Audio
- Video
- Other
Providers must detail training data size by selecting ranges within the estimated total data size for each modality. They also need to describe the types of content for each selected modality, for example:
- Fiction text
- Non-fiction text
- Scientific text
- Photography
- Visual artworks
- Infographics
- Social media images
- Musical compositions
- Audiobooks
- Private audio communication
- Music videos
- Films
- TV programmes
- Video games
- Social media videos.
Finally, providers must share the latest date of data acquisition or collection for model training and any additional information about the collection of training data.
Section Two: Data Sources
The second, and largest, section of the template requires providers to detail specific sources of data used to train the GPAI model. Organisations should specify the modality or modalities of the content covered by the datasets concerned in each section, then answer specific questions for each type of data source.
This section classifies the term “dataset” as a single, pre-packaged collection of data; data that has been filtered and pre-processed from the same pre-packaged collection should not be considered a new dataset to be disclosed separately. If a dataset falls into more than one category, providers should select the most relevant category.
GPAI providers must provide details about the datasets used to train the model:
- Publicly available datasets
- Datasets compiled by a third party are made available publicly for free and are readily downloadable as a whole or in predefined chunks.
- Private non-publicly available datasets obtained from third parties
- Datasets commercially licensed by rightsholders or their representatives.
- Private datasets obtained from other third parties.
- Data crawled and scraped from online sources
- Crawled, scraped data, or data otherwise compiled from online sources, excluding publicly available datasets already covered.
- User data
- User data collected by all services and products of the provider, not including data licensed by users based on commercial transactional agreements or customer data, to fine-tune models for specific purposes.
- Synthetic AI-generated data
- Data created for training the model on the outputs of another model, such as AI feedback through reinforcement learning, not including the use of AI models to clean or enrich data.
- Other sources of data
- Data that does not fall under any of the previous categories, e.g. data collected from offline sources, self-digitised media, datasets labelled by humans commissioned by the provider.
Section Three: Data Processing Aspects
The third section of the template focuses on the measures the provider has implemented to identify and comply with any reservations of rights under the text and data mining (TDM) exception or limitation set out in Article 4 of the Directive on Copyright in the Digital Single Market. These measures should also align with the provider’s copyright policy, as required by Article 53 of the EU AI Act.
This includes describing measures the provider has implemented before model training to respect reservations of rights from the TDM exception or limitation:
- Measures implemented before and during data collection
- Opt-out protocols and solutions honoured by the provider
- Opt-out protocols and solutions honoured by third parties from which datasets have been obtained.
GPAI providers must provide a general description of the measures they have taken to avoid or remove illegal content under Union law from the training data. However, they aren’t required to disclose specific details about their internal business practices or trade secrets.
Finally, the template provides an optional section where providers can share any other relevant information about data processing measures taken before or after the training of the model.
Next Steps
For GPAI providers, it’s vital to review existing GPAI model documentation and processes. In preparation for using the template, organisations should ensure clear internal visibility on dataset sources, dataset modalities, sizes and content types, and existing data processing measures.
Implementing best practices, such as those outlined in the AI management standard ISO 42001 to build an ethical AI management system (AIMS), can also help to increase transparency, reduce AI risk, ensure clear documentation and build trust in an organisation and its AI models.