High-quality, diverse and extensive datasets are fundamental for improving machine learning model performance, and web scraping helps gather the necessary data to develop more robust and generalizable models.
Web scraping poses different legal challenges, such as data protection, copyright and contractual law-related issues. Intellectual property concerns arise as website content, like text, images and data, is often copyrighted, and scraping without the copyright owner's permission may lead to infringement claims.
Further, many websites prohibit scraping in their terms of service, and violating these terms can also result in legal action against the operators of web scrapers.
Data protection implications of web scraping
The EU General Data Protection Regulation defines personal data as "any information relating to an identified or identifiable natural person." Web scraping poses significant data protection challenges because it often collects personal data, including sensitive data, without individuals' knowledge or consent.
In the EU, data protection laws limit the legal use of web scraping. The GDPR defines processing as any operation on personal data, including collecting, organizing, storing, modifying, retrieving, using and disseminating it. Since web scraping involves these activities, operators are considered data controllers. This means they must comply with controller obligations, including having a lawful basis for data processing, having a legitimate purpose, e.g., training a model, and adhering to principles of transparency, data minimization, storage limitation, accuracy, security, confidentiality, integrity and accountability.
Under the GDPR, any processing of personal data must be justified by a legitimate legal basis. While the European Artificial Intelligence Act aims to establish a comprehensive legal framework for the deployment and operation of AI systems, it does not currently provide a specific legal basis for the initial collection of personal data for training AI tools.
Instead, the AI Act focuses on data processing within AI sandboxes and development environments, leaving the justification for initial data collection to be governed by the GDPR. Organizations using web scraping, therefore, must ensure they have a lawful basis under the GDPR for processing both ordinary and special categories of personal data, considering various legal bases:
Consent: Consent is unlikely to serve as a valid legal basis for web scraping, as it requires the informed and voluntary agreement of the individuals whose data is collected. Obtaining such consent is practically impossible in the context of automated and large-scale data collection, particularly given the "black box" nature of AI. This complexity further complicates the issue of consent for subsequent data processing.
Contractual necessity: Processing based on contractual necessity requires a direct contractual relationship between the data controller and the data subject. In web scraping, there is typically no such relationship with the individuals whose data is being collected. Consequently, this legal basis is generally inapplicable for justifying web scraping activities.
Further,when web scraping captures special categories of personal data, such as health information, additional constraints under Article 9 of the GDPR apply. These constraints include the necessity for explicit consent or meeting specific conditions, such as processing for substantial public interest or scientific research purposes.
Legitimate interest of web scrapers in data collection
The European Data Protection Board's ChatGPT Task Force Report clearly points out that the collection of training data, preprocessing of the data and training are different data processing purposes that require their own established legal bases. This aligns with guidance from France's data protection authority, the Commission nationale de l'informatique et des libertés, titled "Relying on the legal basis of legitimate interests to develop an AI system," which differentiates between the different phases of training and using AI systems with data scraped from the internet, identifying risks with each phase of the training and utilization process.
The task force reminds us that the legal assessment of the legitimate interest basis must consider three key criteria: the existence of a legitimate interest; the necessity of processing, ensuring the data is adequate, relevant and limited to what is necessary; and the balancing of interests.This requires a careful evaluation of the fundamental rights and freedoms of data subjects against the controller's legitimate interests, taking into account the reasonable expectations of data subjects. The task force suggests safeguards could include technical measures like defining precise collection criteria and ensuring certain data categories or sources, such as public social media profiles, are excluded from data collection.
The Netherlands' DPA, Autoriteit Persoonsgegevens, states in its guidelines that only legally protected interests qualify as legitimate interests and purely commercial interests are insufficient. Note the CNIL says "the commercial aim of developing an AI system is not inherently contradictory to using the legal basis of legitimate interest." The AP is precise and says a legitimate interest may be established if an organization or a third party has an additional legally recognized interest, such as improving systems for fraud prevention or information technology security.
The AP's position indicates that establishing a legitimate interest in web scraping is challenging and often impractical. In contrast, the EDPB's ChatGPT Task Force emphasizes the necessity of a case-by-case evaluation, considering the collection and processing of "ordinary" personal data and special categories of personal data for which additional safeguards apply.
The AP, the EDPB's ChatGPT task force report and the CNIL also recommend using specific safeguards to favor the relevant data controller relying on web scraping techniques. These safeguards, as listed by the CNIL, include mandatory measures to ensure data minimization, such as setting precise criteria for data collection and applying filters to exclude unnecessary data such as bank transactions, geolocation and sensitive data, and promptly deleting irrelevant data once identified, e.g., collecting pseudonyms on forums when only comment content is needed; and applying supplementary guarantees.
These supplementary guarantees may include:
Excluding data collection from predefined sites with sensitive information, such as pornographic sites, health forums and social networks primarily used by minors, genealogy sites or those with extensive personal data.
Avoiding data from sites that explicitly prohibit scraping through robot.txt or ai.txt files.
Implementing a blacklist for individuals who object to data collection on specific websites, even before collection begins.
Ensuring individuals' rights to object to data collection.
Limiting data collection to freely accessible data and explicitly public user data, thereby preventing loss of control over private information, e.g., excluding private social network posts.
Applying anonymization or pseudonymization measures immediately after collection to enhance data security.
Informing users about affected websites and data collection practices through web scraping notifications.
Preventing cross-referencing personal data with other identifiers unless necessary for developing AI systems.
Registering contact details with the CNIL to inform individuals and enable them to exercise their GDPR rights with the data controller.
Conclusion
Web scraping is integral to the development of AI but poses significant legal challenges, particularly regarding data protection. While the controller's or a third party's legitimate interest as a legal basis under the GDPR can justify data collection if a legitimate interest is established and balanced against data subject rights, comprehensive safeguards must be implemented to mitigate legal risks on a case-by-case basis. The evolving regulatory landscape, including the AI Act, will likely provide further clarity on permissible data collection practices, but current uncertainties necessitate cautious and responsible data handling practices.
Tamás Bereczki, CIPP/E, and Ádám Liber, CIPP/E, CIPM, FIP are partners at PROVARIS Varga & Partners.