Reddit sues Perplexity, accusing the AI lab of using scraped content for training

TITLE: Reddit’s Legal Battle Against Perplexity Exposes AI Industry’s Content Scraping Crisis

The Data Scraping Lawsuit That Could Reshape AI Development

Reddit has launched a significant legal offensive against AI startup Perplexity and three data scraping companies, alleging what court documents describe as an “industrial-scale” scheme to illegally harvest Reddit’s user-generated content. The lawsuit, filed in the Southern District of New York, represents a critical moment in the ongoing tension between content platforms and AI companies hungry for training data.

The Data Scraping Lawsuit That Could Reshape AI Development
Bypassing Defenses: The Scraping Methodology
Perplexity’s Technology Under Microscope
A Pattern of Controversial Data Practices
Broader Implications for AI Industry
Legal Remedies and Industry Impact

The social media platform accuses Perplexity of collaborating with scraping firms Oxylabs UAB, AWMProxy, and SerpApi to systematically bypass Reddit’s data protection measures. According to the complaint, these entities worked in concert to steal copyrighted user conversations and discussions to fuel Perplexity’s AI-powered “answer engine.”

Bypassing Defenses: The Scraping Methodology

Reddit’s legal filing reveals sophisticated methods allegedly employed to evade detection. The scraping companies reportedly circumvented Reddit’s anti-scraping measures through backdoor techniques, including extracting Reddit content directly from Google search results pages. This approach allowed them to access data while appearing as regular search traffic rather than automated scraping bots.

Perhaps most damning is Reddit’s description of a carefully laid trap. The company created a unique “test post” accessible only to Google’s search crawler and unavailable elsewhere online. Within hours, content from this deliberately hidden post appeared in Perplexity’s search results, providing what Reddit claims is undeniable evidence of improper data collection., as comprehensive coverage

Perplexity’s Technology Under Microscope

The lawsuit offers a rare public dissection of an AI company’s technical architecture. Reddit characterizes Perplexity’s core technology as “nothing groundbreaking,” describing it as built on “retrieval-augmented generation” (RAG) where scraped data is processed by another company’s large language model.

“The business model is effectively to take Reddit’s content from Google search results, feed them into a third party’s LLM, and call it a new product,” the complaint states. This assessment challenges Perplexity’s $20 billion valuation and raises questions about the fundamental value proposition of some AI startups., according to recent research

A Pattern of Controversial Data Practices

This isn’t the first time Perplexity has faced allegations of questionable data collection practices. In August, Cloudflare publicly accused the AI company of ignoring robots.txt files and using stealth crawlers to evade Web Application Firewall rules. According to Cloudflare’s report, Perplexity employed undeclared crawlers after customers blocked its known crawling agents (PerplexityBot and Perplexity-User).

Reddit’s complaint notes that despite sending a cease-and-desist letter to Perplexity in May 2024 and receiving promises to respect Reddit’s robots.txt file, the volume of citations from Reddit on Perplexity’s platform actually “increased forty-fold” following the warning.

Broader Implications for AI Industry

The lawsuit represents the latest escalation in Reddit’s campaign to control how its data is used by AI companies. In June, Reddit filed a similar lawsuit against Anthropic, accusing the Claude creator of “two-faced” behavior—publicly advocating for responsible AI while privately scraping data against Reddit’s terms of service.

These legal actions highlight the growing conflict between:

Content platforms seeking to monetize and control their user-generated data
AI companies requiring massive datasets for training and operation
Legal frameworks struggling to keep pace with technological development
User rights and the ownership of contributed content

Legal Remedies and Industry Impact

Reddit is seeking court orders to permanently stop the defendants from scraping its data and requesting damages for the harm caused. The company specifically demands the “disgorgement of any ill-gotten gains” earned from unauthorized use of its content—a potentially significant financial penalty given Perplexity’s reported valuation.

The outcome of this case could establish important precedents for how AI companies access and use online content. With major players like OpenAI and Google entering formal agreements with content platforms while smaller startups allegedly resort to scraping, the legal landscape for AI training data appears to be at a critical juncture.

As the AI industry continues its rapid expansion, this lawsuit underscores the urgent need for clear guidelines around data sourcing, copyright compliance, and the ethical development of artificial intelligence technologies.

References & Further Reading

This article draws from multiple authoritative sources. For more information, please consult:

This article aggregates information from publicly available sources. All trademarks and copyrights belong to their respective owners.

Note: Featured image is for illustrative purposes only and does not represent any specific product, service, or entity mentioned in this article.

OpenAI has launched Atlas, an AI-powered browser that embeds ChatGPT’s capabilities directly into web navigation. Early demonstrations show students using the tool for real-time explanations, self-quizzing, and task management without breaking focus from their primary content.

AI-Powered Browser Aims to Transform Online Learning

OpenAI has introduced Atlas, a new web browser that integrates ChatGPT’s conversational intelligence directly into the browsing experience, according to reports from the company’s launch details. The browser attempts to address the common problem of distraction during online research by allowing users to ask questions and get explanations without ever leaving their current webpage.

Reddit sues Perplexity, accusing the AI lab of using scraped content for training

The Data Scraping Lawsuit That Could Reshape AI Development

Table of Contents

Bypassing Defenses: The Scraping Methodology

Perplexity’s Technology Under Microscope

A Pattern of Controversial Data Practices

Broader Implications for AI Industry

Legal Remedies and Industry Impact

Related Articles You May Find Interesting

References & Further Reading

Leave a Reply Cancel reply

Featured Posts

The Quantum Countdown: Securing Healthcare’s Digital Future Against…

The AI Reality Check: How Enterprises Are Building…

Apple’s $700 MacBook With iPhone Chip Makes No…

Gallery

Recent Posts

Izzy Schwab’s ‘People-First’ Legacy Lives On At D&H

Peloton Lays Off 11% of Staff in Latest…

Tenable Jumps Into the AI Security Fray, But…

Quick Links

The Data Scraping Lawsuit That Could Reshape AI Development

Table of Contents

Bypassing Defenses: The Scraping Methodology

Perplexity’s Technology Under Microscope

A Pattern of Controversial Data Practices

Broader Implications for AI Industry

Legal Remedies and Industry Impact

Related Articles You May Find Interesting

References & Further Reading

Related Posts

AI-Powered Browser Aims to Transform Online Learning

Leave a Reply Cancel reply