Reddit Accuses Perplexity of Data Theft in Elaborate Digital Trap

Reddit Accuses Perplexity of Data Theft in Elaborate Digital - The Digital "Marked Bill" Trap In what reads like a digital de

The Digital “Marked Bill” Trap

In what reads like a digital detective story, Reddit has accused artificial intelligence startup Perplexity of systematically bypassing its data protection measures to scrape content for its AI models. According to a lawsuit filed Wednesday in Manhattan federal court, the social media giant set up an elaborate trap that allegedly caught the $20 billion AI company red-handed.

The confrontation began when Reddit noticed something peculiar. Despite Perplexity having agreed to follow Reddit’s instructions and block its systems from scraping content from the site, the AI company was actually citing Reddit in its AI-generated answers more frequently than ever before. The increase was so dramatic that industry observers speculated the two companies might have secretly struck a content licensing deal, as reported by industry analysts.

But Reddit’s lawsuit tells a different story. “In truth, there is no license between Perplexity and Reddit,” the legal filing states, alleging instead that the increased citations resulted from “a scheme by Perplexity to obtain Reddit’s data through the circumvention of the technological measures protecting Reddit data.”

How the Trap Worked

Reddit’s technical team engineered what they described in legal documents as a digital equivalent of a “marked bill” – creating a test post that could only be crawled by Google’s search engine. This setup was crucial because Google has a legitimate content-licensing deal with Reddit, while Perplexity does not.

The technical approach was sophisticated. According to the lawsuit, the only way Perplexity could access this specially marked content would be if it bypassed Reddit’s guardrails by scraping Google’s search engine page results (SERPs). Essentially, Reddit created content that should have been invisible to Perplexity’s systems under normal circumstances.

What happened next surprised even the Reddit team. “Within hours, queries to Perplexity’s ‘answer engine’ produced the contents of that test post,” the lawsuit reveals. The speed of the response suggested an automated system actively scraping and incorporating the marked content into Perplexity’s knowledge base.

Broader Pattern of Behavior

This isn’t the first time Perplexity has faced allegations of bypassing web standards. Reddit’s legal filing references similar concerns raised by internet infrastructure company Cloudflare. In an August blog post, Cloudflare described setting up web pages with code that explicitly instructed Perplexity not to crawl those sites’ content – only to find that Perplexity’s crawlers visited those websites anyway.

Cloudflare CEO Matthew Prince didn’t mince words when characterizing the behavior, comparing Perplexity to “North Korean hackers” in a social media post. “Some supposedly ‘reputable’ AI companies act more like North Korean hackers,” Prince wrote. “Time to name, shame, and hard block them.” Reddit cited this characterization in its own lawsuit, suggesting a pattern of behavior across the industry.

The legal action names not just Perplexity but three data-scraping companies as co-defendants: Oxylabs UAB, AWM Proxy, and SerpApi. Reddit alleges these companies may have taken its posts without permission and sold them to Perplexity. One defendant, AWMProxy, is identified in the lawsuit as a former Russian botnet, raising additional security concerns.

Industry Implications and Responses

This case arrives at a critical moment for the AI industry, as companies race to gather training data while content creators increasingly push back against unauthorized scraping. The tension between AI development and content ownership represents one of the defining legal battles of the current technological era.

Perplexity’s response to the allegations has been multifaceted. Spokesperson Jesse Dwyer told Business Insider that the company “will not tolerate threats against openness and the public interest.” Meanwhile, in a Reddit post following the lawsuit, the company stated that it “does not train AI models on content,” though this appears to contradict the core allegations in Reddit’s legal filing.

The data-scraping companies named in the lawsuit are mounting their defense. A representative for SerpApi said the company plans to “vigorously defend ourselves in court,” while Oxylabs’ chief governance and strategy officer Denas Grybauskas expressed being “shocked and disappointed” by the allegations. “Oxylabs has always been and will continue to be a pioneer and an industry leader in public data collection,” Grybauskas stated.

As the legal battle unfolds, the case could establish important precedents for how AI companies source their training data and what constitutes acceptable scraping practices. With artificial intelligence development accelerating rapidly, the rules of engagement between content platforms and AI trainers are being written in real-time through cases like this one.

Leave a Reply

Your email address will not be published. Required fields are marked *