In the vast number of fields where generative AI has been tested, law is perhaps its most glaring point of failure. Tools like OpenAI’s ChatGPT have gotten lawyers sanctioned and experts publicly embarrassed, producing briefs based on made-up cases and nonexistent research citations. So when my colleague Kylie Robison got access to ChatGPT’s new “deep research” feature, my task was clear: make this purportedly superpowerful tool write about a law humans constantly get wrong.
Compile a list of federal court and Supreme Court rulings from the last five years related to Section 230 of the Communications Decency Act, I asked Kylie to tell it. Summarize any significant developments in how judges have interpreted the law.
I was asking ChatGPT to give me a rundown on the state of what are commonly called the 26 words that created the internet, a constantly evolving topic I follow at The Verge. The good news: ChatGPT appropriately selected and accurately summarized a set of recent court rulings, all of which exist. The so-so news: it missed some broader points that a competent human expert might acknowledge. The bad news: it ignored a full year’s worth of legal decisions, which, unfortunately, happened to upend the status of the law.
Deep research is a new OpenAI feature meant to produce complex and sophisticated reports on specific topics; getting more than “limited” access requires ChatGPT’s $200 per month Pro tier. Unlike the simplest form of ChatGPT, which relies on training data with a cutoff date, this system searches the web for fresh information to complete its task. My request felt consistent with the spirit of ChatGPT’s example prompt, which asked for a summary of retail trends over the past three years. And because I’m not a lawyer, I enlisted legal expert Eric Goldman, whose blog is one of the most reliable sources of Section 230 news, to review the results.
The deep research experience is similar to using the rest of ChatGPT. You input a query, and ChatGPT asks follow-up questions for clarification: in my case, whether I wanted to focus on a specific area of Section 230 rulings (no); or include additional analysis around lawmaking (also no). I used the follow-up to throw in another request, asking it to point out where different courts disagree on what the law means, which might require the Supreme Court to step in. It’s a legal wrinkle that’s important but sometimes difficult to keep abreast of — the kind of thing I could imagine getting from an automated report.
Deep research is supposed to take between five and 30 minutes, and in my case, it was around 10. (The report itself is here, so you can read the whole thing if you’re inclined.) The process delivers footnote web links as well as a series of explanations that provide more information about how ChatGPT broke the problem down. The result was about 5,000 words of a text that was dense but formatted with helpful headers and fairly readable if you’re used to legal analysis.
The first thing I did with my report, obviously, was check the name of every legal case. Several were already familiar, and I verified the rest outside ChatGPT — they all seemed real. Then, I passed it to Goldman for his thoughts.
“I could quibble with some nuances throughout the piece, but overall the text appears to be largely accurate,” Goldman told me. He agreed there weren’t any made-up cases, and the ones ChatGPT selected were reasonable to include, though he disagreed with how important it indicated some were. “If I put together my top cases from that period, the list would look different, but that’s a matter of judgment and opinion.” The descriptions sometimes glossed over noteworthy legal distinctions — but in ways that aren’t uncommon among humans.
Less positively, Goldman thought ChatGPT ignored context a human expert would find important. Law isn’t made in a vacuum; it’s decided by judges who respond to larger trends and social forces, including shifting sympathies against tech companies and a conservative political blitz against Section 230. I didn’t tell ChatGPT to discuss broader dynamics, but one goal of research is to identify important questions that aren’t being asked — a perk of human expertise, apparently, for now.
But the biggest problem was that ChatGPT didn’t follow the single clearest element of my request: tell me what happened in the last five years. ChatGPT’s report title declares that it covers 2019 to 2024. Yet the latest case it mentions was decided in 2023, after which it soberly concludes that the law remains “a robust shield” whose boundaries are being “refine[d].” A layperson could easily think that means nothing happened last year. An informed reader would realize something was very wrong.
“2024 was a rollicking year for Section 230,” Goldman points out. This period produced an out-of-the-blue Third Circuit ruling against granting the law’s protections to TikTok, plus several more that could dramatically narrow how it’s applied. Goldman himself declared mid-year that Section 230 was “fading fast” amid the flood of cases and larger political attacks. By the start of 2025, he wrote he’d be “shocked if it survives to see 2026.” Not everyone seems this pessimistic, but I’ve spoken to multiple legal experts in the past year who believe Section 230’s shield is becoming less ironclad. At the very least, opinions like the Third Circuit TikTok case should “definitely” figure into “any proper accounting” of the law during the past five years, Goldman says.
The upshot is that ChatGPT’s output felt a bit like a report on 2002 to 2007 cellphone trends ending with the rise of the BlackBerry: the facts aren’t wrong, but the omissions sure change what story they tell.
Casey Newton of Platformer notes that, like many AI tools, deep research works best if you’re already familiar with a subject, partly because you can tell where it’s screwing things up. (Newton’s report did, in fact, make some mistakes he deemed “embarrassing.”) But where he found it a useful way to further explore a topic he already understood, I felt like I didn’t get what I asked for.
At least two of my Verge colleagues also got reports that omitted useful information from last year, and they were able to fix it by asking ChatGPT to specifically rerun the reports with data from 2024. (I didn’t do this, partly because I didn’t spot the missing year immediately and partly because even the Pro tier has a limited pool of 100 queries a month.) I’d normally chalk the issue up to a training data cutoff, except that ChatGPT is clearly capable of accessing this information, and OpenAI’s own example of deep research requests it.
Either way, this seems like a simpler issue to remedy than made-up legal rulings. And the report is a fascinating and impressive technological achievement. Generative AI has gone from producing meandering dream logic to a cogent — if imperfect — legal summary that leaves some Ivy League-educated federal lawmakers in the dust. In some ways, it feels petty to complain that I have to nag it into doing what I ask.
While lots of people are documenting Section 230 decisions, I could see a competent ChatGPT-based research tool being useful for obscure legal topics with less human coverage. That seems a ways off, though. My report leaned heavily on secondary analysis and reporting; ChatGPT is not (as far as I know) hooked into specialized data sources that would facilitate original research like poring over court filings. OpenAI acknowledges hallucination problems persist, so you’d need to carefully check its work, too.
I’m not sure how indicative my test is of deep research’s overall usefulness. I made a more technical, less open-ended request than Newton, who asked how the social media fediverse could help publishers. Other users’ requests might be more like his than mine. But ChatGPT arguably aced the crunchy technical explanations — it failed at filling out the big picture.
For now, it’s plain annoying if I have to keep a $200 per month commercial computing application on task like a distractible toddler. I’m impressed by deep research as a technology. But from my current limited vantage point, it might still be a product for people who want to believe in it, not those who just want it to work.