AI deep research agents struggle with nuanced academic subjects despite bold claims

250313 Deep research

AI tools such as ChatGPT, Perplexity, and Gemini now have deep research features. Their developers have made bold claims about their capabilities, so we decided to see how they would handle questions about highly specialized topics.

Table of Contents

Recently, a number of AI developers have added a feature called ‘deep research’ to their tools. In theory, this is supposed to produce more accurate responses, and these developers have made some pretty bold claims in support of this new feature. OpenAI says that deep research allows ChatGPT to “find, analyze, and synthesize hundreds of online sources to create a comprehensive report at the level of a research analyst” while Perplexity AI claims that its eponymous tool “excels at a range of expert-level tasks.” My colleague José invited me to put these claims to the test to see if deep research actually lives up to the hype.

Ground rules

I used three AI tools for this experiment:

  • ChatGPT
  • Perplexity
  • Gemini

The prompts I used were:

  • What are some of the pros and cons of the various commonly used translations of the Egyptian term "jrj-pꜥt”?” 
  • What are some stylistic elements of ministerial submissions to the British monarch?

I chose these questions because they draw on specialist knowledge and would require the AI to go beyond popular sources like Wikipedia. But the information can be found online–given the way generative AI works, it wouldn’t have been fair for me to ask a question that could only be answered by consulting an unpublished manuscript hidden away in the basement of the British Library! To make it easier for me to check the AI’s work, I stuck with subjects I'm already familiar with. I’m a historian, and while most of my work is about the British constitution, I have a longstanding interest in ancient history (particularly Egyptology). 

Prompt 1

What are some of the pros and cons of the various commonly used translations of the Egyptian term "jrj-pꜥt”?” 

Irj-pꜥt (often transliterated as ‘iry-pat’) is an Egyptian term that crops up a lot in ancient texts yet it defies easy translation. The literal meaning is something like ‘member of the elite,’ but Egyptologists have used a number of different translations over the years, including ‘hereditary prince,’ ‘hereditary nobleman,’ or ‘crown prince.’ 

I chose this topic because, although it is highly specialized, there is a fair amount of source material online. However, it can be tricky to find, as the relevant information is spread across a range of books, journal articles, and websites.

ChatGPT

This was the most thorough response. It started by identifying the most common translations before proceeding to discuss the strengths and weaknesses of each one. Broadly speaking, its assessments were accurate. For example, when discussing ‘hereditary prince,’ ChatGPT noted that, while this translation does a good job of conveying the title’s exalted status, it can be misleading since the title wasn’t necessarily limited to members of the royal family. 

ChatGPT

I also appreciated that ChatGPT included a discussion of how Egyptologists' approach to translating iry-pat has changed over the years with more recent scholars prioritizing accuracy and linguistic fidelity over familiarity.  

Unfortunately, ChatGPT’s response was marred by a number of problematic statements. For example, it repeatedly characterized Egypt’s nobility as a hereditary aristocracy even though we don’t actually know the rules that governed the use of titles like iry-pat. ChatGPT also cited some less-than-authoritative sources. It relied heavily on Wikipedia and Fandom.com (!), and at one point it cited a blog by an anonymous author.

Perplexity

Perplexity did a better job of examining the issue from a theoretical perspective with sections on the linguistics of iry-pat and the distinctions between emic (i.e., insider) vs. etic (i.e., analytical) frameworks. It also cited a wider range of sources, including a scholarly monograph.

Perplexity

However, some of these citations turned out to be misleading. At one point, Perplexity cited The Double Kingdom Under Taharqo by Jeremy W. Pope to support the idea that comparing the aristocracies of Rome and Egypt can be anachronistic. It’s a valid point, but as far as I can tell, Pope doesn’t actually make that argument in the book. 

Perplexity also ignored the most common English translations of iry-pat in favor of more obscure ones such as ‘patrician’ or ‘grandee.’ That’s a problem when the prompt explicitly asked for a discussion of commonly used translations.

Gemini

Gemini’s response was the weakest of the three. While it covered some of the most common translations, its discussion of the advantages and disadvantages of each felt cursory. That’s not to say it didn’t make valid points–Gemini was the only AI to note that Egyptian noble titles may not have been hereditary–but it definitely could have benefitted from a deeper dive. 

Gemini

Gemini’s level of precision varied from paragraph to paragraph. While it questioned the hereditary nature of Egypt’s aristocracy, it translated pat as ‘patrician’ without comment despite the fact that it’s arguably anachronistic. 

Gemini also allowed the narrative to wander a bit. It mentioned that jrj can be written several different ways then segued into a discussion of how Jean-François Champollion figured out how hieroglyphs worked even though it was only tangentially related to the prompt.  

Prompt 2

What are some stylistic elements of ministerial submissions to the British monarch?

In this context, ‘submissions’ are documents in which a government minister formally advises the British monarch to do something. Because Britain is a constitutional monarchy, the sovereign is normally expected to follow the advice of their ministers, so these interactions are an important part of the UK's constitution.

I chose this prompt because I wanted to see how the AIs would handle a relatively narrow array of sources. This is also a subject that's particularly close to my heart, as I’ve written quite a bit on this topic (see here, for example). Previously, ministerial submissions hadn’t received much in the way of formal discussion or analysis. This was likely due in part to the fact that, until recently, much of the relevant material was confined to the archives. 

ChatGPT

Once again, ChatGPT’s response was the best. Not only did it provide a good overview, but it also provided a surprising number of direct quotations and specific examples. However, it struggled with some of the details, and some of its assertions were just wrong. For example, ChatGPT argued that “[t]here is little pretense that the monarch might independently decide otherwise” even though plenty of submissions pay lip service to the idea that the monarch is actually exercising personal agency.

ChatGPT

The organization of ChatGPT’s response was also rather strange. It had separate sections for “policy or executive recommendations,” “legislative and prerogative approvals,” “ceremonial and honorary matters,” and “routine reporting (cover letters)” even though these aren’t really discrete categories. Indeed, if you know anything about the British constitution, they’re downright nonsensical. 

Perplexity

Perplexity’s response started relatively strong with an overview of the stylistic elements but was soon derailed by nonsense. It seems like the AI didn’t really understand the prompt. For example, it dedicated an entire paragraph to the “one paragraph, one issue” dictum and suggested that “a submission on healthcare reform might separate funding allocations, workforce planning, and patient outcomes into distinct paragraphs, each ending with implicit or explicit ties to the central recommendation.” None of that is relevant, though. Submissions to the monarch aren’t going to discuss things like “workforce planning” or “patient outcomes” because those aren’t matters for the sovereign’s consideration. 

Perplexity

It gets worse. In a section about “adaptation to modern governance needs,” Perplexity claims that submissions feature executive-summary-style conclusions, gender-neutral language, and security protocols, but none of that information is accurate. 

Perplexity’s citations were a mess here, too. In the section on “unspoken rules” it claims that “submissions from junior ministers often contain more explanatory text than those from premiers, reflecting uncertainty about the sovereign’s familiarity with niche portfolios.” While not entirely unprecedented, junior ministers typically don’t make submissions to the monarch. To back this up, Perplexity cites two articles written by my friend David Torrance of the House of Commons Library even though he never said anything along those lines.

Gemini

I assumed that things couldn’t get much worse after Perplexity, but then I saw Gemini’s response. While Perplexity’s digressions had a sliver of relevance, Gemini jettisoned logic entirely. It devotes entire paragraphs to topics such as audiences with the monarch and the Accession Council even though they aren’t relevant. While Gemini does cite a few sources, they’re often quite superficial–the sort of thing you could easily find via Google. Moreover, none of the citations actually relate to ministers’ submissions (at one point, Gemini cites the general contact information on the Royal Family’s website!).

Gemini

It’s painfully obvious that Gemini didn’t understand the prompt. It treated submissions like ordinary correspondence with the monarch even though they are actually quite distinct. Perhaps this fundamental misunderstanding made it difficult for Gemini to find relevant information which led it to hallucinate. 

A word of caution

If I had to rank these AIs based on their deep research capabilities, ChatGPT would be first, Perplexity would be second, and Gemini a distant third. 

ChatGPT’s responses delved deeper and were better organized than the others. I also appreciated the fact that it sought clarification if it was uncertain about something. But although ChatGPT was the least-bad option, I won’t be using it for research purposes anytime soon. Its responses weren’t as cringeworthy as Perplexity or Gemini, but I wouldn’t classify them as ‘good,’ either. At the end of the day, they just had too many problems. There were times where it felt like I was reading the work of an overeager undergraduate who was bound and determined to impress me with their intellectual prowess. 

Consider this excerpt from ChatGPT’s response to the ministerial submissions prompt (citations omitted):

“If You Shall So Please” – Maintaining the Fiction: Another traditional phrasing that persists is the use of conditional polite clauses like “if it should please Your Majesty” or “if you shall so please.” Rather than bluntly stating that the monarch will do something, submissions often couch approvals in these polite hypotheticals. For instance, a recent submission regarding judicial appointments read: “submits for Your Majesty’s approval, if you shall so please, the attached list of names…” Similarly, a template line when requesting Royal Assent via Commission is: “The Lord Chancellor… submits for Your Majesty’s signature, if you shall so please, a Commission for giving the Royal Assent to certain Bills…” This archaic politeness survives to present day in UK practice. It emphasizes that the monarch’s will is supreme (even though conventionally the monarch will accept the advice). In contrast, many other modern constitutional monarchies have dropped this indirect style – for example, New Zealand’s official advice template simply states “Her Excellency the Governor General is recommended to sign the attached Order in Council…”, a much more direct construction. The UK’s continued use of “if you please” phrasing is a conscious retention of traditional courtesy.

That’s almost 200 words spent on making what is ultimately a fairly simple thesis. 

ChatGPT tried very hard to put a nice bow on everything even though this led to some unfortunate outcomes. A good example of this would be how ChatGPT tried to argue that different submissions can have different stylistic elements even though it ended up drawing meaningless distinctions. And instead of acknowledging our imperfect understanding of the Egyptian nobility, ChatGPT presented it as a hereditary system.

To be fair, ChatGPT doesn’t present itself as some kind of infallible oracle. In fact, there’s a tiny disclaimer beneath the prompt-input box which reminds users that “ChatGPT can make mistakes.” Those mistakes were obvious to me since I was seeking information about topics I already understood, but that’s probably not how most people use ChatGPT. If a response has a patina of authority about it, a lot of folks will take it at face value and won’t bother to look for misinformation.

Keep your eye on AI

To sum it up, ChatGPT was the best of the three AI tools I tested. Perplexity came in second, and Gemini third. While each one had its strong points, the weaknesses of Perplexity and Gemini ended up outweighing their strengths. ChatGPT’s responses had more depth and complexity while those of Perplexity and Gemini tended to be far shallower. Perplexity in particular seemed to have a real problem with sources, as it repeatedly cited works that didn’t actually support the points it was trying to make.  

If you decide to use ChatGPT (or any other AI tool) for research, be sure to view its work through a critical lens. Even if something appears legitimate, don’t hesitate to double check (if you’re looking for tips, José has a great piece about how digital literacy can help us evaluate online information). AI isn’t a silver bullet. It can make things easier, but its answers should be the start of your research journey, not the end.

Illustration of colorful books on a shelf against a dark background.