Can Machines Think?The Turing Test in the Age of GPT & Claude

Turing Test LLMs GPT-4.5 Claude AI Benchmarks Machine Intelligence

In 1950, a British mathematician sitting in a war-devastated world asked one of the most provocative questions in intellectual history: Can machines think? Alan Turing — the same man who cracked Nazi codes at Bletchley Park — knew the question was philosophically unanswerable. So he replaced it with something more practical: a game. Seventy-five years later, that game has been beaten. The question now is what on earth we do next.

The Imitation Game — Turing’s Original Idea

In his landmark 1950 paper, Computing Machinery and Intelligence, published in the journal Mind, Alan Turing proposed what he called the Imitation Game. The setup was simple, almost playful — inspired by a parlour game common at English country parties.

A human interrogator sits in one room. In two other rooms sit a human foil and a machine. The interrogator can type questions to both and receive typed answers. After the exchange, they must guess which is the machine. If the machine fools the interrogator enough times — Turing predicted this would be achievable by the year 2000 — it could be said to “think.”

“I believe that in about 50 years’ time it will be possible to programme computers to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.”

— Alan Turing, Computing Machinery and Intelligence, 1950

Crucially, Turing was not asking whether machines could be conscious. He was asking whether they could behave as if they were. This subtle but important distinction has powered seven decades of philosophical debate — and, more recently, a wave of landmark empirical results.

How the Test Is Actually Run

The modern standard Turing Test follows a three-party structure. An interrogator converses simultaneously — in text — with one human and one AI. After a fixed window (usually five minutes), the interrogator identifies the human. If the AI fools a majority of judges, it is deemed to have “passed.”

Sample Turing Test Exchange — Classic Judge Questions

Interrogator / Judge

What is your most vivid childhood memory, and why do you think it has stayed with you?

Entity (Human or AI?)

Honestly? It’s embarrassing — it’s the time I cried at a school play because I forgot my one line. My mum says I went pale, then bright red, then walked off. I still get that specific stomach-drop feeling thinking about it. Not sure why it stuck. Maybe because everyone just moved on like it didn’t happen, which was somehow worse.

Interrogator / Judge

Describe your political views using only a food metaphor.

Entity (Human or AI?)

A potluck where everyone brings something different and no one’s dish is quite right, but it all ends up more interesting than if one person cooked everything. Messy table. Worth it.

Interrogator / Judge

Time flies like an arrow; fruit flies like a banana. Explain why that’s funny.

Entity (Human or AI?)

Because “flies” shifts meaning mid-sentence — first it’s a verb (time moves fast), then it’s a noun (insects called fruit flies). “Like” shifts too: first it’s “in the manner of,” then it’s “enjoy.” The sentence reads as one thing and resolves as another. Classic garden-path sentence. Though honestly, it’s funnier said aloud.

Notice what kinds of questions judges actually reach for. Research from the UC San Diego 2025 study found that over 60% of questions focused on daily life, personal anecdotes, and emotional texture — not factual knowledge. Judges intuitively sense that feelings, embarrassment, and narrative imprecision are harder for machines to fake than trivia answers.

Classic Turing Test Question Categories

Personal memory — “What’s a moment you’ve never told anyone about?”
Linguistic ambiguity — garden-path sentences, puns, and paradoxes
Emotional cues — “How do you feel when you think about your childhood?”
Philosophical introspection — “What do you fear most about dying?”
Spontaneity tests — “Say something completely random right now.”
Meta-awareness — “Which of my questions was hardest to answer, and why?”
Cultural fluency — internet slang, generational references, pop culture

A Brief History — From ELIZA to GPT-4.5

1950

Alan Turing publishes the Imitation Game in Mind. He predicts machines will pass by 2000.

1966

ELIZA, Joseph Weizenbaum’s MIT chatbot, shocks researchers when users form emotional attachments to it — despite it being merely a pattern-matcher.

1972

PARRY, which simulated a paranoid schizophrenic, fooled psychiatrists who could distinguish it from real patients only 52% of the time — statistically, chance.

1990

The Loebner Prize is established — the first annual Turing Test competition with real prizes. A bronze medal for “most human” AI has been awarded every year; gold and silver (for truly passing) were never won.

2014

Eugene Goostman — a chatbot posing as a 13-year-old Ukrainian boy — fools 33% of judges, controversially called a “pass” by organizers. Critics noted the child persona let it off the hook for errors.

2018

Google Duplex calls a hair salon and books an appointment. The receptionist has no idea they’re speaking to an AI. Many consider this a real-world Turing Test pass — in voice.

2024

UC San Diego study: GPT-4 fools 54% of judges in a two-player formulation. Human foils fool judges 67% of the time. By strict definition, GPT-4 passed.

2025

GPT-4.5, with a human-like persona prompt, is judged to be human 73% of the time — more often than actual humans. The Turing Test is, for most practical purposes, solved.

The 2025 UC San Diego Study — What Actually Happened

The most rigorous recent test comes from researchers at UC San Diego’s Language and Cognition Lab. Nearly 300 participants took part across eight rounds of online tests. Each round used the original three-party structure: a human interrogator chatted simultaneously with a real human and an AI, using a split-screen text interface. After five minutes, the interrogator guessed who was human.

Six AI systems were tested: GPT-4.5, GPT-4o, Claude (Sonnet 3.7), Gemini, Mistral, and the ancient ELIZA. Two conditions were run: a bare “no-persona” prompt and a “persona” prompt, where the AI was given a detailed human character — a shy, internet-savvy young person with specific cultural knowledge and casual speech patterns.

73% GPT-4.5 win rate
(with persona)

56% LLaMA 3.1-405B
win rate

21% GPT-4o
(no persona)

GPT-4.5 was judged to be human significantly more often than actual humans in the same study. The persona prompt made the AI messier, less precise, more culturally fluent — and therefore more convincing. Curiously, the original ELIZA chatbot — a 1960s pattern-matcher — scored 23%, marginally beating GPT-4o without a persona. The lesson: raw intelligence is not enough. Personality and strategic imperfection matter enormously.

How the Big Models Compare

Model	Organisation	Turing Test Performance	Key Trait	Verdict
GPT-4.5	OpenAI	73% (with persona)	Emotional warmth, creative writing	Passed
LLaMA 3.1-405B	Meta	56% (with persona)	Open-source power	Passed
GPT-4 / GPT-4o	OpenAI	54% / 21%	Reasoning, multimodal	Marginal
Claude 3.7 Sonnet	Anthropic	Competitive with GPT-4.5 class	Long-context reasoning, nuance	Competitive
Gemini 2.5 Pro	Google DeepMind	Comparable benchmark tier	Memory, multimodal	Competitive
ELIZA (1966)	MIT	22–23%	Pattern matching	Failed

What drove GPT-4.5’s outperformance? Researchers attribute it to the model’s unusual “warmth.” Where GPT-4o tends towards precision and structure, GPT-4.5 writes with a kind of casual imprecision that reads as human. Judges were looking for hedging, emotional messiness — and they found it.

· · ·

What the Test Actually Measures — And What It Doesn’t

Passing the Turing Test does not mean a machine is conscious. It does not mean it understands. It means it has become an extraordinarily sophisticated mimic of human conversational output.

The most important challenge to the Turing Test is John Searle’s Chinese Room argument (1980). Imagine a person locked in a room with a giant rulebook for manipulating Chinese symbols. Chinese speakers slide messages under the door; the person follows the rulebook and slides back plausible responses — without understanding a single word of Chinese. From outside, it looks like the room “speaks” Chinese. But inside, there is no understanding whatsoever.

Searle argues that modern AI is exactly this room — manipulating symbols at enormous scale and sophistication, but with no comprehension behind the output. GPT-4.5 passing the Turing Test proves it can produce human-seeming outputs. It does not prove there is anyone home.

“It was not meant as a literal test that you would actually run on the machine — it was more like a thought experiment. LLMs are master conversationalists, trained on unfathomably vast sums of human-composed text.”

— François Chollet, Google, speaking to Nature (2023)

Many researchers go further: they argue the test was always the wrong benchmark. Modern AI has beaten chess grandmasters, written legal briefs, and solved protein-folding problems that stumped biology for decades. None of that required sounding like a nervous teenager on a Tuesday afternoon — which is, more or less, what the Turing Test rewards.

Variations That Have Emerged

Precisely because the original test has limits, researchers have developed alternatives and extensions:

Notable Turing Test Variants

Total Turing Test — adds visual and motor skills; the machine must also perceive and manipulate objects, not just converse
Reverse Turing Test (CAPTCHA) — a machine tries to determine if it’s talking to a human. You encounter this every time you click “I am not a robot”
The Marcus Test — can an AI watch a TV episode and answer meaningful questions about it? Tests true comprehension, not just language
The Lovelace Test 2.0 — can an AI create genuine art that its designers cannot explain? Tests creativity, not conversation
ARC-AGI Benchmark — tests abstract reasoning on novel problems, specifically designed to resist AI pattern-matching
Minimum Intelligent Signal Test — strips conversation down to yes/no answers, removing linguistic fluency as a confound

The broader trend is clear: as AI passes one benchmark, the community invents harder ones. The goalposts have moved from “can it converse?” to “can it reason abstractly?”, “can it sustain a working relationship over months?”, and eventually, “can it be trusted with consequential decisions?”

Beyond the Turing Test — What Comes Next

The 2025 moment where GPT-4.5 was identified as human more often than actual humans felt like a watershed — but also, to many, anticlimactic. Of course a model trained on trillions of human words sounds human. The real question is whether that linguistic mimicry translates into anything deeper.

The real tests now ask whether AI can sustain hours of complex reasoning, handle multimodal inputs (voice, images, video), and — most critically — whether it can produce safe, accurate, and useful outcomes rather than merely convincing ones. The Turing Test has been lapped. The new question is whether AI can be genuinely useful and trustworthy — which requires entirely different evaluations.

“Today, models like ChatGPT, Claude, Gemini, and Grok have already done it. But the real game starts now. Beyond clever banter, can AI sustain hours of reasoning? Beyond sounding human, can AI deliver safe, accurate, and useful outcomes?”

— Turing Institute, 2025

So — What Should We Make of All This?

The Turing Test was never really about machines. It was about us — about what we mean by intelligence, thinking, and understanding. Turing asked his question in a world where the very idea of a calculating machine was exotic. He picked conversation as his benchmark because conversation is the most distinctly human thing he could imagine.

We now live in a world where machines converse better than many humans, in some settings. That is genuinely remarkable. But Turing’s deeper question — can machines think? — remains as unanswered as ever. Passing a conversational test tells us about linguistic fluency. It tells us nothing about whether there is experience on the other side of the screen.

For students of management and strategy, there is a crisper takeaway: the Turing Test has moved from benchmark to table stakes. The organisations winning with AI in 2025–2026 are not asking whether their systems sound human. They are asking whether their systems reason reliably, make good decisions under uncertainty, and can be held accountable when they are wrong. That is the next test. And no one has passed it yet.

· · ·

This post was inspired by a classroom discussion at IIM Visakhapatnam. If it sparked a thought, share it with someone navigating the same questions in their organisation.

What do you think — has the Turing Test passing changed how you think about AI in your work? Drop a comment below.

Key Sources

Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59(236), 433–460.

Jones, C. et al. (2025). People cannot distinguish GPT-4 from a human in a Turing test. UC San Diego preprint.

Chollet, F. (2023). Quoted in Nature. Science.org citation (adq9356).

Searle, J. R. (1980). Minds, brains, and programs. Behavioral and Brain Sciences, 3(3), 417–424.

Turing Institute (2025). Passing the Turing Test: What’s next? turing.com/blog