The Turing Test, or Imitation Game, is a proxy test for “true” or “general” artificial intelligence. It was developed prior to modern computers, and with the understanding that we can’t tell by looking at circuits whether a computer is intelligent.
The test essentially consists of having a computer pretend to be a human through text-based communication and seeing if a real human can be reliably fooled. This was an important first step into analyzing questions of machine intelligence, but it obviously has some flaws.
Most obviously of all, modern AI comfortably passes the Turing Test while clearly lacking any underlying cognition or understanding. (See the Chinese Room argument for the lack of understanding.)
The test’s weakness
It was known from the start that this outcome was possible. The Turing Test is a proxy test, meaning it tests for something other than what we want to measure. We want to measure intelligence, but we don’t know how.
The trick is that we do know how to measure certain signs of intelligence, like solving math problems and conversing intelligently. By the time of Alan Turing and early computers, it was more than apparent that math problems could be solved completely mechanically (that is, literally by a machine) requiring no intelligence once the program was set up.
So Turing logically picked a task which seemed difficult to perform mechanically while still being objectively testable, namely human-like conversation.
Conversing mechanically
We can imagine a simple chatbot program that just uses if-then statements to give stock responses to common phrases. This type of “AI” was used in some early text-interface computer games.
For example, if the user types “Hello” then respond “Hi!”, and so on. If the user’s input isn’t recognized, just respond “I didn’t understand that.”
At its most basic, it’s just a list of keywords mapped to premade responses. However, a system like this can easily be expanded to consider various pieces of contextual information, like whether a question has been asked multiple times. Again, some video games have dialogue systems like this.
The problem
It’s nearly impossible for such a program to pass the Turing Test. As soon as you mention something unusual, it won’t know how to respond. It’s infeasible to code in sensible responses to every topic a person might think of, maintain enough contextual information, and so on. No one even knows the extent of contextual information that is relevant to a given question.
But it’s not actually impossible for it to pass, and that turns out to be a logical problem for the Turing Test. We can imagine that, despite a program being very incomplete in what it can talk about, it just happens to be pre-programmed with the perfect responses to whatever the human doing the test happens to say.
If we test it again and again, it becomes more and more unlikely for the program to succeed by coincidence, but it’s still not impossible. What repeatedly passing does likely require is that the number of pre-programmed responses be very large, but only it strictly needs as many as are actually used in the test.
In other words, it was always known that it was technically possible for a computer program to pass the Turing Test “illegitimately.”
A different approach is needed
Turing naturally thought that intelligent mechanical conversation was exceedingly unlikely. To reliably pass, the amount of information and the complexity of processing it is simply too great. What he could not have predicted is that the mechanical solution would be completely different from the if-then approach described above.
This is why large language models (LLMs) were developed. They operate mechanically, like a program that uses simple if-then statements, but without any human needing to understand how the information is being processed. It is a huge amount of information, and it is incomprehensibly complex, but it works because know one needs to know how it works.
That is to say, no one needs to know how or why individual choices are made. LLMs are neural nets, which I’ve explained in another post. In short, a neural net converts input data into numbers, does math on those numbers according to a bunch of parameters, then spits out a number that can be interpreted as data. In the case of LLMs, the data is a string of text.
Teaching to the test
A neural net is trained through various processes that adjust its parameters, iteratively making the output better. In the case of LLMs, the basic problem that the neural net has to solve is predicting the next token (a token could be something like a word or part of a word). An LLM can be trained by, for example, letting it read through a text word-by-word while trying to guess the next word each time. It can get an overall score based on how well it did, then adjust its parameters, try again, and see if the new score is better or worse.
This is a massive simplification of how modern, advanced LLMs work, but that is the basic idea.
No one needs to know what context is relevant because the neural net embeds contextual relevance mathematically. Rather than using an understanding of language to produce language, an LLM produces language by reproducing mathematical patterns.
Alan Turing created his Imitation Game which pits a human’s intuitive grasp of language patterns against a machine’s ability to imitate those patterns. AI engineers in turn created a solution that does exactly that and nothing else: a convincing imitation.
Passing the test
AI chatbots that work based on the regularity of human language have long been thought possible, but only became feasible recently due to key technological developments.
Neural nets solved the problem of trying to build a language-parsing system without relying on an understanding of language. However, neural nets have been around much longer than LLMs. It also took hardware advancements to make modern AI possible.
There was still the problem of the sheer quantity of information involved and the computational power required to process it. Modern graphics cards (GPUs) have especially made LLMs feasible. Memory, storage space, and CPU power are also important.
This is another aspect that Turing likely didn’t predict. The rapid improvement to computing power we’re now used to (such as is described in Moore’s Law) didn’t take off until well after Turing’s death. Even though Turing considered hypothetical computers with infinite time and memory to solve problems, he couldn’t have imagined that physical hardware would be what it is today.
In short, the Turing Test was a reasonable idea at the time but turns out in hindsight to be poorly designed.
What now?
The fundamental problem with Turing Test-like approaches is that human behavior is sufficiently patterned (in many circumstances) to be mechanically imitated. In the case of LLMs, they are created with the express purpose of imitating human communication, and as a result that’s all they do.
Other forms of generative AI reiterate this problem. The things humans produce like drawings and music strongly follow certain patterns with the exception of new innovations. It is possible that all human behavior is sufficiently patterned to be mechanically imitated, though many people find that implausible.
What prevents these AI from being “truly” intelligent is that they produce these things in the “wrong way.” They rely solely on the patterns present in the output and lack any reflection, planning, or understanding.
If “true” AI is possible, I don’t think we’ll be able to identify it by its resemblance to humans. I think it will have to be a different kind of intelligence, potentially more different from humans than are octopi (who have distributed intelligence and not centralized intelligence).
We might never come up with a satisfactory test for what we mean by intelligence.
Photo of George Moore and H. M. Joseph using the SEAC computer (National Institute of Standards and Technology)
