Limits of Turing Machines and Explaining Transformers via Dynamic Programming
This article is the first section of “The hard argument against LLMs being AGI” essay series.
When people argue that LLMs are not AGIs, we often get blamed that we are just anthropomorphising AGI in order to exclude any modern LLM or machine learning method from reaching that. As if we are doing the goal post move, while in fact the other side is just skipping science and using their own frame of reference. For this reason I decided to start the Hard Argument against LLMs Being AGI essay series by explicitly explaining how LLMs operate from computational perspective by using the computer science foundations. According to Stanford philosophy encyclopedia computer science produces theoretical claims on what is impossible. The basic idea is that if any empirical system is able to demonstrate that the theories have been invalidated, and then we need a better theory. So instead of starting from “how humans think” (because that is vague and still evolving scientific discipline) I start here from how “mathematical optimization” works with Turing Machines (to which current machine learning paradigms belong to).
I assume that most of my readers here are somewhat into computer science and already know the significance of David Hilbert’s Entscheidungsproblem (Decision Problem), how Gödel’s Incompleteness relates to that, how Church-Turing thesis follows and how halting problem exists for all algorithms and how that relates back to NP-Complete problems. You do not have to understand the details to follow my next idea, but it is sufficient that you are aware that no algorithm on a transistor computer is computationally efficient if it breaks some computational boundary conditions for algorithms. Computer science is the scientific phenomenological description of when algorithms start failing to finish reliably. The consequence of such a hard limit is that reliably running programs are bound to this domain of complexity.
Nice feature of both natural language and computational algorithms is that they can be composed from smaller sub-problems if we know the rules of composition. Many times computational problem solving starts from a natural language instruction set (pseudo code), which we then refine to something more computationally efficient. When we need to solve something hard on a computer, we must program the algorithm in an efficient manner. In other words, we must skip all unnecessary steps of the Turing Machine. Very often methods like Dynamic Programming will provide such solutions. The trick is rather simple but in reality also the most efficient thing you can do in many situations.
In Dynamic Programming you split the problem to sub processes and encapsulate them to functions. Then you will build a look-up table, where you store the tuple (grouping) of input parameters as index and the return values (output) as the value every time you perform the function. In Dynamic Programming, we add first lines to the function that checks if we already know the answer and replies with the correct return value, so that we do not have to rerun the computationally heavy algorithm. Instead we run a single line returning the value we already know.
Because computational algorithms are deterministic it is guaranteed that if we run the same function with the same parameters we get the same output. Some of you might wonder, what about object oriented methods? When we consider computational algorithms the theory of course predates object oriented programming and similar stateful methods of modeling. However, we can consider the properties encapsulated by the object as parameters and just do the same. No problem here. This is kind of what compilers do to the objects of object-oriented programs anyway.
Explaining Transformers Through Dynamic Programming
If I would give you a brief history of LLMs, it woud go something like this:
- Delta Learning Rule is discovered as a way of enabling learning for the Neural Networks
- Perceptrons do not produce interesting results because their error is measured between 0 and 1; we are efficiently limited to the linear regression domain (besides of course the obvious problems with size of datasets and computing power back in the days).
- Sigmoid activation function enables continuous valued errors; we are limited to the non-linear regression domain.
- Feed Forward Networks enable hidden-layers, which enable us to approximate pretty much any function with Neural Networks.
- Backpropagation Algorithm enables us to efficiently apply Delta Learning Rule to each hidden-layer.
- Recurrent Neural Networks and Long Short-Term Memory networks allow us to use backpropagation of error in recursive manner. These enable us to learn using really long n-grams (at least in theory).
- Transformers introduce Dynamic Programming like attention mechanism and thus we are able to use Fuzzy Lookup Table-like system, that memorizes useful skip-grams instead of n-grams.
If you want a more detailed introduction to early neural networks, you might enjoy reading this paper. If you want to read a really well written article by professor Mark Riedl explaining Transformers and ChatGPT, you might enjoy reading this article.
So the basic idea here is that Transformers are same for RNN-LSTM networks as Dynamic Programming is for Recursive functions. What this means in practice is that Transformers are the final frontier with neural networks with Turing Machines, because they utilize the most sophisticated mathematical optimization method known to be available.
While in Dynamic Programming we store Key-Value pairs, where Key contains all input parameters of a function, in Transformer networks we store Query-Key memory to the network, which produces proper weights to discover Value from the encoded embeddings of the network. So when we are predicting the next tokens, we use the “input” as Query and then the tokens in the prediction will accommodate Keys and Values. While in real Transformer networks the position of tokens can be any, I will here simplify the idea to forward only for the example: if in some specific place after Query Xn, the Key Xn+m exists, then Value probably exists in Xn+m+o position. Anyway, you get the idea. The actual mathematics goes like this:
Now you know how Transformers work and how Dynamic Programming works. LLMs have random seeds to produce some variety to the outputs. The important takeaway of this section for the next sections is that this Dynamic Programming tuple of input parameters is a kind of fingerprint of an algorithm and after Transformer model is pretrained, it becomes frozen and behaves just like any other computer program with deterministic input output pairs. In other words, Few-Shot Learning doesn’t do any “in-between” learning, but only uses LLM to autocomplete text for you.