Understanding LLMs and Few-Shot Learning

6 min readApr 18, 2023

This article is the second section of of “The hard argument against LLMs being AGI” essay series.

In previous section we went through some basic ideas of how computer science functions from the perspective of philosophy of science. Then we elaborated on the limits of computational systems and Turing Machines. Then we introduced one of the most efficient mathematical optimization methods, Dynamic Programming, often used in real-life hard to compute problems to max out what we can do with transistor computer systems.

Many real-life software systems are actually not computationally demanding, which means most software engineers never use these methods because they are a bit harder to read and write by humans. However, ignoring the computational demands can lead to only occasionally working software that will hang only with some parameters. Most programmers have written such bugs within the first five years in the job without understanding what is going on, writing a fix that might cause another problem to emerge, when the original problem is solved but another use case was left out from testing. When we go closer to the hardware and reliability is a hard demand we often need these more demanding algorithms, for example when building Internet protocols or device drivers etc. Most of the hard problems are often hidden in reusable libraries and such abstractions, so programmers can focus on less demanding stuff.

One key aspect of Dynamic Programing is to abuse that we know all computational algorithms are are completely deterministic per their input parameters. The relationship between input parameters and the output value of the function is called algorithmic fingerprint. In this section we will dive deeper into LLMs and Few-Shot Learning and how they work from computational theory perspective and how they too are completely deterministic (after pre-training has produced us a frozen model that is run just like any other program) and use algorithmic fingerprints.

When it comes to LLMs, when we run it to produce predictions of a nice set of words they work just like any other computer program or algorithm: the same algorithmic finger print will deterministically produce the same output. In some setups we might have a pseudorandom number seed to provide some variance to the output, but just like above that pseudorandom number is just part of the fingerprint input parameters. In other words the input of LLMs is just a sequence of word tokens, some other input parameters, some potential pseudorandom number and it produces the output of a sequence of word tokens deterministically. Nothing inside the LLM is changed during prediction. Such changes are imagined by the user.

Machine Learning and Dynamic Programming are both based on the same mathematical principle of “blessing of dimensionality”. Curse of dimensionality means that when we analyze high-dimensional data, we end up with computational problems we wouldn’t have in lower dimensions. Many computational problems and especially machine learning problems are related to the mathematical concept of fractal dimensionality. To put it simply it just means that when something grows exponentially, something else grows linearly, or when something grows linearly another measure grows sublinearly. “Blessing of dimensionality” means that there exists a way to avoid the “curse of dimensionality”, because some form of fractal dimensionality exists for the problem space.

Zipf’s Law is an example of Fractal Dimensionality. The red line portrays the fractal dimensionality. Original image.

Most of you have probably used compression algorithms like Zip. It is based on the intuition of Zipf’s Law that natural language has fractal dimensionality, which is better known as the straight line you get when you have a log-log representation of the frequency and rank of words in a text corpus. All networks also have fractal dimensionality. Another example is dimensionality reduction. Optimal dimension of dimensionality reduction can be predicted by the fractal dimensionality of the dataset. While Dynamic Programming and Zip algorithms produce exact and reversible solutions, when we pretrain machine learning models, they tend to be lossy with their information. One metaphor to understand what happens in pretraining of a neural network is that the data is used for building a computer program, which when it predicts works as a computational algorithm, which has memorized some paths of the learning data in the sense of lossy compression, which then can be reaccessed by using the algorithmic fingerprint that invokes that specific neural path. Of course the parameter space is continuous, not discrete here.

In Few-Shot Learning there is one important issue, which philosophers of science had to deal with in quantum mechanics. In quantum mechanics the measuring apparatus will impact the quantum system, which is called the observer effect. This kind of situation often feels counter intuitive for human beings, who have been used to clear subject-object distinction (at least in Western tradition). Many prompting humans consider the situation as a discussion, where you say something to someone and that someone thinks and says something and vice versa. However, while humans react to the most recent utterance of the LLMs, the LLMs always react to the entire prompt history (as explained in the “Language Models are Few-Shot Learners” article). In this manner the way how humans and computational processes of LLMs are fundamentally different in how they operate. The human has already processed the past and moves forward from previous utterance, while the LLM calculates the most recent answer from the whole algorithmic fingerprint of all prompts in the history of the dialogue of the session.

In other words what the LLM doesn’t do is having a symbolic state built from the semantic understanding of what was previously said, but instead it recalculates syntactically coherent output based on the whole syntactic coherence of the prompt and next number from the pseudo random seed (by using the configuration parameters of the session too). People often confuse this to “human-like cognitive capabilities” and that is called Prompt Leading. LLMs do not build semantic models or a world model. The eventual semantic coherence is a product of human interaction and the delusion of cognition is caused by the Observer Effect when the user doesn’t understand how Few-Shot Learning model does “error correction”. It doesn’t learn, it recalculates with more information from the observer.

The user of the prompt influences the performance of the LLM. Similar problems arise all around science, especially in humanistics. Computer programs on the other hand tend to be deterministic regardless of the user. Original image

However, there is one important thing to notice at this point. The important thing is not that semantic coherence is a product of human interaction. Same is true for humans too! So how does the LLM and AGI (or Human) differ here? The LLM remains a static program during the whole process, while AGI and Human would change their internal configuration of the semantic coherence of the situation. The LLM doesn’t do any work towards semantic coherence (in sense of energy or effort), but instead it is just a tool used by a human being. In a sense humans and AGI are similar to LLMs in that we all “recalculate everything”. We can see life-span of a human being as a single prompt, for example. Interacting through prompts is an existential endeavor after all. But humans and AGI do change, while the pre-trained LLM does not. In a sense LLM could be seen as meta-cognitive, similar to human culture, but culture neither is as static as LLM is. LLM is in its own category here, because it is not adaptive after it has been trained. Constantly retrained model on the other hand would lose the coherence with the prompt history, so neither that argument works.

Understanding LLMs and Few-Shot Learning

Written by Ahti Ahde

Responses (1)