this post was submitted on 08 Jun 2025
        
      
      826 points (95.4% liked)
      Technology
    76365 readers
  
      
      1555 users here now
      This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
        founded 2 years ago
      
      MODERATORS
      
    you are viewing a single comment's thread
view the rest of the comments
    view the rest of the comments
the probabilities are also fixed after training. You seem to be conflating running the llm with different input to the model somehow adapting. The new context goes into the same fixed model. And yes, it can be reduced to fixed transition logic, you just need to have all possible token combinations in the table. This is obviously intractable due to space issues, so we came up with a lossy compression scheme for it. The table itself is learned once, then it's fixed. The training goes into generating a huge markov chain. Just because the table is learned from data, doesn't change what it actually is.
This argument collapses the entire distinction between parametric modeling and symbolic lookup. Yes, the weights are fixed after training, but the key point is that an LLM does not store or retrieve a state transition table. It learns to approximate the probability of the next token given a sequence through function approximation, not by memorizing discrete transitions. What appears to be a "table" is actually a deep, distributed representation compressed into continuous weight matrices. It is not indexing state transitions, it is computing probabilities from patterns in the input space.
A true Markov chain defines transition probabilities over explicit states. An LLM embeds tokens into high-dimensional vectors, then transforms them repeatedly using self-attention and feedforward layers that can capture subtle syntactic, semantic, and structural features. These features interact in nonlinear ways that go far beyond what any finite transition table could express. You cannot meaningfully represent an LLM’s behavior as a finite Markov model, even in principle, because its representations are not enumerable states but regions of a continuous latent space.
Saying “you just need all token combinations in a table” ignores the fact that the model generalizes to combinations never seen during training. That is the core of its power. It doesn’t look up learned transitions-it constructs responses by interpolating through an embedding space guided by attention and weight structure. No Markov chain does this. A lossy compressor of a transition table still implies a symbolic map; a neural network is a differentiable function trained to fit a distribution, not to encode it explicitly.
yes, the matrix and several levels are the "decompression". At the end you get one probability distribution, deterministically. And the state is the whole context, not just the previous token. Yes, if we were to build the table manually with only available data, lots of cells would just be 0. That's why the compression is lossy. There would actually be nothing stopping anyone from filling those 0 cells out, it's just infeasible. you could still put states you never actually saw, but are theoretically possible in the table. And there's nothing stopping someone from putting thought into it and filling them out.
Also you seem obsessed by the word table. A table is just one type of function mapping a fixed input to a fixed output. If you replaced it with a function that gives the same outputs for all inputs, then it's functionally equivalent. It being a table or some code in a function is just an implementation detail.
As a thought exercise imagine setting temperature to 0, passing all the combinations of tokens of input, and record the output for every single one of them. put them all in a "table" (assuming you have practically infinite space) and you have a markov chain that is 100% functionally equivalent to the neural network with all its layers and complexity. But it does it without the neural network, and gives 100% identical results every single time in O(1). Because we don't have infinite time and space, we had to come up with a mapping function to replace the table. And because we have no idea how to make a good approximation of such a huge function, we use machine learning to come up with a suitable function for us, given tons of data. You can introduce some randomness in the sampling of that, and you now have nonzero temperature again.
Ex. A table containing the digits of pi, in order, could be transparently replaced with a spigot algorithm that calculates the nth digit on-demand. Output would be exactly the same
This is an elegant metaphor, but it fails to capture the essential difference between symbolic enumeration and neural computation. Representing an LLM as a decompression function that reconstructs a giant transition table assumes that the model is approximating a complete, enumerable mapping of inputs to outputs. That’s not what is happening. LLMs are not trained to reproduce every possible sequence. They are trained to generalize over an effectively infinite space of token combinations, including many never seen during training.
Your thought experiment—recording the output for every possible input at temperature 0—would indeed give you a deterministic function that could be stored. But this imagined table is not a Markov chain. It is a cached output of a deep contextual function, not a probabilistic state machine. A Markov model, by definition, uses transition probabilities based on fixed state history and lacks internal computation. An LLM generates the distribution through recursive transformation of continuous embeddings with positional and attention-based conditioning. That is not equivalent to symbolically defining state transitions, even if you could record the output for every input.
The analogy to a spigot algorithm for pi misses the point. That algorithm computes digits of a predefined number. An LLM doesn't compute a predetermined output. It computes a probability distribution conditioned on a context it was never explicitly trained on, using representations learned across many dimensions. The model encodes distributed knowledge and compositional patterns. A Markov table does not. Even a giant table with manually filled hypothetical entries lacks the inductive bias, generalization, and emergent capabilities that arise from the structure of a trained network.
Equivalence in output does not imply equivalence in function. Replacing a rich model with an exhaustively recorded output set may yield the same result, but it loses what makes the model powerful: the reasoning behavior from structure, not just output recall. The function is not a shortcut to a table. It is the intelligence.
"lacks internal computation" is not part of the definition of markov chains. Only that the output depends only on the current state (the whole context, not just the last token) and no previous history, just like llms do. They do not consider tokens that slid out of the current context, because they are not part of the state anymore.
And it wouldn't be a cache unless you decide to start invalidating entries, which you could just, not do.. it would be a table with token-alphabet-size^context length size, with each entry being a vector of size token_alphabet_size. Because that would be too big to realistically store, we do not precompute the whole thing, and just approximate what each table entry should be using a neural network.
The pi example was just to show that how you implement a function (any function) does not matter, as long as the inputs and outputs are the same. Or to put it another way if you give me an index, then you wouldn't know whether I got the result by doing some computations or using a precomputed table.
Likewise, if you give me a sequence of tokens and I give you a probability distribution, you can't tell whether I used A NN or just consulted a precomputed table. The point is that given the same input, the table will always give the same result, and crucially, so will an llm. A table is just one type of implementation for an arbitrary function.
There is also no requirement for the state transiiltion function (a table is a special type of function) to be understandable by humans. Just because it's big enough to be beyond human comprehension, doesn't change its nature.
You're correct that the formal definition of a Markov process does not exclude internal computation, and that it only requires the next state to depend solely on the current state. But what defines a classical Markov chain in practice is not just the formal dependency structure but how the transition function is structured and used. A traditional Markov chain has a discrete and enumerable state space with explicit, often simple transition probabilities between those states. LLMs do not operate this way.
The claim that an LLM is "just" a large compressed Markov chain assumes that its function is equivalent to a giant mapping of input sequences to output distributions. But this interpretation fails to account for the fundamental difference in how those distributions are generated. An LLM is not indexing a symbolic structure. It is computing results using recursive transformations across learned embeddings, where those embeddings reflect complex relationships between tokens, concepts, and tasks. That is not reducible to discrete symbolic transitions without losing the model’s generalization capabilities. You could record outputs for every sequence, but the moment you present a sequence that wasn't explicitly in that set, the Markov table breaks. The LLM does not.
Yes, you can say a table is just one implementation of a function, and from a purely mathematical perspective, any function can be implemented as a table given enough space. But the LLM’s function is general-purpose. It extrapolates. A precomputed table cannot do this unless those extrapolations are already baked in, in which case you are no longer talking about a classical Markov system. You are describing a model that encodes relationships far beyond discrete transitions.
The pi analogy applies to deterministic functions with fixed outputs, not to learned probabilistic functions that approximate conditional distributions over language. If you give an LLM a new input, it will return a meaningful distribution even if it has never seen anything like it. That behavior depends on internal structure, not retrieval. Just because a function is deterministic at temperature 0 does not mean it is a transition table. The fact that the same input yields the same output is true for any deterministic function. That does not collapse the distinction between generalization and enumeration.
So while yes, you can implement any deterministic function as a lookup table, the nature of LLMs lies in how they model relationships and extrapolate from partial information. That ability is not captured by any classical Markov model, no matter how large.
yes you can enumerate all inputs, because thoy are not continuous. You just raise the finite number of different tokens to the finite context size and that's exactly the size of the table you would need. finite*finite=finite. You are describing training, i.e how the function is geerated. Yes correlations are found there and encoded in a couple of matrices. Those matrices are what are used in the llm and none of what you said applies. Inference is purely a markov chain by definition.
You can say that the whole system is deterministic and finite, so you could record every input-output pair. But you could do that for any program. That doesn't make every deterministic function a Markov process. It just means it is representable in a finite way. The question is not whether the function can be stored. The question is whether its behavior matches the structure and assumptions of a Markov model. In the case of LLMs, it does not.
Inference does not become a Markov chain simply because it returns a distribution based on current input. It becomes a sequence of deep functional computations where attention mechanisms simulate hierarchical, relational, and positional understanding of language. That does not align with the definition or behavior of a Markov model, even if both map a state to a probability distribution. The structure of the computation, not just the input-output determinism, is what matters.
no, not any computer program is a markov chain. only those that depend only on the current state and ignore prior history. Which fits llms perfectly.
Those sophisticated methods you talk about are just a couple of matrix multiplications. Those matrices are what's learned. Anything sophisticated happens during training. Inference is so not sophisticated. sjusm mulmiplying some matrices together and taking the rightmost column of the result. That's it.
Yes, LLM inference consists of deterministic matrix multiplications applied to the current context. But that simplicity in operations does not make it equivalent to a Markov chain. The definition of a Markov process requires that the next output depends only on the current state. You’re assuming that the LLM’s “state” is its current context window. But in an LLM, this “state” is not discrete. It is a structured, deeply encoded set of vectors shaped by non-linear transformations across layers. The state is not just the visible tokens—it is the full set of learned representations computed from them.
A Markov chain transitions between discrete, enumerable states with fixed transition probabilities. LLMs instead apply a learned function over a high-dimensional, continuous input space, producing outputs by computing context-sensitive interactions. These interactions allow generalization and compositionality, not just selection among known paths.
The fact that inference uses fixed weights does not mean it reduces to a transition table. The output is computed by composing multiple learned projections, attention mechanisms, and feedforward layers that operate in ways no Markov chain ever has. You can’t describe an attention head with a transition matrix. You can’t reduce positional encoding or attention-weighted context mixing into state transitions. These are structured transformations, not symbolic transitions.
You can describe any deterministic process as a function, but not all deterministic functions are Markovian. What makes a process Markov is not just forgetting prior history. It is having a fixed, memoryless probabilistic structure where transitions depend only on a defined discrete state. LLMs don’t transition between states in this sense. They recompute probability distributions from scratch each step, based on context-rich, continuous-valued encodings. That is not a Markov process. It’s a stateless function approximator conditioned on a window, built to generalize across unseen input patterns.
the fact that it is a fixed function, that only depends on the context AND there are a finite number of discrete inputs possible does make it equivalent to a huge, finite table. You really don't want this to be true. And again, you are describing training. Once training finishes anything you said does not apply anymore and you are left with fixed, unchanging matrices, which in turn means that it is a mathematical function of the context (by the mathematical definition of "function". stateless, and deterministic) which also has the property that the set of all possible inputs is finite. So the set of possible outputs is also finite and strictly smaller or equal to the size of the set of possible inputs. This makes the actual function that the tokens are passed through CAN be precomputed in full (in theory) making it equivalent to a conventional state transition table.
This is true whether you'd like it to or not. The training process builds a markov chain.
You’re absolutely right that inference in an LLM is a fixed, deterministic function after training, and that the input space is finite due to the discrete token vocabulary and finite context length. So yes, in theory, you could precompute every possible input-output mapping and store them in a giant table. That much is mathematically valid. But where your argument breaks down is in claiming that this makes an LLM equivalent to a conventional Markov chain in function or behavior.
A Markov chain is not simply defined as “a function from finite context to next-token distribution.” It is defined by a specific type of process where the next state depends on the current state via fixed transition probabilities between discrete states. The model operates over symbolic states with no internal computation. LLMs, even during inference, compute outputs via multi-layered continuous transformations, with attention mixing, learned positional embeddings, and non-linear activations. These mechanisms mean that while the function is fixed, its structure does not resemble a state machine—it resembles a hierarchical pattern recognizer and function approximator.
Your claim is essentially that “any deterministic function over a finite input space is equivalent to a table.” This is true in a computational sense but misleading in a representational and behavioral sense. If I gave you a function that maps 4096-bit inputs to 50257-dimensional probability vectors and said, “This is equivalent to a transition table,” you could technically agree, but the structure and generative capacity of that function is not Markovian. That function may simulate reasoning, abstraction, and composition. A Markov chain never does.
You are collapsing implementation equivalence (yes, the function could be stored in a table) with model equivalence (no, it does not behave like a Markov chain). The fact that you could freeze the output behavior into a lookup structure doesn’t change that the lookup structure is derived from a fundamentally different class of computation.
The training process doesn’t “build a Markov chain.” It builds a function that estimates conditional token probabilities via optimization over a non-Markov architecture. The inference process then applies that function. That makes it a stateless function, yes—but not a Markov chain. Determinism plus finiteness does not imply Markovian behavior.
you wouldn't be "freezing" anything. Each possible combination of input tokens maps to one output probability distribution. Those values are fixed and they are what they are whether you compute them or not, or when, or how many times.
Now you can either precompute the whole table (theory), or somehow compute each cell value every time you need it (practice). In either case, the resulting function (table lookup vs matrix multiplications) takes in only the context, and produces a probability distribution. And the mapping they generate is the same for all possible inputs. So they are the same function. A function can be implemented in multiple ways, but the implementation is not the function itself. The only difference between the two in this case is the implementation, or more specifically, whether you precompute a table or not. But the function itself is the same.
You are somehow saying that your choice of implementation for that function will somehow change the function. Which means that according to you, if you do precompute (or possibly cache, full precomputation is just an infinite cache size) individual mappings it somehow magically makes some magic happen that gains some deep insight. It does not. We have already established that it is the same function.