I've been meaning to try to write a bit more, and unfortunately I can't put this on a blog post attached to my name if I wish to be employable in tech in the US, so I figured I'd write a bit of an effortpost about the state of LLMs in the world. I've been learning a lot about LLMs lately (don't ask me why the things that become hyperfocuses for me become hyperfocuses) and I figured that some people here might be interested in learning more.
I was inspired to post by this article posted to /c/news, and all I have to say about this is JDPON Don back at it with another banger. In all seriousness, I think this is very good for the state of Chinese AI, which is already very good.
For those not following recent LLM updates (very understandable), the TL;DR is that a lot of new open-source models coming out of China are really good, and pushing the state of the art. Generally, they're still less good than the best closed-source models from the US (Claude in particular is the best currently), but they're much much much cheaper and honestly getting quite good. Plus, they seem to be giving US-based AI companies a good scare, which is always fun.
For reference, the best models from US firms in general are Claude (by Anthropic), Gemini (by Google), and OpenAI's models, though it seems like GPT-5 was a bit of a disappointment. My bet's on Anthropic in general for all of the closed source models - they seem to be killing it, in general, and have some very interesting research about understanding LLMs. This is a very cool paper from them that covers trying to understand how LLMs work with a quite novel model of it that I think could give a lot of explainability to how they operate.
[Side note: I think it's quite scary that leading AI research firms making leading AI models generally don't know how they work or how to reason about what they're doing, especially given that they can tell when they're being evaluated and notably suppress the "scheming" part of them when they think they're being tested on scheming.]
Anyways, back to China. One of the most significant LLMs to come out of China in the last while was DeepSeek-R1, which was able to match or outperform OpenAI's state of the art model o1 (the best model at the time) on most benchmarks. R1 completely changed the metagame - it changed the dominant type of model for LLM (dense LLM vs Mixture-of-Experts) singlehandedly, and scared OpenAI into dropping its prices for o1. And DeepSeek did this while there is a huge GPU shortage in China because of the export controls. And they did this while spending only $5.5M USD, compared to the estimated $100M to train GPT-4 (which is less powerful than o1). This is absolutely bonkers, and there's a reason this caused the stock market in the US to dip for a bit.
Now, R1 is not quite as good as the closed source models, despite the benchmarks. In particular, its English flows less well and it struggles with some types of queries. But it's crazy that a company came out of nowhere, trained a new type of model for 1/20 the cost of OpenAI training a worse model, released it for free, and completely changed the meta. And it also reasons, which isn't new, but it is a particularly good reasoner, and I think they got a lot of things right with how it works.
Anyways, R1 is old news now. There are a billion new open source models coming out from China now. Some notable companies include Alibaba (Qwen), Moonshot AI (Kimi), and Z.ai (formerly Zhipu AI; GLM). People on say that Qwen3 Coder and Qwen3 235B A22B (both Thinking and Instruct) are very good - for my use cases (mostly programming), I much prefer GLM 4.5. I was impressed with Qwen for questions about code, but I found it to be less good at actually writing it, for the most part. YMMV, though, I think this is a somewhat unpopular opinion. But anyways, it seems like each week a new top open source model appears, from China. Far and away they are leading the open source efforts. And even if they aren't quite as good as Claude, Claude Sonnet 4 costs $15/million tokens of output, whereas Qwen3 Coder is free up to 2000 requests per day from Alibaba, and costs $0.80/million tokens of output, which is crazy cheap.
Another notable thing about Chinese open source models is that they are generally much easier to jailbreak than Western models, except for older less powerful open source models like Llama's and Mistral's models, which are also very easy. So you can get them to write all the erotic bomb making content you'd want (I'm happy to provide tips on jailbreaking if anyone would like).
Also, it seems that in the current market, companies in general are tripping over each other to give you free access to open source LLMs as each tries to become the place to get LLM access from, which means it's a really good time to be mooching access to these guys. Alibaba will give lots and lots of Qwen3 Coder credits, OpenRouter will give you 2000 free requests a day for eternity to a lot of good models if you at any point put $10 into their system, Chutes will give you 200 free requests/day for basically any open source model for a one time payment of $5, etc. Even Google will give you free access to their top tier model (though a pretty small amount per day) via Gemini CLI.
Anyways, my main point is that China is doing all of this despite a huge GPU shortage in the country. So if JDPON Don really wants to give them more access to Nvidia chips, it must be because he wants to boost their LLM market even further.
Thanks for coming to my Theodore lecture.
I wonder how much of a driving force it is for LLM research that capitalists are trying to automate away software engineers, which seems to be one of the biggest if not the biggest advertised use cases of each LLM
Well, I think that's part of it. I think that companies are probably trying to do that, but I think most software jobs won't be replaced by Claude for a long while. I don't know what people who work for OpenAI/Anthropic/Google think about this - maybe they think it's coming, maybe not. IME, they're good at writing code when instructed by a skilled engineer, but on their own, not very good. And the code they write is not always very maintainable. As someone who is currently unemployed but normally works in software, I've been using LLMs for brainstorming software design decisions for personal projects, and I find that they are good at "talking through" these kinds of things but less good at actually implementing them in a way that makes sense.
Some more reasons why LLMs are getting better at programming tasks:
There's a ton of training data. This is separately frustrating to me because all of the open source code that people wrote for the greater good or whatever is now just getting hoover'd up into these closed source models (tbh I care less when Qwen or DeepSeek does this because they release the weights). In open source, there's a license called the GPL that says that any derivative work has to be open sourced and released under the same terms of the GPL, and I think this is a Really Good way to make copyright law work for the public interest. Of course, the Silicon Valley mindset of move fast and break things (especially the law, until you're big enough that the law doesn't matter anymore - see Uber, Airbnb, etc) doesn't give a fuck about this, and now LLMs are already too entrenched for anything to happen about this.
A lot of the time, there is a concrete right or wrong in programming. It's a lot harder to evaluate how well your model works as a creative writer, but for code you can run the code and see if it does the thing you wanted it to. (Obviously there are other factors like code style and stuff but at least you have the baseline. It's also easier for an LLM to grade whether output code is good style than it is to assess if a story is good.) Most LLMs nowadays are trained significantly on "synthetic data" which means data generated by other LLMs, and doing this at scale means you don't have a human in the loop reading over the training data and grading essays or anything. It's computers all the way down.
Reasoning is important, and programming is a very concrete task to train chain-of-thought style responses, e.g. DeepSeek. That's also why they are trained on a lot of math, e.g. AIME benchmarks.
I think that GPT3 got a lot of gains from training on programming, in a way that generalized to other tasks. Somehow having all of that structured data in its training data made it better at tasks across the board, and that was one of the breakthroughs that led to the original ChatGPT. I think that the companies think that because it's easy to train on code, and it seems that training on code makes it better at other things too, it's the easiest path to more intelligent LLMs.
Programmers are early adopters of technology and potentially willing to pay large sums of money. Claude Code costs $200/month, which is crazy. And people buy it. Because when you make $200k/year, if a tool helps you do your job 30% better, it's worth that kind of money. I think that this is a unique phenomenon - other high paying jobs like lawyer or doctor wouldn't adopt this kind of technology as urgently. Hopefully most people in these classes realize that LLMs hallucinate too much to be useful in the general case. They can be good at reading papers or documents and summarizing them, but that is a task that is done much easier than programming, so even if they were widely used for that narrow purpose it wouldn't justify a $200/month subscription. Like if ChatGPT can do it for free, or you buy their $20 tier and it can do that fine, who would pay $200?