this post was submitted on 22 Jan 2025
13 points (100.0% liked)

Technology

1166 readers
37 users here now

A tech news sub for communists

founded 3 years ago
MODERATORS
 

R1 utilizes a training method called direct reinforcement learning which is a form of unsupervised learning that forgoes the need for labelled data or explicit solutions. Instead, the model explores various approaches and generates multiple potential answers that are grouped and evaluated using a reward score. This score acts as a fitness function, allowing for learning and adjusting strategies over time. R1 progressively improves its problem-solving abilities by reinforcing successful approaches. This is a similar process to how humans learn to solve problems through trial and error.

top 3 comments
sorted by: hot top controversial new old
[–] ksynwa@lemmygrad.ml 5 points 6 months ago (2 children)

Tangential but I understand how these reasoning based systems are supposed to work. I saw some sample output from R1 and it looks like it generates a thought process for answering the prompt before actually answering it. I can see how it would make it more likely to answer more logically but the thinking part was like 2 to 3 times longer than the actual answer.

I am assuming OpenAI, Anthropic etc. are doing something similar. As a concept I see nothing wrong with it but since these services charge per token wouldn't this process balloon the context size? It would make querying them much more expensive.

[–] KrasnaiaZvezda@lemmygrad.ml 5 points 6 months ago

It is more expensive although Deepseek's model is quite cheaper than the others making this much less of a factor. Aditionally these "reasoning models" aren't necessecerilly better for every task though, so for many things a normal, and cheaper model, might be prefered still.

[–] yogthos@lemmygrad.ml 3 points 6 months ago

The key advantage here is that you can see how it arrives at a solution, this is key for being able to guarantee correctness. The core problem with LLMs is that they can't explain how they landed on a particular solution. When the steps are explicitly spelled out, you can review the steps and ask it to fix a specific one. Another possibility going forward could be to allow for directed learning the same way we teach humans. If it gives you the steps you could explain the mistake in a specific step and have it learn to avoid making it going forward.