Deep dive into LLM reasoning model

In Progress

January 2026

As I wanted to better understand how to improve the reasoning of an LLM and how they differ from traditional foundational model, I decided to follow through the book "Building a reasoning model from scratch" from Sebastian Raschka.

Finetuning LLaMA 3.2 model using QLoRA to predict product price based on description. Data preparation and model evaluation against business objectives.

Reasoning ModelsLLMReinforcement LearningDistillationModel EvaluationInference Scaling

Learn More →Repository →

Key concepts

There are two ways to improve reasoning: Increasing training compute or Increasing inference scaling (aka. inference-time scaling)

Inference scaling methods

CoT: Extending CoT response to prompt the model to explain its reasoning. Not all models reasoning benefits from CoT. Some use case can lead the models to "Overthinking" which is when the model generates erraneous explanations and mislead itself. CoT does not provide the model with new knowledge but instead helop the model use its existing knowledge.
Self-consistency: Parallel sampling via self-consistency where the model generated multiple responses and selects the most frequent one. This is a techniques from Google research paper (https://arxiv.org/abs/2203.11171). It is a form of simple majority voting, where we use temperature scaling and top-p filtering to generate multiple answers and then select the most frequent one.
Self-refinement: Iterative self-refinement where the model reviews and improves its own reasoning and answers across multiple steps.

Inference-time scaling via self-consistency

Self-consistency lets the model produce several answers in parallel. We then pick the final answer by taking a simple majority vote over these candidates. It often yields large gains in answer accuracy, which is why it has become a common choice in LLM applications where accuracy is a higher priority than latency. Examples: DeepSeekMath-V2 and Google's Gemini 3 Deep Think mode.

One downside of self-consistency is that majority voting requires short answers that can be compared.

Inference-time scaling via self-refinement

Self-refinement focuses on iteratively refining a single answer to correct potential mistakes.