Efficient Medical Reasoning with DeepSeek-R1
Jan 1, 2025
There has been a lot of recent evidence that reasoning models that produce a Chain-of-thought perform better on multi-step reasoning. We investigate how far a distilled 8-billion-parameter reasoning model—DeepSeek-R1-Distill-Llama8B—can be pushed toward expert-level medical problem solving.
Our pipeline applies supervised fine-tuning (SFT) with QLoRA on 20k Medical-o1-Reasoning questions, each paired with CoT traces and answers. Extensive SFT ablations reveal that reasoning models are extremely prone to overfitting: reducing the learning rate to 1e-6, trimming training to a single epoch, and lowering weight decay to 1e-2 are necessary to maintain CoT quality while improving task accuracy.
On MedQA, MedMCQA, and PubMedQA benchmarks the best SFT model closes roughly half the gap between the distilled base and HuatuoGPT-o1-8B (which uses both SFT and RL), achieving accuracies of 0.498 – 0.563 and showing strong gains in pass@k as k increases, indicating good search ability but weaker answer ranking. The results provide concrete hyper-parameter guidelines for fine-tuning distilled reasoning models in low-resource settings.