MedQA: Fine-Tuning Clinical AI on AMD ROCm Without CUDA

By Artūras Malašauskas May 08, 2026 4 min read Share:

A new HuggingFace project demonstrates full medical AI fine-tuning on AMD Instinct MI300X hardware using ROCm, proving CUDA is no longer mandatory for clinical model development.

Medical question answering carries stakes that most AI applications never approach. A model confidently selecting the wrong answer on a clinical multiple-choice question isn't merely incorrect—it's potentially dangerous. Yet most open-source medical AI work assumes you have an NVIDIA GPU. CUDA is the default. Everything else is an afterthought. A new project challenges that assumption directly.

The MedQA project is a LoRA fine-tuned clinical question-answering model built entirely on AMD hardware using ROCm. It takes a multiple-choice medical question and returns both the correct answer letter and a clinical explanation of the reasoning. The entire training pipeline—from data loading to adapter export—runs on an AMD Instinct MI300X without a single CUDA dependency. The model is available on the HuggingFace blog with full documentation and code.

Why AMD ROCm? The AMD Instinct MI300X is a remarkable piece of hardware: 192 GB of HBM3 memory in a single device. For LLM fine-tuning, VRAM is often the binding constraint—it dictates batch size, sequence length, and whether you need to quantize at all. With 192 GB available, the team trained Qwen3-1.7B with LoRA in full fp16 without any 4-bit or 8-bit quantization hacks. That's a physical difference you can feel when loading datasets. No waiting for memory to swap. No watching progress bars crawl.

More importantly, the goal was to prove that the HuggingFace ecosystem—Transformers, PEFT, TRL, Accelerate—works seamlessly on ROCm. It does. The same training code that runs on CUDA runs on ROCm with three environment variables set. That's it. No code changes. No custom kernels. No CUDA compatibility shims. The barrier to entry drops from "need an NVIDIA GPU" to "need the right environment variables" (which is still a barrier, but a much smaller one).

The dataset powering this work is MedMCQA, a large-scale multiple-choice question dataset derived from Indian medical entrance exams. Each example contains a clinical question, four answer options, the correct answer index, and an optional free-text explanation field. For this project the team used 2,000 training samples—a deliberately small slice to demonstrate that meaningful fine-tuning is achievable quickly. Training took approximately 5 minutes on the MI300X.

The base model is Qwen/Qwen3-1.7B—Alibaba's latest small-scale language model. At 1.7 billion parameters it's compact enough to fine-tune cheaply but capable enough to produce coherent clinical reasoning. It supports trust_remote_code=True and loads cleanly with HuggingFace Transformers. The prompt format uses a consistent template for every training example and inference call, which matters more than most developers realize when instruction fine-tuning.

Training uses LoRA via the PEFT library. Rather than fine-tuning all 1.5 billion parameters, LoRA injects small trainable rank-decomposition matrices into the attention layers, leaving the base weights frozen. Only ~2.2 million of the model's 1.5 billion parameters are trained. This keeps memory usage low and training fast. The configuration targets q_proj and v_proj modules with a rank of 8 and alpha of 16.

A few technical details worth noting: the team uses standard fp16 rather than bfloat16. In early experiments with bfloat16 they encountered NaN loss; switching to fp16 resolved it entirely. Gradient checkpointing is enabled—not strictly necessary on MI300X given the 192 GB VRAM, but good practice for reproducibility on smaller GPUs. The learning rate schedule uses cosine decay with warmup, which provides smoother convergence than a flat schedule for short training runs.

After training, the outputs directory contains the LoRA adapter weights—a few MB of files rather than a full multi-GB model checkpoint. At inference time you load the base model, attach the LoRA adapter, and optionally merge the weights. Generation uses greedy decoding with a repetition penalty to prevent the model from looping. The physical experience of running this locally is notably different from cloud-based alternatives. No API latency. No request queue. Just your hardware, your data, your control.

This project arrives as AMD continues expanding its AI ecosystem. At AMD AI DevDay 2026 in San Francisco, the company showcased high-signal keynotes, technical deep dives, and hands-on workshops with a clear message: AMD is pushing toward an open, full-stack AI compute ecosystem for developers. AMD's official documentation details expanded ROCm software stack support with Day 0 model support, Triton performance CI, and nightly Hugging Face integration.

The industry context matters here. Medical AI is moving toward domain-specific automation where reasoning and native integration aim to improve efficacy and safety. Recent product announcements from companies like Ambience Healthcare, Corti, and Ensemble show AI evolving across healthcare use cases. But most of these tools run on proprietary infrastructure. The MedQA project demonstrates that open-source alternatives can compete on technical merit without requiring CUDA-locked hardware.

Whether this actually changes the landscape remains to be seen. NVIDIA's CUDA ecosystem has decades of momentum. ROCm support varies across frameworks. Not every model runs equally well on AMD hardware. The MedQA project proves it's possible, but possibility isn't the same as practicality at scale. Developers will still need to test their specific workloads. Healthcare organizations will still need to validate clinical accuracy. And somewhere in the middle, someone will still need to figure out why their fp16 training is producing NaN loss.

The real question isn't whether AMD can run medical AI models. It's whether hospitals and research institutions will actually deploy them. Whether users actually pay for it remains the real question.

Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn

MedQA: Fine-Tuning Clinical AI on AMD ROCm Without CUDA

Comments