AI Anatomy Segmentation Evaluated Without Ground Truth Data

By Artūras Malašauskas May 13, 2026 3 min read Share:

Researchers developed a consensus-based framework to compare medical AI segmentation models using NLST chest CT scans, revealing variable performance across anatomical structures.

Medical imaging researchers face a persistent bottleneck: how to objectively compare AI segmentation tools when expert-verified ground truth annotations don't exist. A recent study published in the Journal of Medical Imaging proposes a practical solution—evaluating models based on their agreement with each other rather than absolute accuracy against a gold standard.

The investigation centers on chest CT images from the National Lung Screening Trial (NLST), a widely used public dataset containing thousands of imaging volumes. Despite its scale, NLST lacks comprehensive organ and bone segmentations. Manual annotations for such intricate structures require highly skilled radiologists and are astronomically time-consuming, rendering complete ground truth labeling impractical for large-scale studies.

Researchers selected six prominent open-source segmentation models for comparison: TotalSegmentator (two versions), Auto3DSeg, MOOSE, MultiTalent, and CADS. Each model uses different terminology, boundary definitions, and anatomical inclusion criteria. Without standardization, direct comparison becomes nearly impossible.

The team harmonized all outputs by converting them into an interoperable DICOM segmentation standard. They unified labels using the SNOMED-CT vocabulary—a widely accepted medical ontology—assigning uniform color codes and identifiers to anatomical regions. This harmonization enabled side-by-side visualization of segmentations from different models on the very same scan.

To enhance accessibility, the study leveraged two open-source platforms: OHIF Viewer, a browser-based tool, and 3D Slicer, a robust desktop application. Custom plugins display multiple segmentations simultaneously in three-dimensional and orthogonal two-dimensional views. Researchers can now interactively explore congruence and discrepancies among models for individual organs with unprecedented ease (a workflow that previously required manual overlay work in multiple applications).

The analytic phase focused on 18 chest CT scans from different NLST participants. After filtering out partially imaged or inconsistently detected anatomical structures, the study concentrated on 24 key regions, including lung lobes, the heart, ribs, thoracic vertebrae, and the sternum. For each structure, authors identified a "consensus" segmentation defined as the voxel set concurrently labeled by all models recognizing that anatomical part.

Results illuminated variable performance across structures. Lung segmentation demonstrated remarkable agreement, with high overlap and nearly indistinguishable boundaries across all models. This consistency highlights the maturity of lung segmentation technologies—likely a function of abundant training data and well-defined anatomical landmarks.

Heart segmentations initially showed moderate concordance owing primarily to one outlier model adopting a narrower definition of the heart. Excluding this model markedly improved overall alignment among the remainder. Bone structures revealed greater challenges. Four of the six models manifested frequent errors in rib and thoracic vertebrae labels, including merges of adjacent bones or misidentification of vertebral levels.

Two models trained on distinct datasets produced notably more consistent and anatomically comprehensive segmentations. These subtleties eluded aggregate statistics but emerged clearly through simultaneous visual scrutiny, underscoring the indispensability of combined quantitative and qualitative evaluation techniques.

The investigation underscores a crucial insight: even highly cited AI segmentation models can harbor systematic weaknesses, particularly when trained on overlapping or limited data. It also validates a novel pathway for meaningful model assessment without the prohibitive cost of manual ground truth annotation.

For developers, this framework offers a transferable methodology applicable to any three-dimensional imaging dataset. The team released a publicly accessible interactive website to disseminate findings, inviting the broader research community to examine detailed concordance metrics and underlying imaging data themselves.

Whether this consensus-based approach becomes the new standard for medical AI evaluation remains uncertain. The real test comes when these models encounter edge cases outside their training distributions—where agreement among flawed models might simply confirm shared biases rather than anatomical truth.

The framework works well until you need to know which model is actually right, not just which ones agree.

Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn

AI Anatomy Segmentation Evaluated Without Ground Truth Data

Comments