Acquiring accurate real-world monocular depth data in surgery is often infeasible. Widely used synthetic datasets can provide accurate ground truth labels, however, these do not reflect variabilities in real-world surgery. To address this limitation, we aim to leverage high-fidelity synthetic depth data and transfer this understanding to diverse surgical dataset. To achieve this, we introduce a novel efficient teacher-student architecture, namely 'PatchSurg'. Our PatchSurg exploits the structural details in synthetic datasets and transfers it to realworld cases by mitigating the domain gap between synthetic and real-world data, using a detail and scale disentangling technique. Furthermore, we utilise a pose prediction network that processes temporally adjacent frames to enhance temporal consistency in depth estimation. In terms of root mean squared error, our novel PatchSurg method achieves a 20.5% improvement over recent approaches on the synthetic SimuScope dataset and exhibits substantial gain over any existing state-of-the-art methods on real surgical datasets, including 9% on EndoNeRF, 21.4% on SCARED, and 26.4% on SERV-CT, compared to the most accurate method.
10.1109/ISBI61048.2026.11516012
Conference paper
2026-01-01T00:00:00+00:00
2026-April