arXiv:2511.20853

MODEST

Multi-Optics Depth-of-Field Stereo Dataset

Nisarg K. Trivedi, Vinayak A. Belludi, Li-Yun Wang

Training and evaluation of state-of-the-art computer vision algorithms for reliable shallow depth of field (DoF) rendering and defocus deblurring remain constrained by a persistent lack of large-scale, full-frame, high fidelity, real-image datasets. Optical effects of shallow depth of field and defocus blur depend intimately on camera optical configuration set with focal length and aperture; requiring rigorous evaluation of the models when these parameters systematically change. Further, modern applications such as Augmented, Virtual Reality (AR, VR), smartphones, industrial robots deploy stereo or multi-camera systems. We present MODEST - the first high-resolution (5472×3648px, 20MP), multi-optics depth of field stereo DSLR dataset that methodically varies focal length and aperture for a series of complex, real-world scenes, capturing the optical realism and complexity of professional camera systems. With 20,000 images across 50 distinct optical configurations, focal length ranging 28-70mm and aperture ranging f/22-f/2.8 for 20 stereo viewpoints per each of 10 scenes, this high-resolution, full-range optics coverage enables controlled analysis of geometric and optical effects for shallow depth of field rendering and defocus deblurring algorithms. Each scene is intentionally curated to include challenging visual elements, including reflective surfaces, transparent glass walls, fine-grained details, point lights, and multi-scale depth illusions. In addition, we provide image sets for monocular intrinsics calibration and stereo extrinsics calibration for each focal configuration to support ever-evolving classical and learning-based calibration methods. We evaluate several state-of-the-art algorithms for depth of field and defocus deblurring across focal configurations and demonstrate failure cases and limitations. This work attempts to bridge the gap between synthetic, low-resolution training data and inference generalization on high-resolution, real camera optics. We release the dataset, calibration files, and evaluation code to support reproducible benchmarking and further research on real-world optical generalization.

Dataset Comparisons

Data Comparison by Use Case

Data comparison categorized for different use cases, including shallow depth of field rendering and defocus deblurring and deblurring.

Dataset Capture Setup Real / Synthetic Scenes Images Resolution Focal Var. Aperture Var. Depth Range Calibration Set
Use Case: Shallow DoF rendering and defocus deblurring
CUHK ('14)Natural internet images + binary blur masksRealN/A1,000352 x 352N/AN/AN/A
RTF ('16)Mono RGB (Lytro light-field camera)Real2244360 x 360N/A
DPDD ('19)Mono RGBReal5002,0001680 x 11200.3m - 10m
LFDOF ('21)Mono RGB (light-field) + LiDARRealN/A23,978375 x 5400.3m - 10m
RealDOF ('21)Stereo RGB (beam-splitter setup)Real501001536 x 2320N/A
BLB ('22)Synthetic Blender renderingsSynthetic101,0001920 x 10800.5m - 10m
VABD ('24)Mono RGBReal5352,9401536 x 1024N/A
RealBokeh ('25)Mono RGB (Canon EOS R6 II)Real30023,0006000 x 4000N/A
MODEST (Ours)Stereo RGB (Canon 6D)Real920,0005472 x 36480.5m - 10m
Use Case: Deblurring
DeepLens ('18)Dual-lens camera + synthetic thin-lens modelingReal + SyntheticN/A502,4623024 x 4032N/A
SYNDOF ('19)Mono RGB (thin-lens model)SyntheticN/A8,231960-2000 x 436-20000.5m - 80m
BSD-B ('20)Blurred via uniform disk kernelsSynthetic50040,000320-480 x 320-480N/AN/AN/A
iDFD ('23)Dual-sensor rig + Kinect depthRealN/A1,5281050 x 10500m - 10m
QPDD ('25)Mono RGB (50MP quad-pixel sensor)Real3009,8701080-4096 x 1280-30720.2m - infinity
MODEST (Ours)Stereo RGB (Canon 6D)Real920,0005472 x 36480.5m - 10m

MODEST vs Recent Bokeh Datasets

Metric MODEST (ours) RealBokeh EBB! Aperture BEDT
# Samples20,00023,0004,6942,94220,000
Aperturesf/22.0 - f/2.8f/20.0 - f/2.0f/1.8f/8.0, f/2.0f/2.0, f/1.8
Focal length28mm - 70mm28mm - 70mm85mm--
Mono / stereoStereoMonoMonoMonoMono
Systematic focal length capture--
Systematic aperture capture
Multi-capture
Intrinsic calibration images
Extrinsic calibration images

Dataset Showcase

Stereo camera assembly — two Canon EOS 6D DSLRs on a precision rig

Stereo Camera Assembly

Two Canon EOS 6D full-frame DSLRs mounted on a precision stereo rig capture synchronized 20 MP image pairs. A zoom-lens system enables systematic variation across 10 focal lengths (28–70 mm) and 5 aperture stops (f/2.8–f/22), reproducing the full optics space of professional DSLR photography.

ChArUco calibration board used for intrinsics and extrinsics

ChArUco Calibration Pattern

A ChArUco board enables sub-pixel-accurate intrinsic and extrinsic calibration. Besides a global calibration set, each scene and each focal length has a dedicated calibration set — supporting both classical (OpenCV) and learning-based calibration methods for stereo geometry recovery across every optical configuration.

MODEST Dataset Overview

MODEST dataset overview visualization

For 10 scenes with varying scene complexity, lighting, and background, images are captured with two identical camera assemblies at 10 focal lengths (28–70 mm) and 5 apertures (f/2.8–f/22.0) — 2,000 images per scene, 20,000 total. Challenging elements include multi-scale optical illusions, reflective surfaces, transparent glass doors, sharp lighting changes, and ambiguous background depths. Dedicated calibration sets per scene and focal length support classical and learning-based intrinsic and extrinsic calibration methods.

Scene Diversity & Challenging Elements

Collage of challenging scene elements

Scenes are intentionally curated to stress-test vision algorithms: reflective surfaces that mirror the environment, semi-transparent glass doors, intricate fine-grained textures, and high-contrast point lights.

Multi-Scale Depth Illusions

Multi-scale depth illusions showing algorithm failure cases

Scenes include objects at multiple scales that generate ambiguous, competing depth cues — foreground elements that appear far, and distant backgrounds that visually anchor to the near field. These configurations expose systematic failure modes in learned depth estimation and stereo matching pipelines, revealing where model priors break down on real, high-resolution optical data.

Depth of Field Effects

Depth of field effect progression across apertures

Systematic aperture variation from f/2.8 to f/22 produces a wide continuum of depth-of-field effects. Wide-open apertures yield rich foreground/background bokeh; stopped-down apertures maintain near-full scene sharpness — enabling rigorous, controlled evaluation of shallow DoF rendering and defocus deblurring models under identical scene and illumination conditions.

Multi-Viewpoint Coverage

5 of 20 viewpoints captured per focal configuration per scene

20 stereo viewpoints are captured per focal configuration per scene. The 5 shown above illustrate spatial diversity — spanning a range of left , middle and right camera positions with varied subject framing, providing strong geometric constraints for stereo matching, multi-view consistency analysis, and novel-view synthesis. ↔ Scroll to explore

Aperture Variation per Focal Length

5 apertures at 60mm focal length

At each of the 10 focal lengths, all 5 aperture stops are captured — from wide open (f/2.8) to fully stopped down (f/22). Shown here at 60 mm, the transition from extreme subject-isolation bokeh to near-infinite depth of field illustrates the rich optical parameter space systematically covered by MODEST. ↔ Scroll to explore

Experiments

We benchmark state-of-the-art methods on two tasks — shallow depth of field rendering and defocus deblurring — across five representative focal lengths (28, 36, 45, 60, 70 mm). Metrics are PSNR ↑, SSIM ↑, LPIPS ↓, and mean-opinion-rank (MOR ↓). Best values per column are highlighted in green.

Depth of Field Rendering

Quantitative evaluation across five focal lengths and overall average, including mean-opinion-rank (MOR) and inference time per image on an A100-40 GB GPU.

Method PSNR ↑ SSIM ↑ LPIPS ↓ Average
fl28fl36fl45fl60fl70 fl28fl36fl45fl60fl70 fl28fl36fl45fl60fl70 PSNR↑SSIM↑LPIPS↓MOR↓Time(s)↓
DoF Rendering Methods
BokehDiff ('25) 27.5327.4521.7221.7921.37 0.920.930.880.880.85 0.170.160.190.230.26 23.620.890.212.9623.1+2e−6
Dr.Bokeh ('24) 26.4026.5521.3821.6821.22 0.910.920.870.890.86 0.220.190.240.230.25 23.140.890.233.06120.4
BokehMe ('22) 25.7525.1520.6720.9821.02 0.890.890.850.870.86 0.230.230.250.250.25 22.460.870.242.4217.7+12.2
Bokehlicious-2.8 ('25) 23.4522.9119.4419.7419.99 0.780.790.750.790.80 0.460.450.460.430.37 20.920.780.433.8117.8
Bokehlicious-4.5* ('25) 24.7924.4320.2520.5620.68 0.820.840.800.830.84 0.390.370.380.350.31 21.910.820.362.7417.8

* Bokehlicious-4.5 trained with aperture ratio 4.5×. MOR = mean-opinion-rank (lower is better). Time measured per image on A100-40 GB.

Qualitative comparison of DoF rendering methods

Fig. Five shallow depth-of-field rendering methods evaluated on three MODEST scenes at focal length 70 mm. Input is the sharp f/22.0 capture; target output is the f/2.8 bokeh image. Click to zoom.

Defocus Deblurring

Quantitative evaluation across five focal lengths and overall average. Two motion-deblurring methods are stress-tested on the defocus deblurring task to highlight the domain gap.

Method PSNR ↑ SSIM ↑ LPIPS ↓ Average
fl28fl36fl45fl60fl70 fl28fl36fl45fl60fl70 fl28fl36fl45fl60fl70 PSNR↑SSIM↑LPIPS↓Time(s)↓
Defocus Deblurring Methods
Restormer ('22) 27.1928.5126.9926.4225.62 0.910.910.890.900.87 0.030.030.050.060.08 26.950.900.056.36
NRKNet ('24) 27.5028.9727.3026.4125.53 0.910.910.890.890.87 0.040.030.060.080.11 26.980.890.070.35
ViTDeblur ('25) 26.8327.9427.0026.5425.79 0.910.910.890.900.87 0.040.030.050.060.08 26.730.900.0512.91
Motion Deblurring Methods (stress-tested on defocus)
EVSSM ('25) 22.6423.6118.7617.0316.55 0.810.850.760.720.70 0.100.080.130.190.24 19.340.760.16
FFTformer ('23) 23.4424.7819.5817.4416.97 0.820.850.770.730.71 0.120.110.170.230.28 20.030.770.19
Qualitative comparison of defocus deblurring methods

Fig. Qualitative comparison of three defocus deblurring models across three MODEST scenes at focal length 70 mm. Input images are captured at wide aperture (f/2.8); ground truth corresponds to all-in-focus captures at f/22.0. Click to zoom.

Download Dataset

Dataset Structure

MODEST/
├── Global_calibration_set/
│   ├── EOS6D_A_Left/
│   │   └── fl_<focal_length>/
│   │       ├── calibration/
│   │       │   └── rectified/
│   │       └── inference/
│   ├── EOS6D_B_Right/
│   │   └── fl_<focal_length>/
│   │       ├── calibration/
│   │       │   └── rectified/
│   │       └── inference/
│   └── stereocal_rectified_calibration_<focal_length>/
│
├── Scene<id>/
│   ├── EOS6D_A_<Left|Right>/
│   │   └── fl_<focal_length>/
│   │       ├── calibration/
│   │       │   └── rectified/
│   │       └── inference/
│   │           ├── F<aperture>/
│   │           └── rectified/
│   │          
│   │
│   └── EOS6D_B_<Left|Right>/
│       └── fl_<focal_length>/
│           ├── calibration/
│           │   └── rectified/
│           └── inference/
│               ├── F<aperture>/
│               └── rectified/
│
└── ...

Notes

  • <focal_length> ∈ {28mm, 32mm, 36mm, 40mm, 45mm, 50mm, 55mm, 60mm, 65mm, 70mm}
  • <aperture> ∈ {F2.8, F5.0, F9.0, F16.0, F22.0}
  • Scene<id> spans multiple scenes captured under identical optical configurations

Dataset Statistics

  • Total Images18,000
  • Resolution5472 × 3648px
  • Scenes9 indoor environments
  • Focal Lengths10 (28mm - 70mm)
  • Apertures5 (f/2.8 - f/22)
  • CameraCanon 6D DSLR
  • Depth Range0.5m - 10m