Multi-Optics Depth-of-Field Stereo Dataset
Training and evaluation of state-of-the-art computer vision algorithms for reliable shallow depth of field (DoF) rendering and defocus deblurring remain constrained by a persistent lack of large-scale, full-frame, high fidelity, real-image datasets. Optical effects of shallow depth of field and defocus blur depend intimately on camera optical configuration set with focal length and aperture; requiring rigorous evaluation of the models when these parameters systematically change. Further, modern applications such as Augmented, Virtual Reality (AR, VR), smartphones, industrial robots deploy stereo or multi-camera systems. We present MODEST - the first high-resolution (5472×3648px, 20MP), multi-optics depth of field stereo DSLR dataset that methodically varies focal length and aperture for a series of complex, real-world scenes, capturing the optical realism and complexity of professional camera systems. With 20,000 images across 50 distinct optical configurations, focal length ranging 28-70mm and aperture ranging f/22-f/2.8 for 20 stereo viewpoints per each of 10 scenes, this high-resolution, full-range optics coverage enables controlled analysis of geometric and optical effects for shallow depth of field rendering and defocus deblurring algorithms. Each scene is intentionally curated to include challenging visual elements, including reflective surfaces, transparent glass walls, fine-grained details, point lights, and multi-scale depth illusions. In addition, we provide image sets for monocular intrinsics calibration and stereo extrinsics calibration for each focal configuration to support ever-evolving classical and learning-based calibration methods. We evaluate several state-of-the-art algorithms for depth of field and defocus deblurring across focal configurations and demonstrate failure cases and limitations. This work attempts to bridge the gap between synthetic, low-resolution training data and inference generalization on high-resolution, real camera optics. We release the dataset, calibration files, and evaluation code to support reproducible benchmarking and further research on real-world optical generalization.
Data comparison categorized for different use cases, including shallow depth of field rendering and defocus deblurring and deblurring.
| Dataset | Capture Setup | Real / Synthetic | Scenes | Images | Resolution | Focal Var. | Aperture Var. | Depth Range | Calibration Set |
|---|---|---|---|---|---|---|---|---|---|
| Use Case: Shallow DoF rendering and defocus deblurring | |||||||||
| CUHK ('14) | Natural internet images + binary blur masks | Real | N/A | 1,000 | 352 x 352 | N/A | N/A | N/A | ✗ |
| RTF ('16) | Mono RGB (Lytro light-field camera) | Real | 22 | 44 | 360 x 360 | ✗ | ✓ | N/A | ✗ |
| DPDD ('19) | Mono RGB | Real | 500 | 2,000 | 1680 x 1120 | ✗ | ✗ | 0.3m - 10m | ✗ |
| LFDOF ('21) | Mono RGB (light-field) + LiDAR | Real | N/A | 23,978 | 375 x 540 | ✗ | ✗ | 0.3m - 10m | ✗ |
| RealDOF ('21) | Stereo RGB (beam-splitter setup) | Real | 50 | 100 | 1536 x 2320 | ✗ | ✓ | N/A | ✗ |
| BLB ('22) | Synthetic Blender renderings | Synthetic | 10 | 1,000 | 1920 x 1080 | ✗ | ✓ | 0.5m - 10m | ✗ |
| VABD ('24) | Mono RGB | Real | 535 | 2,940 | 1536 x 1024 | ✗ | ✓ | N/A | ✗ |
| RealBokeh ('25) | Mono RGB (Canon EOS R6 II) | Real | 300 | 23,000 | 6000 x 4000 | ✓ | ✓ | N/A | ✗ |
| MODEST (Ours) | Stereo RGB (Canon 6D) | Real | 9 | 20,000 | 5472 x 3648 | ✓ | ✓ | 0.5m - 10m | ✓ |
| Use Case: Deblurring | |||||||||
| DeepLens ('18) | Dual-lens camera + synthetic thin-lens modeling | Real + Synthetic | N/A | 502,462 | 3024 x 4032 | ✗ | ✓ | N/A | ✗ |
| SYNDOF ('19) | Mono RGB (thin-lens model) | Synthetic | N/A | 8,231 | 960-2000 x 436-2000 | ✓ | ✓ | 0.5m - 80m | ✗ |
| BSD-B ('20) | Blurred via uniform disk kernels | Synthetic | 500 | 40,000 | 320-480 x 320-480 | N/A | ✓ | N/A | N/A |
| iDFD ('23) | Dual-sensor rig + Kinect depth | Real | N/A | 1,528 | 1050 x 1050 | ✗ | ✓ | 0m - 10m | ✗ |
| QPDD ('25) | Mono RGB (50MP quad-pixel sensor) | Real | 300 | 9,870 | 1080-4096 x 1280-3072 | ✗ | ✓ | 0.2m - infinity | ✗ |
| MODEST (Ours) | Stereo RGB (Canon 6D) | Real | 9 | 20,000 | 5472 x 3648 | ✓ | ✓ | 0.5m - 10m | ✓ |
| Metric | MODEST (ours) | RealBokeh | EBB! | Aperture | BEDT |
|---|---|---|---|---|---|
| # Samples | 20,000 | 23,000 | 4,694 | 2,942 | 20,000 |
| Apertures | f/22.0 - f/2.8 | f/20.0 - f/2.0 | f/1.8 | f/8.0, f/2.0 | f/2.0, f/1.8 |
| Focal length | 28mm - 70mm | 28mm - 70mm | 85mm | - | - |
| Mono / stereo | Stereo | Mono | Mono | Mono | Mono |
| Systematic focal length capture | ✓ | ✗ | ✗ | - | - |
| Systematic aperture capture | ✓ | ✗ | ✗ | ✗ | ✗ |
| Multi-capture | ✓ | ✗ | ✗ | ✗ | ✗ |
| Intrinsic calibration images | ✓ | ✗ | ✗ | ✗ | ✗ |
| Extrinsic calibration images | ✓ | ✗ | ✗ | ✗ | ✗ |
Two Canon EOS 6D full-frame DSLRs mounted on a precision stereo rig capture synchronized 20 MP image pairs. A zoom-lens system enables systematic variation across 10 focal lengths (28–70 mm) and 5 aperture stops (f/2.8–f/22), reproducing the full optics space of professional DSLR photography.
A ChArUco board enables sub-pixel-accurate intrinsic and extrinsic calibration. Besides a global calibration set, each scene and each focal length has a dedicated calibration set — supporting both classical (OpenCV) and learning-based calibration methods for stereo geometry recovery across every optical configuration.
For 10 scenes with varying scene complexity, lighting, and background, images are captured with two identical camera assemblies at 10 focal lengths (28–70 mm) and 5 apertures (f/2.8–f/22.0) — 2,000 images per scene, 20,000 total. Challenging elements include multi-scale optical illusions, reflective surfaces, transparent glass doors, sharp lighting changes, and ambiguous background depths. Dedicated calibration sets per scene and focal length support classical and learning-based intrinsic and extrinsic calibration methods.
Scenes are intentionally curated to stress-test vision algorithms: reflective surfaces that mirror the environment, semi-transparent glass doors, intricate fine-grained textures, and high-contrast point lights.
Scenes include objects at multiple scales that generate ambiguous, competing depth cues — foreground elements that appear far, and distant backgrounds that visually anchor to the near field. These configurations expose systematic failure modes in learned depth estimation and stereo matching pipelines, revealing where model priors break down on real, high-resolution optical data.
Systematic aperture variation from f/2.8 to f/22 produces a wide continuum of depth-of-field effects. Wide-open apertures yield rich foreground/background bokeh; stopped-down apertures maintain near-full scene sharpness — enabling rigorous, controlled evaluation of shallow DoF rendering and defocus deblurring models under identical scene and illumination conditions.
20 stereo viewpoints are captured per focal configuration per scene. The 5 shown above illustrate spatial diversity — spanning a range of left , middle and right camera positions with varied subject framing, providing strong geometric constraints for stereo matching, multi-view consistency analysis, and novel-view synthesis. ↔ Scroll to explore
At each of the 10 focal lengths, all 5 aperture stops are captured — from wide open (f/2.8) to fully stopped down (f/22). Shown here at 60 mm, the transition from extreme subject-isolation bokeh to near-infinite depth of field illustrates the rich optical parameter space systematically covered by MODEST. ↔ Scroll to explore
We benchmark state-of-the-art methods on two tasks — shallow depth of field rendering and defocus deblurring — across five representative focal lengths (28, 36, 45, 60, 70 mm). Metrics are PSNR ↑, SSIM ↑, LPIPS ↓, and mean-opinion-rank (MOR ↓). Best values per column are highlighted in green.
Quantitative evaluation across five focal lengths and overall average, including mean-opinion-rank (MOR) and inference time per image on an A100-40 GB GPU.
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | Average | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fl28 | fl36 | fl45 | fl60 | fl70 | fl28 | fl36 | fl45 | fl60 | fl70 | fl28 | fl36 | fl45 | fl60 | fl70 | PSNR↑ | SSIM↑ | LPIPS↓ | MOR↓ | Time(s)↓ | |
| DoF Rendering Methods | ||||||||||||||||||||
| BokehDiff ('25) | 27.53 | 27.45 | 21.72 | 21.79 | 21.37 | 0.92 | 0.93 | 0.88 | 0.88 | 0.85 | 0.17 | 0.16 | 0.19 | 0.23 | 0.26 | 23.62 | 0.89 | 0.21 | 2.96 | 23.1+2e−6 |
| Dr.Bokeh ('24) | 26.40 | 26.55 | 21.38 | 21.68 | 21.22 | 0.91 | 0.92 | 0.87 | 0.89 | 0.86 | 0.22 | 0.19 | 0.24 | 0.23 | 0.25 | 23.14 | 0.89 | 0.23 | 3.06 | 120.4 |
| BokehMe ('22) | 25.75 | 25.15 | 20.67 | 20.98 | 21.02 | 0.89 | 0.89 | 0.85 | 0.87 | 0.86 | 0.23 | 0.23 | 0.25 | 0.25 | 0.25 | 22.46 | 0.87 | 0.24 | 2.42 | 17.7+12.2 |
| Bokehlicious-2.8 ('25) | 23.45 | 22.91 | 19.44 | 19.74 | 19.99 | 0.78 | 0.79 | 0.75 | 0.79 | 0.80 | 0.46 | 0.45 | 0.46 | 0.43 | 0.37 | 20.92 | 0.78 | 0.43 | 3.81 | 17.8 |
| Bokehlicious-4.5* ('25) | 24.79 | 24.43 | 20.25 | 20.56 | 20.68 | 0.82 | 0.84 | 0.80 | 0.83 | 0.84 | 0.39 | 0.37 | 0.38 | 0.35 | 0.31 | 21.91 | 0.82 | 0.36 | 2.74 | 17.8 |
* Bokehlicious-4.5 trained with aperture ratio 4.5×. MOR = mean-opinion-rank (lower is better). Time measured per image on A100-40 GB.
Fig. Five shallow depth-of-field rendering methods evaluated on three MODEST scenes at focal length 70 mm. Input is the sharp f/22.0 capture; target output is the f/2.8 bokeh image. Click to zoom.
Quantitative evaluation across five focal lengths and overall average. Two motion-deblurring methods are stress-tested on the defocus deblurring task to highlight the domain gap.
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | Average | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fl28 | fl36 | fl45 | fl60 | fl70 | fl28 | fl36 | fl45 | fl60 | fl70 | fl28 | fl36 | fl45 | fl60 | fl70 | PSNR↑ | SSIM↑ | LPIPS↓ | Time(s)↓ | |
| Defocus Deblurring Methods | |||||||||||||||||||
| Restormer ('22) | 27.19 | 28.51 | 26.99 | 26.42 | 25.62 | 0.91 | 0.91 | 0.89 | 0.90 | 0.87 | 0.03 | 0.03 | 0.05 | 0.06 | 0.08 | 26.95 | 0.90 | 0.05 | 6.36 |
| NRKNet ('24) | 27.50 | 28.97 | 27.30 | 26.41 | 25.53 | 0.91 | 0.91 | 0.89 | 0.89 | 0.87 | 0.04 | 0.03 | 0.06 | 0.08 | 0.11 | 26.98 | 0.89 | 0.07 | 0.35 |
| ViTDeblur ('25) | 26.83 | 27.94 | 27.00 | 26.54 | 25.79 | 0.91 | 0.91 | 0.89 | 0.90 | 0.87 | 0.04 | 0.03 | 0.05 | 0.06 | 0.08 | 26.73 | 0.90 | 0.05 | 12.91 |
| Motion Deblurring Methods (stress-tested on defocus) | |||||||||||||||||||
| EVSSM ('25) | 22.64 | 23.61 | 18.76 | 17.03 | 16.55 | 0.81 | 0.85 | 0.76 | 0.72 | 0.70 | 0.10 | 0.08 | 0.13 | 0.19 | 0.24 | 19.34 | 0.76 | 0.16 | — |
| FFTformer ('23) | 23.44 | 24.78 | 19.58 | 17.44 | 16.97 | 0.82 | 0.85 | 0.77 | 0.73 | 0.71 | 0.12 | 0.11 | 0.17 | 0.23 | 0.28 | 20.03 | 0.77 | 0.19 | — |
Fig. Qualitative comparison of three defocus deblurring models across three MODEST scenes at focal length 70 mm. Input images are captured at wide aperture (f/2.8); ground truth corresponds to all-in-focus captures at f/22.0. Click to zoom.
MODEST/
├── Global_calibration_set/
│ ├── EOS6D_A_Left/
│ │ └── fl_<focal_length>/
│ │ ├── calibration/
│ │ │ └── rectified/
│ │ └── inference/
│ ├── EOS6D_B_Right/
│ │ └── fl_<focal_length>/
│ │ ├── calibration/
│ │ │ └── rectified/
│ │ └── inference/
│ └── stereocal_rectified_calibration_<focal_length>/
│
├── Scene<id>/
│ ├── EOS6D_A_<Left|Right>/
│ │ └── fl_<focal_length>/
│ │ ├── calibration/
│ │ │ └── rectified/
│ │ └── inference/
│ │ ├── F<aperture>/
│ │ └── rectified/
│ │
│ │
│ └── EOS6D_B_<Left|Right>/
│ └── fl_<focal_length>/
│ ├── calibration/
│ │ └── rectified/
│ └── inference/
│ ├── F<aperture>/
│ └── rectified/
│
└── ...
<focal_length> ∈ {28mm, 32mm, 36mm, 40mm, 45mm, 50mm, 55mm, 60mm, 65mm, 70mm}<aperture> ∈ {F2.8, F5.0, F9.0, F16.0, F22.0}Scene<id> spans multiple scenes captured under identical optical configurations