arXiv:2511.20853

MODEST

Multi-Optics Depth-of-Field Stereo Dataset

Nisarg K. Trivedi, Vinayak A. Belludi, Li-Yun Wang

Training and evaluation of state-of-the-art computer vision algorithms for reliable shallow depth of field (DoF) rendering and defocus deblurring remain constrained by a persistent lack of large-scale, full-frame, high fidelity, real-image datasets. Optical effects of shallow depth of field and defocus blur depend intimately on camera optical configuration set with focal length and aperture; requiring rigorous evaluation of the models when these parameters systematically change. Further, modern applications such as Augmented, Virtual Reality (AR, VR), smartphones, industrial robots deploy stereo or multi-camera systems. We present MODEST - the first high-resolution (5472×3648px, 20MP), multi-optics depth of field stereo DSLR dataset that methodically varies focal length and aperture for a series of complex, real-world scenes, capturing the optical realism and complexity of professional camera systems. With 20,000 images across 50 distinct optical configurations, focal length ranging 28-70mm and aperture ranging f/22-f/2.8 for 20 stereo viewpoints per each of 10 scenes, this high-resolution, full-range optics coverage enables controlled analysis of geometric and optical effects for shallow depth of field rendering and defocus deblurring algorithms. Each scene is intentionally curated to include challenging visual elements, including reflective surfaces, transparent glass walls, fine-grained details, point lights, and multi-scale depth illusions. In addition, we provide image sets for monocular intrinsics calibration and stereo extrinsics calibration for each focal configuration to support ever-evolving classical and learning-based calibration methods. We evaluate several state-of-the-art algorithms for depth of field and defocus deblurring across focal configurations and demonstrate failure cases and limitations. This work attempts to bridge the gap between synthetic, low-resolution training data and inference generalization on high-resolution, real camera optics. We release the dataset, calibration files, and evaluation code to support reproducible benchmarking and further research on real-world optical generalization.

Read Paper Download Dataset

Dataset Comparisons

Data Comparison by Use Case

Data comparison categorized for different use cases, including shallow depth of field rendering and defocus deblurring and deblurring.

Dataset	Capture Setup	Real / Synthetic	Scenes	Images	Resolution	Focal Var.	Aperture Var.	Depth Range	Calibration Set
Use Case: Shallow DoF rendering and defocus deblurring
CUHK ('14)	Natural internet images + binary blur masks	Real	N/A	1,000	352 x 352	N/A	N/A	N/A	✗
RTF ('16)	Mono RGB (Lytro light-field camera)	Real	22	44	360 x 360	✗	✓	N/A	✗
DPDD ('19)	Mono RGB	Real	500	2,000	1680 x 1120	✗	✗	0.3m - 10m	✗
LFDOF ('21)	Mono RGB (light-field) + LiDAR	Real	N/A	23,978	375 x 540	✗	✗	0.3m - 10m	✗
RealDOF ('21)	Stereo RGB (beam-splitter setup)	Real	50	100	1536 x 2320	✗	✓	N/A	✗
BLB ('22)	Synthetic Blender renderings	Synthetic	10	1,000	1920 x 1080	✗	✓	0.5m - 10m	✗
VABD ('24)	Mono RGB	Real	535	2,940	1536 x 1024	✗	✓	N/A	✗
RealBokeh ('25)	Mono RGB (Canon EOS R6 II)	Real	300	23,000	6000 x 4000	✓	✓	N/A	✗
MODEST (Ours)	Stereo RGB (Canon 6D)	Real	9	20,000	5472 x 3648	✓	✓	0.5m - 10m	✓
Use Case: Deblurring
DeepLens ('18)	Dual-lens camera + synthetic thin-lens modeling	Real + Synthetic	N/A	502,462	3024 x 4032	✗	✓	N/A	✗
SYNDOF ('19)	Mono RGB (thin-lens model)	Synthetic	N/A	8,231	960-2000 x 436-2000	✓	✓	0.5m - 80m	✗
BSD-B ('20)	Blurred via uniform disk kernels	Synthetic	500	40,000	320-480 x 320-480	N/A	✓	N/A	N/A
iDFD ('23)	Dual-sensor rig + Kinect depth	Real	N/A	1,528	1050 x 1050	✗	✓	0m - 10m	✗
QPDD ('25)	Mono RGB (50MP quad-pixel sensor)	Real	300	9,870	1080-4096 x 1280-3072	✗	✓	0.2m - infinity	✗
MODEST (Ours)	Stereo RGB (Canon 6D)	Real	9	20,000	5472 x 3648	✓	✓	0.5m - 10m	✓

MODEST vs Recent Bokeh Datasets

Metric	MODEST (ours)	RealBokeh	EBB!	Aperture	BEDT
# Samples	20,000	23,000	4,694	2,942	20,000
Apertures	f/22.0 - f/2.8	f/20.0 - f/2.0	f/1.8	f/8.0, f/2.0	f/2.0, f/1.8
Focal length	28mm - 70mm	28mm - 70mm	85mm	-	-
Mono / stereo	Stereo	Mono	Mono	Mono	Mono
Systematic focal length capture	✓	✗	✗	-	-
Systematic aperture capture	✓	✗	✗	✗	✗
Multi-capture	✓	✗	✗	✗	✗
Intrinsic calibration images	✓	✗	✗	✗	✗
Extrinsic calibration images	✓	✗	✗	✗	✗

Dataset Showcase

Stereo Camera Assembly

Two Canon EOS 6D full-frame DSLRs mounted on a precision stereo rig capture synchronized 20 MP image pairs. A zoom-lens system enables systematic variation across 10 focal lengths (28–70 mm) and 5 aperture stops (f/2.8–f/22), reproducing the full optics space of professional DSLR photography.

ChArUco calibration board used for intrinsics and extrinsics

ChArUco Calibration Pattern

A ChArUco board enables sub-pixel-accurate intrinsic and extrinsic calibration. Besides a global calibration set, each scene and each focal length has a dedicated calibration set — supporting both classical (OpenCV) and learning-based calibration methods for stereo geometry recovery across every optical configuration.

MODEST Dataset Overview

For 10 scenes with varying scene complexity, lighting, and background, images are captured with two identical camera assemblies at 10 focal lengths (28–70 mm) and 5 apertures (f/2.8–f/22.0) — 2,000 images per scene, 20,000 total. Challenging elements include multi-scale optical illusions, reflective surfaces, transparent glass doors, sharp lighting changes, and ambiguous background depths. Dedicated calibration sets per scene and focal length support classical and learning-based intrinsic and extrinsic calibration methods.

Scene Diversity & Challenging Elements

Scenes are intentionally curated to stress-test vision algorithms: reflective surfaces that mirror the environment, semi-transparent glass doors, intricate fine-grained textures, and high-contrast point lights.

Multi-Scale Depth Illusions

Scenes include objects at multiple scales that generate ambiguous, competing depth cues — foreground elements that appear far, and distant backgrounds that visually anchor to the near field. These configurations expose systematic failure modes in learned depth estimation and stereo matching pipelines, revealing where model priors break down on real, high-resolution optical data.

Depth of Field Effects

Depth of field effect progression across apertures

Systematic aperture variation from f/2.8 to f/22 produces a wide continuum of depth-of-field effects. Wide-open apertures yield rich foreground/background bokeh; stopped-down apertures maintain near-full scene sharpness — enabling rigorous, controlled evaluation of shallow DoF rendering and defocus deblurring models under identical scene and illumination conditions.

Multi-Viewpoint Coverage

5 of 20 viewpoints captured per focal configuration per scene

20 stereo viewpoints are captured per focal configuration per scene. The 5 shown above illustrate spatial diversity — spanning a range of left , middle and right camera positions with varied subject framing, providing strong geometric constraints for stereo matching, multi-view consistency analysis, and novel-view synthesis. ↔ Scroll to explore

Aperture Variation per Focal Length

At each of the 10 focal lengths, all 5 aperture stops are captured — from wide open (f/2.8) to fully stopped down (f/22). Shown here at 60 mm, the transition from extreme subject-isolation bokeh to near-infinite depth of field illustrates the rich optical parameter space systematically covered by MODEST. ↔ Scroll to explore

Experiments

We benchmark state-of-the-art methods on two tasks — shallow depth of field rendering and defocus deblurring — across five representative focal lengths (28, 36, 45, 60, 70 mm). Metrics are PSNR ↑, SSIM ↑, LPIPS ↓, and mean-opinion-rank (MOR ↓). Best values per column are highlighted in green.

Depth of Field Rendering

Quantitative evaluation across five focal lengths and overall average, including mean-opinion-rank (MOR) and inference time per image on an A100-40 GB GPU.

Method	PSNR ↑					SSIM ↑					LPIPS ↓					Average
Method	fl28	fl36	fl45	fl60	fl70	fl28	fl36	fl45	fl60	fl70	fl28	fl36	fl45	fl60	fl70	PSNR↑	SSIM↑	LPIPS↓	MOR↓	Time(s)↓
DoF Rendering Methods
BokehDiff ('25)	27.53	27.45	21.72	21.79	21.37	0.92	0.93	0.88	0.88	0.85	0.17	0.16	0.19	0.23	0.26	23.62	0.89	0.21	2.96	23.1^+2e−6
Dr.Bokeh ('24)	26.40	26.55	21.38	21.68	21.22	0.91	0.92	0.87	0.89	0.86	0.22	0.19	0.24	0.23	0.25	23.14	0.89	0.23	3.06	120.4
BokehMe ('22)	25.75	25.15	20.67	20.98	21.02	0.89	0.89	0.85	0.87	0.86	0.23	0.23	0.25	0.25	0.25	22.46	0.87	0.24	2.42	17.7+12.2
Bokehlicious-2.8 ('25)	23.45	22.91	19.44	19.74	19.99	0.78	0.79	0.75	0.79	0.80	0.46	0.45	0.46	0.43	0.37	20.92	0.78	0.43	3.81	17.8
Bokehlicious-4.5^* ('25)	24.79	24.43	20.25	20.56	20.68	0.82	0.84	0.80	0.83	0.84	0.39	0.37	0.38	0.35	0.31	21.91	0.82	0.36	2.74	17.8

^* Bokehlicious-4.5 trained with aperture ratio 4.5×. MOR = mean-opinion-rank (lower is better). Time measured per image on A100-40 GB.

Qualitative comparison of DoF rendering methods

Fig. Five shallow depth-of-field rendering methods evaluated on three MODEST scenes at focal length 70 mm. Input is the sharp f/22.0 capture; target output is the f/2.8 bokeh image. Click to zoom.

Defocus Deblurring

Quantitative evaluation across five focal lengths and overall average. Two motion-deblurring methods are stress-tested on the defocus deblurring task to highlight the domain gap.

Method	PSNR ↑					SSIM ↑					LPIPS ↓					Average
Method	fl28	fl36	fl45	fl60	fl70	fl28	fl36	fl45	fl60	fl70	fl28	fl36	fl45	fl60	fl70	PSNR↑	SSIM↑	LPIPS↓	Time(s)↓
Defocus Deblurring Methods
Restormer ('22)	27.19	28.51	26.99	26.42	25.62	0.91	0.91	0.89	0.90	0.87	0.03	0.03	0.05	0.06	0.08	26.95	0.90	0.05	6.36
NRKNet ('24)	27.50	28.97	27.30	26.41	25.53	0.91	0.91	0.89	0.89	0.87	0.04	0.03	0.06	0.08	0.11	26.98	0.89	0.07	0.35
ViTDeblur ('25)	26.83	27.94	27.00	26.54	25.79	0.91	0.91	0.89	0.90	0.87	0.04	0.03	0.05	0.06	0.08	26.73	0.90	0.05	12.91
Motion Deblurring Methods (stress-tested on defocus)
EVSSM ('25)	22.64	23.61	18.76	17.03	16.55	0.81	0.85	0.76	0.72	0.70	0.10	0.08	0.13	0.19	0.24	19.34	0.76	0.16	—
FFTformer ('23)	23.44	24.78	19.58	17.44	16.97	0.82	0.85	0.77	0.73	0.71	0.12	0.11	0.17	0.23	0.28	20.03	0.77	0.19	—

Qualitative comparison of defocus deblurring methods

Fig. Qualitative comparison of three defocus deblurring models across three MODEST scenes at focal length 70 mm. Input images are captured at wide aperture (f/2.8); ground truth corresponds to all-in-focus captures at f/22.0. Click to zoom.

Download Dataset

Dataset Structure

MODEST/
├── Global_calibration_set/
│   ├── EOS6D_A_Left/
│   │   └── fl_<focal_length>/
│   │       ├── calibration/
│   │       │   └── rectified/
│   │       └── inference/
│   ├── EOS6D_B_Right/
│   │   └── fl_<focal_length>/
│   │       ├── calibration/
│   │       │   └── rectified/
│   │       └── inference/
│   └── stereocal_rectified_calibration_<focal_length>/
│
├── Scene<id>/
│   ├── EOS6D_A_<Left|Right>/
│   │   └── fl_<focal_length>/
│   │       ├── calibration/
│   │       │   └── rectified/
│   │       └── inference/
│   │           ├── F<aperture>/
│   │           └── rectified/
│   │          
│   │
│   └── EOS6D_B_<Left|Right>/
│       └── fl_<focal_length>/
│           ├── calibration/
│           │   └── rectified/
│           └── inference/
│               ├── F<aperture>/
│               └── rectified/
│
└── ...

Notes

<focal_length> ∈ {28mm, 32mm, 36mm, 40mm, 45mm, 50mm, 55mm, 60mm, 65mm, 70mm}
<aperture> ∈ {F2.8, F5.0, F9.0, F16.0, F22.0}
Scene<id> spans multiple scenes captured under identical optical configurations

Dataset Statistics

Total Images18,000
Resolution5472 × 3648px
Scenes9 indoor environments
Focal Lengths10 (28mm - 70mm)
Apertures5 (f/2.8 - f/22)
CameraCanon 6D DSLR
Depth Range0.5m - 10m

Read Paper on arXiv Download Dataset