DarkQA:
Benchmarking Vision-Language Models
for Visual-Primitive Question Answering
in Low-Light Indoor Scenes

1KAIST, 2POSTECH
DarkQA Teaser
We present DarkQA, a new benchmark that evaluates VLM robustness in visual-primitives under low-light conditions in embodied scenario. DarkQA assesses VLM performance under two distinct conditions: clean, well-lit inputs (L0) and a multi-level ladder of physics-based low-light images (L1-L5). Furthermore, the benchmark examines the effect of applying Low-Light Image Enhancement (LLIE) models as a pre-processing step.

Abstract

Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments, a core necessity that has been largely overlooked.

To address this underexplored challenge, we present DarkQA, an open-source benchmark for evaluating perceptual primitives under multi-level low-light conditions in embodied scenarios. DarkQA evaluates single-view egocentric observations across controlled degradation levels, isolating low-light perceptual failures before they are entangled with complex embodied tasks.

The benchmark contains 9.4K deterministically generated and verifiable question-image pairs spanning five visual-primitive families. A key design feature of DarkQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline; we further validate the synthesis against real paired low-light camera data. We evaluate representative VLMs and Low-Light Image Enhancement (LLIE) preprocessing methods.

Results show consistent VLM degradation under low illumination and sensor noise, while LLIE provides severity-dependent but unstable recovery. We demonstrate the utility of DarkQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models, and systematically reveal VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.

Dataset Construction and QA Pair Generation

Our DarkQA is designed to evaluate VLMs’ recognition of core perceptual primitives from a single image-question pair under controlled low-light conditions. We synthesize low-light images from the HM3D dataset. And we deterministically generate Question-Answer (QA) pairs.

Low-Light Image Synthesis for Benchmark Inputs

Low-light image synthesis pipeline
To generate controlled low-light inputs for our benchmark, we adopt an ISP-inspired unprocessing and noise formulation from prior work. Crucially, we produce paired variants for each original image to disentangle failure sources in VLM-based EQA: (a) a physics-based branch (top) that unprocesses sRGB to Bayer RAW, injects four noise components in RAW, and then applies EV drop and gamma compression; and (b) a noise-free branch (bottom) that applies the same EV drop in linear RGB without noise injection. This paired design enables separate evaluation of performance degradation due to illumination reduction vs. sensor noise.

We design a physics-based low-light synthesis pipeline. Specifically, across multiple degradation severities (L1–L5, increasing severity), we synthesize two paired low-light variants per original image: (i) A noise-free EV-drop variant and (ii) a physics-motivated variant with level-dependent sensor noise injection in the RAW domain, as in the above image. This design enables disentangling the respective impacts of illumination degradation and sensor noise on perceptual performance of VLMs.

Below is the synthesized low-light image examples across degradation levels L0–L5:

Low-light image synthesis examples
Synthesized low-light image examples across degradation levels L0–L5. The top row shows EV drop only, while the bottom row shows EV drop combined with noise injection. The lower-right insets show 1/4-image crops with pixel intensities amplified for visibility; the numbers (×10, ×20, ×50) indicate the amplification factor.

Dataset Construction

We build the dataset for evaluation upon a representative subset of 52 scenes from HM3D-Sem dataset. For each scene, we record a human-demonstrated navigation trajectory that systematically explores the environment to maximize spatial coverage. To generate the ground-truth QA pairs, we uniformly subsample the trajectory and select keyframes at a fixed time interval (e.g., one frame every 2,s), rendering their geometric and semantic modalities (e.g., RGB, depth, segmentation). We then use deterministic procedure to automatically generate QA pairs from the pre-computed per-keyframe statistics. For detailed procedures, please refer to Section 3-B of our paper.

Question Family and Dataset Statistics

Question family of our DarkQA
Five DarkQA question categories with examples. DarkQA encompasses questions asking room-type recognition, room affordance check, object recognition, object attribute, and closest object recognition.
Dataset statistics
Dataset statistics, including semantic-class coverage, room-category distribution, and question-category distribution.

Experiments

Evaluation summary on our DarkQA
Degradation level indicates the severity of low-light corruption: L0 corresponds to the original (well-lit) input, and higher levels (L1 → L5) denote progressively darker (lower-illumination) inputs. We evaluate a range of open-source VLMs (LLaVA, InternVL, and Qwen-VL series, 7B–32B). The shaded regions in (a) and (b) denote the minimum–maximum accuracy across models at each degradation level. (a) Impact of noise injection. (b) Impact of LLIE pre-processing. (c) Model-wise comparison. (d) Image samples enhanced with LLIE model. We include GPT-4 as a Blind-LLM baseline (evaluated without vision; gray dashed line) and GPT-4o [16] as an upper-bound reference (black line).

Question-wise evaluation result on our DarkQA
We plot VLM accuracy across different question types under increasing low-light degradation, where darker lines indicate more severe degradation and the gray dashed line denotes the GPT-4 Blind-LLM baseline. We observe significant drops in "Room Type Recognition" and "Object Attribute – Color," where VLM performance falls below the GPT-4 Blind-LLM baseline.

Full evaluation result table on our DarkQA

BibTeX

@article{park2025darkqa,
  author  = {Park, Yohan and Ha, Hyunwoo and Jo, Wonjun and Oh, Tae-Hyun},
  title   = {DarkQA: Benchmarking Vision-Language Models on Visual-Primitive Question Answering in Low-Light Indoor Scenes},
  journal = {arXiv preprint arXiv:2512.24985},
  year    = {2025},
}