StreetNVS: Effective Multi-sensor Conditioning
for Street-view Novel-view Synthesis

Zhengfei Kuang¹ Adam Sun¹ Liyuan Zhu¹ Tong Wu¹ Shengqu Cai¹ Jonathan Tremblay² Iro Armeni¹ Ehsan Adeli¹ Lior Yariv¹ Gordon Wetzstein¹

¹ Stanford University ² NVIDIA

Modern vehicle platforms are equipped with a rich sensor suite, including LiDAR, calibrated multi-camera rigs, and accurate ego-motion, that in principle offers strong signal for re-rendering a driving scene from novel viewpoints. A growing line of recent work leverages video diffusion models for this task, using their generative priors to synthesize plausible novel views from sparse vehicle observations. In practice, however, existing methods exploit only a fragment of this signal, and their quality tends to degrade as the target trajectory departs from the recorded driving path. We argue that this is fundamentally a multi-sensor fusion problem: sparse LiDAR reprojections supply accurate but incomplete metric geometry, surround-view reference imagery supplies dense appearance but no metric depth, and camera poses tie the two together across views. We introduce StreetNVS, a video diffusion framework that jointly conditions on all three signals through a Reference-Enhanced Camera Attention module based on a relative ray-level positional encoding. We develop a two-stage curriculum training strategy that gradually exposes the model to increasingly sparse LiDAR. On the Waymo Open Dataset, StreetNVS substantially outperforms state-of-the-art baselines under sparse LiDAR conditioning, matches methods that rely on 10–100× denser point clouds. We further show capabilities of synthesizing coherent videos along extreme out-of-trajectory paths such as elevation, lane-shift, pullback, and rotation.

arXiv Paper

§1 · Unseen novel view synthesis

Synthesizing trajectories far from the vehicle path

Pick a scene below to render five camera trajectories that depart from the original ego-trajectory: Spiral (orbital sweep around the source pose), Lane Shift (sideways translation), Rotation (yaw turn), Pull-Back (dolly retreat along the forward axis), and Elevation (BEV-like rise). We additionally demonstrate two extreme trajectories — wider lane-shift and stronger top-down — that go well beyond the training distribution. While quality drops a little under these extreme out-of-distribution viewpoints, our method still maintains overall geometry consistency of the scene. Drag the slider on any clip to wipe between the synthesized RGB output (left) and its underlying LiDAR conditioning (right).

Scene

Input Video

§2 · Comparison with baselines

Side-by-side against state-of-the-art

Pick a scene and a baseline below to compare against StreetNVS at the indicated LiDAR-sparsity ratio. The strongest LiDAR-aware comparison is StreetCrafter*.

Scene

Baseline

LiDAR Input

FreeVS

StreetNVS (Ours)

Ground Truth

§3 · Demo on variable LiDAR sparsity

How much LiDAR is enough?

Pick a scene and sweep the density slider between 0.1 % and 100 % of the original LiDAR points. Even at 0.1 % density, StreetNVS keeps structure and identity intact while StreetCrafter* breaks down; the appearance sharpens gracefully as more anchors come back.

Scene

LiDAR Input

StreetCrafter*

StreetNVS

Ground Truth

0.001 0.01 0.1 1

§4 · Ablation study

Each conditioning signal carries weight

Pick a scene and compare our full model against three reduced variants: w/ Camera Only, w/ Projection Only, and w/o Reference. Each variant degrades a different facet of consistency.

Scene

LiDAR Input

Ground Truth

StreetNVS Full

Ours w/ Camera Only

Ours w/ Projection Only

Ours w/o Reference

§5 · Novel-view comparison vs. baseline

More coherent, more geometry-accurate

Under extreme trajectory shifts the LiDAR-aware baseline (StreetCrafter*) drifts: scene structure warps, surfaces tear, and identity wanders frame to frame. StreetNVS's multi-sensor conditioning keeps the scene rigid — the same buildings, lanes, and vehicles appear in the right places throughout the clip — yielding sharper appearance and noticeably better cross-frame coherence. Pick any clip below to see the source ego-camera input next to both methods' renderings.

Clip

Input (ego-camera)

StreetCrafter*

StreetNVS (Ours)