StreetNVS: Effective Multi-sensor Conditioning
for Street-view Novel-view Synthesis

StreetNVS conditions a controllable video generation backbone on LiDAR-grounded 3D anchors and relative camera poses, synthesizing novel views from a vehicle's trajectory while preserving identity and structure under extreme extrapolation — including bird's-eye, lane-shifted, and rotated viewpoints.

§1 · Unseen novel view synthesis

Synthesizing trajectories far from the vehicle path

Pick a scene below to render five camera trajectories that depart from the original ego-trajectory: Spiral (orbital sweep around the source pose), Lane Shift (sideways translation), Rotation (yaw turn), Pull-Back (dolly retreat along the forward axis), and Elevation (BEV-like rise). We additionally demonstrate two extreme trajectories — wider lane-shift and stronger top-down — that go well beyond the training distribution. While quality drops a little under these extreme out-of-distribution viewpoints, our method still maintains overall geometry consistency of the scene. Drag the slider on any clip to wipe between the synthesized RGB output (left) and its underlying LiDAR conditioning (right).

Input Video
§2 · Comparison with baselines

Side-by-side against state-of-the-art

Pick a scene and a baseline below to compare against StreetNVS at the indicated LiDAR-sparsity ratio. The strongest LiDAR-aware comparison is StreetCrafter*.

LiDAR Input
FreeVS
StreetNVS (Ours)
Ground Truth
§3 · Demo on variable LiDAR sparsity

How much LiDAR is enough?

Pick a scene and sweep the density slider between 0.1 % and 100 % of the original LiDAR points. Even at 0.1 % density, StreetNVS keeps structure and identity intact while StreetCrafter* breaks down; the appearance sharpens gracefully as more anchors come back.

LiDAR Input
StreetCrafter*
StreetNVS
Ground Truth
LiDAR Density 0.001
0.001 0.01 0.1 1
§4 · Ablation study

Each conditioning signal carries weight

Pick a scene and compare our full model against three reduced variants: w/ Camera Only, w/ Projection Only, and w/o Reference. Each variant degrades a different facet of consistency.

LiDAR Input
Ground Truth
StreetNVS Full
Ours w/ Camera Only
Ours w/ Projection Only
Ours w/o Reference
§5 · Novel-view comparison vs. baseline

More coherent, more geometry-accurate

Under extreme trajectory shifts the LiDAR-aware baseline (StreetCrafter*) drifts: scene structure warps, surfaces tear, and identity wanders frame to frame. StreetNVS's multi-sensor conditioning keeps the scene rigid — the same buildings, lanes, and vehicles appear in the right places throughout the clip — yielding sharper appearance and noticeably better cross-frame coherence. Pick any clip below to see the source ego-camera input next to both methods' renderings.

Input (ego-camera)
StreetCrafter*
StreetNVS (Ours)