StreetNVS conditions a controllable video generation backbone on LiDAR-grounded 3D anchors and relative camera poses, synthesizing novel views from a vehicle's trajectory while preserving identity and structure under extreme extrapolation — including bird's-eye, lane-shifted, and rotated viewpoints.
Pick a scene below to render five camera trajectories that depart from the original ego-trajectory: Spiral (orbital sweep around the source pose), Lane Shift (sideways translation), Rotation (yaw turn), Pull-Back (dolly retreat along the forward axis), and Elevation (BEV-like rise). We additionally demonstrate two extreme trajectories — wider lane-shift and stronger top-down — that go well beyond the training distribution. While quality drops a little under these extreme out-of-distribution viewpoints, our method still maintains overall geometry consistency of the scene. Drag the slider on any clip to wipe between the synthesized RGB output (left) and its underlying LiDAR conditioning (right).
Pick a scene and a baseline below to compare against StreetNVS at the indicated LiDAR-sparsity ratio. The strongest LiDAR-aware comparison is StreetCrafter*.
Pick a scene and sweep the density slider between 0.1 % and 100 % of the original LiDAR points. Even at 0.1 % density, StreetNVS keeps structure and identity intact while StreetCrafter* breaks down; the appearance sharpens gracefully as more anchors come back.
Pick a scene and compare our full model against three reduced variants: w/ Camera Only, w/ Projection Only, and w/o Reference. Each variant degrades a different facet of consistency.
Under extreme trajectory shifts the LiDAR-aware baseline (StreetCrafter*) drifts: scene structure warps, surfaces tear, and identity wanders frame to frame. StreetNVS's multi-sensor conditioning keeps the scene rigid — the same buildings, lanes, and vehicles appear in the right places throughout the clip — yielding sharper appearance and noticeably better cross-frame coherence. Pick any clip below to see the source ego-camera input next to both methods' renderings.