Adaptive memory refinement and perception enhancement for exo-to-ego video generation

Li Jianhui, Hu Weipeng, Wang Xingyue, Hoe Jiun Tian, Hu Ping, Jiang Xudong, Tan Yap-Peng

Publisher

The task of synthesizing cross-view videos from an exocentric (third-person) to an egocentric (first-person) perspective, referred to as the E2VG problem, remains highly challenging. This is due to the significant viewpoint differences and limited spatial overlap between the two perspectives. Current approaches often fail to capture the temporal dynamics essential for target-view synthesis, and insufficiently leverage source-view perceptual features. In this paper, we present a video-based framework, Adaptive Memory Refinement and Perception Enhancement (ARPE), to address the problem. To capture long-horizon dependencies beyond redundant short-term dynamics, we propose a Distant Temporal Dependencies (DTD) module that extracts egocentric-relevant semantics from temporally distant exocentric frames. By leveraging a sliding window, DTD aligns long-range temporal patterns across views and refines exocentric features through an egocentric-memory guidance. To enhance the focus of the model on informative content, we propose a Saliency-guided Relevance Weighting (SRW) module that adaptively highlights semantically relevant frames and spatial regions. Specifically, SRW assigns inter-frame attention to distant frames based on their relevance to the target-view reconstruction, and further applies intra-frame weighting to emphasize salient areas within each selected frame. These weights are guided by the similarity between the temporal dynamics of the two views, ensuring spatial-temporal consistency. Recognizing the need for semantic consistency across views, we propose the DINOv2 Perception Enhancement (DPE) module. It leverages DINOv2 features to capture view-invariant object-scene cues, thereby improving cross-view feature coherence. Our extensive experimental analysis demonstrates that our approach outperforms existing state-of-the-art methods, excelling in both quantitative metrics and qualitative assessments.

Publisher: Neurocomputing

Article number: 131917

ISSN (Electronic): 18728286

ISSN (Print): 09252312

Keywords

  • Adaptive refinement
  • Cross-view video generation
  • Exocentric to egocentric synthesis
  • Perception enhancement

ASJC Scopus subject areas

  • Computer Science Applications
  • Cognitive Neuroscience
  • Artificial Intelligence

Publication year

2026

Fingerprint