The task of synthesizing cross-view videos from an exocentric (third-person) to an egocentric (first-person) perspective, referred to as the E2VG problem, remains highly challenging. This is due to the significant viewpoint differences and limited spatial overlap between the two perspectives. Current approaches often fail to capture the temporal dynamics essential for target-view synthesis, and insufficiently leverage source-view perceptual features. In this paper, we present a video-based framework, Adaptive Memory Refinement and Perception Enhancement (ARPE), to address the problem. To capture long-horizon dependencies beyond redundant short-term dynamics, we propose a Distant Temporal Dependencies (DTD) module that extracts egocentric-relevant semantics from temporally distant exocentric frames. By leveraging a sliding window, DTD aligns long-range temporal patterns across views and refines exocentric features through an egocentric-memory guidance. To enhance the focus of the model on informative content, we propose a Saliency-guided Relevance Weighting (SRW) module that adaptively highlights semantically relevant frames and spatial regions. Specifically, SRW assigns inter-frame attention to distant frames based on their relevance to the target-view reconstruction, and further applies intra-frame weighting to emphasize salient areas within each selected frame. These weights are guided by the similarity between the temporal dynamics of the two views, ensuring spatial-temporal consistency. Recognizing the need for semantic consistency across views, we propose the DINOv2 Perception Enhancement (DPE) module. It leverages DINOv2 features to capture view-invariant object-scene cues, thereby improving cross-view feature coherence. Our extensive experimental analysis demonstrates that our approach outperforms existing state-of-the-art methods, excelling in both quantitative metrics and qualitative assessments.