Motion-Driven Temporal Analysis for Deepfake Detection: A Hybrid Framework Combining Frame Differencing and LSTMBased Sequence Modeling for Facial Video Sequences

Motion-Driven Temporal Analysis for Deepfake Detection: A Hybrid Framework Combining Frame Differencing and LSTMBased Sequence Modeling for Facial Video Sequences Journal of Multimedia Processing and Technologies Yao-Liang Chung 17 2 2026 https://doi.org/10.6025/jmpt/2026/17/2/39-58 https://www.dline.info/jmpt/fulltext/v17n2/jmptv17n2_1.pdf The rapid advancement of generative artificial intelligence has significantly increased the realism and accessibility of deepfake content, posing critical challenges to digital media authenticity and security. This study proposes a hybrid deepfake detection framework that integrates motion-driven analysis with temporal sequence modeling to improve detection robustness. The approach combines frame-differencing techniques to capture pixel-level motion inconsistencies with Long Short-Term Memory (LSTM) networks to model temporal dependencies across facial video sequences. Experiments were conducted on preprocessed facial datasets derived from FaceForensics++ and DeepFakeDetection, ensuring standardized inputs through face extraction, alignment, and quality filtering. Motion intensity was computed from frame to frame differences, enabling the effective identification of temporal anomalies via statistical thresholding. Qualitative analysis using temporal montage visualization demonstrated strong motion coherence in real sequences and detectable inconsistencies in manipulated content. The LSTM-based temporal modeling further enhanced the system by capturing motion trends, activity transitions, and sequence level patterns. Results indicate that the proposed framework effectively identifies high motion anomalies and improves interpretability through temporal smoothing. However, evaluation using pseudo ground truth revealed limitations in distinguishing low and moderate activity classes, highlighting the constraints of motion only features. Overall, the study demonstrates that integrating spatial motion cues with temporal learning significantly enhances deepfake detection performance. Future work will focus on incorporating spatio temporal deep learning models and multimodal features to further improve classification accuracy and robustness.