

<?xml version="1.0" encoding="UTF-8"?>
<record>
  <title>Motion-Driven Temporal Analysis for Deepfake Detection: A Hybrid Framework Combining Frame Differencing and LSTMBased Sequence Modeling for Facial Video Sequences</title>
  <journal>Journal of Multimedia Processing and Technologies</journal>
  <author>Yao-Liang Chung</author>
  <volume>17</volume>
  <issue>2</issue>
  <year>2026</year>
  <doi>https://doi.org/10.6025/jmpt/2026/17/2/39-58</doi>
  <url>https://www.dline.info/jmpt/fulltext/v17n2/jmptv17n2_1.pdf</url>
  <abstract>The rapid advancement of generative artificial intelligence has significantly increased the realism and
accessibility of deepfake content, posing critical challenges to digital media authenticity and security. This
study proposes a hybrid deepfake detection framework that integrates motion-driven analysis with temporal
sequence modeling to improve detection robustness. The approach combines frame-differencing techniques
to capture pixel-level motion inconsistencies with Long Short-Term Memory (LSTM) networks to model
temporal dependencies across facial video sequences.
Experiments were conducted on preprocessed facial datasets derived from FaceForensics++ and
DeepFakeDetection, ensuring standardized inputs through face extraction, alignment, and quality filtering.
Motion intensity was computed from frame to frame differences, enabling the effective identification of
temporal anomalies via statistical thresholding. Qualitative analysis using temporal montage visualization
demonstrated strong motion coherence in real sequences and detectable inconsistencies in manipulated
content.
The LSTM-based temporal modeling further enhanced the system by capturing motion trends, activity
transitions, and sequence level patterns. Results indicate that the proposed framework effectively identifies
high motion anomalies and improves interpretability through temporal smoothing. However, evaluation
using pseudo ground truth revealed limitations in distinguishing low and moderate activity classes,
highlighting the constraints of motion only features.
Overall, the study demonstrates that integrating spatial motion cues with temporal learning significantly
enhances deepfake detection performance. Future work will focus on incorporating spatio temporal deep
learning models and multimodal features to further improve classification accuracy and robustness.</abstract>
</record>
