Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection

Scalable Face Security Vision Foundation Model
for Deepfake, Diffusion, and Spoofing Detection

Extended version of FSFM (CVPR 2025)
Gaojian Wang1,2, Feng Lin1,2, Tong Wu1,2, Zhisheng Yan3, Kui Ren1,2
1State Key Laboratory of Blockchain and Data Security, Zhejiang University 2Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
3George Mason University

In brief, this extended version enriches our framework as a full-stack, versatile solution for face vision security.

Key differences and new contributions of this extension compared to the FSFM (CVPR25) conference version:

  • Scalability We scale the model from a single ViT-B in the previous FSFM to a family of FS-VFM {ViT-S, ViT-B, ViT-L}, highlighting the effectiveness of our method across different model capacities and demonstrating consistent generalization gains across all face security tasks.
  • Novel FS-Adapter We introduce FS-Adapter, a lightweight plug-and-play bottleneck module featuring novel real-anchor contrastive learning, which enables the efficient transfer of our FS-VFMs to downstream face security tasks while retaining superior generalization performance.
  • Improved Generalization and Efficiency Simple fine-tuning of the FS-VFM ViT sets a new generalization baseline for deepfake detection, diffusion-generated face forensic, and face anti-spoofing, while FS-Adapter facilitates efficient adaptation to downstream face security tasks.
  • Broader and Deeper Evaluation We benchmark FS-VFM against a wider range of vision foundation models spanning different pre-training domains, paradigms, and scales; we include a recent Celeb-DF++ dataset; we also update the SOTA task-specialized methods for comparisons.
  • Deeper Analysis We provide new insights into our method, including model and data scaling, ablations, quantitative and qualitative analyses.

Cross-Dataset Deeofake Detection

Comparison with existing VFMs

Cross-dataset Deepfake Detection Table

Cross-dataset evaluation of simple fine-tuning VFMs on deepfake detection. All models are fine-tuned on FF++ (c23) and tested on unseen datasets under identical settings. &FS-Adapter ET (Efficient Tuning) only updates the FS-Adapter and head, freezing the ViT backbone. Left: frame-level, Right: video-level. Best results, second-best.

Comparison with task-specific methods

Cross-dataset Deepfake Detection Table

Cross-dataset evaluation on deepfake detection. For a fair comparison, results of SOTA task-specialized methods are cited from their original papers, and the results of CDF++ are from its benchmark. Avg.ΔOurs denotes the average AUC improvement of FS-VFM (Ours) over other methods across their tested sets.
Left: frame-level, Right: video-level. Best results, second-best.

Cross-Domain Face Anti-Spoofing

Comparison with existing VFMs

Cross-dataset Deepfake Detection Table

Cross-domain evaluation of simple fine-tuning VFMs on face anti-spoofing. All models are fine-tuned under identical settings. FS-Adapter Efficient Tuning} only updates the FS-Adapter and head, freezing the ViT backbone. Best results, second-best.

Comparison with task-specific methods

Cross-dataset Deepfake Detection Table

Cross-domain evaluation on face anti-spoofing. The results of SOTA specialized methods are cited from their original papers. Best results, second-best.

Unseen Diffusion-Generated Face Forensic

Cross-dataset DiFF Table

Cross-dataset evaluation on the DiFF benchmark. All models are fine-tuned only on the FF++_DeepFakes/c23 subset. Best results, second-best.

BibTeX

@article{wang2024fsfm,
  title={FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning},
  author={Wang, Gaojian and Lin, Feng and Wu, Tong and Liu, Zhenguang and Ba, Zhongjie and Ren, Kui},
  journal={arXiv preprint arXiv:2412.12032},
  year={2024}
}

@misc{wang2025scalablefacesecurityvision,
  title={Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection},
  author={Gaojian Wang and Feng Lin and Tong Wu and Zhisheng Yan and Kui Ren},
  year={2025},
  eprint={2510.10663},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.10663}
}