Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection

Scalable Face Security Vision Foundation Model
for Deepfake, Diffusion, and Spoofing Detection

¹State Key Laboratory of Blockchain and Data Security, Zhejiang University ²Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
³George Mason University

In brief, this extended version enriches our framework as a full-stack, versatile solution for face vision security.

Key differences and new contributions of this extension compared to the FSFM (CVPR25) conference version:

Scalability We scale the model from a single ViT-B in the previous FSFM to a family of FS-VFM {ViT-S, ViT-B, ViT-L}, highlighting the effectiveness of our method across different model capacities and demonstrating consistent generalization gains across all face security tasks.
Novel FS-Adapter We introduce FS-Adapter, a lightweight plug-and-play bottleneck module featuring novel real-anchor contrastive learning, which enables the efficient transfer of our FS-VFMs to downstream face security tasks while retaining superior generalization performance.
Improved Generalization and Efficiency Simple fine-tuning of the FS-VFM ViT sets a new generalization baseline for deepfake detection, diffusion-generated face forensic, and face anti-spoofing, while FS-Adapter facilitates efficient adaptation to downstream face security tasks.
Broader and Deeper Evaluation We benchmark FS-VFM against a wider range of vision foundation models spanning different pre-training domains, paradigms, and scales; we include a recent Celeb-DF++ dataset; we also update the SOTA task-specialized methods for comparisons.
Deeper Analysis We provide new insights into our method, including model and data scaling, ablations, quantitative and qualitative analyses.

Cross-Dataset Deeofake Detection

Comparison with existing VFMs

Cross-dataset evaluation of simple fine-tuning VFMs on deepfake detection. All models are fine-tuned on FF++ (c23) and tested on unseen datasets under identical settings. &FS-Adapter ET (Efficient Tuning) only updates the FS-Adapter and head, freezing the ViT backbone. Left: frame-level, Right: video-level. Best results, second-best.

Comparison with task-specific methods

Cross-dataset evaluation on deepfake detection. For a fair comparison, results of SOTA task-specialized methods are cited from their original papers, and the results of CDF++ are from its benchmark. Avg.ΔOurs denotes the average AUC improvement of FS-VFM (Ours) over other methods across their tested sets.
Left: frame-level, Right: video-level. Best results, second-best.

Cross-Domain Face Anti-Spoofing

Comparison with existing VFMs

Cross-domain evaluation of simple fine-tuning VFMs on face anti-spoofing. All models are fine-tuned under identical settings. FS-Adapter Efficient Tuning} only updates the FS-Adapter and head, freezing the ViT backbone. Best results, second-best.

Comparison with task-specific methods

Cross-domain evaluation on face anti-spoofing. The results of SOTA specialized methods are cited from their original papers. Best results, second-best.

BibTeX

@article{wang2024fsfm, title={FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning}, author={Wang, Gaojian and Lin, Feng and Wu, Tong and Liu, Zhenguang and Ba, Zhongjie and Ren, Kui}, journal={arXiv preprint arXiv:2412.12032}, year={2024} } @misc{wang2025scalablefacesecurityvision, title={Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection}, author={Gaojian Wang and Feng Lin and Tong Wu and Zhisheng Yan and Kui Ren}, year={2025}, eprint={2510.10663}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.10663} }

Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection

Cross-Dataset Deeofake Detection

Comparison with existing VFMs

Comparison with task-specific methods

Cross-Domain Face Anti-Spoofing

Comparison with existing VFMs

Comparison with task-specific methods

Unseen Diffusion-Generated Face Forensic

BibTeX

Scalable Face Security Vision Foundation Model
for Deepfake, Diffusion, and Spoofing Detection