FSFM: A Generalizable Face Security Foundation Model
via Self-Supervised Facial Representation Learning

1State Key Laboratory of Blockchain and Data Security, Zhejiang University 2Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security

TL;DR: A self-supervised pre-training framework to learn a transferable facial representation that boosts various face security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection.

Abstract

This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region Consistency and challenging inter-region Coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global Correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision Foundation Model for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection.

Overview

The self-supervised pretraining framework for learning fundamental representations of real faces (3C💣). Guided by the CRFR-P masking strategy, the masked image modeling (MIM) captures intra-region Consistency with \( \mathcal{L}_\mathit{rec}^\mathit{m} \) and enforces inter-region Coherency via \( \mathcal{L}_\mathit{rec}^\mathit{fr} \), while the instance discrimination (ID) collaborates to promote local-to-global Correspondence through \( \mathcal{L}_\mathit{sim} \). After pretraining, the online encoder \( E_\mathit{o} \) (a vanilla ViT) is applied to boost downstream face security tasks.

Given an input image \( I \), the CRFR-P strategy generates a facial region mask \( M_\mathit{fr} \) and an image mask \( M \). The MIM network, a masked autoencoder, reconstructs the masked face \( I_\mathit{m} \) from visible patches \( x_\mathit{v} \) (masked by \( M \)), with an emphasis on the fully masked region \( I_\mathit{m}^\mathit{fr} \) (specified by \( M_\mathit{fr} \)). The ID network maximizes the representation similarity between the masked online view and the unchanged target view of the same sample by projection upon a disentangled space structured by Siamese rep decoders.

<

Method

An Empirical Study of Facial Masking Strategies in MIM

🤗 Masking Strategies Demo

We delve into the impact of different facial masking strategies on the MIM pre-training, including quantitative and qualitative analyses of attention differences. The proposed CRFR-P masking effectively directs attention to critical facial regions with appropriate range and diversity for both intra-region consistency and inter-region coherency, enabling the pretrained facial model to avoid trivial solutions (shortcuts) and capture the intrinsic properties of real faces.

The ID Network for Local-to-Global Self-Distillation

We integrate different ID paradigms with MIM into FSFM and suggest (c) the local-to-global correspondence via elaborate self-distillation: the CRFR-P masked online view introduces spatial variance, the full (unmasked) target view retains complete semantic, and Siamese representation decoders form a disentangled space. In light of this, FSFM structures the encoded space with semantically complete facial representations, which endows the encoder with strong facial discriminability.

<

Results

Generalization performance on downstream face security tasks: cross-dataset deepfake detection (top), cross-domain face anti-spoofing (left bottom), and unseen diffusion facial forgery detection (right bottom).

Visualizations of Reconstruction (left) and CAM (right).

BibTeX

insert here
}