This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region Consistency and challenging inter-region Coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global Correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision Foundation Model for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection.
The self-supervised pretraining framework for learning fundamental representations of real faces (3C💣). Guided by the CRFR-P masking strategy, the masked image modeling (MIM) captures intra-region Consistency with \( \mathcal{L}_\mathit{rec}^\mathit{m} \) and enforces inter-region Coherency via \( \mathcal{L}_\mathit{rec}^\mathit{fr} \), while the instance discrimination (ID) collaborates to promote local-to-global Correspondence through \( \mathcal{L}_\mathit{sim} \). After pretraining, the online encoder \( E_\mathit{o} \) (a vanilla ViT) is applied to boost downstream face security tasks.
Given an input image \( I \), the CRFR-P strategy generates a facial region mask \( M_\mathit{fr} \) and an image mask \( M \). The MIM network, a masked autoencoder, reconstructs the masked face \( I_\mathit{m} \) from visible patches \( x_\mathit{v} \) (masked by \( M \)), with an emphasis on the fully masked region \( I_\mathit{m}^\mathit{fr} \) (specified by \( M_\mathit{fr} \)). The ID network maximizes the representation similarity between the masked online view and the unchanged target view of the same sample by projection upon a disentangled space structured by Siamese rep decoders.
We delve into the impact of different facial masking strategies on the MIM pre-training, including quantitative and
qualitative analyses of attention differences. The proposed CRFR-P masking effectively directs
attention to critical facial regions with appropriate range and diversity for both intra-region consistency
and inter-region coherency, enabling the pretrained facial model to avoid trivial solutions (shortcuts) and
capture the intrinsic properties of real faces.
We integrate different ID paradigms with MIM into FSFM and suggest (c) the local-to-global correspondence via elaborate self-distillation: the CRFR-P masked online view introduces spatial variance, the full (unmasked) target view retains complete semantic, and Siamese representation decoders form a disentangled space. In light of this, FSFM structures the encoded space with semantically complete facial representations, which endows the encoder with strong facial discriminability.
insert here
}