Inversion-Guided Defense: Detecting Model Stealing Attacks by Output Inverting

Publisher:
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Publication Type:
Journal Article
Citation:
IEEE Transactions on Information Forensics and Security, 2024, 19, pp. 4130-4145
Issue Date:
2024-01-01
Filename Description Size
1718503.pdfPublished version1.17 MB
Adobe PDF
Full metadata record
Model stealing attacks involve creating copies of machine learning models that have similar functionalities to the original model without proper authorization. Such attacks raise significant concerns about the intellectual property of the machine learning models. Nonetheless, current defense mechanisms against such attacks tend to exhibit certain drawbacks, notably in terms of utility, and robustness. For example, watermarking-based defenses require victim models to be retrained for embedding watermarks, which can potentially impact the main task performance. Moreover, other defenses, especially fingerprinting-based methods, often rely on specific samples like adversarial examples to verify ownership of the target model. These approaches might prove less robust against adaptive attacks, such as model stealing with adversarial training. It remains unclear whether normal examples, as opposed to adversarial ones, can effectively reflect the characteristics of stolen models. To tackle these challenges, we propose a novel method that leverages a neural network as a decoder to inverse the suspicious model's outputs. Inspired by model inversion attacks, we argue that this decoding process will unveil hidden patterns inherent in the original outputs of the suspicious model. Drawing from these decoding outcomes, we calculate specific metrics to determine the legitimacy of the suspicious models. We validate the efficacy of our defense technique against diverse model stealing attacks, specifically within the domain of classification tasks based on deep neural networks.
Please use this identifier to cite or link to this item: