Collaborative Research: SATC: CORE: Small: Towards Label Enrichment and Refinement to Harden Learning-based Security Defenses


Sponsoring Agency
National Science Foundation


From data breaches to ransomware infections, the increasingly sophisticated attacks are demanding more advanced defense mechanisms. Machine learning models, with the ability to identify hidden patterns that cannot be expressed by rules or signatures, have become an attractive solution. Unfortunately, most learning-based systems are trained under a “closed-world” assumption, expecting the testing data distribution to roughly match that of the training data. When the model is deployed in the “open-world”, dynamic changes of both organic benign players and malicious attackers can easily cause testing distribution drift, leading to concept drift and serious model failures. Addressing concept drift requires labeling a large number of new samples for model retraining. This process is extremely expensive. Different from labeling images or text (which can be effectively crowdsourced), labeling malware, for example, requires years of security training and practice. The high demand for expertise and experience makes it difficult to scale up such labeling efforts. In this proposal, we ask one critical question --- assuming we will never have representative labels, what can we do to significantly improve the adaptability and resilience of learning-based defenses with extremely limited labeling capacity? While the problem looks challenging, recent progress in self-supervised learning has shown great promise to perform complex learning tasks with limited labels. Self-supervision is about designing pretext learning tasks to better utilize unlabeled data and obtaining supervision from the data itself. While most existing efforts focus on computer vision and natural language process tasks, we believe some of the fundamental ideas can significantly benefit the security community to address the concept drift problem. Our preliminary analysis has returned promising results so far. In this proposal, we want to combine the idea of self-supervision with the domain-specific insights in malware detection to build new solutions to combat concept drift.

Research Area