Familial Clustering For Weakly-labeled Android Malware Using Hybrid Representation Learning

Zhang, Y; Sui, Y; Pan, S; Zheng, Z; Ning, B; Tsang, I; Zhou, W

Familial Clustering For Weakly-labeled Android Malware Using Hybrid Representation Learning

Zhang, Y Sui, Y

Pan, S

Zheng, Z Ning, B Tsang, I

Zhou, W

Permalink

Publication Type:: Journal Article
Citation:: IEEE Transactions on Information Forensics and Security, 2019
Issue Date:: 2019-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted ManuscriptAdobe PDF (1.14 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, Y	en_US
dc.contributor.author	Sui, Y https://orcid.org/0000-0002-9510-6574	en_US
dc.contributor.author	Pan, S https://orcid.org/0000-0003-0794-527X	en_US
dc.contributor.author	Zheng, Z	en_US
dc.contributor.author	Ning, B	en_US
dc.contributor.author	Tsang, I https://orcid.org/0000-0001-8095-4637	en_US
dc.contributor.author	Zhou, W	en_US
dc.date.available	2021-01-02T18:04:25Z
dc.date.issued	2019-01-01	en_US
dc.identifier.citation	IEEE Transactions on Information Forensics and Security, 2019	en_US
dc.identifier.issn	1556-6013	en_US
dc.identifier.uri	http://hdl.handle.net/10453/136931
dc.description.abstract	IEEE Labeling malware or malware clustering is important for identifying new security threats, triaging and building reference datasets. The state-of-the-art Android malware clustering approaches rely heavily on the raw labels from commercial AntiVirus (AV) vendors, which causes misclustering for a substantial number of weakly-labeled malware due to the inconsistent, incomplete and overly generic labels reported by these closed-source AV engines, whose capabilities vary greatly and whose internal mechanisms are opaque (i.e., intermediate detection results are unavailable for clustering). The raw labels are thus often used as the only important source of information for clustering. To address the limitations of the existing approaches, this paper presents ANDRE, a new ANDroid Hybrid REpresentation Learning approach to clustering weakly-labeled Android malware by preserving heterogeneous information from multiple sources (including the results of static code analysis, the metainformation of an app, and the raw-labels of the AV vendors) to jointly learn a hybrid representation for accurate clustering. The learned representation is then fed into our outlieraware clustering to partition the weakly-labeled malware into known and unknown families. The malware whose malicious behaviours are close to those of the existing families on the network, are further classified using a three-layer Deep Neural Network (DNN). The unknown malware are clustered using a standard density-based clustering algorithm. We have evaluated our approach using 5,416 ground-truth malware from Drebin and 9,000 malware from VIRUSSHARE (uploaded between Mar. 2017 and Feb. 2018), consisting of 3324 weakly-labeled malware. The evaluation shows that ANDRE effectively clusters weaklylabeled malware which cannot be clustered by the state-of-theart approaches, while achieving comparable accuracy with those approaches for clustering ground-truth samples.	en_US
dc.relation	http://purl.org/au-research/grants/arc/DE170101081
dc.relation.ispartof	IEEE Transactions on Information Forensics and Security	en_US
dc.relation.isbasedon	10.1109/TIFS.2019.2947861	en_US
dc.rights	info:eu-repo/semantics/openAccess
dc.subject.classification	Strategic, Defence & Security Studies	en_US
dc.title	Familial Clustering For Weakly-labeled Android Malware Using Hybrid Representation Learning	en_US
dc.type	Journal Article
utslib.for	08 Information and Computing Sciences	en_US
utslib.for	09 Engineering	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	open_access	*
pubs.publication-status	Published	en_US

Abstract:

IEEE Labeling malware or malware clustering is important for identifying new security threats, triaging and building reference datasets. The state-of-the-art Android malware clustering approaches rely heavily on the raw labels from commercial AntiVirus (AV) vendors, which causes misclustering for a substantial number of weakly-labeled malware due to the inconsistent, incomplete and overly generic labels reported by these closed-source AV engines, whose capabilities vary greatly and whose internal mechanisms are opaque (i.e., intermediate detection results are unavailable for clustering). The raw labels are thus often used as the only important source of information for clustering. To address the limitations of the existing approaches, this paper presents ANDRE, a new ANDroid Hybrid REpresentation Learning approach to clustering weakly-labeled Android malware by preserving heterogeneous information from multiple sources (including the results of static code analysis, the metainformation of an app, and the raw-labels of the AV vendors) to jointly learn a hybrid representation for accurate clustering. The learned representation is then fed into our outlieraware clustering to partition the weakly-labeled malware into known and unknown families. The malware whose malicious behaviours are close to those of the existing families on the network, are further classified using a three-layer Deep Neural Network (DNN). The unknown malware are clustered using a standard density-based clustering algorithm. We have evaluated our approach using 5,416 ground-truth malware from Drebin and 9,000 malware from VIRUSSHARE (uploaded between Mar. 2017 and Feb. 2018), consisting of 3324 weakly-labeled malware. The evaluation shows that ANDRE effectively clusters weaklylabeled malware which cannot be clustered by the state-of-theart approaches, while achieving comparable accuracy with those approaches for clustering ground-truth samples.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/136931