Unearthing and Exploiting Latent Semantics behind DNS Domains for Deep Network Traffic Analysis

Abstract	Machine learning has been applied to a broad range of network analysis tasks including device classification, device type identification, or abnormal behavior detection. However, existing solutions often require tedious and fragile manual engineered features. In addition, existing solutions may face additional challenges as the fraction of encrypted traffic is increasing. This paper proposes a novel approach that relies on the latent semantics behind the DNS names to discover endpoints' properties. First, we introduce the concept of DNS embeddings, which consist of dense representation of DNS names in the form of numeric vectors that capture the semantic relationships behind them. Second, we present a novel algorithm, dns2vec, to create DNS embeddings from DNS traffic. We evaluate it on actual network traffic, and show that dns2vec can unearth the semantics behind DNS names, e.g., revealing the close similarity between newyorker.com and nytimes.com, sharelatex.com and overleaf.com, or sinovision.net and asiancc.net. Finally, we demonstrate that these DNS embeddings can significantly improve the performance of network traffic analysis tasks. We implement a multilayer perceptron which takes as inputs DNS embeddings to identify IoT devices, and show that the error rate is reduced by one order of magnitude compared to a traditional Naive Bayes classifier.
Authors	Franck Le (IBM US) Mudhakar Srivatsa (IBM US) Dinesh Verma (IBM US)
Date	Aug-2019
Venue	International Joint Conference on Artificial Intelligence 2019