You Are What You Broadcast: Identification of Mobile and ...

You Are What You Broadcast: Identification of Mobile and IoT Devices from (Public) WiFi

Lingjing Yu, Institute of Information Engineering, Chinese Academy of Sciences; School of Cybersecurity, University of the Chinese Academy of Sciences; Bo Luo,

The University of Kansas; Jun Ma, Tsinghua University; Zhaoyu Zhou and Qingyun Liu, Institute of Information Engineering, Chinese Academy of Sciences



This paper is included in the Proceedings of the 29th USENIX Security Symposium.

August 12?14, 2020

978-1-939133-17-5

Open access to the Proceedings of the 29th USENIX Security Symposium is sponsored by USENIX.

You Are What You Broadcast: Identification of Mobile and IoT Devices from (Public) WiFi

Lingjing Yu, Bo Luo?, Jun Ma , Zhaoyu Zhou, Qingyun Liu National Engineering Lab for Information Security Technologies Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China

? EECS/ITTC, The University of Kansas, Lawrence, KS, USA

Tsinghua University, Beijing, China; Pi2star Technology, Beijing, China yulingjing@iie., bluo@ku.edu, majun_ee@tsinghua., {zhouzhaoyu,liuqingyun}@iie.

Abstract

With the rapid growth of mobile devices and WiFi hotspots, security risks arise. In practice, it is critical for administrators of corporate and public wireless networks to identify the type and/or model of devices connected to the network, in order to set access/firewall rules, to check for known vulnerabilities, or to configure IDS accordingly. Mobile devices are not obligated to report their detailed identities when they join a (public) wireless network, while adversaries could easily forge device attributes. In the literature, efforts have been made to utilize features from network traffic for device identification. In this paper, we present OWL, a novel device identification mechanism for both network administrators and normal users. We first extract network traffic features from passively received broadcast and multicast (BC/MC) packets. Embedding representations are learned to model features into six independent and complementary views. We then present a new multi-view wide and deep learning (MvWDL) framework that is optimized on both generalization performance and labelview interaction performance. Meanwhile, a malicious device detection mechanism is designed to assess the inconsistencies across views in the multi-view classifier to identify anomalies. Finally, we demonstrate OWL's performance through experiments, case studies, and qualitative analysis.

1 Introduction

Over the past decade, we have observed a steady growth in the number and types of portable devices. WiFi and cellular network remain the two major options for mobile devices to connect to the Internet. Although cellular networks have improved speed and coverage, and reduced costs in recent years, WiFi still has the edge in lower cost, better support from devices, and less capacity limits. Cisco predicts that the role

L. Yu, Z. Zhou, and Q. Liu were supported in part by the Youth Innovation Promotion Association of the Chinese Academy of Sciences, and the Key Technical Talents Project of CAS (Y8YY041101); B. Luo was supported in part by NSF-1565570, NSA Science of Security (SoS) Initiative, and the Ripple University Blockchain Research Initiative.

and coverage of WiFi will continue to expand, and WiFi traffic will account for 50% of total IP traffic by 2022. Meanwhile, the number of public WiFi hotspots will grow 4-fold globally, from 124 million (2017) to 549 million (2022) in a five-year span [11]. With the significant growth of public Wifi support and usage, security and privacy concerns naturally arise.

The administrators of corporate and public WiFi services are concerned with malicious devices connecting to their networks, which may potentially harm the platform or other users in the network, e.g., [4, 45]. The security challenges are primarily caused by the diversity of devices, potential access to critical/core services, lack of proper security management by their owners, and limited auditing capability. On the other hand, users of public WiFi also express concerns about the security of their devices, data, and personal information. However, they do not always exercise proper privacy protection while connecting to unknown networks [5, 9, 30].

For system administrators, whenever a new mobile device connects to the network, it is critical to identify its manufacturer, type, and model, so that proper security precautions could be taken, e.g., configure firewall rules accordingly, verify if known vulnerabilities are patched, or inform IDS. In practice, identifying the type of mobile/IoT devices is of particular interest, since devices of the same or similar types are often managed under similar access control and firewall policies. For instance, when employees connect smart tea kettles or coffee makers to the network, the corporate security policy may place them in the same group that is limited from accessing any internal resource, while smartphones are expected to be governed by completely different policies. Meanwhile, the manufacturer1 attribute also provides important information in device management. The same manufacturer tends to share the design and implementation of hardware and software components across products. As a result, they often have similar vulnerabilities and are patched simultaneously. For example, the firmware vulnerability reported in CVE-20066292 affects Apple's Mac mini, MacBook, and MacBook Pro

1In the rest of the paper, we use manufacturer and make interchangeably.

USENIX Association

29th USENIX Security Symposium 55

products. Meanwhile, regular users also have the need to discover potentially harmful devices, such as hidden cameras or a virtual machine with spoofed identity [10, 60], when they connect to WiFi hotspots. While active reconnaissance poses the risk of being detected and denied, users have the option of passive reconnaissance, where they receive and examine broadcast/multicast (BC/MC) messages to identify other devices in the same network, and looks for potential threats.

Efficient and accurate identification of mobile devices is challenging, especially when the features are limited and often incomplete. There is no standard protocol to actively query devices for their identities. Even if there were one, devices do not have to provide faithful answers. Existing researches on IoT device identification utilize a small set of network features and were only tested on approximately 20 to 50 devices in controlled environments, e.g., [43, 64]. With relatively small feature space, scalability becomes a concern. That is, detection accuracy may drop dramatically with the increasing quantity and diversity of devices in real-world applications.

In this paper, we attempt to answer three questions: (1) When a mobile/IoT device connects to a wireless network, what protocol(s) would broadcast information that may be received by other devices connected to the same WiFi? (2) What information or features contained in the broadcast messages are unique to a device, and how could system administrators or normal users make use of such information to accurately identify the important attributes: manufacturer, type, and model, of the devices? And (3) How can we utilize subtle hints caught during device identification to discover malicious devices?

To answer these questions, we present OWL: overhearing on WiFi for device identification. The key idea is to utilize the unique features in network packets that are introduced by the subtle differences in the implementations of network modules on mobile/IoT devices. OWL examines and utilizes all the features that could be passively collected from broadcast and multicast protocols such as DHCP, DHCPv6, SSDP, mDNS, LLMNR, BROWSER, NBNS, IGMP, etc. Distinct features extracted from related protocols naturally form a view. Multi-view learning is then employed to utilize views constructed from all available protocols for device classification. With fingerprints collected from more than 30,000 mobile/IoT devices, we demonstrate outstanding performance of the proposed mechanism.

Moreover, malicious devices may attempt to forge their identities and hide their presence to avoid being identified or tracked. For instance, in our dataset, we found a virtual machine running on a laptop that claimed to be an open WiFi hotspot. We argue that it is difficult for adversarial devices to completely forge the complex set of features from the entire stack of essential network protocols. We observed that fabricated or forged devices often behave inconsistently in different views, e.g., the fake WiFi hotspot demonstrated features of a real WiFi access point on some views, while showing features of its host laptop on other views. Therefore, we further attempt to discover malicious devices by examining the

inconsistency across views in the multi-view classifier. The technical contributions of this paper are: (1) We pro-

pose a multi-view wide and deep learning model to identify mobile/IoT devices using features from BC/MC packets collected through passive reconnaissance over WiFi; (2) Through large-scale experiments, we demonstrate the performance of the proposed mechanism in identifying the manufacturer, type, and model of mobile/IoT devices; and (3) OWL is also able to effectively detect forged or fabricated devices by identifying the abnormal inconsistencies across views.

The rest of the paper is organized as follows: we define the problem in Section 2, and explain the data collection processes in Section 3. We present the OWL algorithm, followed by implementation and experiments in Sections 4 and 5. We present case studies of abnormal devices in Section 6. We discuss other important issues and review the literature in Sections 8 and 9, and finally conclude the paper.

2 Problem Statement and the Threat Model

In this section, we formally present the objectives of OWL, followed by an adversary model of abnormal devices. Device Identification. The primary goal of OWL is to identify devices on a WiFi network through packets they broadcast/multicast (BC/MC). Formally, device identification is a classification problem: given a set of labeled samples {(Di, li)}, find a classifier c : D L, which assigns a label lx = c(Dx) to a new sample Dx. In OWL, Di is a device represented by features extracted from BC/MC packets. Devices are identified at three granularity levels: {manufacturer}, e.g., "amazon"; {manufacturer-type}, e.g., "amazon-kindle"; {manufacturer-type-model}, e.g., "amazon-kindle-v2.0". Last, we design OWL to only rely on unencrypted passive traffic that could be sniffed without any special privilege. Abnormal Device Detection. It is beneficial to the administrators/users if OWL could tell if a device appears abnormal, besides labeling it. Therefore, another objective of OWL is to identify devices whose BC/MC traffic appears to deviate from known benign patterns. This abnormal sample could be a previously unknown device, or a fabricated/forged device. Formally, function d : D {"benign", "malicious"} is designed to assign a label d(Di) for each new device Di. Initially, d is only trained with benign samples. When new malicious samples are confirmed, they are used to re-train d to improve the detection accuracy for future samples of this type. Assumptions and Adversary Model. We assume that OWL could connect to the to-be-measured WiFi network?the network is open, or the WiFi security key is known. This is true for network administrators who measure their own networks. This is also true for users who attempt to detect suspicious devices when they connect to public WiFi. We also assume that the network infrastructure we connect to is benign, so that they faithfully forward/route packets as defined by the protocols, and OWL is able to collect those packets. Finally,

56 29th USENIX Security Symposium

USENIX Association

we start with a clean model in the first task, where adversaries are not considered. Hence, we assume that the overwhelming majority of the devices in the training dataset are benign.

In the task of abnormal device detection, we employ a simple threat model as follows: the adversaries attempt to connect (unauthorized) devices to (public) wireless networks. The abnormal/malicious device could be: (1) devices that do not forge their own identities (so that they are unaltered, genuine devices), however, they are forbidden in the network, such as hidden cameras; (2) devices that attempt to hide their true identities. This includes fabricated or altered devices that connect to the network with malicious purposes, such as fake access points or DHCP servers, spoofed IoT device identities [60, 65]. This also includes devices that are counterfeit or forged at manufacturing, such as the fake Apple TVs we discovered (please see Section 6). This threat model only applies to the second task of the OWL approach.

3 Data Collection and Feature Extraction

3.1 Data Collection and Initial Analysis

Data was collected through a fully passive approach from three types of WiFi networks: (1) Open (unencrypted) public networks at coffee shops, restaurants, retail stores, some airports, etc. We directly connected to the hotspots without providing any credentials. (2) Open public WiFi with captive portals at airports, hotels, corporate guest networks, etc. We connected to these networks but did not provide information on landing pages. Hence, we were usually blocked from accessing the Internet, but we were able to sniff BC/MC packets. (3) Secure WiFi networks, including organization networks, home WiFi, and some public WiFi. We only collected data from networks that we were granted access to, such as university networks and retail stores that give passwords to customers. We connected the sniffing laptop to the networks, and employed Wireshark or tcpdump to download all BC/MC messages. The process was completely passive and non-intrusive. We did not turn on promiscuous or monitor mode. We did not actively send any message or make any spoofing attempt. The packets were all in plaintext and were also accessible to any other user on the same network.

With the help of our collaborators, we collected wireless network traffic from seven countries: US, Portugal, Sweden, Norway, Japan, Korea, and China. From January 2019 to July 2019, we collected data from 176 WiFi networks, among which 12 networks disabled BC/MC. Each data collection session lasted approximately 20 to 30 minutes. The WiFi networks we sniffed were very diverse in terms of ownership, including university, airport and hotel WiFi, restaurant, retail store, and volunteers' household WiFi. In total, we collected BC/MC packets from 31,850 distinct devices, which were identified by MAC addresses. Figure 1 (a) shows the distribution of WiFi networks (allowed BC/MC) and devices. The number of devices per network is higher in Korea and China,

mostly due to higher population density. In particular, we collected data from an airport in Korea and a student dorm in China, which contributed large volumes of devices. We statistically analyzed the collected data and found the following: 1. In total, we have identified 275 distinct protocols in the data. Note that we treat UDP packets to different ports as distinct protocols. Figure 1 (b) shows the distribution of the top 10 most frequently used protocols, led by ARP, ICMPv6 and mDNS. 2. 69.5% of devices sent BC/MC packets using more than 2 protocols and 46.1% of devices sent BC/MC packets using more than 3 protocols. Intuitively, the more protocols devices use for broadcasting, the more information they leak. 51.9% of the devices sent mDNS packets, which may convey semi-identifiable attributes of the devices. Application layer protocols like DHCP, SSDP and LLMNR are also wildly used. 3. Protocol popularity appears to be consistent across countries, with a few exceptions. For instance, mDNS is the most frequently used BC/MC protocol in the US, Japan, and Sweden, but is ranked lower in the other countries. This is explained by the fact that these countries have higher density of Apple devices2, which intensively use mDNS to discover services in the network. Meanwhile, Dropbox LAN Sync Discovery (DLSD) is not found in China, because DLSD is a proprietary protocol of Dropbox, which is blocked in China. 4. Some protocols are only used by one type of devices. For instance, the KINK protocol is only found in packets sent from Samsung TVs. This observation implies two perspectives: (1) the proprietary protocols are good identifiers of hardware/software manufacturers; (2) when a proprietary protocol appears in the traffic generated by a third-party device (identified from other network traffic features), such device should be further investigated?it could be a spoofed device. 5. In the initial analysis, we employ Apriori [56] to statistically examine the patterns of BC/MC protocols used in each type of devices, and show some examples in Table 1. For each device, the protocols are ranked by the frequency of captured packets. We can observe that each device family may have its distinct frequency pattern of protocols. Different products from the same manufacturer may show the same/similar pattern of protocols, e.g., several DLink devices demonstrate identical patterns of protocols. Most likely, such devices share the same hardware and software in their WiFi component.

The initial analysis suggests the possibility of using features extracted from BC/MC packets to identify the make, type, and model of the devices. The complexity of the patterns also implies that it could be very challenging for adversaries to perfectly spoof the network features of other devices.

3.2 Ethical Considerations

We collected data through a completely passive approach. We did not turn on promiscuous mode. That means, we were

2According to OS market share by country reported by . os-market-share/

USENIX Association

29th USENIX Security Symposium 57

Figure 1: Statistics of collected data: (a) distribution of sniffed WiFi networks and devices in 7 countries; (b) the top 10 most frequently used BC/MC protocols in the dataset; and (c) the distribution of number of protocols used in devices.

Table 1: Examples of broadcast/multicast protocol frequency patterns of mobile/IoT devices.

device-type

protocol frequency pattern

device-type

protocol frequency pattern

apple-phone

ARP,mDNS,DHCP,ICMPv6,LLC,IGMP

apple-smartspeaker

ARP,ICMPv6,mDNS

dlink-siren

ARP,mDNS,DHCP,ICMPv6,IGMP

hikvision-camera

ADWIN_CONFIG,SSDP,IGMP

dlink-watersensor

ARP,mDNS,DHCP,ICMPv6,IGMP

lg-tv

ARP,mDNS,ICMPv6,SSDP,IGMP

edimax-camera

ARP,mDNS,DHCP,SSDP,IGMP

sumung-tv

ARP,UDP_15600,UDP_8001,IGMP

microsoft-gameconsole mDNS,LLMNR,ICMPv6,DHCPv6,IPv6,SSDP,IGMP xiaomi-humidifier

ARP,mDNS

the legitimate and intended receivers of the BC/MC packets. These packets were also received by all other computers in the same subnet. We did not eavesdrop on any unicast packet. We did not attempt to send anything (e.g., ARP requests). To our best knowledge, the data collection process did not violate any networks' Terms and Conditions that were presented to the users. None of the T&Cs mentioned BC/MC traffic or network monitoring. Some forbid activities that may impact the security or usability of the network, while we did not impact the network. Some information in our data set may be considered somewhat sensitive. We discuss them here: 1. MAC. MAC addresses are unique identifiers of devices (not users). Recent research showed that users are vulnerable to MAC tracking attacks [14]. Such privacy risk does not apply in our data: (1) we only briefly collected data from sites that are very sparsely scattered globally. The probability to re-encounter the same MAC is extremely low. (2) We only retained the top six hexadecimal digits of MAC addresses. They cannot be used as unique identifiers of devices. 2. Device Name. Some devices (e.g., iOS devices) allow users to configure device names, and adopt them in several protocols, such as mDNS and DHCP 3. Users may name the device with their own name (e.g., Alice's iPhone). We observed individuals' names in approximately 7% of the devices. The majority of them were first names, and many were fake names. In data pre-processing, we removed all names and analogues.

Besides MACs and (some) names, we do not have any identifier or personal information in the data. We did not collect any opinion, behavioral information, sensor data, demographic attribute, or other sensitive information. It is ex-

3Although Android allows users to set device names, the user-defined names are only used as hotspot and Bluetooth names, while DeviceName in mDNS and DHCP are manufacturer-defined strings that cannot be changed.

tremely difficult, if not impossible, to associate the collected data with offline identities. We did not make any attempt to discover personal information or to track any user. The data collection and analysis process did not introduce any risk to any user. The information we collected was technical data that was received by a large audience (anyone in the subnet).

We discussed the project and data collection process with the IRB of the National Engineering Lab for Info. Sec. Tech. at CAS. They determined that our project was not human subject research, and it did not need a full IRB review. The Human Research Protection Program at the University of Kansas reviewed our written memo and agreed with the decision.

3.3 Identifiers and Feature Extraction

We extracted three categories of features from the sniffed BC/MC packets: (1) the identifiers are (almost) unique to each make/type/model of the devices, i.e., they can be employed to uniquely identify devices when they are available. (2) The main features are robust discriminators that can be combined to collectively provide enough information to distinguish devices. (3) auxiliary features are collected through actively querying devices. We only use them in evaluation. 1. Identifiers. Examples of protocols/fields that may carry device identification attributions are listed in Table 2, roughly ordered by their popularity and robustness (i.e., the unlikelihood to be altered). MAC prefix is available on every device and it could be utilized to infer the manufacturer of a device [40]. We retain the top six hexadecimal digits of MAC addresses in the MAC prefix feature, e.g., string "80:e6:50" is extracted from MAC "80:e6:50:19:54:4e". However, MAC prefix may only indicate the manufacturer of the WiFi module on some devices, not the device manufacturer. Next, Host Name in DHCP, answer names in mDNS response messages are

58 29th USENIX Security Symposium

USENIX Association

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download