AESOP: Automatic Policy Learning for Predicting and ...

AESOP: Automatic Policy Learning for Predicting and Mitigating Network Service Impairments

Supratim Deb, Zihui Ge, Sastry Isukapalli, Sarat Puthenpura, Shobha Venkataraman, He Yan, Jennifer Yates

AT&T Labs, Bedminister, NJ [supratim,gezihui,sastry,sarat,shvenk,yanhe,jyates]@research.a .com

ABSTRACT

E cient management and control of modern and next-gen networks is of paramount importance as networks have to maintain highly reliable service quality whilst supporting rapid growth in tra c demand and new application services. Rapid mitigation of network service degradations is a key factor in delivering high service quality. Automation is vital to achieving rapid mitigation of issues, particularly at the network edge where the scale and diversity is the greatest. is automation involves the rapid detection, localization and (where possible) repair of service-impacting faults and performance impairments. However, the most signi cant challenge here is knowing what events to detect, how to correlate events to localize an issue and what mitigation actions should be performed in response to the identi ed issues. ese are de ned as policies to systems such as ECOMP [1].

In this paper, we present AESOP, a data-driven intelligent system to facilitate automatic learning of policies and rules for triggering remedial actions in networks. AESOP combines best operational practices (domain knowledge) with a variety of measurement data to learn and validate operational policies to mitigate service issues in networks. AESOP's design addresses the following key challenges: (i) learning from high-dimensional noisy data, (ii) capturing multiple fault models, (iii) modeling the high service-cost of false positives, and (iv) accounting for the evolving network infrastructure. We present the design of our system and show results from our ongoing experiments to show the e ectiveness of our policy leaning framework.

CCS CONCEPTS

?Computing methodologies Rule learning; Supervised learning by classi cation; ?Networks Network management;

KEYWORDS

Policy Learning; Network Management; Supervised Learning

1 INTRODUCTION

Today's networks serve hundreds of petabytes of tra c every day. As consumer demand for high bandwidth services such as video

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permi ed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. Request permissions from permissions@. KDD'17, August 13?17, 2017, Halifax, NS, Canada. ? 2017 ACM. 978-1-4503-4887-4/17/08. . . $15.00. DOI: h p://dx.10.1145/3097983.3098157

Figure 1: Policy learning based automation loop for network service mitigation

continues to explode, network tra c volumes are expected to continue to grow at a rapid pace. Supporting such a continued rapid growth in network tra c volumes without a corresponding increase in network operations sta requires increasingly e cient management and control of increasingly complex networks.

Fault and performance management constitutes a signi cant portion of the day-to-day operations activities in large networks or distributed systems [8]. As the name suggests, fault and performance management deals with the fault or performance degradation conditions occurring in networks. e operational steps include timely detection of the fault or performance degradation conditions in the system, localizing the faulty or underperforming network elements, and devising and deploying mitigation measures in the system ? e.g., by performing a reboot on a network device, or switching tra c to a secondary path.

Traditional network management has relied on considerable domain expertise and operational knowledge accumulated in Operations personnel through years of experience. To improve customer service experience and reduce operational cost, network service providers are heavily investing in advanced automation technologies to facilitate the real-time response to fault and performance management. For example, in 2016, AT&T unveiled ECOMP [1], a cloud-based scalable so ware platform for policy driven automation of network management functions. ese automation systems would be governed by a set of operations' policies which in the case of fault and performance management de ne the trigger conditions and the procedures to automatically respond to service impairments. Traditionally, the policies or rules triggering the responses to service impairments are de ned through domain expertise. However, this is typically incomplete and hard to capture as this knowledge is distributed across large number of operation personnel. Instead, we propose using a machine-learning based approach to identifying these policies. e input to our learning based system comes from vast numbers of logs capturing the actions that network operators took in the network along with high-dimensional measurement

time-series data capturing the conditions that existed in the network as these actions were performed. Our goal in this work is to design a system that automatically learns and validates policies for the automated detection, localization and repair of fault and performance impairments in networks. We term this problem as policy learning. e complete automation cycle enabled by our policy learning approach is illustrated in Figure 1. For the remainder of this paper, we use cellular networks as a use case to illustrate our design and for evaluation purposes. However, our framework and core design principles of policy driven automation are applicable more generally to other types of networks and distributed systems - comprised of physical and/or virtual network functions (network elements).

In this work, we focus on policies for mitigating network service impairments. Such policies would specify a remedial action, such as rebooting a cellular base station, rese ing an antenna element, etc - actions which are taken in response to identi ed impairments.

e typically vast network scale results in the same underlying network impairments typically appearing repeatedly - albeit in di erent locations and at various times of day. Impairments caused by a common so ware bug or underlying root cause would be repaired via a common remedial action. Given the complexity of the so ware and network conditions, such responses to given network conditions are o en determined by operators through experience, with this knowledge (inconsistently) distributed across possibly large numbers of operators. If there is a su cient number of these impairment instances, we hypothesize that we could automatically learn the policies that de ne the conditions under which di erent remedial actions should be invoked. With these learned policies, the remedial actions could be fully automated1.

1.1 Challenges

In practice, however, learning operational policies for network management and control is not a pure data mining exercise. A system that learns policies to mitigate network service issues needs to have a modular and exible architecture fusing operational knowledge into the learning pipeline to address the following challenges:

(1) Taking incorrect remedial actions: A huge challenge to policy learning comes from the typically high service cost of an incorrect action. A remedial action that is just unnecessary will still likely impact customer experience during the mitigation period; an incorrect action may result in additional complications. us the false positives have to be very low so that the service downtime is acceptable. Our goal instead is to learn policies that can automate as many correct remedial actions as possible while keeping the false positives extremely low.

(2) Multiple faults: A second challenge comes from multiple underlying types of impairments (fault or performance degradation) that may require similar remedial actions. e available remedial action log does not have annotation of the underlying fault or service degradation that was the reason that the action was taken. us, we would like to learn a collection of policies

1We focus only on the learning aspects in this work. e system for automating the remedial action based on a learned policy (e.g., a system for automatically rebooting a base station) is out of the scope of this paper.

where each policy implicitly captures an underlying type of network impairment and its associated remedial action. (3) Evolving heterogeneous infrastructure: e network infrastructure is constantly in a state of ux, with so ware upgrades, con guration changes, new technologies, new types of devices and new applications appearing on an ongoing basis. Any policy for a remedial action is typically speci c to an operational environment (e.g., a combination of so ware version, con guration, services activated, protocols and devices supported, etc.) under which it applies. us, we would like to learn policies for di erent operational environments and diverse implicit impairment conditions.

In addition, the policies learned should assist the experts in nding permanent xes for the faults. Domain experts typically do this through root-cause analysis. e learned policies can aid experts signi cantly by identifying relevant measurement counters for further investigation. Further, tracking the operational environment can also help experts with root cause analysis, by detecting broad trends in automated remedial actions. For example, if a number of automatically detected service issues relate to network elements with a speci c so ware version and a speci c type of service degradation, the vendor could be contacted to identify a permanent x.

1.2 Contributions

In this paper, we describe AESOP (Automatic Extraction of Signatures for Operation Policies), a system for automatically learning policies for mitigating network service impairments. Our key insight is the following: while raw measurement data and logs do not reveal the underlying faults, we can design higher-level statisticallyderived symptoms that can be used as pivots to learn policies for di erent operational environments. To the best of our knowledge, we present the rst platform for automated policy learning in networks. In particular, our main contributions are as follows:

Policy learning framework: We present a novel data-driven framework and architecture for automatically learning policies for mitigating network service impairments. Our system is modular and exible enough to con gure based on di erent use-cases.

Modeling and algorithms: We develop a data-driven system model that re ects key principles of network operations, models service degradation in a systematic manner, and learns operations' policies for di erent operational environments and di erent types of service degradation.

Experimental results: We present results from our ongoing experiments to illustrate the e ectiveness of our system in the context of cellular networks. We show that the policies learned can correctly predict a sizable percentage (up to around 50%) of remedial actions from a target set of predictable instances.

2 DATA DRIVEN POLICY LEARNING

Our goal is to design a system that can learn policies for mitigating network service issues. Our core design principles and framework apply to all kinds of networks including SDN, NFV based networks, wireline and cellular networks. However, we will illustrate our design with the aid of a cellular LTE network se ing. To this end, we rst introduce some terminology for cellular LTE networks.

In LTE networks, the main network element responsible for wireless transmission and reception is known as the eNodeB (the so-called "cell towers"). Each eNodeB has multiple transmi ers; a mobile device is associated with any one of the eNodeB's transmitters. A wide range of measurements are reported by the eNodeB; these measurements are collected at the eNodeB and sent to data centers periodically.

2.1 Network Measurement Data

e di erent data types used to present our system design and evaluation are as follows: (1) Measurement counters: ese are raw performance mea-

surement counters reported by eNodeBs. ese counters are reported at xed measurement intervals (typically 15 min), thereby creating time-series data. Example counters include the number of connection a empts in a measurement window, the number of call failures during hand-o to a neighboring cell, the number of bytes downloaded, etc. Each eNodeB reports more than a thousand measurement counters. (2) Service related Key Performance Indicators: For each eNodeB, service related Key Performance Indicators (service KPIs) are measurements that show quality of service delivered to end-users. In general, there are other KPIs that are not directly related to service. However, for the purpose of this paper we are mainly interested in service KPIs and we simply refer to them as KPIs in rest of the paper. KPIs are typically based on a formula that takes as input the raw measurement counters described in the preceding paragraph. Some example service KPIs include downlink throughput, call accessibility percentage (i.e., a measure of the fraction of a empted calls that were successful), call retainability (i.e., a measure of the fraction of calls that were not dropped) etc. e important point to note is that, frequently used KPIs are much fewer in number, typically no more than a few tens of time series. For ease of exposition, we will assume that higher values of service KPIs implies be er service quality. is is true for the examples of service KPIs given in this paragraph. (3) Remedial action logs: ese are speci c network logs that record details (timestamp, steps etc.) of remedial actions taken at di erent eNodeBs by expert operations engineers. When an eNodeB experiences degraded performance for an extended period of time, a ticket is created to alert a network engineer, who takes a suitable remedial action, e.g., so ware reset or hardware reboot of the eNodeB. ese actions are logged using network management so ware. (4) Network con guration: e network con guration data captures how each element is con gured, including the so ware version and the parameters that specify the network element s functionality. Since these con gurations change over time, the data also contains the date associated with each con guration snapshot. We will refer to the so ware version and con guration template combination as the operational environment.

R

. We present our system and evaluation in the context of

the preceding types of data. However, our system architecture and

policy learning framework is general; it can incorporate other types

of data including alarms, syslogs, service related tickets, etc.

Use case speci c meta-data: So far we have described some different types of data that are typically extensively leveraged for operating a network. However, there are additional properties of the data that depends on the speci cs of a given network and service. ese properties can be captured as meta-data and the design of AESOP accounts for this meta-data. e speci cation of this meta-data can be con gured by a domain expert on a per use case basis. Precisely, our system accounts for two types of use case speci c meta-data, namely primary KPIs and measurement categories.

ese are as follows in the context of cellular 4G LTE network:

? Primary KPI: ese are the most important KPIs that accurately re ect the service quality of an eNodeB. Put simply, if the primary KPIs are not degraded, the eNodeB is considered to be in good health. Domain experts can hand-pick the list of primary KPIs (typically around ten or so) based on expert knowledge of critical service metrics. In a la er section, we show how the choice of primary KPIs is used to de ne the notion of a symptom associated with a potentially degraded eNodeB.

? Measurement categories: Although there are over a thousand performance measurement counters, many of them present di erent performance aspects of a common underlying physical phenomena or a common networking protocol. us, measurement counters can be categorized into correlated groups within which many counters will o en provide redundant information. e categorization can be obtained in a number of ways: labeling by a domain expert, common keywords in counter names, originating protocol or phenomena of the counters, etc. For policy learning, categorization has two bene ts: (i) dimensionality reduction through a two-stage process (see Section 3) that isolates important features in each category and pools them together, and (ii) identi cation of higher-level components responsible for the network impairment, thus enabling easier root-cause analysis.

Scale of data: Depending on the network, the scale of the data could vary and is typically very large. In the cellular network studied here, there are around 900 million distinct measurements per day for a large US metropolitan area (e.g., New York City).

2.2 Problem Formulation

Our goal in this work is to learn policies for network problem mitigation by combining data-driven intelligence with domain knowledge. To this end, we formalize the notion of a policy in the context of our work. Before we do that, we will introduce some key notations. We present our problem in terms of the speci c application of rebooting eNodeBs. However, the framework is general and can be applied to other areas of a network, di erent mitigation actions and di erent network data. For the use case analyzed here, we are given the following historical data: (1) Network measurement counters xc (e, t ) as a function of eN-

odeB e and time t, along with measurement KPIs xkpi (e, t ), where (xc , xkpi ) Rm . (2) Expert aided network logs a(e, t ) which represent actions that were taken on eNodeB e at time t where a(.) A and A is a nite set of remedial actions. To keep the notation generic, whenever there is no remedial action taken on an eNodeB (during normal

working conditions), we consider this as a special action that can be termed "no-remedial-action required."

Policies for a remedial action: In simple terms, a policy for a remedial action is a mapping from network measurement counters/KPIs to a possible remedial action on a given network element (eNodeB in our speci c use case). However, the applicability of each policy to a network element (eNodeB) depends on (i) the operating environment (so ware version and con guration) of the eNodeB, and (ii) symptoms or degradation condition (to be made precise in the next section) of the eNodeB if any. More formally, each policy is characterized by the following:

? Policy domain: is de nes the speci c operating environment of an eNodeB for which a policy is applicable and the speci c symptoms or degradation conditions of the eNodeB. We will provide a more formal de nition of symptom in a later section, but intuitively each symptom speci es the set of primary KPIs that were severely degraded thus requiring a possible remedial action.

? Policy function: is function is a mapping from (xc , xkpi ) to a A, the set of actions. Note that, WLOG, the set of actions also includes "no-remedial-action required."

Problem Statement: e policy learning problem can be stated as follows: learn all possible policies (policy domain and function) based on historical network counters/KPIs xc (e, t ) and xkpi (e, t ) along with action logs a(e, t ). e learning objective is to ensure that each policy function maps to an accurate remedial action that can x the service degradation while the false positive rate of taking a remedial action (excluding "no-remedial-action" required) is less than , a small number. is is because an incorrect action on an eNodeB can lead to small but non-negligible loss of service on the eNodeB when the eNodeB is under repair.

We will show in the next sub-section how this can be cast as a classi cation problem a er suitable data-processing steps.

2.3 AESOP Overview

As described earlier, using machine learning for this problem needs to address several challenges: high-dimensional noisy data, multiple fault models, low false positives, rapidly-changing infrastructure and more. To address these challenges, we take the following dataprocessing steps to construct training examples:

(1) Computing base-line summaries: is step computes spatiotemporal statistical summaries by computing statistics (mean, median, quantiles) for each measurement counter/KPI and for each eNodeB at di erent times of the day. is step is useful to subsequently convert raw measurements into a normalized anomaly score.

(2) Data normalization via anomaly scoring: e main functionality of this step is to perform normalization of raw measurement values by converting raw data into anomaly scores.

(3) Symptom modeling: is step assigns symptom, if any, to every measurement data instance. Since the action logs do not have annotation for possible fault, the symptoms allow us to group measurement data corresponding to di erent remedial actions in a systematic manner. e symptoms are assigned based on combinations of degraded primary KPIs. Subsequently, this step constructs the training examples by suitable data cleansing,

Figure 2: Key system modules of the policy learning framework

Figure 3: e behavior of a key KPI (downlink throughput) from two di erent cellular eNodeBs on a cellular operator's network. e exact values on the -axis are not shown as the data is proprietary.

downsampling, and grouping of training examples; each group will have a common set of policies. (4) Feature extraction and classi cation: is has two machine learning functionalities. First, this module performs dimensionality reduction of the data by combining expert-aided feature categorization with standard feature extraction techniques. Second, it learns suitable classi ers so that the desirable properties in our problem statement in Section 2.2 are satis ed.

ese steps learn the policies periodically (say, once in a week). In addition, there is also a separate module that validates the policies learned. Figure 2 shows the interaction of the di erent system modules carrying out the above steps.

3 AESOP FRAMEWORK

We now describe the models and algorithms used in the AESOP framework for processing the network data and learning policies.

3.1 Computing Statistical Baselines

e statistical baseline for each counter and KPI must account for the underlying spatio-temporal pa ern. Cellular network data is known to exhibit time-of-day and day-of-week pa erns for many key metrics. As an illustration, Figure 3 depicts the downlink throughput for two di erent eNodeBs over a one week period, illustrating the clear time varying nature of the throughput. We therefore compute statistical baselines using the following steps:

(1) Binning: For each data sample from a measurement counter/KPI, we associate a spatio-temporal bin depending on three parameters: hour of day, day of week, and eNodeB of the data. e binning is done over a time-window that is typically a few months to strike a balance between capturing slow changes in network behavior and having su cient samples in each bin for computing reliable statistical estimates. Also, prior to the binning process, missing data is given a special label.

(2) Computing statistics: Since we do not want to include any anomalous data (due to fault or other planned operational activities) in our statistical baselines, we rst do data cleansing as follows: exclude missing data and lter out data that corresponds to a time-window (24 hours typically) within which the eNodeB underwent some remedial action or when the eNodeB was in maintenance mode. Once the data cleansing is done, the ensemble of remaining data in each bin is used to compute di erent statistics: mean, standard deviation, quantiles (10, 25, 50, 75, 90).

3.2 Data Normalization via Anomaly Scoring

Since the data can have very di erent ranges across measurement

counters/KPIs, originating eNodeBs and measurement times, we

normalize the measurements to get all data in the same scale. Many

machine learning algorithms rely on the di erent dimensions to be

in the same scale. Normalization also helps us in feature extraction by allowing us to compare importance weight of di erent measure-

ments in a meaningful manner. Since our goal is to learn policies

for remedial actions that x service problems, as a design choice,

we perform this normalization via anomaly scoring. To motivate

this for cellular networks, consider the following example: since

cellular download speed depends on network load and eNodeB con-

guration, a download speed of 2 Mbps could be perfectly normal

at 9 AM in a busy business area but highly abnormal at 4 PM on

weekends in a residential area. To learn data-driven policies, we

need to have a consistent anomaly score that re ects these two

scenarios.

To compute the anomaly score, we make use of IQR (inter-

quantile range) based robust statistics combined with notions of

outliers and extreme outliers in data. For each measurement data

we rst retrieve the statistics of the associated spatio-temporal bin

as de ned in Section 3.1. Suppose, for a given measurement x, that the 25th and 75th quantiles for the corresponding bin are Q1 and Q3 respectively. We also have IQR = Q3 - Q1. We wish to nd a

mapping so that the following is satis ed: (i) all anomaly scores lie

in the range [0, 1], (ii) extreme outliers de ned as 3 ? IQR above Q3 or 3 ? IQR below Q1 are mapped to the right and le end-points

of the range, 1 and 0 respectively. One such mapping that satis es

the above is as follows:

fas

(x )

=

max 0.5

0,

0.5(1

-

Q 1-x 3I Q R

)

min

1,

0.5(1

+

x -Q 3 3I Q R

)

x < Q1 x [Q 1, Q 3] (1)

x > Q3

Indeed, fas (Q1 - 3IQR) = 0 and fas (Q3 + 3IQR) = 1.

Our above choice is robust to skew in data statistics (unlike a

z-score based statistic) and is also computationally light (unlike

sophisticated measures based on time-series anomaly detection).

is is an important requirement, given the dimension of the data

(over a thousand per eNodeB per measurement instance).

Capturing time-series history: We also augment the features by computing the time-average statistics of each normalized mea-

surement. Speci cally, we compute rolling statistics (median and

mean) of the normalized anomaly scores over a xed time window,

set as 2 hours.

In summary, this module performs the following functions: (1) For each dimension in the measurement vector (xc (e, t ), xkpi (e, t )),

the module computes the anomaly scores given by (1). Denote the vector of the anomaly scores by (zc (e, t ), zkpi (e, t )). (2) For each dimension in (zc (e, t ), zkpi (e, t )), the module computes the windowed running median and mean values over a time window [t -W , t] where typicallyW = 2 hrs which corresponds to the past eight measurements since measurements are sent once every 15 min. e choice of W re ects the fact that we wish to learn policies that can rapidly detect service issues. W is a con gurable parameter that should be set based on expert

operational insights.

Missing data: As discussed in Section 3.1, missing measurements could be indicative of a problem or could instead be the

result of a data collection issue. us we assign a special symbol

for missing data. We also tested encoding strategies like one-hot

encoding. However, our experience was that this considerably in-

creases the dimensionality (already into the thousands) without

providing appreciable gains in e ectiveness of our models.

3.3 Symptom Modeling

A remedial action on a network element is typically taken in response to an underlying fault that causes some service problem. However, the action log which we are using to capture the dates, times and locations of given operator actions (e.g., eNodeB reboots) does not capture any information regarding why an action was taken by a network operator. To overcome this challenge, AESOP's design makes use of the following insight: each underlying rootcause manifests as a degraded condition on some subset of the most important KPIs. We refer to these KPIs as the primary KPIs. e primary KPIs are speci ed as inputs to AESOP - meta-data provided by domain experts. us, a symptom is simply some combination of degraded primary KPIs. To assist in a formal de nition of a symptom, we rst de ne the degradation score of a KPI.

D

1 (KPI

). Given windowed running-

median anomaly score of a measurement KPI z^(t ), degradation score

of the KPI is de ned by Scored (z^(t )) =

K k =0

1(?

-z^

(t

)

Lk

)

where

? = 0.5 is the nominal anomaly score and Lk 's are discretization

levels (e.g, L0 = 0.1, L1 = 0.2, L2 = 0.3).

D

2 (M

). Given a time-windowed

running median anomaly score of a primary KPI-vector z^kpi (t ) = (z^1 (t ), z^2 (t ) . . .), the underlying symptom of the measurement is

characterized by two properties: (i) a symptom score given by the

maximum degradation score MaxScored (z^kpi) = maxi Scored (z^i ),

and (ii) symptom KPIs given by the set of KPIs with the maximum

degradation score, i.e. the set {i : Scored (z^i ) = MaxScored (zkpi)}

Intuitively, the notion of a symptom captures the directions of

the KPI vector that are most abnormal (in a probabilistic order of magnitude sense). When all KPIs are within Q1 and Q3, the symptom has an associated degradation score of zero.

We now describe how we construct training examples. e training examples consist of two sets of records - measurements that were followed by a remedial action and measurements during normal network behavior. We perform the following steps: (1) Symptom computation: For each measurement, we compute the

associated symptom based on the preceding de nition. (2) Extracting successful remedial actions: ough remedial actions

are taken to mitigate service impairments, sometimes a given remedial action may not x the service problem in which case an operations engineer may try an alternative remedial action. Since we wish to learn pa erns for remedial actions that xed the service problem, it is imperative to compute the remedial actions (from the set of all remedial actions in remedial action log) that successfully mitigated service impairments. To extract the successful remedial actions, for each remedial action we compute an aggregate2 degradation score across KPIs in xed length time windows before and a er each remedial action. Any remedial action that did not result in an improvement based on this aggregate score is removed from the training data set. (3) Downsampling: When there is a service problem at an eNodeB, it gets manifested in terms of di erent symptoms at di erent times (prior to the remedial action). is is because degradations for di erent KPIs have some degree of correlation. is raises an important question: which instances and corresponding symptoms should be represented in the training data for a particular remedial action? Since our goal is to learn policies for mitigating service impairment, a natural choice is to use those instances when the service impairment was most severe. In our framework, we construct a single training example per remedial action as follows: we pick the measurement corresponding to the worst aggregate degradation score in the Th =12 hr period leading up to the remedial action. For constructing training data corresponding to measurements of normal behavior, we randomly sample data from normal behavior such that there is at most one measurement from an eNodeB on a single day; this ensures statistical independence of the training examples. (4) Grouping for identifying policy domains: Finally, we group the training examples from each remedial action based on the symptom and additional data related to the operational environment of an eNodeB. e key idea is to learn policies for each group individually, as service problems in each group are likely to have a common underlying pa ern. Note that, a group re ects a policy domain corresponding to policies learned for the group. Symptom pooling: Ideally we wish to learn policies for each underlying symptom. is is a challenge for symptoms without su cient training data to learn pa erns. For example, suppose for an operational environment, there are (i) 100 remedial actions corresponding to a symptom with KPI-1 and KPI-2 both simultaneously worst hit with a degradation score of 3, and (ii) there are 15 remedial actions corresponding to symptoms where KPI-1, KPI-2 and KPI-3 are worst hit with a degradation score of 3. In this case, we pool the la er training examples with the former. In general, whenever the size of training data set is less than a threshold for

2 Aggregate degradation score is the normed (L2 norm) score across individual KPI degradation scores.

some symptom s1, we pool it with another symptom s2 such s1 and s2 have common degraded KPIs; this process can be iterated a few

times if required.

3.4 Feature Extraction and Classi cation

We will describe our models assuming a single remedial action.

e extension to multiple remedial actions is similar to extending

2-class classi cation to a multiclass classi cation. Without loss of generality, we consider data corresponding to a single group of training examples from the remedial action since we wish to

learn the policies for each group separately. Recall that each group

(policy domain) is characterized by a symptom, so ware version

and con guration as outlined in Section 3.3. In this section, we say

measurements to mean "measurements normalized via anomaly

scores."

A er constructing the training examples in each group using

the data-processing steps discussed in the previous section, let zi , i, 1, 2, . . . , m be the set of measurements, where i [1, m1] corresponds to measurements prior to the remedial action (class-R) and i [m1 + 1, m] corresponds to measurements during normal network behavior (class-N). Let i be the binary variable denoting the two classes, i.e., i = 1 for i [1, m1] and i = 0 for [m1 + 1, m]. We will also use the notation D+ = {zi : i [1, m1]}, D- = {zi : i [m1 + 1, m]}, D = D+ D-. Our goal is to compute a mapping = h(z) so that we maximize the true positives subject to low false positives. Denoting PrD+ (h(z)) = Pr(h(z) | z D+) and PrD- (h(z)) = Pr(h(z) | z D-), we wish to solve the following problem:

max PrD+ (h(z) = 1) s.t. PrD- (h(z) = 1)

h : D {0, 1 }

In log-domain, the above problem is equivalent to

max log PrD+ (h(z) = 1)

h : D {0, 1 }

s.t. log PrD- (h(z) = 0) K ,

(2)

where K = log(1 - ). ough this is a constrained optimization problem, this can be cast into an unconstrained problem using

Lagrangian duality principles [4]. Under suitable convexity assump-

tions [4], (2) is equivalent to solving the following unconstrained

problem:

min max L(h(.); )

0 h:D{0,1}

L(h(.); ) = log PrD+ (h(z) = 1) + (log PrD- (h(z) = 0) - K ) , (3)

where L(.) is the Lagrangian of the original constrained problem. Note that for a given value of , the expression in (3) is exactly the same as log-likelihood of the training data with the class-N examples weighted by . us, we have a standard classi cation problem where the optimal classi er is computed based on maximiz-

ing log-likelihood of the training data; also, there is an additional hyper-parameter which needs to be optimized.

We now describe our feature extraction and learning steps.

3.4.1 Feature Extraction. e measurement data has around 1100 counters and around 2200 dimensions once we augment the feature space with rolling time window based statistics (median or mean). us, it is very important to select the right features

to learn policies that are e ective. Rather than applying a feature extraction methodology to the entire pool of features, we can make use of expert aided measurement categories as each category has features that relate to a common underlying phenomena or protocol. For example, in cellular networks, a very important KPI is the fraction of calls not dropped when mobile users move across cell boundaries. When a call is dropped, it is captured by multiple measurement counters corresponding to di erent protocols and events. Expert knowledge provides a measurement category for all such measurements as meta-data. us, for symptoms related to this KPI, we can focus on this set of features and perform feature extraction much more e ciently.

To this end, our feature extraction step identi es the top features in each measurement category and then pools them together. is ensures that the most important feature each related group of features are used in learning. See Section 3.5 on how this approach lends to interpretable policies.

For extracting features from a category of measurements we proceed as follows. First, we project training example-i zi into a lower dimensional space consisting of measurements belonging to this category. Next, we use stability selection [15] which repeatedly uses a random subset of examples and a random subset of features to perform randomized logistic regression with L1 regularization; the repetitions are aggregated to produce an importance weight for each feature. Once the feature weights are chosen, we can drop all features with weight below a threshold relative to the most important feature in a measurement category.

3.4.2 Classification and regularization. e classi cation objective is to compute whether a given measurement instance can be classi ed as normal behavior or as a service issue xable through the remedial action. We will use logistic-regression based classi er for our purpose as follows. According to (3), this boils down to maximizing

Obj (w | , ) = log PrD+ (h(z) = 1) + log PrD- (h(z) = 0) - (w) where = h(z) is modeled by a sigmoid logistic function as

1

Pr(h(z) = 1) = 1 + exp(-w0 - wT z) .

(4)

ere are two hyper-parameters of the classi cation step: which is the weight of class-N in (3), and a cut-of factor for feature selection which decides relative importance of the most and the least important feature from each measurement category. e term (w) is a regularization term for which we choose elastic net3. Elastic

net provides robustness by striking a balance between sparsity of

features and capturing groups of correlated features. Tuning classi er for low false positives: Once the model is trained

by computing the optimal w, we need a cut o for class probabilities. We choose this to satisfy the false positive constraint. Denoting p(h(zi )) as the value of Pr(h(zi ) = 1) for the computed optimal w, this can be done as follows: (i) based on the trained model, sort the values of p(h(zi )) for all i belonging to class-N, (ii) compute the (1 - )th quantile of the sorted values of p(h(zi )) from class-N.

3Elastic net is a linear combination of L1 and L2 regularization.

Hyper-parameter search: e previous steps are for a given choice

of hyper-parameter. e suitable value of hyper-parameter com-

bination (, ) is chosen through a grid-search over reasonable values. Each choice of the tuple is evaluated through a K-fold cross-validation where K = 10.

3.4.3 Boosting. Training data corresponding to a given symptom within a group could have multiple underlying pa erns leading to the same remedial action. us, we would like to compute multiple policies within each group. A natural way to express this would be to use a disjunction (i.e., logical OR) of multiple policies within each group. Indeed, even in manual operations, upon dissecting the data, the next step is to check if one of a few simplistic conditions are violated. We use a boosting-like [7] approach to extend our classi cation based policy learning to learn disjunction of policies within each group. ere is a rich set of literature on di erent variants of boosting that could be adapted to our se ing. Suppose, for a training example group, we wish to nd the disjunction of policies Hk (z) = ki=1hk (z) where hk is the kth rule. Le ing T PRk and F PRk be the true positive rate and false positive rate respectively a er learning the kth rule, we can see that,

T PRk = PrD+ (Hk-1 (z) hk (z) = 1)

= T PRk-1 + PrD+ (Hk-1 (z) = 0)PrD+ (hk (z) = 1 | Hk-1 (z) = 0)

Also, we can write F PRk F PRk-1 + PrD- (hk (z) = 1). Say we cap the number of boosting iterations to, say, N = 3, then F PRk is ensured if PrD- (hk (z) = 1) /N . Since, PrD+ (hk (z) = 1 | Hk-1 (z) = 0) can rewri en as PrD+Ik-1 (hk (z) = 1) where Ik-1 are the class-R examples incorrectly classi ed till iteration-(k - 1), in iteration-k, we must solve

max

hk

Pr

D+

Ik

-1

(hk

(z )

=

1)

s.t. PrD- (hk (z) = 1)

,

where = /N where N is the maximum boosting iterations. is is similar to our original problem except that, before iteration-k, we

must remove all correctly classi ed training examples from class-R till iteration-(k - 1).

3.5 Interpretability

e structure of a policy could guide an expert towards nding a permanent x for the problem as follows. First, the expert obtains the symptom associated with the policy. Next, by considering the degraded KPI/s associated with this symptom, the domain expert gets a very good idea about the measurement categories that could get impacted in order of criticality (order depends on the degraded KPI). e feature/s with highest weight (in the classi cation rule) within a category provides a glimpse into the faults captured by the policy which could be investigated using raw data for di erent eNodeBs (from the training set) before the remedial action. is way, an expert obtains the key a ected measurements, in order of criticality, for a given fault which can be then be used to x the issue by consultation with the vendor.

4 EVALUATION

Implementation: We have implemented the system on a Linux based high performance computing cluster with 1 TB RAM on each

machine. e implementation is Python-based and uses processbased parallel programming constructs to speed up the training steps. We are also working on an Apache Spark based implementation to evaluate possible gains in computing speed.

We now evaluate AESOP using data collected from an operational cellular network in a major US metropolitan region in the US East coast. e area consisted of around 1800 eNodeBs. Over the course of 3 months of training data, we encountered ve di erent operational environments (so ware version and con guration).

4.1 Experiment Setup

Measurement data for training: e system learns the policies based on 3 months of history and the policies are updated once a month. e data consisted of around 1100 dimensions (consisting of measurement counters and KPIs) collected every 15 minutes . e time-series data is augmented using a rolling median (as described in Section 3.2) of each dimension. We use two data sets for our results: Section 4.2 is based on policies learned in Jan 2017 and Section 4.3 is based on evaluation of policies learned in April 2017. Remedial action logs: We show results for a frequently used remedial action taken in the network, speci cally rebooting an eNodeB. In other words, we trained and evaluated our system for learning policies that can recommend when an eNodeB can be rebooted to x a service issue. For the purpose of training, we

ltered out all incidents during maintenance hours. Also ltered out were two additional types of reboots: (i) ones that were due to so ware upgrades, and (ii) ones that were not preceded by any KPI degradation as there are many reboots that are automated and have nothing to do with mitigating service issues. We only focus on cases where the service impact (measured through the aggregate KPI degradation score) is above a minimum threshold as the success of our system depends on e ectively detecting service issues with appreciable KPI degradation. Testing: Once the policies are learned, evaluation/testing is done in a subsequent period (clearly non-overlapping period). Testing is done periodically based on all measurement data collected over the period (we do it once a day but it can be performed once every few hours as well). e testing steps for a measurement instance corresponding to an eNodeB at a given time is as follows:

(1) Normalize the data using the data normalization steps described in Section 3.2.

(2) Associate a policy group based on the so ware version, con guration template and computed symptoms.

(3) For each policy learned for this group, compute the classi cation probability p^ of class-R given by (4). We output this measurement instance as class-R if p^ is more than a threshold (obtained through 10-fold cross validation in the learning phase). In standard approaches with ROCs from the K-fold cross-validation [6], one compares p^ to the threshold calculated from the mean-ROC plot; however, due to the sensitivity to false positives in our system, we also associate a con dence to the classi cation based on the fraction of thresholds (there are 10 threshold once for each cross-validation) exceeded by p^.

(4) For each remedial action from action logs, in the time leading up to the action, nd the prediction with maximum con dence.

Table 1: Number of features used and average TPR (F P R < 0.005%) for the best choice of hyper-parameters based on 10-fold crossvalidations. We use tptKPI, xaKPI, cdKPI to denote KPIs related to throughput, connection access, and call drops respectively (rather than the exact KPI names). e operational environment tags are anonymized for con dentiality.

Operational

Symptom

Environment

(SW Vrsn./Conf.) KPIs

No. Features TPR Iteration

ST-1

xaKPI-2

0

tptKPI-6

0

1

ST-3

cdKPI-1

0

ST-5

tptKPI-5 and tptKPI-6 0

xaKPI-2

0

tptKPI-6

0

1

2

cdKPI-1

0

26 34% 60 30% 74 20% 54 80% 62 35% 22 36% 24 53% 35 23% 65 15% 84 29%

Figure 4: ROC curve for few cases in Table 1

In the following, we rst present in Section 4.2 a validation of the individual policies learned by AESOP, and then in Section 4.3 we present results demonstrating the practical usefulness of the pool of policies learned.

4.2 Policy Validation

We rst present cross-validation results to understand the e ectiveness of our key steps and model. We have three goals: rst, we wish to validate that our model outputs policies with good true positive statistics in-spite of very-tight constraints on the false positive rate which is important for practical usefulness of the policies; second, we wish to validate potential gains from boosting in Section 3.4.3; third, we wish to validate whether the best choice of model hyper-parameters provides a signi cant reduction in the number of counters which is important to make sense out of the extracted features (by a domain expert). All results are based on the 10-fold cross-validation results produced by e best choice of hyper-parameter obtained through a grid search.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download