Anticipating Utility Site Disruptions

Our business question in this situation is that a Utility company wants to predict their site disruptions in order to minimize the amount of time that their service is down. My boss has come to me and asked me to generate a report every morning in order to preemptively predict these disruptions so that she can send out service technicians to their first location in the morning.

Each service station has a fault severity of 0 (no issues), 1 (minor issues), or 2 (major issue). The other data that are given to us are fairly abstract in nature, but we know each observation is a single instance of a service check on the station.

#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from functools import reduce


from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score

After importing the libraries, we read in the data, which is given to us in 4 different tables. After taking a look at each one, we clean them up enough to combine together

df_list = [event_tb, log_tb, resource_tb, severity_tb, train_tb]

for item in df_list:
    print(item.head())

     id     event_type
6597  event_type 11
8011  event_type 15
2597  event_type 15
5022  event_type 15
5022  event_type 11
     id  log_feature  volume
6597   feature 68       6
8011   feature 68       7
2597   feature 68       1
5022  feature 172       2
5022   feature 56       1
     id    resource_type
6597  resource_type 8
8011  resource_type 8
2597  resource_type 8
5022  resource_type 8
6852  resource_type 8
     id    severity_type
6597  severity_type 2
8011  severity_type 2
2597  severity_type 2
5022  severity_type 1
6852  severity_type 1
      id      location  fault_severity
14121  location 118               1
 9320   location 91               0
14394  location 152               1
 8218  location 931               1
14804  location 120               0

#Merge the dataframes together
df = reduce((lambda df1, df2: pd.merge(df1, df2, on='id')), df_list)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61839 entries, 0 to 61838
Data columns (total 8 columns):
id                61839 non-null int64
event_type        61839 non-null int32
log_feature       61839 non-null int32
volume            61839 non-null int64
resource_type     61839 non-null int32
severity_type     61839 non-null int32
location          61839 non-null int32
fault_severity    61839 non-null int64
dtypes: int32(5), int64(3)
memory usage: 3.1 MB

After taking a look at each variable using value counts, we see that each feature is fairly abstract, it difficult to see at face value what each feature physically corresponds to. This will mean that we will not be able to semantically group any features together.

Since log_feature and volume came in the same file, we’ll look at how they relate to one another:

Graph1

Graph2

Taking the log of volume gives us a better scale, so we will be using the log transform of the ‘volume’ feature.

Let’s continue to look into the relationships surrounding log_feature by looking at a boxplot of how the volumes vary depending on our target: fault severity.

Graph3

Lets take a look at some more graphs to try and get a sense of how the variables are related to one another.

Graph4

This graph shows the frequency of fault severity per resource type and severity type.

We can see that resource type 5 only has 2, or ‘critical’ fault severity. We can also see that resource types 2, 3, and 9 have higher chance of 0 fault severity.

Graph5

We can see some relationships within each event type or resource type, but no real overarching trends. This is why we have models to pick up on the more subtle relationships between variables.

Lets prepare our data for the model by getting dummies on our object columns. After that, we will group our data by ID number, since we want to look at the fault severity at each ID.

data = data.groupby('id', sort = False).sum()

First, we convert resource_type, severity_type, event_type, and log_feature to string values and use pd.getdummies on them, since we dont want our model to infer direction from different values (resource_type 2 is not higher than 1).

We then separate the feature and target variables, and randomly select training and test rows by using sklearn’s train_test_split. We then decide to use the gradient boosting classifier to predict to fit to our data and predict.

#The Gradient Boosting Classifier fits our data the best
gbc_model = gbc.fit(x_train, y_train)

gbc_accuracy = accuracy_score(y_train, gbc.predict(x_train))
print(gbc_accuracy)


#setting up prediction probabilities for each row
y_pred_proba = gbc.predict_proba(x_test)
y_pred = gbc.predict(x_test)

After predicting the probability, we want to create an output that is easily readable by our supervisor so that they can send the service techs to the correct locations.

#Make a DataFrame of the predictions
result = pd.DataFrame({
    "id": x_test.index,
    "Predicted fault_severity": y_pred,
    "prediction_probability_0": y_pred_proba[:, 0],
    "prediction_probability_1": y_pred_proba[:, 1],
    "prediction_probability_2": y_pred_proba[:, 2]},
    columns =['id', 'Predicted fault_severity', "prediction_probability_0",
              "prediction_probability_1", "prediction_probability_2"])

result.head(10)

Next, we sort the values by most likely to have a severe fault.

result.sort_values(['prediction_probability_2'], ascending= False, inplace= True)
result.head(20)'

After we get this table, all we need to do is write it to a csv and send it to our boss so the service techs can do their thing!

Our predictions with the basic model are decent, but as you can see in the confusion matrix below, there is only a 58% chance that we will predict a fault severity of 2 correctly.

After some hyperparameter tuning, we can increase this to 63% chance that a 2 is predicted correctly, and nearly 80% chance there is at least a minor issue with the site. Given that the assignment is to give insight into a technician’s first assignment of the day, 63% is a considerable increase to chance for very low risk.

Basic GBC Model:

Graph6

Parameter Tuned GBC: Graph7

Graphs made by using matplotlib, seaborn, and yellowbrick.

Link to Jupyter notebook may be coming soon here, or can be sent upon request.