Building a Predictive Maintenance Model

Hello and welcome to another machine learning project! Today, our task is to build a predictive maintenance model for a transportation company. They have a telemetry device attached to the ECU of their vehicles and collect readings once the vehicles complete a trip OR break down in the field. Instead of checking every vehicle on a cyclical basis (once a month) we will build a model that will be able to read an observation and recommend if a vehicle will need maintenance. This will save the company money in 2 ways: reduce the amount of break-downs in the field, and increase the time needed between the cyclical vehicle check-ups.

Step 1: Exploratory Data Analysis

As always, the first step is to see what kind of data we’re working with.

data = pd.read_csv("telemetry_failures.csv")

data.head()
date device failure attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute8 attribute9
0 2015-01-01 S1F01085 0 215630672 56 0 52 6 407438 0 0 7
1 2015-01-01 S1F0166B 0 61370680 0 3 0 6 403174 0 0 0
2 2015-01-01 S1F01E6Y 0 173295968 0 0 0 12 237394 0 0 0
3 2015-01-01 S1F01JE0 0 79694024 0 0 0 6 410186 0 0 0
4 2015-01-01 S1F01R2B 0 135970480 0 0 0 15 313173 0 0 3

We’ll also take a look at our target variable, ‘failure’

data['failure'].value_counts()
0    124388
1       106
Name: failure, dtype: int64

As we can see, the classes are extremely imbalanced. In previous projects, we’ve had imbalanced classes and tried to work around it. In this project, I will address this issue directly using the python library imblearn.

After exploring each column in detail using value counts and pandas profiling report, I discovered each attribute is fairly abstract, so we will have to use the shape of each attribute in order to decide how to treat it.

  • Based on the date column, we can tell that this data is taken from 11 months of observations
  • Attributes 2, 3, 4, 7, 9 all seem to be codes since the majority of their contents are 0s.
  • Attributes 1 and 6 have a fairly even spread, so we will treat those as actual numerical features.
  • Attribute 7 is the same as attribute 8. Since we don’t want essentially twice the effect of that variable, we will drop it.
data.drop('attribute8', inplace=True, axis=1)

Visual EDA

Now comes the fun part of EDA. Taking a look at graphs to try and tease out trends or interesting aspects that may help with feature engineering later on. These are also useful for reporting purposes.

Time Series and Seasonality Differences

The first thing that we’ll look at is the rate of device failures across the duration of the study.

Graph1

So we can see that there are more TOTAL failures in January 2015, but the overall RATE of failure (failure/ number of shipments) had a larger spike at the end of October 2015, around 3% failure rate. As we can see below, this means that there are some seasonality differences of shipments.

Graph2

General Distribution Comparisons

Since our target is not normally distributed, I wanted to take a closer look at how close it is to a Poisson distribution, which is used when you know how many events to expect in a certain time frame, but the time between events is independent.

Graph3

We can see it follows the same trend of a Poisson distribution but of course it’s not perfect.

Groupby Graphs

From here I wanted to take a closer look at how failed device features differed from working device features.

failed_devices = vis_data[vis_data['failure'] == 1]
working_devices = vis_data[vis_data['failure'] == 0]

After taking a sum of the categorical codes that we specified earlier, we can see if certain codes have ‘comorbidity’, the presence of an attribute means a higher chance of other attributes also being present. So we would expect if there is no comorbidity, the difference between the presence of an attribute would be exactly 1 ‘total code’.

Graph4

After looking at this, I wanted to see if there was an effect of seasonality as well. The graph below is the same as above with an extra variable (season) also shown.

Graph5

This graph is useful for showing the seasonality of each attribute in failed devices. For example, we see that attribute 2, 3, and 9 rarely occur in the fall. This could be due to the low sample size of those fall failures.

Ok, we have a pretty good sense of the data now, we can start really drilling down and building a DataFrame to model.

Groupbys and Feature Engineering

The first thing that we can do is add a variable for seasonality.

#Adding the season variable
data['date'] = pd.to_datetime(data['date'])
data['season'] = data['date'].apply(lambda dt: (dt.month%12 + 3)//3)
data['season'] = data['season'].astype(str)

The next thing that we’ll do is identify the devices that went out after failure. Of the 106 devices that failed, there were 5 that went on another trip after their failure date.

I accomplished this by grouping the devices by failure/non-failure, tagging the max date they went out, and then re-combining and comparing those DataFrames

attribute1_fail attribute2_fail attribute3_fail attribute4_fail attribute5_fail attribute6_fail attribute7_fail attribute9_fail failure_date
device
S1F023H2 64499464 0 0 1 19 514661 16 3 2015-01-19
S1F03YZM 110199904 240 0 0 8 294852 0 0 2015-08-03
S1F09DZQ 77351504 2304 0 3 7 418563 0 2 2015-07-18
S1F0CTDN 184069720 528 0 4 9 387871 32 3 2015-01-07
S1F0DSTY 97170872 2576 0 60 12 462175 0 0 2015-02-14
S1F0F4EB 243261216 0 0 0 10 255731 0 3 2015-05-07
S1F0GG8X 54292264 64736 0 160 11 192179 0 2 2015-01-18
S1F0GJW3 83874704 2144 0 401 9 229593 16 0 2015-03-17
S1F0GKFX 121900592 0 0 18 65 246719 0 0 2015-04-27
S1F0GKL6 160459104 0 0 2 90 249366 0 0 2015-05-13

#Step 2: Building the Model

Creating the Model DataFrame

device_group = data.groupby('device').agg({'failure': 'sum', 'attribute1' : 'mean', 'attribute2': ['min','max'],
                                           'attribute3':['min','max'], 'attribute4': ['min','max'], 'attribute5': 'mean',
                                           'attribute6': 'mean', 'attribute7': ['min','max'], 'attribute9':['min','max', 'count']})

device_columns = ['failure', 'attribute1mean','attribute2min','attribute2max', 'attribute3min','attribute3max',
                  'attribute4min','attribute4max', 'attribute5mean', 'attribute6mean', 'attribute7min','attribute7max',
                  'attribute9min','attribute9max', 'days working']

device_group = pd.DataFrame(device_group, index = device_group.index)
device_group.columns = device_columns

#Creating the sent_after_failure column
sent_after_failure = ['S1F0GPFZ', 'S1F136J0', 'W1F0KCP2', 'W1F0M35B', 'W1F11ZG9']

device_group['sent_after_failure'] = 0
device_group.loc[sent_after_failure, 'sent_after_failure'] = 1

From here, I separate the features and target columns, and create a user-defined function to finish the pre-processing (dummies on all the object columns).

Modeling and Pipelines

In many projects, the target variable will be unbalanced by necessity. I would hope there are a much higher amount of devices that did NOT fail in comparison to devices that do fail. Or an example from a previous project: more people will repay a loan than default on a loan.

I will go over one technique to solve this problem of class imbalance. I will also show a bit of sklearn’s Pipeline function, which is useful for comparing different models, hyperparameters, and balancing techniques.

#imports needed
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
#Instanciate SMOTE, Random Forest Classifier, and Gradient Boost Classifier

sm = SMOTE(sampling_strategy= 'auto', random_state=42)
sm2 = SMOTE(sampling_strategy= 1 , random_state=42)


r = RandomForestClassifier()
gbc = GradientBoostingClassifier()
gbc2 = GradientBoostingClassifier(learning_rate= 1.5, n_estimators= 40, max_depth= 12, random_state=42)
l = LogisticRegression()


rf_pipeline = make_pipeline(sm, r)

rf_pipeline2 = make_pipeline(sm2, r)

gbc_pipeline = make_pipeline(sm, gbc)

gbc_pipeline2 = make_pipeline(sm2, gbc)

gbc_pipeline3 = make_pipeline(sm2, gbc2)


classes = ["Working Device", "Failed Device"] #Needed for the visualizers

Synthetic Minority Over-sampling Technique (SMOTE) is a re-sampling technique that synthetically generates data points that are similar to the minority class (failed devices). The ratio of samples in the minority vs. majority class can be manipulated using the sampling_strategy argument.

In addition to that, I wanted to compare the difference between Random Forest and Gradient Boosting models, since they performed the best in my user defined function that quickly compares models.

I created 5 pipelines and compared metrics between them, an example of the best pipeline (gbc_pipeline2) is shown below.

#Train/Test/Split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = .3,
stratify=target, random_state = 42)

# Pipeline here
gbc_pipeline2.fit(X_train, y_train)

#Sklearn Classification report
y_pred = gbc_pipeline2.predict(X_test)

print(classification_report(y_test, y_pred))

Then we run each pipeline through Yellowbrick’s classification report, confusion matrix, and ROCAUC functionalities.

#YellowBrick Classification report

visualizer = ClassificationReport(gbc_pipeline2, classes=classes)

Graph6

Since our objective is to predict which devices will fail on the next trip, the metric we want to focus on here is the Recall for failed devices.

#Confusion Matrix
cm = ConfusionMatrix(gbc_pipeline2, classes=[0,1])

Graph7

In the confusion matrix, we can see that we correctly classify 22/32 failed devices. This is useful for getting a sense of how our algorithm would classify in a real study and spread of the data.

#Yellowbrick ROC/AUC

visualizer = ROCAUC(gbc_pipeline2, classes=classes)

Graph8

The ROC AUC curve shows how well the algorithm can distinguish between when a failed device and functioning device. That means our model is 87% sure that it can distinguish between the classes.

Conclusion

# Final test with whole dataset:

#Confusion Matrix
cm = ConfusionMatrix(gbc_pipeline2, classes=[0,1])

# Fit fits the passed model.
cm.fit(X_train, y_train)
cm.score(features, target)

Graph9

Our goal was to create a machine learning model to predict if a vehicle would fail on it’s next trip based on data from a telemetry device. Using this model, the company can now catch 94 out of every 106 devices that fails . This will increase the company’s rate of successfully completed trips, increasing their revenue in multiple ways (saving towing costs, better customer relations).

Thank you for taking the time to read through this project and if you have any questions or concerns please dont hesitate to reach out via email!