Customer Segmentation

Hello again and welcome to another one of my projects! This time, the boss has given us a dataset with previous transaction data and is asking for a few different things:

  • Separate data into 3 different tables (Customers/Orders/Products) and write to a SQL database for storage.
  • Create a Recency-Frequency-Monetary table for reporting.
  • Segment the customers into groups using a basic KMeans clustering algorithm.

These clusters could represent anything from pushing different advertisements to those customers to inviting them back with a deal if they haven’t bought from our site in a while, or even general demographic information for later use. We will be evaluating our model using silhouette score, Calinski-Harabaz score, and distortion score. Lets get to it!

Step 1: Primary Analysis and Cleaning

After importing our libraries, we’ll read in the data from the single csv file that we’ve been given.

#import data:
df = pd.read_csv("customers.csv")

df.info()
List of features <class 'pandas.core.frame.DataFrame'> RangeIndex: 24958 entries, 0 to 24957 Data columns (total 74 columns): Customers.id 24950 non-null object Customers.fname 8572 non-null object Customers.lname 8572 non-null object Customers.create_date 8572 non-null object Customers.mailing 8572 non-null object Customers.last_modified 8572 non-null object Orders.id 8572 non-null object Orders.fname 8572 non-null object Orders.lname 8572 non-null object Orders.order_number 8572 non-null object Orders.currency 8572 non-null object Orders.subtotal 8572 non-null float64 Orders.shipping 8572 non-null float64 Orders.total 8572 non-null object Orders.shipping_carrier 8571 non-null object Orders.shipping_method 8564 non-null object Orders.tracking 8564 non-null object Orders.payment_status 8564 non-null float64 Orders.payment_date 8564 non-null float64 Orders.payment_type 8564 non-null object Orders.payment_amount 8564 non-null float64 Orders.payment_id 8564 non-null object Orders.payment_code 8564 non-null object Orders.status 8564 non-null float64 Orders.placed_date 8564 non-null float64 Orders.updated_date 8564 non-null float64 Orders.shipped_date 8564 non-null float64 Order_Items.id 8564 non-null float64 Order_Items.product_id 8564 non-null float64 Order_Items.product_name 8564 non-null object Order_Items.qty 8564 non-null float64 Order_Items.price 8564 non-null float64 Order_Items.cost 8564 non-null float64 Products.id 8564 non-null float64 Products.template 8564 non-null object Products.vendor 8564 non-null float64 Products.import_id 8564 non-null float64 Products.name 8564 non-null object Products.display_name 8564 non-null object Products.list_price 8564 non-null float64 Products.price 8564 non-null float64 Products.cost 8564 non-null float64 Products.flags 8564 non-null float64 Products.last_modified 8564 non-null float64 Products.taxable 8564 non-null float64 Products.shopping_gtin 8564 non-null float64 Products.shopping_brand 8564 non-null object Products.shopping_mpn 8564 non-null object Products.shopping_flags 8564 non-null float64 Products.amazon_asin 8564 non-null object Products.amazon_item_type 8564 non-null object Products.google_shopping_id 8564 non-null object Products.google_shopping_type 8564 non-null object Products.google_shopping_cat 8564 non-null object Products.shopping_type 8564 non-null object Products.pricegrabber_cat 8564 non-null object Products.thefind_cat 8564 non-null object Products.quickbooks_id 8564 non-null object Products.qb_edit_sequence 8564 non-null float64 Products.short_description 8564 non-null object Products.long_description 8557 non-null object Products.seo_title 4186 non-null object Products.seo_url 4186 non-null object Products.unit 4186 non-null object Products.packaging 4186 non-null object Products.multiple 4186 non-null object Products.upc 4186 non-null float64 Products.hcpcs 4186 non-null object Products.case_qty 4186 non-null float64 Products.import_flags 4186 non-null float64 Products.shipping_length 4186 non-null float64 Products.shipping_width 4186 non-null float64 Products.shipping_height 4186 non-null float64 Products.family_id 4186 non-null object dtypes: float64(32), object(42) memory usage: 14.1+ MB

As we can see, there are quite a few different kinds of data in this table. Since our first task is to separate our data into categories and write them to a SQL server, we’ll see if there’s an easy ID number we can use to split everything off.

#Since we're going to separate the tables, check to see if we can do so easily
#by separating on ID, no such luck
df[['Customers.id', 'Orders.id', 'Products.id']].head()
Customers.id Orders.id Products.id
0 797 3758 2310.0
1 3 23 177.0
2 3 9531 1.0
3 4 29 983.0
4 5 30 991.0

No such luck, let’s keep exploring the data and see what else we can find out about the different tables that we’ll be creating.

pd.DataFrame([{'customers': len(df['Customers.id'].value_counts()),
               'products': len(df['Products.id'].value_counts()),    
               'orders': len(df['Orders.id'].value_counts()),  
              }], columns = ['products', 'orders', 'customers'], index = ['quantity'])
products orders customers
quantity 1710 3568 3058

So there are 3054 unique customers, who made 3568 unique orders of 1710 unique products. That will be useful information for later on. While exploring, I found there are a bunch of nonsense rows. Lets take those out.

df['Customers.id'].replace('\t<li>Commode liners can be used with most commode buckets.</li>', np.NaN, inplace = True)
df['Customers.id'].replace('\t<li>Liners include an absorbent pad which solidifies the waste and makes clean up easy and hygienic.</li>',
                                  np.NaN, inplace = True)


df.dropna(thresh= 8, inplace = True)
#we have about 70 columns, so if any observations are missing 10% or more of the data we should drop them

df.reset_index(inplace=True, drop=True)

As a final step in exploration, lets take a quick look at some of the transactions for each customer.

#Lets take a quick look at our customers and the order prices
Customers_groupdf = df.groupby(['Customers.id', 'Customers.fname', 'Customers.lname'])['Order_Items.price',
                                                                                       'Order_Items.cost', 'Orders.total'].sum()

Customers_groupdf.head(20)
Customers groupby table
Order_Items.price Order_Items.cost
Customers.id Customers.fname Customers.lname
3.0 John Smith 73.78 54.37
4.0 James Anderson 19.56 12.62
5.0 Abraham Pollak 95.14 66.33
7.0 peggy thompson 39.19 27.99
8.0 Randy Pruss 59.75 45.96
10.0 Tommy Smith 34.00 34.00
11.0 Mark Tremble 34.00 34.00
12.0 Emely Cooke 10.76 2.82
13.0 george mcmillin 118.68 85.22
14.0 adrian Cavitt 339.99 339.99
15.0 Sharon Mueller 18.94 11.32
21.0 Corey Edmondson 34.00 34.00
22.0 Robert Miller 6.84 1.71
23.0 Mekala Whitaker 141.40 100.40
24.0 Richard Ariano 29.38 20.26
25.0 marc gorzynski 35.00 35.00
26.0 Richard L. Shaak 35.00 35.00
27.0 Kenneth Schmude 125.16 31.29
30.0 Jesse Spalding 35.00 35.00
31.0 Alan Safir 56.78 43.68

Step 2: Separate into 3 DataFrames

In this step we’ll separate each chunk of our original table, clean them up and at the end, write it all into a SQL server.

#Customers DataFrame
customers = df.loc[: , :'Customers.last_modified']
customers.columns = [col.split('.')[1] for col in customers.columns]

#Orders DataFrame, we need to specify columns because they start with both:
#Orders.colname and Order_Items.colname which sometimes overlap
orders = df.loc[: , 'Orders.id':'Order_Items.cost']
orders_cols = ['id', 'fname', 'lname', 'order_number', 'currency', 'subtotal', 'shipping', 'total',  'shipping_carrier',
               'shipping_method', 'tracking', 'payment_status', 'payment_date', 'payment_type', 'payment_amount', 'payment_id',
               'payment_code', 'status', 'placed_date', 'updated_date', 'shipped_date', 'Items.id', 'Items.product_id',
               'Items.product_name', 'Items.qty', 'Items.price', 'Items.cost']
orders.columns = orders_cols

#Products DataFrame
products = df.loc[: , 'Products.id':]
products.columns = [col.split('.')[1] for col in products.columns]

Customers Table

Separating out each factor from the larger table creates a bunch of duplicates, so our primary step for each table will be dropping duplicates with the same ID.

#As you can see from this example, there are many duplicates
customers[customers['id'] == 3371.0]
Customer duplicates
id fname lname create_date mailing last_modified
3821 3371 Terry Rich 1461637787 1.0 1461637787
3822 3371 Terry Rich 1461637787 1.0 1461637787
3823 3371 Terry Rich 1461637787 1.0 1461637787
3824 3371 Terry Rich 1461637787 1.0 1461637787
3825 3371 Terry Rich 1461637787 1.0 1461637787
3826 3371 Terry Rich 1461637787 1.0 1461637787
3827 3371 Terry Rich 1461637787 1.0 1461637787
3828 3371 Terry Rich 1461637787 1.0 1461637787
3829 3371 Terry Rich 1461637787 1.0 1461637787
3830 3371 Terry Rich 1461637787 1.0 1461637787
3831 3371 Terry Rich 1461637787 1.0 1461637787
3832 3371 Terry Rich 1461637787 1.0 1461637787
3833 3371 Terry Rich 1461637787 1.0 1461637787
3834 3371 Terry Rich 1461637787 1.0 1461637787
3835 3371 Terry Rich 1461637787 1.0 1461637787
3836 3371 Terry Rich 1461637787 1.0 1461637787
3837 3371 Terry Rich 1461637787 1.0 1461637787
3838 3371 Terry Rich 1461637787 1.0 1461637787
3839 3371 Terry Rich 1461637787 1.0 1461637787
3840 3371 Terry Rich 1461637787 1.0 1461637787
3841 3371 Terry Rich 1461637787 1.0 1461637787
3842 3371 Terry Rich 1461637787 1.0 1461637787
3843 3371 Terry Rich 1461637787 1.0 1461637787
3844 3371 Terry Rich 1461637787 1.0 1461637787
3845 3371 Terry Rich 1461637787 1.0 1461637787
3846 3371 Terry Rich 1461637787 1.0 1461637787
3847 3371 Terry Rich 1461637787 1.0 1461637787
3848 3371 Terry Rich 1461637787 1.0 1461637787
3849 3371 Terry Rich 1461637787 1.0 1461637787
3850 3371 Terry Rich 1461637787 1.0 1461637787
... ... ... ... ... ... ...
8162 3371 Terry Rich 1461637787 1.0 1461637787
8163 3371 Terry Rich 1461637787 1.0 1461637787
8164 3371 Terry Rich 1461637787 1.0 1461637787
8165 3371 Terry Rich 1461637787 1.0 1461637787
8166 3371 Terry Rich 1461637787 1.0 1461637787
8167 3371 Terry Rich 1461637787 1.0 1461637787
8168 3371 Terry Rich 1461637787 1.0 1461637787
8169 3371 Terry Rich 1461637787 1.0 1461637787
8170 3371 Terry Rich 1461637787 1.0 1461637787
8171 3371 Terry Rich 1461637787 1.0 1461637787
8172 3371 Terry Rich 1461637787 1.0 1461637787
8173 3371 Terry Rich 1461637787 1.0 1461637787
8174 3371 Terry Rich 1461637787 1.0 1461637787
8175 3371 Terry Rich 1461637787 1.0 1461637787
8176 3371 Terry Rich 1461637787 1.0 1461637787
8177 3371 Terry Rich 1461637787 1.0 1461637787
8178 3371 Terry Rich 1461637787 1.0 1461637787
8179 3371 Terry Rich 1461637787 1.0 1461637787
8180 3371 Terry Rich 1461637787 1.0 1461637787
8181 3371 Terry Rich 1461637787 1.0 1461637787
8182 3371 Terry Rich 1461637787 1.0 1461637787
8183 3371 Terry Rich 1461637787 1.0 1461637787
8184 3371 Terry Rich 1461637787 1.0 1461637787
8185 3371 Terry Rich 1461637787 1.0 1461637787
8186 3371 Terry Rich 1461637787 1.0 1461637787
8187 3371 Terry Rich 1461637787 1.0 1461637787
8188 3371 Terry Rich 1461637787 1.0 1461637787
8189 3371 Terry Rich 1461637787 1.0 1461637787
8190 3371 Terry Rich 1461637787 1.0 1461637787
8191 3371 Terry Rich 1461637787 1.0 1461637787

4371 rows × 6 columns

customers.drop_duplicates(subset='id', inplace = True)

#After doing this, there is one more ID that is null, we will drop this as well
customers.drop(index = [801], inplace = True)

Our next step will be converting the columns to correct data types so that we can work with that data in python later if we need. This doesn’t matter so much for writing to the SQL database because we can specify formatting later on.

#Converting some columns to correct data type
customers['create_date'] = customers['create_date'].astype(int)
customers['mailing'] = customers['mailing'].astype(float)
customers['last_modified'] = customers['last_modified'].astype(int)
customers['create_date'] = customers['create_date'].transform(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))
customers['last_modified'] = customers['last_modified'].transform(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))

Lets just take one more look at the info to make sure we’re good to go.

customers.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3054 entries, 0 to 8571
Data columns (total 6 columns):
id               3054 non-null object
fname            3054 non-null object
lname            3054 non-null object
create_date      3054 non-null object
mailing          3054 non-null float64
last_modified    3054 non-null object
dtypes: float64(1), object(5)
memory usage: 167.0+ KB

Orders Table

As before, we will first drop the duplicates. Then we drop the remaining rows with null values, about 3 rows in this table.

orders.drop_duplicates(subset='id', inplace = True)
orders.dropna(inplace = True)

Then we do the feature transformations similar to the previous table.

orders['total'] = orders['total'].astype(float)
orders['updated_date'] = orders['updated_date'].transform(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))
orders['shipped_date'] = orders['shipped_date'].transform(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))

Next, we’ll create a column ‘profit’ that will be useful for segmentation later. This is also useful to get a quick snapshot of how much our company is making from each transaction.

orders['profit'] = (orders['Items.price'] - orders['Items.cost']) / orders['Items.qty']
orders['profit'].describe()
count    3565.000000
mean       15.067519
std        22.067572
min       -29.510000
25%         6.010000
50%        10.620000
75%        18.000000
max       549.000000
Name: profit, dtype: float64

Already we the benefit of creating this column is apparent, there are some orders we are actually LOSING money on! We can sort by those transactions that our profit is less than 0 for further analysis later on, although we won’t be covering it in this specific project, we have to stay focused on the task at hand!

#we can take a look at which orders we're losing money on and why
orders[orders['profit'] < 0]
Orders we lose money on
id fname lname order_number currency subtotal shipping total shipping_carrier shipping_method tracking payment_status payment_date payment_type payment_amount payment_id payment_code status placed_date updated_date shipped_date Items.id Items.product_id Items.product_name Items.qty Items.price Items.cost profit
812 15142 Marcia Olsen 15142 USD 29.97 0.00 29.97 fedex 11|Ground 7.76035E+11 3.0 1.459800e+09 authorize.net 29.97 8133230477 03161Z 1.0 1.459800e+09 2016-04-04 15:39:40 2016-04-04 15:39:40 17707.0 1454.0 Sterile Bordered Gauze 3.0 9.99 10.050000 -0.020000
1815 5867 PJ Nassi 5867 USD 32.16 9.95 42.11 fedex 11|Ground 6.49865E+13 3.0 1.441042e+09 authorize.net 42.11 7483749716 263228 1.0 1.441042e+09 2015-09-02 20:32:37 2015-08-31 19:35:17 7614.0 2110.0 MoliCare Disposable Super Plus Briefs, Large/X... 3.0 10.72 19.330000 -2.870000
2057 7327 Leigh Anne Duncan 7327 USD 99.90 0.00 99.90 fedex 11|Ground 7.74823E+11 3.0 1.445780e+09 paypal 99.90 1W093636Y5791203L 02708Z 1.0 1.445780e+09 2015-11-02 19:32:49 2015-10-27 07:01:00 9368.0 1454.0 Medline Sterile Bordered Gauze 10.0 9.99 10.050000 -0.006000
2184 8629 Michael Zayats 8629 USD 19.98 5.54 17.98 fedex 11|Ground 9.4055E+21 3.0 1.448205e+09 authorize.net 17.98 7727592306 513 1.0 1.448060e+09 2015-11-26 10:48:07 2015-11-23 09:09:28 10719.0 1454.0 Sterile Bordered Gauze 2.0 9.99 10.050000 -0.030000
2615 11582 Teresa Joslin 11582 USD 65.99 6.93 65.99 fedex 11|Ground 7.75441E+11 3.0 1.452899e+09 authorize.net 65.99 7897798685 01591R 1.0 1.452899e+09 2016-02-08 11:29:29 2016-01-18 12:21:13 13844.0 1841.0 Emesis Bags, Blue, 36.000 OZ 1.0 65.99 88.500000 -22.510000
3071 13228 Barbara Hadley 13228 USD 176.97 23.67 176.97 fedex 11|Ground 7.75731E+11 3.0 1.456368e+09 authorize.net 176.97 8013360365 243464 1.0 1.456368e+09 2016-03-02 14:15:17 2016-02-25 07:21:39 15648.0 1841.0 Emesis Bags, Blue, 36.000 OZ 3.0 58.99 88.500000 -9.836667
3086 13267 Marcy Seaman 13267 USD 119.97 0.00 119.97 fedex 11|Ground 6.63949E+11 3.0 1.456436e+09 authorize.net 119.97 8016026233 245275 1.0 1.456436e+09 2016-03-02 14:16:22 2016-03-01 07:11:54 15691.0 17051.0 Bottom Buddy toilet tissue aid 3.0 39.99 40.500000 -0.170000
3095 13302 Melody Hollowell 13302 USD 58.99 6.71 58.99 fedex 11|Ground 7.75743E+11 3.0 1.456506e+09 authorize.net 58.99 8018282377 10043 1.0 1.456506e+09 2016-03-02 14:17:21 2016-02-26 09:04:14 15731.0 1841.0 Emesis Bags, Blue, 36.000 OZ 1.0 58.99 88.500000 -29.510000
3134 13529 Martha Keler 13529 USD 69.99 6.71 69.99 fedex 11|Ground 7.75777E+11 3.0 1.456858e+09 authorize.net 69.99 8031260682 6135 1.0 1.456858e+09 2016-03-02 14:23:42 2016-03-02 08:49:41 15970.0 1453.0 Sterile Bordered Gauze 1.0 69.99 78.140000 -8.150000
3180 16691 Staci Meredith 16691 USD 14.85 0.00 14.85 fedex 11|Ground 6.87398E+14 3.0 1.462487e+09 authorize.net 14.85 8382059188 10280 5.0 1.462487e+09 2016-05-06 06:26:55 2015-11-05 07:18:20 19385.0 1842.0 Medline Emesis/Barf Bags, throw up bags, Blue ... 1.0 14.85 15.500000 -0.650000
3206 13832 Debra Jordan 13832 USD 58.99 7.16 58.99 fedex 11|Ground 7.75813E+11 3.0 1.457365e+09 authorize.net 58.99 8048836159 45498 1.0 1.457365e+09 2016-03-07 11:23:36 2016-03-07 11:23:36 16293.0 1841.0 Emesis Bags, Blue, 36.000 OZ 1.0 58.99 88.500000 -29.510000
3226 13866 CORI WALTER 13866 USD 58.99 7.85 53.99 fedex 11|Ground 7.75822E+11 3.0 1.457401e+09 authorize.net 53.99 8051104160 1260 1.0 1.457401e+09 2016-03-08 09:26:25 2016-03-08 09:26:25 16328.0 1841.0 Emesis Bags, Blue, 36.000 OZ 1.0 58.99 88.500000 -29.510000
3374 14351 Ann Orlando 14351 USD 27.99 0.00 27.99 fedex 11|Ground 4.83833E+14 3.0 1.458577e+09 authorize.net 27.99 8090515241 56410 1.0 1.458577e+09 2016-03-21 15:17:30 2016-03-21 14:13:14 16855.0 782.0 Bed Assist Bar with Storage Pocket 1.0 27.99 29.160000 -1.170000
3398 14548 Bob Wiright 14548 USD 34.99 9.95 53.42 fedex 11|Ground 6.63949E+11 3.0 1.458757e+09 authorize.net 53.42 8097412082 09877C 1.0 1.458757e+09 2016-03-29 18:33:03 2016-03-29 18:33:03 17069.0 17051.0 Bottom Buddy toilet tissue aid 1.0 34.99 40.500000 -5.510000
3405 14597 Alouis Colgan 14597 USD 34.99 0.00 34.99 fedex 11|Ground 6.63949E+11 3.0 1.458804e+09 authorize.net 34.99 8099072722 161248 1.0 1.458804e+09 2016-03-31 07:05:07 2016-03-31 07:05:07 17119.0 17051.0 Bottom Buddy toilet tissue aid 1.0 34.99 40.500000 -5.510000
3409 14619 lois moore 14619 USD 34.99 0.00 34.99 fedex 11|Ground 6.63949E+11 3.0 1.458838e+09 authorize.net 34.99 8100164574 24024 1.0 1.458838e+09 2016-03-29 18:38:57 2016-03-29 18:38:57 17141.0 17051.0 Bottom Buddy toilet tissue aid 1.0 34.99 40.500000 -5.510000
3420 14710 Sherman Langer 14710 USD 58.99 0.00 58.99 fedex 11|Ground 7.7597E+11 3.0 1.459004e+09 authorize.net 58.99 8105679785 01196D 1.0 1.459004e+09 2016-03-28 06:49:18 2016-03-28 06:49:18 17234.0 1841.0 Emesis Bags, Blue, 36.000 OZ 1.0 58.99 88.500000 -29.510000
3487 14953 Michael Ball 14953 USD 58.99 0.00 58.99 fedex 11|Ground 6.13E+19 3.0 1.459381e+09 authorize.net 58.99 8117784458 322162 1.0 1.459381e+09 2016-03-30 18:49:32 2016-03-30 18:49:32 17496.0 1841.0 Emesis Bags, Blue, 36.000 OZ 1.0 58.99 88.500000 -29.510000
3541 15144 Janice Boyle 15144 USD 27.99 0.00 27.99 fedex 11|Ground 6.87398E+14 3.0 1.459802e+09 authorize.net 27.99 8133377121 08674Z 5.0 1.459802e+09 2016-04-04 15:36:45 2015-11-05 07:18:20 17709.0 782.0 Bed Assist Bar with Storage Pocket 1.0 27.99 29.160000 -1.170000
3542 15145 Janice Boyle 15145 USD 27.99 0.00 25.19 fedex 11|Ground 6.87398E+14 3.0 1.459803e+09 authorize.net 25.19 8133402033 08307Z 4.0 1.459803e+09 2016-04-04 14:09:07 2015-11-05 07:18:20 17710.0 782.0 Bed Assist Bar with Storage Pocket 1.0 27.99 29.160000 -1.170000
3583 15260 Wesley Chalker Jr 15260 USD 19.98 0.00 19.98 fedex 11|Ground 4.83833E+14 3.0 1.460044e+09 authorize.net 19.98 8142281323 7994 1.0 1.460044e+09 2016-04-11 13:23:14 2016-04-11 13:23:14 17832.0 1454.0 Sterile Bordered Gauze 2.0 9.99 10.050000 -0.030000
3584 16364 Wesley Chalker Jr 16364 USD 19.98 0.00 19.98 fedex 11|Ground 9.4055E+21 3.0 1.462125e+09 authorize.net 19.98 8366965020 1079 1.0 1.462125e+09 2016-05-02 11:44:04 2016-05-02 11:44:04 19040.0 1454.0 Sterile Bordered Gauze 2.0 9.99 10.050000 -0.030000
3615 15376 CLARENCE POLLARD 15376 USD 55.98 0.00 50.98 fedex 11|Ground 1.49935E+13 3.0 1.460245e+09 paypal 50.98 7WL48849TP228631L 02708Z 1.0 1.460245e+09 2016-04-12 08:56:07 2016-04-12 08:56:07 17962.0 782.0 Bed Assist Bar with Storage Pocket 2.0 27.99 29.160000 -0.585000
3618 15393 Barbara James 15393 USD 58.99 0.00 58.99 fedex 11|Ground 7.76076E+11 3.0 1.460294e+09 authorize.net 58.99 8303422727 82115 1.0 1.460294e+09 2016-04-13 11:04:41 2016-04-11 12:01:51 17981.0 1841.0 Emesis Bags, Blue, 36.000 OZ 1.0 58.99 88.500000 -29.510000
3619 15394 WARREN SEDRAN 15394 USD 27.99 0.00 25.19 fedex 11|Ground 4.83833E+14 3.0 1.460297e+09 authorize.net 25.19 8303477615 600479 1.0 1.460297e+09 2016-04-12 08:58:35 2016-04-12 08:58:35 17982.0 782.0 Bed Assist Bar with Storage Pocket 1.0 27.99 29.160000 -1.170000
3719 15780 Lenny Shenall 15780 USD 58.99 9.95 64.08 fedex 11|Ground 7.76132E+11 3.0 1.461005e+09 authorize.net 64.08 8328252812 21864 1.0 1.461005e+09 2016-04-18 11:57:28 2016-04-18 11:57:28 18396.0 1841.0 Emesis Bags, Blue, 36.000 OZ 1.0 58.99 88.500000 -29.510000
3731 15844 Dolores Gosnell 15844 USD 27.99 9.95 27.99 fedex 11|Ground 9.53516E+14 3.0 1.461098e+09 authorize.net 27.99 8332005134 01990R 1.0 1.461098e+09 2016-05-04 11:36:46 2016-05-04 11:36:46 18462.0 782.0 Medline Bed Assist Bar 1.0 27.99 29.160000 -1.170000
3752 15899 Lee Gerstenhaber 15899 USD 34.99 0.00 34.99 fedex 11|Ground 9.4055E+21 3.0 1.461190e+09 paypal 34.99 2HB74261TB057390A 02708Z 1.0 1.461190e+09 2016-04-21 10:51:00 2016-04-21 10:51:00 18519.0 17051.0 Bottom Buddy toilet tissue aid 1.0 34.99 40.500000 -5.510000
3766 15943 Jan Kripzer 15943 USD 34.99 9.95 34.99 fedex 11|Ground 9.4055E+19 3.0 1.461274e+09 authorize.net 34.99 8338712178 06826D 1.0 1.461274e+09 2016-04-21 14:27:09 2016-04-21 14:27:09 18566.0 17051.0 Bottom Buddy toilet tissue aid 1.0 34.99 40.500000 -5.510000
3774 15970 Christopher Zanini 15970 USD 27.99 0.00 25.19 fedex 11|Ground 1.02633E+14 3.0 1.461604e+09 authorize.net 25.19 8341795724 275057 1.0 1.461356e+09 2016-04-25 19:21:18 2016-04-25 18:19:10 18593.0 782.0 Medline Bed Assist Bar 1.0 27.99 29.160000 -1.170000
3779 15992 Gail Reel 15992 USD 34.99 0.00 34.99 fedex 11|Ground 9.4055E+21 3.0 1.461605e+09 authorize.net 34.99 8343113413 57116 1.0 1.461407e+09 2016-04-25 10:25:23 2016-04-25 10:25:23 18616.0 17051.0 Bottom Buddy toilet tissue aid 1.0 34.99 40.500000 -5.510000
3789 16005 ANNE MARIE DODERO 16005 USD 27.99 0.00 27.99 fedex 11|Ground 7.96468E+14 3.0 1.461604e+09 authorize.net 27.99 8344004179 01862A 1.0 1.461441e+09 2016-04-25 18:51:06 2016-04-25 17:47:06 18637.0 782.0 Medline Bed Assist Bar 1.0 27.99 29.160000 -1.170000
3813 16068 betty montesi 16068 USD 58.99 0.00 58.99 fedex 11|Ground 7.76208E+11 3.0 1.461782e+09 authorize.net 58.99 8347372581 09654D 1.0 1.461593e+09 2016-04-27 11:29:54 2016-04-27 11:29:54 18704.0 1841.0 Emesis Bags, Blue, 36.000 OZ 1.0 58.99 88.500000 -29.510000
8237 16204 Jill OGorman 16204 USD 34.99 0.00 34.99 fedex 11|Ground 9.4055E+21 3.0 1.461881e+09 authorize.net 34.99 8356224503 825606 1.0 1.461810e+09 2016-05-02 11:29:47 2016-05-02 11:29:47 18863.0 17051.0 Bottom Buddy toilet tissue aid 1.0 34.99 40.500000 -5.510000
8274 16335 Casandra McMorries 16335 USD 58.99 0.00 58.99 fedex 11|Ground 7.76248E+11 3.0 1.462133e+09 authorize.net 58.99 8364716076 19481 1.0 1.462054e+09 2016-05-03 10:46:39 2016-05-03 10:46:39 19010.0 1841.0 Emesis Bags, Blue, 36.000 OZ 1.0 58.99 88.500000 -29.510000
8307 16414 Douglas Monsoor 16414 USD 104.97 9.95 104.97 fedex 11|Ground 7.76253E+11 3.0 1.462217e+09 authorize.net 104.97 8370811515 01110C 1.0 1.462217e+09 2016-05-03 11:58:54 2016-05-03 11:58:54 19092.0 17051.0 Bottom Buddy toilet tissue aid 3.0 34.99 40.500000 -1.836667
8353 16541 Lester Curnow 16541 USD 14.85 0.00 14.85 fedex 11|Ground 9.36129E+21 3.0 1.462373e+09 authorize.net 14.85 8376930603 286988 1.0 1.462373e+09 2016-05-05 09:21:11 2016-05-05 09:21:11 19227.0 1842.0 Medline Emesis/Barf Bags, throw up bags, Blue ... 1.0 14.85 15.500000 -0.650000
8359 16558 Tom Beatty 16558 USD 27.99 0.00 27.99 fedex 11|Ground 6.87398E+14 3.0 1.462391e+09 authorize.net 27.99 8378075808 05297C 4.0 1.462391e+09 2016-05-04 12:37:07 2015-11-05 07:18:20 19245.0 782.0 Medline Bed Assist Bar 1.0 27.99 28.160000 -0.170000
8365 16597 Patricia Kotsenas 16597 USD 64.80 9.95 74.75 fedex 11|Ground 2.27196E+13 3.0 1.462409e+09 authorize.net 74.75 8378981908 02630B 1.0 1.462409e+09 2016-05-05 11:11:21 2016-05-05 10:07:32 19285.0 1025.0 Steel Bariatric Commode 1.0 64.80 92.000000 -27.200000
8384 16693 Rebecca Griffiths 16693 USD 58.99 0.00 58.99 fedex 11|Ground 575-642-8976 3.0 1.462488e+09 authorize.net 58.99 8382102271 586726 1.0 1.462488e+09 2016-05-06 06:31:57 2016-05-06 06:31:57 19387.0 1841.0 Emesis Bags, Blue, 36.000 OZ 1.0 58.99 88.500000 -29.510000
8392 16744 June Rudner 16744 USD 27.99 0.00 27.99 fedex 11|Ground 4.8383E+14 3.0 1.462540e+09 authorize.net 27.99 8383400273 00615Z 1.0 1.462540e+09 2016-05-06 14:47:12 2016-05-06 13:43:12 19439.0 782.0 Medline Bed Assist Bar 1.0 27.99 28.160000 -0.170000
8394 16746 FRANK SOPINSKI 16746 USD 27.99 0.00 27.99 fedex 11|Ground 6.40791E+13 3.0 1.462543e+09 authorize.net 27.99 8383552971 06375C 1.0 1.462543e+09 2016-05-06 16:47:02 2016-05-06 15:40:44 19441.0 782.0 Medline Bed Assist Bar 1.0 27.99 28.160000 -0.170000
8395 16747 michael shotts 16747 USD 74.99 9.95 77.44 fedex 11|Ground 8.3565E+14 3.0 1.462544e+09 authorize.net 77.44 8383594730 18242 1.0 1.462544e+09 2016-05-06 13:17:19 2016-05-06 12:14:14 19442.0 426.0 Padded Transfer Benches 1.0 74.99 77.000000 -2.010000
8396 16749 Holly Scotchel 16749 USD 58.99 0.00 58.99 fedex 11|Ground 6.87398E+14 3.0 1.462545e+09 authorize.net 58.99 8383648902 01683C 5.0 1.462545e+09 2016-05-06 07:39:00 2015-11-05 07:18:20 19444.0 1841.0 Emesis Bags, Blue, 36.000 OZ 1.0 58.99 88.500000 -29.510000
8402 16753 Harold Roberts 16753 USD 27.99 0.00 27.99 fedex 11|Ground 8.3565E+14 3.0 1.462546e+09 authorize.net 27.99 8383741325 5242 1.0 1.462546e+09 2016-05-06 15:18:01 2016-05-06 14:13:22 19452.0 782.0 Medline Bed Assist Bar 1.0 27.99 28.160000 -0.170000
8428 16962 Arthur Royer 16962 USD 44.00 9.95 53.95 manual 0|Standard Shipping 6.87398E+14 3.0 1.462804e+09 authorize.net 24.95 8390592894 909003 5.0 1.462804e+09 2016-05-09 12:56:09 2016-05-09 12:55:41 19674.0 1842.0 BUCKET, REPLACEMENT FOR MDS89668XW 1.0 44.00 51.112718 -7.112718
8434 16981 Rebecca Schnepf 16981 USD 34.99 0.00 34.99 fedex 11|Ground 6.87398E+14 3.0 1.462823e+09 authorize.net 34.99 8391980366 43453 0.0 1.462823e+09 2016-05-10 14:08:06 2015-11-05 07:18:20 19696.0 17051.0 Bottom Buddy toilet tissue aid 1.0 34.99 40.500000 -5.510000
8435 16987 Patricia Sleeper 16987 USD 14.85 0.00 14.85 fedex 11|Ground 6.87398E+14 3.0 1.462824e+09 authorize.net 14.85 8392053758 609572 5.0 1.462824e+09 2016-05-09 13:39:33 2015-11-05 07:18:20 19703.0 1842.0 Medline Emesis/Barf Bags, throw up bags, Blue ... 1.0 14.85 15.500000 -0.650000
8448 17060 Nick Dziurkowski 17060 USD 14.85 0.00 14.85 fedex 11|Ground 6.87398E+14 3.0 1.462907e+09 authorize.net 14.85 8395155339 13223 5.0 1.462907e+09 2016-05-13 07:51:23 2015-11-05 07:18:20 19782.0 1842.0 Medline Emesis/Barf Bags, throw up bags, Blue ... 1.0 14.85 15.500000 -0.650000
8456 17071 Marvin Rahman 17071 USD 14.85 0.00 14.85 fedex 11|Ground 6.87398E+14 3.0 1.462911e+09 authorize.net 14.85 8395421530 157796 5.0 1.462911e+09 2016-05-13 07:54:32 2015-11-05 07:18:20 19798.0 1842.0 Medline Emesis/Barf Bags, throw up bags, Blue ... 1.0 14.85 15.500000 -0.650000
8457 17072 Bethel Jones 17072 USD 14.85 9.95 14.85 fedex 11|Ground 6.87398E+14 3.0 1.462912e+09 authorize.net 14.85 8395506459 01038R 5.0 1.462912e+09 2016-05-13 07:58:05 2015-11-05 07:18:20 19799.0 1842.0 Medline Emesis/Barf Bags, throw up bags, Blue ... 1.0 14.85 15.500000 -0.650000
8470 17115 Barbara Hawkins 17115 USD 34.99 0.00 31.49 fedex 11|Ground 9.4055E+21 3.0 1.462941e+09 authorize.net 31.49 8396533738 10541 1.0 1.462941e+09 2016-05-11 06:59:29 2016-05-11 06:59:29 19846.0 17051.0 Bottom Buddy toilet tissue aid 1.0 34.99 40.500000 -5.510000
8504 17206 Gail Yarbrough 17206 USD 14.85 0.00 14.85 fedex 11|Ground 6.87398E+14 3.0 1.463064e+09 authorize.net 14.85 8400367859 601427 5.0 1.463064e+09 2016-05-13 08:00:03 2015-11-05 07:18:20 19948.0 1842.0 Medline Emesis/Barf Bags, throw up bags, Blue ... 1.0 14.85 15.500000 -0.650000
8522 17271 Julia Gearhart 17271 USD 29.70 0.00 29.70 fedex 11|Ground 6.87398E+14 3.0 1.463160e+09 authorize.net 29.70 8404223306 289728 0.0 1.463160e+09 2015-08-06 06:40:10 2015-11-05 07:18:20 20020.0 1842.0 Medline Emesis/Barf Bags, throw up bags, Blue ... 2.0 14.85 15.500000 -0.325000
8567 17421 Nora Fontana 17421 USD 14.85 0.00 16.06 fedex 11|Ground 6.87398E+14 3.0 1.463409e+09 paypal 16.06 16706988SV3261147 02708Z 0.0 1.463409e+09 2015-08-06 06:40:10 2015-11-05 07:18:20 20186.0 1842.0 Medline Emesis/Barf Bags, throw up bags, Blue ... 1.0 14.85 15.500000 -0.650000

We can drill down and look at things like what shipping methods we use most. Maybe we can cut a deal with USPS or whoever we’re doing the most business with. We can also see how much free shipping is happening, that may effect our bottom line.

orders['shipping_method'].value_counts()

11|Ground 2549 0|Standard Shipping 697 -1|Free Shipping 220 0|Free Shipping 86 NATIONAL DELIVERY 2 HOWARD’S EXPRESS, INC. 2 WILSON TRUCKING CORP 2 NATIONAL DELIVERY SYSTEMS 1 MEDTRANS 1 SOUTHWESTERN MOTOR TRNAS 1 ROADWAY EXPRESS 1 FEDERAL EXPRESS CORP. 1 INTERNATIONAL FEDEX 1 LAND AIR OF NEW ENGLAND 1 Name: shipping_method, dtype: int64

Now that this table is all cleaned up, we can move on to the final table.

Products Table

As before we’ll drop the duplicates, the null values (4 rows this time) and then do the feature transformations.

products.drop_duplicates(subset= 'id', inplace = True)
products.dropna(inplace = True)

#Feature Transformations
products['last_modified'] = products['last_modified'].transform(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))
products['import_id'] = products['import_id'].astype(str)
products['vendor'] = products['vendor'].astype(str)

At this point, we can use NLP to analyze our products, see which ones are most popular, and group them into the types of products that give us the most profit. Again, this is beyond the scope of this specific project, which is to create a basic customer segmentation and a RFM table (eyes on the prize).

Writing to SQL Database

In this basic example, I will be using SQLite3, although I could use SQLalchemy or one of the other million packages that interface between python and SQL.

#Create Connection
conn = sqlite3.connect('SQL Server Filepath\\.db')

#Customers table
customers.to_sql('customers', conn, if_exists='replace', index=True)

#Orders table
orders.to_sql('orders', conn, if_exists='replace', index=True)

#Products table
products.to_sql('products', conn, if_exists='replace', index=True)

conn.close()

Groupby Analysis

With our first task behind us, we can start to see the finish line in the distance…. A bit far in the distance, but we can see it, and the fun bit is still ahead! Since we didn’t really do a deep dive into the products table, we’ll mostly be looking at our customer and order information.

#Make a FullName variable to merge on easily
#People can share the same first or last name, but rarely share both
customers['FullName'] = customers['fname'] + customers['lname']
orders['FullName'] = orders['fname'] + orders['lname']

#Create the new DataFrame
df2 = pd.merge(customers, orders, how= 'outer', on='FullName')

#Drop the redundant columns
df2.drop(['fname_x', 'lname_x', 'fname_y', 'lname_y'], axis = 1, inplace = True)

Lets take another snapshot of our customer buying habits, similar to before, but now we have the profit column as well.

#Lets look at a quick snapshot of customer buying habits
df2_items = df2.groupby('FullName')['Items.qty', 'Items.price', 'Items.cost', 'profit'].sum()
df2_items.sample(20)
Customer buying habits
Items.qty Items.price Items.cost profit
FullName
DavidCislo 1.0 26.09 17.99 8.100
StevenAlexander 1.0 39.35 28.11 11.240
DarrylMiller 1.0 69.17 53.21 15.960
olgapasko 2.0 9.99 6.36 1.815
GarveyStenersen 1.0 160.04 128.03 32.010
CherylMorgan 1.0 60.92 46.50 14.420
sueminier 2.0 16.23 6.49 4.870
RobertPurnick 1.0 114.99 90.99 24.000
KimFreudenberger 1.0 25.00 17.24 7.760
HarlanJuster 3.0 47.99 44.00 1.330
donnahoey 1.0 151.94 121.55 30.390
DanielFoster 1.0 57.19 43.99 13.200
AntlanticSurgicenter 1.0 85.67 66.93 18.740
VendelCsaszar 1.0 35.49 35.49 0.000
MarkForte 2.0 36.82 21.04 15.780
Mary Allbright 1.0 116.99 103.00 13.990
SchivonneBishop 1.0 46.37 33.60 12.770
BarryOxford 1.0 34.27 24.48 9.790
TanyaGlover 2.0 12.78 7.30 2.740
DavidKim 1.0 58.36 44.55 13.810

And we can take a look at the shipping costs in case that’s something we want to analyze later on.

df2_order = df2.groupby('FullName')['subtotal', 'shipping', 'total'].sum()
df2_order.sample(20)
Shipping Costs
subtotal shipping total
FullName
SteveGonzalez 19.45 9.95 29.40
HelenBlair 32.63 9.95 42.58
adamhenninger 4.88 9.95 15.26
MarkDowling 121.36 9.95 119.17
SusanHixson 13.99 0.00 13.99
RonHutchison 37.99 6.73 37.99
AleksandraBaeva 270.66 0.00 270.66
DEBBIETRENT 171.91 9.95 164.67
douglasperry 112.87 7.17 101.58
renee Pallotta 24.40 9.95 35.30
PurleyNewson 20.77 9.95 30.72
Jannett Stewart 68.89 0.00 68.89
ChristaDedebant 747.23 29.85 777.08
DianaPaulsen 63.70 9.95 73.65
T. JohnThomas 23.27 9.95 33.22
Donald Ellis 57.19 0.00 56.12
PatriciaSleeper 14.85 0.00 14.85
RHONDABRYAN 13.13 9.95 21.77
JoseAlva 66.94 9.95 76.89
HimanshuMisra 24.99 0.00 24.99

Now, we’ll create the ‘recency’ aspect of our Recency-Frequency-Monetary table for later.

df3 = df2.copy()
df3.dropna(inplace=True)
df3['placed_date'] = df3['placed_date'].transform(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))

#Newest orders
df_order_new = df3.groupby('id_x')['placed_date'].max()

#Oldest orders
df_order_old = df3.groupby('id_x')['placed_date'].min()

df_order_old = pd.DataFrame(df_order_old)
df_order_new = pd.DataFrame(df_order_new)
df_order = pd.DataFrame()

df_order[['first_order', 'recent_order']] = pd.merge(df_order_old, df_order_new, left_on = 'id_x', right_on= 'id_x')

Now that we’ve done a good bit of feature engineering and analysis, it’s time to do the actual modeling and segmentation.

Modeling

The first thing that we’ll use is PCA, which helps to reduce the amount of features so that we can cluster. PCA can make it difficult to determine WHY the customers are segmented the way they are, so we will also be segmenting based on profit and recency, then comparing the 2 methods.

PCA

#PCA only works on numerical datatypes, so we'll select those out
numerical_features = df2.select_dtypes(exclude='object').columns

pca_df = df2[numerical_features]

pca_df.dropna(inplace=True)

#Create the Principal Components
pca_2 = PCA(n_components=2).fit(pca_df)

#Most of the variance in our features can be explained with 1 PC
#We will include 2 for ease of visualization
print(pca_2.explained_variance_ratio_)

Graph1

Now, we’ll create a DataFrame of the principal components and add it to our df2 for later use.

#Create a PC DataFrame
principalcomponents = pca_2.fit_transform(pca_df)

principal_df = pd.DataFrame(data = principalcomponents, columns = ['principal component 1', 'principal component 2'])

df2.dropna(inplace = True)
#Make sure theyre the same shape so they merge properly
principal_df.shape, df2.shape

((3569, 2), (3569, 31))

df2 = pd.concat([df2, principal_df], axis = 1)

Clustering with KMeans

Since clustering requires low dimensionality data, we could use PCA to cluster. As stated before, PCA can make it difficult to know WHY the clusters exist how they do, so we will create a table to cluster based on profit and recency of the previous purchase. Other variables we could use to cluster include: shipping type, which vendor the customer used, and the types of products (if we used NLP to group product types).

Clustering based on Profit

df_cluster = df2.groupby('id_x')['profit', 'placed_date'].agg({'profit': 'sum','placed_date': 'max'})
df_cluster.head()

df_cluster.dropna(inplace = True)

X = pd.DataFrame(df_cluster)

#Start off with a large amount of clusters to analyze the optemization
K = KMeans(n_clusters = 9).fit(X)

With our cluster fit, it’s time to determine the optimal amount of groups for our customers. We will use yellowbrick’s clustering visualizers to help out with this. Silhouette scores closer to 1 are better (ranked from 0-1).

Graph2

Graph3 Graph4 Graph5

According to the graphs, 4 or 5 clusters would be best. We’ll choose 4 for now.

K = KMeans(n_clusters = 4).fit(X)
silhouette_score(X, K.labels_)

0.6309803241979693

#Now we'll add the cluster labels to our final DataFrame that we'll use for reporting
df_cluster['cluster'] = K.labels_
df_cluster.head()
profit placed_date cluster
id_x
3.0 17.0625 1.449604e+09 0
4.0 6.9400 1.386780e+09 2
5.0 29.6800 1.441905e+09 3
7.0 11.2000 1.388156e+09 2
8.0 13.7900 1.389303e+09 2

Clustering based on PCA

We will follow the same process to determine cluster labels using PCA.

#Start off with a large amount of clusters to analyze the optemization
K = KMeans(n_clusters = 7).fit(principal_df)

Graph6 Graph7 Graph8 Graph9

According to the graphs, 3 clusters is best for the PCA model.

K = KMeans(n_clusters = 3).fit(X)
silhouette_score(X, K.labels_)

0.6402889003646225

We can see that the silhouette score for the PCA model is SLIGHTLY better than the previous model, but the PCA clusters may not be able to give us actionable information.

RFM Table

Our final task, create the RFM table to report to our boss. We will include the clusters as well so they know who to reach out to with certain promotions.

df_cluster = pd.merge(df_cluster, df_order, left_on= 'id_x', right_on= 'id_x')
df_cluster.head(30)
RFM Table
profit placed_date cluster first_order recent_order
id_x
3.0 17.062500 1.449604e+09 0 2013-12-03 09:07:35 2015-12-08 11:40:52
4.0 6.940000 1.386780e+09 2 2013-12-11 08:44:23 2013-12-11 08:44:23
5.0 29.680000 1.441905e+09 3 2013-12-12 07:19:59 2015-09-10 10:07:59
7.0 11.200000 1.388156e+09 2 2013-12-27 06:52:27 2013-12-27 06:52:27
8.0 13.790000 1.389303e+09 2 2014-01-09 13:33:36 2014-01-09 13:33:36
10.0 0.000000 1.390510e+09 2 2014-01-23 12:38:36 2014-01-23 12:38:36
11.0 0.000000 1.390573e+09 2 2014-01-24 06:19:15 2014-01-24 06:19:15
12.0 3.970000 1.390613e+09 2 2014-01-24 17:30:19 2014-01-24 17:30:19
13.0 17.260000 1.424206e+09 1 2014-01-27 12:54:01 2015-02-17 12:46:29
14.0 0.000000 1.391012e+09 2 2014-01-29 08:13:37 2014-01-29 08:13:37
15.0 7.620000 1.391363e+09 2 2014-02-02 09:43:24 2014-02-02 09:43:24
21.0 0.000000 1.391619e+09 2 2014-02-05 08:45:32 2014-02-05 08:45:32
22.0 2.565000 1.391815e+09 2 2014-02-07 15:17:52 2014-02-07 15:17:52
23.0 19.850000 1.420601e+09 1 2014-02-08 14:01:19 2015-01-06 19:20:00
24.0 9.120000 1.392048e+09 2 2014-02-10 08:07:54 2014-02-10 08:07:54
25.0 0.000000 1.392049e+09 2 2014-02-10 08:10:47 2014-02-10 08:10:47
26.0 0.000000 1.392054e+09 2 2014-02-10 09:38:57 2014-02-10 09:38:57
27.0 93.870000 1.404227e+09 2 2014-02-11 07:46:40 2014-07-01 07:57:15
30.0 0.000000 1.392312e+09 2 2014-02-13 09:17:08 2014-02-13 09:17:08
31.0 4.366667 1.392393e+09 2 2014-02-14 07:43:48 2014-02-14 07:43:48
32.0 10.310000 1.392394e+09 2 2014-02-14 08:08:30 2014-02-14 08:08:30
33.0 3.660000 1.392485e+09 2 2014-02-15 09:24:01 2014-02-15 09:24:01
35.0 6.110000 1.393012e+09 2 2014-02-21 11:49:50 2014-02-21 11:49:50
37.0 21.910000 1.393253e+09 2 2014-02-24 06:40:55 2014-02-24 06:40:55
39.0 6.740000 1.393370e+09 2 2014-02-25 15:13:24 2014-02-25 15:13:24
40.0 8.925000 1.393524e+09 2 2014-02-27 10:01:14 2014-02-27 10:01:14
41.0 20.400000 1.393537e+09 2 2014-02-27 13:40:19 2014-02-27 13:40:19
42.0 0.720000 1.393681e+09 2 2014-03-01 05:37:14 2014-03-01 05:37:14
43.0 11.200000 1.393712e+09 2 2014-03-01 14:12:25 2014-03-01 14:12:25
44.0 9.310000 1.393862e+09 2 2014-03-03 08:01:37 2014-03-03 08:01:37

Here we have a table grouped by the customer ID, that shows us the profit we have earned from each customer, the time of their last order, and their cluster number. The clusters can be analyzed further to determine which type of product/offer will be pushed to them.

That’s all for now! As you can tell, there is a lot more that can be done with this data, but due to time and objective constraints I have limited this project to what is done here. If the boss wants us to dive a bit deeper using NLP on our product descriptions, or analyze the clusters more/differently that can be done in further iterations of the project.

Thank you for taking the time to read through this project. If you have any questions, please feel free to reach out to me via email!