Customer Segmentation with KMeans
Customer Segmentation
Hello again and welcome to another one of my projects! This time, the boss has given us a dataset with previous transaction data and is asking for a few different things:
- Separate data into 3 different tables (Customers/Orders/Products) and write to a SQL database for storage.
- Create a Recency-Frequency-Monetary table for reporting.
- Segment the customers into groups using a basic KMeans clustering algorithm.
These clusters could represent anything from pushing different advertisements to those customers to inviting them back with a deal if they haven’t bought from our site in a while, or even general demographic information for later use. We will be evaluating our model using silhouette score, Calinski-Harabaz score, and distortion score. Lets get to it!
Step 1: Primary Analysis and Cleaning
After importing our libraries, we’ll read in the data from the single csv file that we’ve been given.
#import data:
df = pd.read_csv("customers.csv")
df.info()
List of features
<class 'pandas.core.frame.DataFrame'> RangeIndex: 24958 entries, 0 to 24957 Data columns (total 74 columns): Customers.id 24950 non-null object Customers.fname 8572 non-null object Customers.lname 8572 non-null object Customers.create_date 8572 non-null object Customers.mailing 8572 non-null object Customers.last_modified 8572 non-null object Orders.id 8572 non-null object Orders.fname 8572 non-null object Orders.lname 8572 non-null object Orders.order_number 8572 non-null object Orders.currency 8572 non-null object Orders.subtotal 8572 non-null float64 Orders.shipping 8572 non-null float64 Orders.total 8572 non-null object Orders.shipping_carrier 8571 non-null object Orders.shipping_method 8564 non-null object Orders.tracking 8564 non-null object Orders.payment_status 8564 non-null float64 Orders.payment_date 8564 non-null float64 Orders.payment_type 8564 non-null object Orders.payment_amount 8564 non-null float64 Orders.payment_id 8564 non-null object Orders.payment_code 8564 non-null object Orders.status 8564 non-null float64 Orders.placed_date 8564 non-null float64 Orders.updated_date 8564 non-null float64 Orders.shipped_date 8564 non-null float64 Order_Items.id 8564 non-null float64 Order_Items.product_id 8564 non-null float64 Order_Items.product_name 8564 non-null object Order_Items.qty 8564 non-null float64 Order_Items.price 8564 non-null float64 Order_Items.cost 8564 non-null float64 Products.id 8564 non-null float64 Products.template 8564 non-null object Products.vendor 8564 non-null float64 Products.import_id 8564 non-null float64 Products.name 8564 non-null object Products.display_name 8564 non-null object Products.list_price 8564 non-null float64 Products.price 8564 non-null float64 Products.cost 8564 non-null float64 Products.flags 8564 non-null float64 Products.last_modified 8564 non-null float64 Products.taxable 8564 non-null float64 Products.shopping_gtin 8564 non-null float64 Products.shopping_brand 8564 non-null object Products.shopping_mpn 8564 non-null object Products.shopping_flags 8564 non-null float64 Products.amazon_asin 8564 non-null object Products.amazon_item_type 8564 non-null object Products.google_shopping_id 8564 non-null object Products.google_shopping_type 8564 non-null object Products.google_shopping_cat 8564 non-null object Products.shopping_type 8564 non-null object Products.pricegrabber_cat 8564 non-null object Products.thefind_cat 8564 non-null object Products.quickbooks_id 8564 non-null object Products.qb_edit_sequence 8564 non-null float64 Products.short_description 8564 non-null object Products.long_description 8557 non-null object Products.seo_title 4186 non-null object Products.seo_url 4186 non-null object Products.unit 4186 non-null object Products.packaging 4186 non-null object Products.multiple 4186 non-null object Products.upc 4186 non-null float64 Products.hcpcs 4186 non-null object Products.case_qty 4186 non-null float64 Products.import_flags 4186 non-null float64 Products.shipping_length 4186 non-null float64 Products.shipping_width 4186 non-null float64 Products.shipping_height 4186 non-null float64 Products.family_id 4186 non-null object dtypes: float64(32), object(42) memory usage: 14.1+ MBAs we can see, there are quite a few different kinds of data in this table. Since our first task is to separate our data into categories and write them to a SQL server, we’ll see if there’s an easy ID number we can use to split everything off.
#Since we're going to separate the tables, check to see if we can do so easily
#by separating on ID, no such luck
df[['Customers.id', 'Orders.id', 'Products.id']].head()
Customers.id | Orders.id | Products.id | |
---|---|---|---|
0 | 797 | 3758 | 2310.0 |
1 | 3 | 23 | 177.0 |
2 | 3 | 9531 | 1.0 |
3 | 4 | 29 | 983.0 |
4 | 5 | 30 | 991.0 |
No such luck, let’s keep exploring the data and see what else we can find out about the different tables that we’ll be creating.
pd.DataFrame([{'customers': len(df['Customers.id'].value_counts()),
'products': len(df['Products.id'].value_counts()),
'orders': len(df['Orders.id'].value_counts()),
}], columns = ['products', 'orders', 'customers'], index = ['quantity'])
products | orders | customers | |
---|---|---|---|
quantity | 1710 | 3568 | 3058 |
So there are 3054 unique customers, who made 3568 unique orders of 1710 unique products. That will be useful information for later on. While exploring, I found there are a bunch of nonsense rows. Lets take those out.
df['Customers.id'].replace('\t<li>Commode liners can be used with most commode buckets.</li>', np.NaN, inplace = True)
df['Customers.id'].replace('\t<li>Liners include an absorbent pad which solidifies the waste and makes clean up easy and hygienic.</li>',
np.NaN, inplace = True)
df.dropna(thresh= 8, inplace = True)
#we have about 70 columns, so if any observations are missing 10% or more of the data we should drop them
df.reset_index(inplace=True, drop=True)
As a final step in exploration, lets take a quick look at some of the transactions for each customer.
#Lets take a quick look at our customers and the order prices
Customers_groupdf = df.groupby(['Customers.id', 'Customers.fname', 'Customers.lname'])['Order_Items.price',
'Order_Items.cost', 'Orders.total'].sum()
Customers_groupdf.head(20)
Customers groupby table
Order_Items.price | Order_Items.cost | |||
---|---|---|---|---|
Customers.id | Customers.fname | Customers.lname | ||
3.0 | John | Smith | 73.78 | 54.37 |
4.0 | James | Anderson | 19.56 | 12.62 |
5.0 | Abraham | Pollak | 95.14 | 66.33 |
7.0 | peggy | thompson | 39.19 | 27.99 |
8.0 | Randy | Pruss | 59.75 | 45.96 |
10.0 | Tommy | Smith | 34.00 | 34.00 |
11.0 | Mark | Tremble | 34.00 | 34.00 |
12.0 | Emely | Cooke | 10.76 | 2.82 |
13.0 | george | mcmillin | 118.68 | 85.22 |
14.0 | adrian | Cavitt | 339.99 | 339.99 |
15.0 | Sharon | Mueller | 18.94 | 11.32 |
21.0 | Corey | Edmondson | 34.00 | 34.00 |
22.0 | Robert | Miller | 6.84 | 1.71 |
23.0 | Mekala | Whitaker | 141.40 | 100.40 |
24.0 | Richard | Ariano | 29.38 | 20.26 |
25.0 | marc | gorzynski | 35.00 | 35.00 |
26.0 | Richard L. | Shaak | 35.00 | 35.00 |
27.0 | Kenneth | Schmude | 125.16 | 31.29 |
30.0 | Jesse | Spalding | 35.00 | 35.00 |
31.0 | Alan | Safir | 56.78 | 43.68 |
Step 2: Separate into 3 DataFrames
In this step we’ll separate each chunk of our original table, clean them up and at the end, write it all into a SQL server.
#Customers DataFrame
customers = df.loc[: , :'Customers.last_modified']
customers.columns = [col.split('.')[1] for col in customers.columns]
#Orders DataFrame, we need to specify columns because they start with both:
#Orders.colname and Order_Items.colname which sometimes overlap
orders = df.loc[: , 'Orders.id':'Order_Items.cost']
orders_cols = ['id', 'fname', 'lname', 'order_number', 'currency', 'subtotal', 'shipping', 'total', 'shipping_carrier',
'shipping_method', 'tracking', 'payment_status', 'payment_date', 'payment_type', 'payment_amount', 'payment_id',
'payment_code', 'status', 'placed_date', 'updated_date', 'shipped_date', 'Items.id', 'Items.product_id',
'Items.product_name', 'Items.qty', 'Items.price', 'Items.cost']
orders.columns = orders_cols
#Products DataFrame
products = df.loc[: , 'Products.id':]
products.columns = [col.split('.')[1] for col in products.columns]
Customers Table
Separating out each factor from the larger table creates a bunch of duplicates, so our primary step for each table will be dropping duplicates with the same ID.
#As you can see from this example, there are many duplicates
customers[customers['id'] == 3371.0]
Customer duplicates
id | fname | lname | create_date | mailing | last_modified | |
---|---|---|---|---|---|---|
3821 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3822 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3823 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3824 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3825 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3826 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3827 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3828 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3829 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3830 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3831 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3832 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3833 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3834 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3835 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3836 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3837 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3838 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3839 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3840 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3841 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3842 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3843 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3844 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3845 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3846 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3847 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3848 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3849 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
3850 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
... | ... | ... | ... | ... | ... | ... |
8162 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8163 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8164 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8165 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8166 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8167 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8168 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8169 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8170 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8171 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8172 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8173 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8174 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8175 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8176 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8177 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8178 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8179 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8180 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8181 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8182 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8183 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8184 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8185 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8186 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8187 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8188 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8189 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8190 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
8191 | 3371 | Terry | Rich | 1461637787 | 1.0 | 1461637787 |
4371 rows × 6 columns
customers.drop_duplicates(subset='id', inplace = True)
#After doing this, there is one more ID that is null, we will drop this as well
customers.drop(index = [801], inplace = True)
Our next step will be converting the columns to correct data types so that we can work with that data in python later if we need. This doesn’t matter so much for writing to the SQL database because we can specify formatting later on.
#Converting some columns to correct data type
customers['create_date'] = customers['create_date'].astype(int)
customers['mailing'] = customers['mailing'].astype(float)
customers['last_modified'] = customers['last_modified'].astype(int)
customers['create_date'] = customers['create_date'].transform(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))
customers['last_modified'] = customers['last_modified'].transform(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))
Lets just take one more look at the info to make sure we’re good to go.
customers.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3054 entries, 0 to 8571
Data columns (total 6 columns):
id 3054 non-null object
fname 3054 non-null object
lname 3054 non-null object
create_date 3054 non-null object
mailing 3054 non-null float64
last_modified 3054 non-null object
dtypes: float64(1), object(5)
memory usage: 167.0+ KB
Orders Table
As before, we will first drop the duplicates. Then we drop the remaining rows with null values, about 3 rows in this table.
orders.drop_duplicates(subset='id', inplace = True)
orders.dropna(inplace = True)
Then we do the feature transformations similar to the previous table.
orders['total'] = orders['total'].astype(float)
orders['updated_date'] = orders['updated_date'].transform(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))
orders['shipped_date'] = orders['shipped_date'].transform(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))
Next, we’ll create a column ‘profit’ that will be useful for segmentation later. This is also useful to get a quick snapshot of how much our company is making from each transaction.
orders['profit'] = (orders['Items.price'] - orders['Items.cost']) / orders['Items.qty']
orders['profit'].describe()
count 3565.000000
mean 15.067519
std 22.067572
min -29.510000
25% 6.010000
50% 10.620000
75% 18.000000
max 549.000000
Name: profit, dtype: float64
Already we the benefit of creating this column is apparent, there are some orders we are actually LOSING money on! We can sort by those transactions that our profit is less than 0 for further analysis later on, although we won’t be covering it in this specific project, we have to stay focused on the task at hand!
#we can take a look at which orders we're losing money on and why
orders[orders['profit'] < 0]
Orders we lose money on
id | fname | lname | order_number | currency | subtotal | shipping | total | shipping_carrier | shipping_method | tracking | payment_status | payment_date | payment_type | payment_amount | payment_id | payment_code | status | placed_date | updated_date | shipped_date | Items.id | Items.product_id | Items.product_name | Items.qty | Items.price | Items.cost | profit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
812 | 15142 | Marcia | Olsen | 15142 | USD | 29.97 | 0.00 | 29.97 | fedex | 11|Ground | 7.76035E+11 | 3.0 | 1.459800e+09 | authorize.net | 29.97 | 8133230477 | 03161Z | 1.0 | 1.459800e+09 | 2016-04-04 15:39:40 | 2016-04-04 15:39:40 | 17707.0 | 1454.0 | Sterile Bordered Gauze | 3.0 | 9.99 | 10.050000 | -0.020000 |
1815 | 5867 | PJ | Nassi | 5867 | USD | 32.16 | 9.95 | 42.11 | fedex | 11|Ground | 6.49865E+13 | 3.0 | 1.441042e+09 | authorize.net | 42.11 | 7483749716 | 263228 | 1.0 | 1.441042e+09 | 2015-09-02 20:32:37 | 2015-08-31 19:35:17 | 7614.0 | 2110.0 | MoliCare Disposable Super Plus Briefs, Large/X... | 3.0 | 10.72 | 19.330000 | -2.870000 |
2057 | 7327 | Leigh Anne | Duncan | 7327 | USD | 99.90 | 0.00 | 99.90 | fedex | 11|Ground | 7.74823E+11 | 3.0 | 1.445780e+09 | paypal | 99.90 | 1W093636Y5791203L | 02708Z | 1.0 | 1.445780e+09 | 2015-11-02 19:32:49 | 2015-10-27 07:01:00 | 9368.0 | 1454.0 | Medline Sterile Bordered Gauze | 10.0 | 9.99 | 10.050000 | -0.006000 |
2184 | 8629 | Michael | Zayats | 8629 | USD | 19.98 | 5.54 | 17.98 | fedex | 11|Ground | 9.4055E+21 | 3.0 | 1.448205e+09 | authorize.net | 17.98 | 7727592306 | 513 | 1.0 | 1.448060e+09 | 2015-11-26 10:48:07 | 2015-11-23 09:09:28 | 10719.0 | 1454.0 | Sterile Bordered Gauze | 2.0 | 9.99 | 10.050000 | -0.030000 |
2615 | 11582 | Teresa | Joslin | 11582 | USD | 65.99 | 6.93 | 65.99 | fedex | 11|Ground | 7.75441E+11 | 3.0 | 1.452899e+09 | authorize.net | 65.99 | 7897798685 | 01591R | 1.0 | 1.452899e+09 | 2016-02-08 11:29:29 | 2016-01-18 12:21:13 | 13844.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 1.0 | 65.99 | 88.500000 | -22.510000 |
3071 | 13228 | Barbara | Hadley | 13228 | USD | 176.97 | 23.67 | 176.97 | fedex | 11|Ground | 7.75731E+11 | 3.0 | 1.456368e+09 | authorize.net | 176.97 | 8013360365 | 243464 | 1.0 | 1.456368e+09 | 2016-03-02 14:15:17 | 2016-02-25 07:21:39 | 15648.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 3.0 | 58.99 | 88.500000 | -9.836667 |
3086 | 13267 | Marcy | Seaman | 13267 | USD | 119.97 | 0.00 | 119.97 | fedex | 11|Ground | 6.63949E+11 | 3.0 | 1.456436e+09 | authorize.net | 119.97 | 8016026233 | 245275 | 1.0 | 1.456436e+09 | 2016-03-02 14:16:22 | 2016-03-01 07:11:54 | 15691.0 | 17051.0 | Bottom Buddy toilet tissue aid | 3.0 | 39.99 | 40.500000 | -0.170000 |
3095 | 13302 | Melody | Hollowell | 13302 | USD | 58.99 | 6.71 | 58.99 | fedex | 11|Ground | 7.75743E+11 | 3.0 | 1.456506e+09 | authorize.net | 58.99 | 8018282377 | 10043 | 1.0 | 1.456506e+09 | 2016-03-02 14:17:21 | 2016-02-26 09:04:14 | 15731.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 1.0 | 58.99 | 88.500000 | -29.510000 |
3134 | 13529 | Martha | Keler | 13529 | USD | 69.99 | 6.71 | 69.99 | fedex | 11|Ground | 7.75777E+11 | 3.0 | 1.456858e+09 | authorize.net | 69.99 | 8031260682 | 6135 | 1.0 | 1.456858e+09 | 2016-03-02 14:23:42 | 2016-03-02 08:49:41 | 15970.0 | 1453.0 | Sterile Bordered Gauze | 1.0 | 69.99 | 78.140000 | -8.150000 |
3180 | 16691 | Staci | Meredith | 16691 | USD | 14.85 | 0.00 | 14.85 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.462487e+09 | authorize.net | 14.85 | 8382059188 | 10280 | 5.0 | 1.462487e+09 | 2016-05-06 06:26:55 | 2015-11-05 07:18:20 | 19385.0 | 1842.0 | Medline Emesis/Barf Bags, throw up bags, Blue ... | 1.0 | 14.85 | 15.500000 | -0.650000 |
3206 | 13832 | Debra | Jordan | 13832 | USD | 58.99 | 7.16 | 58.99 | fedex | 11|Ground | 7.75813E+11 | 3.0 | 1.457365e+09 | authorize.net | 58.99 | 8048836159 | 45498 | 1.0 | 1.457365e+09 | 2016-03-07 11:23:36 | 2016-03-07 11:23:36 | 16293.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 1.0 | 58.99 | 88.500000 | -29.510000 |
3226 | 13866 | CORI | WALTER | 13866 | USD | 58.99 | 7.85 | 53.99 | fedex | 11|Ground | 7.75822E+11 | 3.0 | 1.457401e+09 | authorize.net | 53.99 | 8051104160 | 1260 | 1.0 | 1.457401e+09 | 2016-03-08 09:26:25 | 2016-03-08 09:26:25 | 16328.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 1.0 | 58.99 | 88.500000 | -29.510000 |
3374 | 14351 | Ann | Orlando | 14351 | USD | 27.99 | 0.00 | 27.99 | fedex | 11|Ground | 4.83833E+14 | 3.0 | 1.458577e+09 | authorize.net | 27.99 | 8090515241 | 56410 | 1.0 | 1.458577e+09 | 2016-03-21 15:17:30 | 2016-03-21 14:13:14 | 16855.0 | 782.0 | Bed Assist Bar with Storage Pocket | 1.0 | 27.99 | 29.160000 | -1.170000 |
3398 | 14548 | Bob | Wiright | 14548 | USD | 34.99 | 9.95 | 53.42 | fedex | 11|Ground | 6.63949E+11 | 3.0 | 1.458757e+09 | authorize.net | 53.42 | 8097412082 | 09877C | 1.0 | 1.458757e+09 | 2016-03-29 18:33:03 | 2016-03-29 18:33:03 | 17069.0 | 17051.0 | Bottom Buddy toilet tissue aid | 1.0 | 34.99 | 40.500000 | -5.510000 |
3405 | 14597 | Alouis | Colgan | 14597 | USD | 34.99 | 0.00 | 34.99 | fedex | 11|Ground | 6.63949E+11 | 3.0 | 1.458804e+09 | authorize.net | 34.99 | 8099072722 | 161248 | 1.0 | 1.458804e+09 | 2016-03-31 07:05:07 | 2016-03-31 07:05:07 | 17119.0 | 17051.0 | Bottom Buddy toilet tissue aid | 1.0 | 34.99 | 40.500000 | -5.510000 |
3409 | 14619 | lois | moore | 14619 | USD | 34.99 | 0.00 | 34.99 | fedex | 11|Ground | 6.63949E+11 | 3.0 | 1.458838e+09 | authorize.net | 34.99 | 8100164574 | 24024 | 1.0 | 1.458838e+09 | 2016-03-29 18:38:57 | 2016-03-29 18:38:57 | 17141.0 | 17051.0 | Bottom Buddy toilet tissue aid | 1.0 | 34.99 | 40.500000 | -5.510000 |
3420 | 14710 | Sherman | Langer | 14710 | USD | 58.99 | 0.00 | 58.99 | fedex | 11|Ground | 7.7597E+11 | 3.0 | 1.459004e+09 | authorize.net | 58.99 | 8105679785 | 01196D | 1.0 | 1.459004e+09 | 2016-03-28 06:49:18 | 2016-03-28 06:49:18 | 17234.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 1.0 | 58.99 | 88.500000 | -29.510000 |
3487 | 14953 | Michael | Ball | 14953 | USD | 58.99 | 0.00 | 58.99 | fedex | 11|Ground | 6.13E+19 | 3.0 | 1.459381e+09 | authorize.net | 58.99 | 8117784458 | 322162 | 1.0 | 1.459381e+09 | 2016-03-30 18:49:32 | 2016-03-30 18:49:32 | 17496.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 1.0 | 58.99 | 88.500000 | -29.510000 |
3541 | 15144 | Janice | Boyle | 15144 | USD | 27.99 | 0.00 | 27.99 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.459802e+09 | authorize.net | 27.99 | 8133377121 | 08674Z | 5.0 | 1.459802e+09 | 2016-04-04 15:36:45 | 2015-11-05 07:18:20 | 17709.0 | 782.0 | Bed Assist Bar with Storage Pocket | 1.0 | 27.99 | 29.160000 | -1.170000 |
3542 | 15145 | Janice | Boyle | 15145 | USD | 27.99 | 0.00 | 25.19 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.459803e+09 | authorize.net | 25.19 | 8133402033 | 08307Z | 4.0 | 1.459803e+09 | 2016-04-04 14:09:07 | 2015-11-05 07:18:20 | 17710.0 | 782.0 | Bed Assist Bar with Storage Pocket | 1.0 | 27.99 | 29.160000 | -1.170000 |
3583 | 15260 | Wesley | Chalker Jr | 15260 | USD | 19.98 | 0.00 | 19.98 | fedex | 11|Ground | 4.83833E+14 | 3.0 | 1.460044e+09 | authorize.net | 19.98 | 8142281323 | 7994 | 1.0 | 1.460044e+09 | 2016-04-11 13:23:14 | 2016-04-11 13:23:14 | 17832.0 | 1454.0 | Sterile Bordered Gauze | 2.0 | 9.99 | 10.050000 | -0.030000 |
3584 | 16364 | Wesley | Chalker Jr | 16364 | USD | 19.98 | 0.00 | 19.98 | fedex | 11|Ground | 9.4055E+21 | 3.0 | 1.462125e+09 | authorize.net | 19.98 | 8366965020 | 1079 | 1.0 | 1.462125e+09 | 2016-05-02 11:44:04 | 2016-05-02 11:44:04 | 19040.0 | 1454.0 | Sterile Bordered Gauze | 2.0 | 9.99 | 10.050000 | -0.030000 |
3615 | 15376 | CLARENCE | POLLARD | 15376 | USD | 55.98 | 0.00 | 50.98 | fedex | 11|Ground | 1.49935E+13 | 3.0 | 1.460245e+09 | paypal | 50.98 | 7WL48849TP228631L | 02708Z | 1.0 | 1.460245e+09 | 2016-04-12 08:56:07 | 2016-04-12 08:56:07 | 17962.0 | 782.0 | Bed Assist Bar with Storage Pocket | 2.0 | 27.99 | 29.160000 | -0.585000 |
3618 | 15393 | Barbara | James | 15393 | USD | 58.99 | 0.00 | 58.99 | fedex | 11|Ground | 7.76076E+11 | 3.0 | 1.460294e+09 | authorize.net | 58.99 | 8303422727 | 82115 | 1.0 | 1.460294e+09 | 2016-04-13 11:04:41 | 2016-04-11 12:01:51 | 17981.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 1.0 | 58.99 | 88.500000 | -29.510000 |
3619 | 15394 | WARREN | SEDRAN | 15394 | USD | 27.99 | 0.00 | 25.19 | fedex | 11|Ground | 4.83833E+14 | 3.0 | 1.460297e+09 | authorize.net | 25.19 | 8303477615 | 600479 | 1.0 | 1.460297e+09 | 2016-04-12 08:58:35 | 2016-04-12 08:58:35 | 17982.0 | 782.0 | Bed Assist Bar with Storage Pocket | 1.0 | 27.99 | 29.160000 | -1.170000 |
3719 | 15780 | Lenny | Shenall | 15780 | USD | 58.99 | 9.95 | 64.08 | fedex | 11|Ground | 7.76132E+11 | 3.0 | 1.461005e+09 | authorize.net | 64.08 | 8328252812 | 21864 | 1.0 | 1.461005e+09 | 2016-04-18 11:57:28 | 2016-04-18 11:57:28 | 18396.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 1.0 | 58.99 | 88.500000 | -29.510000 |
3731 | 15844 | Dolores | Gosnell | 15844 | USD | 27.99 | 9.95 | 27.99 | fedex | 11|Ground | 9.53516E+14 | 3.0 | 1.461098e+09 | authorize.net | 27.99 | 8332005134 | 01990R | 1.0 | 1.461098e+09 | 2016-05-04 11:36:46 | 2016-05-04 11:36:46 | 18462.0 | 782.0 | Medline Bed Assist Bar | 1.0 | 27.99 | 29.160000 | -1.170000 |
3752 | 15899 | Lee | Gerstenhaber | 15899 | USD | 34.99 | 0.00 | 34.99 | fedex | 11|Ground | 9.4055E+21 | 3.0 | 1.461190e+09 | paypal | 34.99 | 2HB74261TB057390A | 02708Z | 1.0 | 1.461190e+09 | 2016-04-21 10:51:00 | 2016-04-21 10:51:00 | 18519.0 | 17051.0 | Bottom Buddy toilet tissue aid | 1.0 | 34.99 | 40.500000 | -5.510000 |
3766 | 15943 | Jan | Kripzer | 15943 | USD | 34.99 | 9.95 | 34.99 | fedex | 11|Ground | 9.4055E+19 | 3.0 | 1.461274e+09 | authorize.net | 34.99 | 8338712178 | 06826D | 1.0 | 1.461274e+09 | 2016-04-21 14:27:09 | 2016-04-21 14:27:09 | 18566.0 | 17051.0 | Bottom Buddy toilet tissue aid | 1.0 | 34.99 | 40.500000 | -5.510000 |
3774 | 15970 | Christopher | Zanini | 15970 | USD | 27.99 | 0.00 | 25.19 | fedex | 11|Ground | 1.02633E+14 | 3.0 | 1.461604e+09 | authorize.net | 25.19 | 8341795724 | 275057 | 1.0 | 1.461356e+09 | 2016-04-25 19:21:18 | 2016-04-25 18:19:10 | 18593.0 | 782.0 | Medline Bed Assist Bar | 1.0 | 27.99 | 29.160000 | -1.170000 |
3779 | 15992 | Gail | Reel | 15992 | USD | 34.99 | 0.00 | 34.99 | fedex | 11|Ground | 9.4055E+21 | 3.0 | 1.461605e+09 | authorize.net | 34.99 | 8343113413 | 57116 | 1.0 | 1.461407e+09 | 2016-04-25 10:25:23 | 2016-04-25 10:25:23 | 18616.0 | 17051.0 | Bottom Buddy toilet tissue aid | 1.0 | 34.99 | 40.500000 | -5.510000 |
3789 | 16005 | ANNE MARIE | DODERO | 16005 | USD | 27.99 | 0.00 | 27.99 | fedex | 11|Ground | 7.96468E+14 | 3.0 | 1.461604e+09 | authorize.net | 27.99 | 8344004179 | 01862A | 1.0 | 1.461441e+09 | 2016-04-25 18:51:06 | 2016-04-25 17:47:06 | 18637.0 | 782.0 | Medline Bed Assist Bar | 1.0 | 27.99 | 29.160000 | -1.170000 |
3813 | 16068 | betty | montesi | 16068 | USD | 58.99 | 0.00 | 58.99 | fedex | 11|Ground | 7.76208E+11 | 3.0 | 1.461782e+09 | authorize.net | 58.99 | 8347372581 | 09654D | 1.0 | 1.461593e+09 | 2016-04-27 11:29:54 | 2016-04-27 11:29:54 | 18704.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 1.0 | 58.99 | 88.500000 | -29.510000 |
8237 | 16204 | Jill | OGorman | 16204 | USD | 34.99 | 0.00 | 34.99 | fedex | 11|Ground | 9.4055E+21 | 3.0 | 1.461881e+09 | authorize.net | 34.99 | 8356224503 | 825606 | 1.0 | 1.461810e+09 | 2016-05-02 11:29:47 | 2016-05-02 11:29:47 | 18863.0 | 17051.0 | Bottom Buddy toilet tissue aid | 1.0 | 34.99 | 40.500000 | -5.510000 |
8274 | 16335 | Casandra | McMorries | 16335 | USD | 58.99 | 0.00 | 58.99 | fedex | 11|Ground | 7.76248E+11 | 3.0 | 1.462133e+09 | authorize.net | 58.99 | 8364716076 | 19481 | 1.0 | 1.462054e+09 | 2016-05-03 10:46:39 | 2016-05-03 10:46:39 | 19010.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 1.0 | 58.99 | 88.500000 | -29.510000 |
8307 | 16414 | Douglas | Monsoor | 16414 | USD | 104.97 | 9.95 | 104.97 | fedex | 11|Ground | 7.76253E+11 | 3.0 | 1.462217e+09 | authorize.net | 104.97 | 8370811515 | 01110C | 1.0 | 1.462217e+09 | 2016-05-03 11:58:54 | 2016-05-03 11:58:54 | 19092.0 | 17051.0 | Bottom Buddy toilet tissue aid | 3.0 | 34.99 | 40.500000 | -1.836667 |
8353 | 16541 | Lester | Curnow | 16541 | USD | 14.85 | 0.00 | 14.85 | fedex | 11|Ground | 9.36129E+21 | 3.0 | 1.462373e+09 | authorize.net | 14.85 | 8376930603 | 286988 | 1.0 | 1.462373e+09 | 2016-05-05 09:21:11 | 2016-05-05 09:21:11 | 19227.0 | 1842.0 | Medline Emesis/Barf Bags, throw up bags, Blue ... | 1.0 | 14.85 | 15.500000 | -0.650000 |
8359 | 16558 | Tom | Beatty | 16558 | USD | 27.99 | 0.00 | 27.99 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.462391e+09 | authorize.net | 27.99 | 8378075808 | 05297C | 4.0 | 1.462391e+09 | 2016-05-04 12:37:07 | 2015-11-05 07:18:20 | 19245.0 | 782.0 | Medline Bed Assist Bar | 1.0 | 27.99 | 28.160000 | -0.170000 |
8365 | 16597 | Patricia | Kotsenas | 16597 | USD | 64.80 | 9.95 | 74.75 | fedex | 11|Ground | 2.27196E+13 | 3.0 | 1.462409e+09 | authorize.net | 74.75 | 8378981908 | 02630B | 1.0 | 1.462409e+09 | 2016-05-05 11:11:21 | 2016-05-05 10:07:32 | 19285.0 | 1025.0 | Steel Bariatric Commode | 1.0 | 64.80 | 92.000000 | -27.200000 |
8384 | 16693 | Rebecca | Griffiths | 16693 | USD | 58.99 | 0.00 | 58.99 | fedex | 11|Ground | 575-642-8976 | 3.0 | 1.462488e+09 | authorize.net | 58.99 | 8382102271 | 586726 | 1.0 | 1.462488e+09 | 2016-05-06 06:31:57 | 2016-05-06 06:31:57 | 19387.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 1.0 | 58.99 | 88.500000 | -29.510000 |
8392 | 16744 | June | Rudner | 16744 | USD | 27.99 | 0.00 | 27.99 | fedex | 11|Ground | 4.8383E+14 | 3.0 | 1.462540e+09 | authorize.net | 27.99 | 8383400273 | 00615Z | 1.0 | 1.462540e+09 | 2016-05-06 14:47:12 | 2016-05-06 13:43:12 | 19439.0 | 782.0 | Medline Bed Assist Bar | 1.0 | 27.99 | 28.160000 | -0.170000 |
8394 | 16746 | FRANK | SOPINSKI | 16746 | USD | 27.99 | 0.00 | 27.99 | fedex | 11|Ground | 6.40791E+13 | 3.0 | 1.462543e+09 | authorize.net | 27.99 | 8383552971 | 06375C | 1.0 | 1.462543e+09 | 2016-05-06 16:47:02 | 2016-05-06 15:40:44 | 19441.0 | 782.0 | Medline Bed Assist Bar | 1.0 | 27.99 | 28.160000 | -0.170000 |
8395 | 16747 | michael | shotts | 16747 | USD | 74.99 | 9.95 | 77.44 | fedex | 11|Ground | 8.3565E+14 | 3.0 | 1.462544e+09 | authorize.net | 77.44 | 8383594730 | 18242 | 1.0 | 1.462544e+09 | 2016-05-06 13:17:19 | 2016-05-06 12:14:14 | 19442.0 | 426.0 | Padded Transfer Benches | 1.0 | 74.99 | 77.000000 | -2.010000 |
8396 | 16749 | Holly | Scotchel | 16749 | USD | 58.99 | 0.00 | 58.99 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.462545e+09 | authorize.net | 58.99 | 8383648902 | 01683C | 5.0 | 1.462545e+09 | 2016-05-06 07:39:00 | 2015-11-05 07:18:20 | 19444.0 | 1841.0 | Emesis Bags, Blue, 36.000 OZ | 1.0 | 58.99 | 88.500000 | -29.510000 |
8402 | 16753 | Harold | Roberts | 16753 | USD | 27.99 | 0.00 | 27.99 | fedex | 11|Ground | 8.3565E+14 | 3.0 | 1.462546e+09 | authorize.net | 27.99 | 8383741325 | 5242 | 1.0 | 1.462546e+09 | 2016-05-06 15:18:01 | 2016-05-06 14:13:22 | 19452.0 | 782.0 | Medline Bed Assist Bar | 1.0 | 27.99 | 28.160000 | -0.170000 |
8428 | 16962 | Arthur | Royer | 16962 | USD | 44.00 | 9.95 | 53.95 | manual | 0|Standard Shipping | 6.87398E+14 | 3.0 | 1.462804e+09 | authorize.net | 24.95 | 8390592894 | 909003 | 5.0 | 1.462804e+09 | 2016-05-09 12:56:09 | 2016-05-09 12:55:41 | 19674.0 | 1842.0 | BUCKET, REPLACEMENT FOR MDS89668XW | 1.0 | 44.00 | 51.112718 | -7.112718 |
8434 | 16981 | Rebecca | Schnepf | 16981 | USD | 34.99 | 0.00 | 34.99 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.462823e+09 | authorize.net | 34.99 | 8391980366 | 43453 | 0.0 | 1.462823e+09 | 2016-05-10 14:08:06 | 2015-11-05 07:18:20 | 19696.0 | 17051.0 | Bottom Buddy toilet tissue aid | 1.0 | 34.99 | 40.500000 | -5.510000 |
8435 | 16987 | Patricia | Sleeper | 16987 | USD | 14.85 | 0.00 | 14.85 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.462824e+09 | authorize.net | 14.85 | 8392053758 | 609572 | 5.0 | 1.462824e+09 | 2016-05-09 13:39:33 | 2015-11-05 07:18:20 | 19703.0 | 1842.0 | Medline Emesis/Barf Bags, throw up bags, Blue ... | 1.0 | 14.85 | 15.500000 | -0.650000 |
8448 | 17060 | Nick | Dziurkowski | 17060 | USD | 14.85 | 0.00 | 14.85 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.462907e+09 | authorize.net | 14.85 | 8395155339 | 13223 | 5.0 | 1.462907e+09 | 2016-05-13 07:51:23 | 2015-11-05 07:18:20 | 19782.0 | 1842.0 | Medline Emesis/Barf Bags, throw up bags, Blue ... | 1.0 | 14.85 | 15.500000 | -0.650000 |
8456 | 17071 | Marvin | Rahman | 17071 | USD | 14.85 | 0.00 | 14.85 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.462911e+09 | authorize.net | 14.85 | 8395421530 | 157796 | 5.0 | 1.462911e+09 | 2016-05-13 07:54:32 | 2015-11-05 07:18:20 | 19798.0 | 1842.0 | Medline Emesis/Barf Bags, throw up bags, Blue ... | 1.0 | 14.85 | 15.500000 | -0.650000 |
8457 | 17072 | Bethel | Jones | 17072 | USD | 14.85 | 9.95 | 14.85 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.462912e+09 | authorize.net | 14.85 | 8395506459 | 01038R | 5.0 | 1.462912e+09 | 2016-05-13 07:58:05 | 2015-11-05 07:18:20 | 19799.0 | 1842.0 | Medline Emesis/Barf Bags, throw up bags, Blue ... | 1.0 | 14.85 | 15.500000 | -0.650000 |
8470 | 17115 | Barbara | Hawkins | 17115 | USD | 34.99 | 0.00 | 31.49 | fedex | 11|Ground | 9.4055E+21 | 3.0 | 1.462941e+09 | authorize.net | 31.49 | 8396533738 | 10541 | 1.0 | 1.462941e+09 | 2016-05-11 06:59:29 | 2016-05-11 06:59:29 | 19846.0 | 17051.0 | Bottom Buddy toilet tissue aid | 1.0 | 34.99 | 40.500000 | -5.510000 |
8504 | 17206 | Gail | Yarbrough | 17206 | USD | 14.85 | 0.00 | 14.85 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.463064e+09 | authorize.net | 14.85 | 8400367859 | 601427 | 5.0 | 1.463064e+09 | 2016-05-13 08:00:03 | 2015-11-05 07:18:20 | 19948.0 | 1842.0 | Medline Emesis/Barf Bags, throw up bags, Blue ... | 1.0 | 14.85 | 15.500000 | -0.650000 |
8522 | 17271 | Julia | Gearhart | 17271 | USD | 29.70 | 0.00 | 29.70 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.463160e+09 | authorize.net | 29.70 | 8404223306 | 289728 | 0.0 | 1.463160e+09 | 2015-08-06 06:40:10 | 2015-11-05 07:18:20 | 20020.0 | 1842.0 | Medline Emesis/Barf Bags, throw up bags, Blue ... | 2.0 | 14.85 | 15.500000 | -0.325000 |
8567 | 17421 | Nora | Fontana | 17421 | USD | 14.85 | 0.00 | 16.06 | fedex | 11|Ground | 6.87398E+14 | 3.0 | 1.463409e+09 | paypal | 16.06 | 16706988SV3261147 | 02708Z | 0.0 | 1.463409e+09 | 2015-08-06 06:40:10 | 2015-11-05 07:18:20 | 20186.0 | 1842.0 | Medline Emesis/Barf Bags, throw up bags, Blue ... | 1.0 | 14.85 | 15.500000 | -0.650000 |
We can drill down and look at things like what shipping methods we use most. Maybe we can cut a deal with USPS or whoever we’re doing the most business with. We can also see how much free shipping is happening, that may effect our bottom line.
orders['shipping_method'].value_counts()
11|Ground 2549 0|Standard Shipping 697 -1|Free Shipping 220 0|Free Shipping 86 NATIONAL DELIVERY 2 HOWARD’S EXPRESS, INC. 2 WILSON TRUCKING CORP 2 NATIONAL DELIVERY SYSTEMS 1 MEDTRANS 1 SOUTHWESTERN MOTOR TRNAS 1 ROADWAY EXPRESS 1 FEDERAL EXPRESS CORP. 1 INTERNATIONAL FEDEX 1 LAND AIR OF NEW ENGLAND 1 Name: shipping_method, dtype: int64
Now that this table is all cleaned up, we can move on to the final table.
Products Table
As before we’ll drop the duplicates, the null values (4 rows this time) and then do the feature transformations.
products.drop_duplicates(subset= 'id', inplace = True)
products.dropna(inplace = True)
#Feature Transformations
products['last_modified'] = products['last_modified'].transform(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))
products['import_id'] = products['import_id'].astype(str)
products['vendor'] = products['vendor'].astype(str)
At this point, we can use NLP to analyze our products, see which ones are most popular, and group them into the types of products that give us the most profit. Again, this is beyond the scope of this specific project, which is to create a basic customer segmentation and a RFM table (eyes on the prize).
Writing to SQL Database
In this basic example, I will be using SQLite3, although I could use SQLalchemy or one of the other million packages that interface between python and SQL.
#Create Connection
conn = sqlite3.connect('SQL Server Filepath\\.db')
#Customers table
customers.to_sql('customers', conn, if_exists='replace', index=True)
#Orders table
orders.to_sql('orders', conn, if_exists='replace', index=True)
#Products table
products.to_sql('products', conn, if_exists='replace', index=True)
conn.close()
Groupby Analysis
With our first task behind us, we can start to see the finish line in the distance…. A bit far in the distance, but we can see it, and the fun bit is still ahead! Since we didn’t really do a deep dive into the products table, we’ll mostly be looking at our customer and order information.
#Make a FullName variable to merge on easily
#People can share the same first or last name, but rarely share both
customers['FullName'] = customers['fname'] + customers['lname']
orders['FullName'] = orders['fname'] + orders['lname']
#Create the new DataFrame
df2 = pd.merge(customers, orders, how= 'outer', on='FullName')
#Drop the redundant columns
df2.drop(['fname_x', 'lname_x', 'fname_y', 'lname_y'], axis = 1, inplace = True)
Lets take another snapshot of our customer buying habits, similar to before, but now we have the profit column as well.
#Lets look at a quick snapshot of customer buying habits
df2_items = df2.groupby('FullName')['Items.qty', 'Items.price', 'Items.cost', 'profit'].sum()
df2_items.sample(20)
Customer buying habits
Items.qty | Items.price | Items.cost | profit | |
---|---|---|---|---|
FullName | ||||
DavidCislo | 1.0 | 26.09 | 17.99 | 8.100 |
StevenAlexander | 1.0 | 39.35 | 28.11 | 11.240 |
DarrylMiller | 1.0 | 69.17 | 53.21 | 15.960 |
olgapasko | 2.0 | 9.99 | 6.36 | 1.815 |
GarveyStenersen | 1.0 | 160.04 | 128.03 | 32.010 |
CherylMorgan | 1.0 | 60.92 | 46.50 | 14.420 |
sueminier | 2.0 | 16.23 | 6.49 | 4.870 |
RobertPurnick | 1.0 | 114.99 | 90.99 | 24.000 |
KimFreudenberger | 1.0 | 25.00 | 17.24 | 7.760 |
HarlanJuster | 3.0 | 47.99 | 44.00 | 1.330 |
donnahoey | 1.0 | 151.94 | 121.55 | 30.390 |
DanielFoster | 1.0 | 57.19 | 43.99 | 13.200 |
AntlanticSurgicenter | 1.0 | 85.67 | 66.93 | 18.740 |
VendelCsaszar | 1.0 | 35.49 | 35.49 | 0.000 |
MarkForte | 2.0 | 36.82 | 21.04 | 15.780 |
Mary Allbright | 1.0 | 116.99 | 103.00 | 13.990 |
SchivonneBishop | 1.0 | 46.37 | 33.60 | 12.770 |
BarryOxford | 1.0 | 34.27 | 24.48 | 9.790 |
TanyaGlover | 2.0 | 12.78 | 7.30 | 2.740 |
DavidKim | 1.0 | 58.36 | 44.55 | 13.810 |
And we can take a look at the shipping costs in case that’s something we want to analyze later on.
df2_order = df2.groupby('FullName')['subtotal', 'shipping', 'total'].sum()
df2_order.sample(20)
Shipping Costs
subtotal | shipping | total | |
---|---|---|---|
FullName | |||
SteveGonzalez | 19.45 | 9.95 | 29.40 |
HelenBlair | 32.63 | 9.95 | 42.58 |
adamhenninger | 4.88 | 9.95 | 15.26 |
MarkDowling | 121.36 | 9.95 | 119.17 |
SusanHixson | 13.99 | 0.00 | 13.99 |
RonHutchison | 37.99 | 6.73 | 37.99 |
AleksandraBaeva | 270.66 | 0.00 | 270.66 |
DEBBIETRENT | 171.91 | 9.95 | 164.67 |
douglasperry | 112.87 | 7.17 | 101.58 |
renee Pallotta | 24.40 | 9.95 | 35.30 |
PurleyNewson | 20.77 | 9.95 | 30.72 |
Jannett Stewart | 68.89 | 0.00 | 68.89 |
ChristaDedebant | 747.23 | 29.85 | 777.08 |
DianaPaulsen | 63.70 | 9.95 | 73.65 |
T. JohnThomas | 23.27 | 9.95 | 33.22 |
Donald Ellis | 57.19 | 0.00 | 56.12 |
PatriciaSleeper | 14.85 | 0.00 | 14.85 |
RHONDABRYAN | 13.13 | 9.95 | 21.77 |
JoseAlva | 66.94 | 9.95 | 76.89 |
HimanshuMisra | 24.99 | 0.00 | 24.99 |
Now, we’ll create the ‘recency’ aspect of our Recency-Frequency-Monetary table for later.
df3 = df2.copy()
df3.dropna(inplace=True)
df3['placed_date'] = df3['placed_date'].transform(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))
#Newest orders
df_order_new = df3.groupby('id_x')['placed_date'].max()
#Oldest orders
df_order_old = df3.groupby('id_x')['placed_date'].min()
df_order_old = pd.DataFrame(df_order_old)
df_order_new = pd.DataFrame(df_order_new)
df_order = pd.DataFrame()
df_order[['first_order', 'recent_order']] = pd.merge(df_order_old, df_order_new, left_on = 'id_x', right_on= 'id_x')
Now that we’ve done a good bit of feature engineering and analysis, it’s time to do the actual modeling and segmentation.
Modeling
The first thing that we’ll use is PCA, which helps to reduce the amount of features so that we can cluster. PCA can make it difficult to determine WHY the customers are segmented the way they are, so we will also be segmenting based on profit and recency, then comparing the 2 methods.
PCA
#PCA only works on numerical datatypes, so we'll select those out
numerical_features = df2.select_dtypes(exclude='object').columns
pca_df = df2[numerical_features]
pca_df.dropna(inplace=True)
#Create the Principal Components
pca_2 = PCA(n_components=2).fit(pca_df)
#Most of the variance in our features can be explained with 1 PC
#We will include 2 for ease of visualization
print(pca_2.explained_variance_ratio_)
Now, we’ll create a DataFrame of the principal components and add it to our df2 for later use.
#Create a PC DataFrame
principalcomponents = pca_2.fit_transform(pca_df)
principal_df = pd.DataFrame(data = principalcomponents, columns = ['principal component 1', 'principal component 2'])
df2.dropna(inplace = True)
#Make sure theyre the same shape so they merge properly
principal_df.shape, df2.shape
((3569, 2), (3569, 31))
df2 = pd.concat([df2, principal_df], axis = 1)
Clustering with KMeans
Since clustering requires low dimensionality data, we could use PCA to cluster. As stated before, PCA can make it difficult to know WHY the clusters exist how they do, so we will create a table to cluster based on profit and recency of the previous purchase. Other variables we could use to cluster include: shipping type, which vendor the customer used, and the types of products (if we used NLP to group product types).
Clustering based on Profit
df_cluster = df2.groupby('id_x')['profit', 'placed_date'].agg({'profit': 'sum','placed_date': 'max'})
df_cluster.head()
df_cluster.dropna(inplace = True)
X = pd.DataFrame(df_cluster)
#Start off with a large amount of clusters to analyze the optemization
K = KMeans(n_clusters = 9).fit(X)
With our cluster fit, it’s time to determine the optimal amount of groups for our customers. We will use yellowbrick’s clustering visualizers to help out with this. Silhouette scores closer to 1 are better (ranked from 0-1).
According to the graphs, 4 or 5 clusters would be best. We’ll choose 4 for now.
K = KMeans(n_clusters = 4).fit(X)
silhouette_score(X, K.labels_)
0.6309803241979693
#Now we'll add the cluster labels to our final DataFrame that we'll use for reporting
df_cluster['cluster'] = K.labels_
df_cluster.head()
profit | placed_date | cluster | |
---|---|---|---|
id_x | |||
3.0 | 17.0625 | 1.449604e+09 | 0 |
4.0 | 6.9400 | 1.386780e+09 | 2 |
5.0 | 29.6800 | 1.441905e+09 | 3 |
7.0 | 11.2000 | 1.388156e+09 | 2 |
8.0 | 13.7900 | 1.389303e+09 | 2 |
Clustering based on PCA
We will follow the same process to determine cluster labels using PCA.
#Start off with a large amount of clusters to analyze the optemization
K = KMeans(n_clusters = 7).fit(principal_df)
According to the graphs, 3 clusters is best for the PCA model.
K = KMeans(n_clusters = 3).fit(X)
silhouette_score(X, K.labels_)
0.6402889003646225
We can see that the silhouette score for the PCA model is SLIGHTLY better than the previous model, but the PCA clusters may not be able to give us actionable information.
RFM Table
Our final task, create the RFM table to report to our boss. We will include the clusters as well so they know who to reach out to with certain promotions.
df_cluster = pd.merge(df_cluster, df_order, left_on= 'id_x', right_on= 'id_x')
df_cluster.head(30)
RFM Table
profit | placed_date | cluster | first_order | recent_order | |
---|---|---|---|---|---|
id_x | |||||
3.0 | 17.062500 | 1.449604e+09 | 0 | 2013-12-03 09:07:35 | 2015-12-08 11:40:52 |
4.0 | 6.940000 | 1.386780e+09 | 2 | 2013-12-11 08:44:23 | 2013-12-11 08:44:23 |
5.0 | 29.680000 | 1.441905e+09 | 3 | 2013-12-12 07:19:59 | 2015-09-10 10:07:59 |
7.0 | 11.200000 | 1.388156e+09 | 2 | 2013-12-27 06:52:27 | 2013-12-27 06:52:27 |
8.0 | 13.790000 | 1.389303e+09 | 2 | 2014-01-09 13:33:36 | 2014-01-09 13:33:36 |
10.0 | 0.000000 | 1.390510e+09 | 2 | 2014-01-23 12:38:36 | 2014-01-23 12:38:36 |
11.0 | 0.000000 | 1.390573e+09 | 2 | 2014-01-24 06:19:15 | 2014-01-24 06:19:15 |
12.0 | 3.970000 | 1.390613e+09 | 2 | 2014-01-24 17:30:19 | 2014-01-24 17:30:19 |
13.0 | 17.260000 | 1.424206e+09 | 1 | 2014-01-27 12:54:01 | 2015-02-17 12:46:29 |
14.0 | 0.000000 | 1.391012e+09 | 2 | 2014-01-29 08:13:37 | 2014-01-29 08:13:37 |
15.0 | 7.620000 | 1.391363e+09 | 2 | 2014-02-02 09:43:24 | 2014-02-02 09:43:24 |
21.0 | 0.000000 | 1.391619e+09 | 2 | 2014-02-05 08:45:32 | 2014-02-05 08:45:32 |
22.0 | 2.565000 | 1.391815e+09 | 2 | 2014-02-07 15:17:52 | 2014-02-07 15:17:52 |
23.0 | 19.850000 | 1.420601e+09 | 1 | 2014-02-08 14:01:19 | 2015-01-06 19:20:00 |
24.0 | 9.120000 | 1.392048e+09 | 2 | 2014-02-10 08:07:54 | 2014-02-10 08:07:54 |
25.0 | 0.000000 | 1.392049e+09 | 2 | 2014-02-10 08:10:47 | 2014-02-10 08:10:47 |
26.0 | 0.000000 | 1.392054e+09 | 2 | 2014-02-10 09:38:57 | 2014-02-10 09:38:57 |
27.0 | 93.870000 | 1.404227e+09 | 2 | 2014-02-11 07:46:40 | 2014-07-01 07:57:15 |
30.0 | 0.000000 | 1.392312e+09 | 2 | 2014-02-13 09:17:08 | 2014-02-13 09:17:08 |
31.0 | 4.366667 | 1.392393e+09 | 2 | 2014-02-14 07:43:48 | 2014-02-14 07:43:48 |
32.0 | 10.310000 | 1.392394e+09 | 2 | 2014-02-14 08:08:30 | 2014-02-14 08:08:30 |
33.0 | 3.660000 | 1.392485e+09 | 2 | 2014-02-15 09:24:01 | 2014-02-15 09:24:01 |
35.0 | 6.110000 | 1.393012e+09 | 2 | 2014-02-21 11:49:50 | 2014-02-21 11:49:50 |
37.0 | 21.910000 | 1.393253e+09 | 2 | 2014-02-24 06:40:55 | 2014-02-24 06:40:55 |
39.0 | 6.740000 | 1.393370e+09 | 2 | 2014-02-25 15:13:24 | 2014-02-25 15:13:24 |
40.0 | 8.925000 | 1.393524e+09 | 2 | 2014-02-27 10:01:14 | 2014-02-27 10:01:14 |
41.0 | 20.400000 | 1.393537e+09 | 2 | 2014-02-27 13:40:19 | 2014-02-27 13:40:19 |
42.0 | 0.720000 | 1.393681e+09 | 2 | 2014-03-01 05:37:14 | 2014-03-01 05:37:14 |
43.0 | 11.200000 | 1.393712e+09 | 2 | 2014-03-01 14:12:25 | 2014-03-01 14:12:25 |
44.0 | 9.310000 | 1.393862e+09 | 2 | 2014-03-03 08:01:37 | 2014-03-03 08:01:37 |
Here we have a table grouped by the customer ID, that shows us the profit we have earned from each customer, the time of their last order, and their cluster number. The clusters can be analyzed further to determine which type of product/offer will be pushed to them.
That’s all for now! As you can tell, there is a lot more that can be done with this data, but due to time and objective constraints I have limited this project to what is done here. If the boss wants us to dive a bit deeper using NLP on our product descriptions, or analyze the clusters more/differently that can be done in further iterations of the project.
Thank you for taking the time to read through this project. If you have any questions, please feel free to reach out to me via email!