Post

Cohort Analysis

Cohort Analysis
FeatureTypeDescription
InvoiceNoobject송장번호
StockCodeobject각 상품(아이템)별로 부여된 5자리 정수번호
Descriptionobject상품명
Quantityint64각 상품의 수량
InvoiceDatedatetime64거래가 발생한 날짜 및 시간 (2010-12-01 ~ 2011-12-09)
UnitPricefloat64단위당 상품 가격(영국 파운드 기준)
CustomerIDfloat64각 고객에게 부여된 5자리 정수번호
Countrycategory해당 거래 고객이 거주하는 나라 이름
  • 분석 내용
    • Cohort 분석을 통해 고객 유지율(Customer Retenntion)을 분석하고 시각화
    • Cohort Analysis : 시간 흐름에 따라 고객 집단(Cohort)의 행동 변화나 잔존율(Retention)을 추적・분석하는 기법
    • 고객이 언제 유입되었는지 기준으로 그룹화하여 시간이 지남에 따라 얼마나 오랬동안 활성화 상태를 유지하는지 파악하는 것을 목표
# Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

I. Load Data

# Load Data
df = pd.read_excel("./Data/Cohort_1/Online Retail.xlsx")
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB

II. Data Preprocessing

II-I. 고객 ID (CustomerID)가 NaN인 행 제거

df.dropna(subset = ['CustomerID'], axis = 0, inplace = True)
df
 InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom
54190458158722613PACK OF 20 SPACEBOY NAPKINS122011-12-09 12:50:000.8512680.0France
54190558158722899CHILDREN’S APRON DOLLY GIRL62011-12-09 12:50:002.1012680.0France
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011-12-09 12:50:004.1512680.0France
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011-12-09 12:50:004.1512680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011-12-09 12:50:004.9512680.0France

406829 rows × 8 columns

  • result
    • 541909 rows에서 CutomerID가 비어있는 행이 제거되고 406829 rows만 남음

II-II. 구매한 날짜(월) 추출

  • 기준이 월이기 때문에 day는 1로 통일한다.
import datetime as dt

# function for Date(year, month, day)
def get_date(x):
    return dt.datetime(x.year, x.month, 1)

df['InvoiceMonth'] = df['InvoiceDate'].apply(get_date)
df
 InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountryInvoiceMonth
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom2010-12-01
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom2010-12-01
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom2010-12-01
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom2010-12-01
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom2010-12-01
54190458158722613PACK OF 20 SPACEBOY NAPKINS122011-12-09 12:50:000.8512680.0France2011-12-01
54190558158722899CHILDREN’S APRON DOLLY GIRL62011-12-09 12:50:002.1012680.0France2011-12-01
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011-12-09 12:50:004.1512680.0France2011-12-01
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011-12-09 12:50:004.1512680.0France2011-12-01
54190858158722138BAKING SET 9 PIECE RETROSPOT32011-12-09 12:50:004.9512680.0France2011-12-01

406829 rows × 9 columns

II-III. 고객 별 처음 구매한 날짜 추출

  • Why? 처음 구매한 월을 추출하여 얼마나 고객이 오랫동안 유지되는지 알아보기 위함
  • 고객 ID (CustomerID)별 처음 구매한 날짜를 추가
df['Cohort First Month'] = df.groupby('CustomerID')['InvoiceMonth'].transform('min')
df
 InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountryInvoiceMonthCohort First Month
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom2010-12-012010-12-01
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom2010-12-012010-12-01
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom2010-12-012010-12-01
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom2010-12-012010-12-01
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom2010-12-012010-12-01
54190458158722613PACK OF 20 SPACEBOY NAPKINS122011-12-09 12:50:000.8512680.0France2011-12-012011-08-01
54190558158722899CHILDREN’S APRON DOLLY GIRL62011-12-09 12:50:002.1012680.0France2011-12-012011-08-01
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011-12-09 12:50:004.1512680.0France2011-12-012011-08-01
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011-12-09 12:50:004.1512680.0France2011-12-012011-08-01
54190858158722138BAKING SET 9 PIECE RETROSPOT32011-12-09 12:50:004.9512680.0France2011-12-012011-08-01

406829 rows × 10 columns

II-IV. 고객이 유지된 날짜 변수 생성

  • 월 기준으로 생성
  • InoviceDate와 Cohort First Month의 차이는 고객이 유지된 날짜이다.
# Function 날짜의 year, month, day 추출 
def get_date_elements(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    return year, month, day

# 적용
Invoice_year, Invoice_month, _ = get_date_elements(df, 'InvoiceMonth')
Cohort_year, Cohort_month, _ = get_date_elements(df, 'Cohort First Month')
# 사람들이 처음 구매 후 활성화된 기간
year_diff = Invoice_year - Cohort_year
month_diff = Invoice_month - Cohort_month
df['Cohort Retention Period'] = year_diff * 12 + month_diff + 1
df.tail()
 InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountryInvoiceMonthCohort First MonthCohort Retention Period
54190458158722613PACK OF 20 SPACEBOY NAPKINS122011-12-09 12:50:000.8512680.0France2011-12-012011-08-015
54190558158722899CHILDREN’S APRON DOLLY GIRL62011-12-09 12:50:002.1012680.0France2011-12-012011-08-015
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011-12-09 12:50:004.1512680.0France2011-12-012011-08-015
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011-12-09 12:50:004.1512680.0France2011-12-012011-08-015
54190858158722138BAKING SET 9 PIECE RETROSPOT32011-12-09 12:50:004.9512680.0France2011-12-012011-08-015

[!Caution] - Cohort Retention Period에 +1을 하는 이유

  • 코호트 분석에서는 “고객이 코호트에 처음 유입된 달을 기준으로 몇 개월이 경과했는가”를 계산합니다.
  • 그런데, 고객이 유입된 같은 달에 다시 구매했다면 경과 개월 수는 0개월이 되어버립니다.
  • 분석 관점에서는 이 ’0개월차’를 고객 활동의 첫 번째 달(Month 1)로 간주하는 것이 자연스럽습니다.
  • 따라서, 0개월차 → 1개월차로 맞추기 위해 Cohort Retention Period에 +1을 더해줍니다.

III. Visualization

# 실제 활동한 고유 고객의 수를 계산
cohort_data = df.groupby(['Cohort First Month', 'Cohort Retention Period'])['CustomerID'].apply(pd.Series.nunique).reset_index()
cohort_data
 Cohort First MonthCohort Retention PeriodCustomerID
02010-12-011948
12010-12-012362
22010-12-013317
32010-12-014367
42010-12-015341
862011-10-01293
872011-10-01346
882011-11-011321
892011-11-01243
902011-12-01141

91 rows × 3 columns

  • Caution
    • 실제 활동한 고유 고객의 수를 계산하는 것은 고객 한 명이 해당 기간 동안 10번 구매했더라도, 그 고객은 1명으로만 집계됨
# create a pivot table
cohort_table = cohort_data.pivot(index = 'Cohort First Month', columns = ['Cohort Retention Period'], values = 'CustomerID')
cohort_table
Cohort Retention Period12345678910111213
Cohort First Month             
2010-12-01948.0362.0317.0367.0341.0376.0360.0336.0336.0374.0354.0474.0260.0
2011-01-01421.0101.0119.0102.0138.0126.0110.0108.0131.0146.0155.063.0NaN
2011-02-01380.094.073.0106.0102.094.097.0107.098.0119.035.0NaNNaN
2011-03-01440.084.0112.096.0102.078.0116.0105.0127.039.0NaNNaNNaN
2011-04-01299.068.066.063.062.071.069.078.025.0NaNNaNNaNNaN
2011-05-01279.066.048.048.060.068.074.029.0NaNNaNNaNNaNNaN
2011-06-01235.049.044.064.058.079.024.0NaNNaNNaNNaNNaNNaN
2011-07-01191.040.039.044.052.022.0NaNNaNNaNNaNNaNNaNNaN
2011-08-01167.042.042.042.023.0NaNNaNNaNNaNNaNNaNNaNNaN
2011-09-01298.089.097.036.0NaNNaNNaNNaNNaNNaNNaNNaNNaN
2011-10-01352.093.046.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2011-11-01321.043.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2011-12-0141.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
# change index 형태 변경
cohort_table.index = cohort_table.index.strftime("%B %Y")
cohort_table
Cohort Retention Period12345678910111213
Cohort First Month             
December 2010948.0362.0317.0367.0341.0376.0360.0336.0336.0374.0354.0474.0260.0
January 2011421.0101.0119.0102.0138.0126.0110.0108.0131.0146.0155.063.0NaN
February 2011380.094.073.0106.0102.094.097.0107.098.0119.035.0NaNNaN
March 2011440.084.0112.096.0102.078.0116.0105.0127.039.0NaNNaNNaN
April 2011299.068.066.063.062.071.069.078.025.0NaNNaNNaNNaN
May 2011279.066.048.048.060.068.074.029.0NaNNaNNaNNaNNaN
June 2011235.049.044.064.058.079.024.0NaNNaNNaNNaNNaNNaN
July 2011191.040.039.044.052.022.0NaNNaNNaNNaNNaNNaNNaN
August 2011167.042.042.042.023.0NaNNaNNaNNaNNaNNaNNaNNaN
September 2011298.089.097.036.0NaNNaNNaNNaNNaNNaNNaNNaNNaN
October 2011352.093.046.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
November 2011321.043.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
December 201141.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
# Visualize our results in heatmap
plt.figure(figsize = (21, 10))
plt.title("Cohort")
sns.heatmap(cohort_table, annot = True, cmap = "Blues")
plt.show()
Cohort_1
# cohort table for percentage
# 고객 유지율 = 특정 경과 기간의 잔존 고객수 / 코호트 유입 시점의 총 고객 수
new_cohort_table = cohort_table.divide(cohort_table.iloc[:, 0], axis = 0)
new_cohort_table
Cohort Retention Period12345678910111213
Cohort First Month             
December 20101.00.3818570.3343880.3871310.3597050.3966240.3797470.3544300.3544300.3945150.3734180.5000000.274262
January 20111.00.2399050.2826600.2422800.3277910.2992870.2612830.2565320.3111640.3467930.3681710.149644NaN
February 20111.00.2473680.1921050.2789470.2684210.2473680.2552630.2815790.2578950.3131580.092105NaNNaN
March 20111.00.1909090.2545450.2181820.2318180.1772730.2636360.2386360.2886360.088636NaNNaNNaN
April 20111.00.2274250.2207360.2107020.2073580.2374580.2307690.2608700.083612NaNNaNNaNNaN
May 20111.00.2365590.1720430.1720430.2150540.2437280.2652330.103943NaNNaNNaNNaNNaN
June 20111.00.2085110.1872340.2723400.2468090.3361700.102128NaNNaNNaNNaNNaNNaN
July 20111.00.2094240.2041880.2303660.2722510.115183NaNNaNNaNNaNNaNNaNNaN
August 20111.00.2514970.2514970.2514970.137725NaNNaNNaNNaNNaNNaNNaNNaN
September 20111.00.2986580.3255030.120805NaNNaNNaNNaNNaNNaNNaNNaNNaN
October 20111.00.2642050.130682NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
November 20111.00.133956NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
December 20111.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
# create a percentages
plt.figure(figsize = (21, 10))
plt.title("Cohort")
sns.heatmap(new_cohort_table, annot = True, cmap = "Blues", fmt = '.0%')
plt.show()
Cohort_2
This post is licensed under CC BY 4.0 by the author.