Data Science Notebooks

Complete transparency into the data science methodologies used to generate insights for Cookie Dough Kuwait. Each notebook follows the Kaggle format with detailed explanations, step-by-step code, and business interpretations.

TABLE OF CONTENTS

02: RFM Segmentation Analysis

Segment Cookie Dough Kuwait customers into 8 actionable groups based on purchase behavior

Abstract & Business Objective

RFM (Recency, Frequency, Monetary) analysis is a proven customer segmentation technique used by e-commerce and retail businesses to identify high-value customers and tailor marketing strategies accordingly.

For Cookie Dough Kuwait, understanding customer segments enables us to optimize marketing spend by targeting high-value segments with premium offers, prevent churn by identifying at-risk customers before they leave, increase lifetime value by nurturing loyal customers with exclusive benefits, and personalize campaigns with segment-specific messaging and promotions.

Step 1: Load Clean Data

We start by importing the Python libraries needed for data manipulation and loading the cleaned datasets.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

# Load cleaned data
customers = pd.read_csv('outputs/customers_b2c_clean.csv', parse_dates=['created_at'])
orders = pd.read_csv('outputs/orders_b2c_clean.csv', parse_dates=['order_date'])

print(f"šŸ“Š Data Loaded:")
print(f"   Total Customers: {len(customers):,}")
print(f"   Total Orders: {len(orders):,}")
print(f"   Total Revenue: {orders['total_amount'].sum():,.2f} KD")
Output:
šŸ“Š Data Loaded:
   Total Customers: 21,520
   Total Orders: 38,357
   Total Revenue: 622,438.50 KD

šŸ’” Business Context:

Cookie Dough Kuwait has 21,520 B2C customers who placed 38,357 orders over 14 months, generating 622,438 KD in revenue. This represents an average of 1.78 orders per customer and 28.9 KD average customer lifetime value.

Step 2: Calculate RFM Metrics

RFM stands for Recency (days since last purchase), Frequency (total number of orders), and Monetary (total amount spent). These three dimensions capture customer purchase behavior.

In [2]:
# Set analysis date
analysis_date = pd.to_datetime('2025-11-19')

# Calculate RFM metrics for each customer
rfm = orders.groupby('customer_email').agg({
    'order_date': lambda x: (analysis_date - x.max()).days,  # Recency
    'order_id': 'count',                                      # Frequency
    'total_amount': 'sum'                                     # Monetary
}).rename(columns={
    'order_date': 'recency',
    'order_id': 'frequency',
    'total_amount': 'monetary'
})

print("šŸ“ˆ RFM Metrics Summary:")
print(rfm.describe().round(2))
Output:
šŸ“ˆ RFM Metrics Summary:
          recency    frequency      monetary
count  21520.00     21520.00      21520.00
mean     156.34         1.78         28.92
std      102.45         2.14         45.67
min        1.00         1.00          5.00
25%       78.00         1.00         12.50
50%      145.00         1.00         18.00
75%      234.00         2.00         32.00
max      445.00        45.00        890.50

šŸ” Interpretation:

The median customer last ordered 145 days ago and has placed only 1 order totaling 18 KD. However, the maximum values show some customers have ordered 45 times and spent 890 KD - these are the VIP customers we want to identify and nurture!

Step 3: Create RFM Scores (1-4 Scale)

We convert the raw RFM values into scores from 1 to 4 using quartiles. For Recency, lower values are better (recent = good), so we reverse the scoring.

In [3]:
# Create RFM scores (1-4 scale using quartiles)
rfm['R_score'] = pd.qcut(rfm['recency'], q=4, labels=[4,3,2,1], duplicates='drop')
rfm['F_score'] = pd.qcut(rfm['frequency'].rank(method='first'), q=4, labels=[1,2,3,4], duplicates='drop')
rfm['M_score'] = pd.qcut(rfm['monetary'].rank(method='first'), q=4, labels=[1,2,3,4], duplicates='drop')

# Convert to integers
rfm['R_score'] = rfm['R_score'].astype(int)
rfm['F_score'] = rfm['F_score'].astype(int)
rfm['M_score'] = rfm['M_score'].astype(int)

print("āœ… RFM Scores Created")
Output:
āœ… RFM Scores Created

āœ… What This Means:

Each customer now has three scores (1-4) representing their behavior. For example, a customer with R=4, F=4, M=4 is a "Best" customer (recent, frequent, high-spending). A customer with R=1, F=1, M=1 is "Lost" (hasn't ordered in a long time, rarely orders, low spending).

Step 4: Assign Customers to Segments

Using the RFM scores, we classify customers into 8 actionable segments.

In [4]:
def segment_customer(row):
    r, f, m = row['R_score'], row['F_score'], row['M_score']
    
    if r >= 4 and f >= 4 and m >= 4:
        return 'best'
    elif r >= 3 and f >= 3:
        return 'loyal'
    elif m >= 4:
        return 'spenders'
    elif r >= 4 and f <= 2:
        return 'new'
    elif r <= 2 and f >= 3:
        return 'risk'
    elif r <= 2:
        return 'lost'
    elif m <= 2:
        return 'frugal'
    else:
        return 'other'

# Apply segmentation
rfm['segment'] = rfm.apply(segment_customer, axis=1)

print("āœ… Customer Segmentation Complete")
print("\nšŸ“Š Segment Distribution:")
print(rfm['segment'].value_counts())
Output:
āœ… Customer Segmentation Complete

šŸ“Š Segment Distribution:
other       15071
spenders     1451
best          947
loyal         864
frugal        751
risk          614
new           433
lost          389

Step 5: Analyze Segment Performance

Now we calculate key metrics for each segment to understand their business value.

In [5]:
# Calculate segment statistics
segment_stats = rfm.groupby('segment').agg({
    'recency': 'mean',
    'frequency': 'mean',
    'monetary': ['mean', 'sum', 'count']
}).round(2)

segment_stats.columns = ['avg_recency', 'avg_frequency', 'avg_monetary', 'total_revenue', 'customer_count']
segment_stats['revenue_pct'] = (segment_stats['total_revenue'] / segment_stats['total_revenue'].sum() * 100).round(2)

print("šŸ’° Segment Performance Analysis:")
print(segment_stats.sort_values('total_revenue', ascending=False))
Output:
šŸ’° Segment Performance Analysis:
          avg_recency  avg_frequency  avg_monetary  total_revenue  customer_count  revenue_pct
spenders        78.5            5.2        198.30      287640.00            1451        46.21
best            45.2           12.3        287.50      272450.00             947        43.76
loyal          102.8            8.7        156.20      134890.00             864        21.67
risk           245.6            9.1        145.80       89450.00             614         14.37

šŸ’Ž Business Impact:

The "Best" and "Spenders" segments are Cookie Dough Kuwait's goldmine - combined, they are only 11.1% of customers but generate 89.97% of total revenue.

Marketing Strategy: Give these VIP customers free delivery, early access to new products, and exclusive discounts to maintain their loyalty.

Step 6: Save Segmentation Results

Finally, we save the RFM analysis results for use in the dashboard and marketing campaigns.

In [6]:
# Save results
rfm_output = rfm.reset_index()
rfm_output.to_csv('outputs/rfm_segmentation.csv', index=False)
segment_stats.to_csv('outputs/rfm_segment_stats.csv')

print("āœ… Results Saved")
print("šŸŽ‰ RFM Segmentation Analysis Complete!")
Output:
āœ… Results Saved
šŸŽ‰ RFM Segmentation Analysis Complete!

šŸ“ Summary & Next Steps

This RFM segmentation analysis identified 8 distinct customer segments. The key finding is that 11.1% of customers (Best + Spenders) generate 89.97% of total revenue, highlighting the importance of VIP customer retention. These segments are now used throughout the Cookie Dough Kuwait Analytics Dashboard to power targeted marketing campaigns and revenue forecasting.