Tag: Distribution

How to Prepare Sales Data for an AI Pricing Model

Introduction: Why Clean, Well-Prepared Data Is the Secret Ingredient in AI Pricing

As distributors across every industry look to gain a competitive edge, AI-powered pricing models are becoming one of the most powerful tools available. These models can uncover hidden patterns in historical transactions, predict customer sensitivity to price changes, and recommend optimized prices that protect margin while staying competitive.

But before an algorithm can learn anything, it needs clean, well-structured data. Most distributors already sit on a goldmine of information — product catalogs, customer order histories, cost data, and supplier terms — yet these valuable records are often scattered, inconsistent, or incomplete. That’s why the first step in building a successful model is learning how to prepare sales data for an AI pricing model.

In this article, we’ll walk through how to collect, clean, and enhance your existing sales and operations data so it’s ready for machine learning. By the end, you’ll understand which data sources matter most, how to transform them into model-ready inputs, and how better data can translate directly into smarter, more profitable pricing decisions.

The Data Distributors Already Have (and Why It’s a Goldmine)

The good news for most distributors is that the foundation for an AI pricing model is already sitting in your systems — it just needs to be unlocked. Every quote and sales order tells a story about what your customers value, what they’re willing to pay, and how your prices perform in the market. By gathering this information into a clean, structured dataset, you can train a machine learning model to detect patterns that no human could ever spot at scale.

Here are some of the most valuable types of data that distributors already possess:

Sales transactions: Each line item — product, gross profit margin, and customer — forms the backbone of your training data. These records show how price interacts with real-world buying behavior.
Product information: Descriptions, SKUs, product categories and groups, stock or non-stock status, and cost data help the model understand relationships between products and margin structures.
Customer data: Attributes such as industry, region, customer tier and/or customer type (e.g., contractor vs. OEM) allow the model to personalize pricing recommendations.
Supplier and cost data: Fluctuating supplier prices and terms can be key variables when predicting optimal selling prices.
Historical quotes and win/loss data: This often-overlooked data is extremely valuable for understanding price sensitivity and competitive dynamics. Valuable information includes quoted lead times, quantity on hand at the time of quote, quantity quoted, outcome (won or lost) and salesperson.
Seasonality and time-based data: Sales patterns by month, quarter, or season help the model adjust for demand cycles.

When all of this information is combined, it becomes a dynamic pricing engine waiting to happen. The next step is ensuring that this data is clean, consistent, and machine-readable — which is where the real work (and value) begins.

Cleaning and Normalizing: Turning Raw Data into Model-Ready Input

Raw sales data, no matter how rich, is rarely ready for machine learning. It’s full of duplicates, missing fields, inconsistent formats, and outdated records. Before your AI model can recognize pricing patterns, it first has to trust the data it’s being trained on. That’s why data cleaning and normalization are so critical — they transform messy sales records into a structured, reliable dataset that a pricing algorithm can actually learn from.

Here are the most important steps in preparing distributor data for an AI pricing model:

Remove duplicates and errors. Repeated invoice lines or miskeyed prices can distort model training. Even a few outliers can cause the model to learn incorrect pricing relationships.
Handle missing or incomplete data. When costs, quantities, customer category, product category or dates are missing, use business logic or statistical methods to fill gaps — or remove unusable rows entirely. Consistency is more important than volume.
Fix or remove records with invalid data (prices or costs at or below zero, negative lead-time days, negative quantities, etc.) Do not include records for internal transactions (quote or sale records for internal transfers, for example.)
Normalize units and currencies. Distributors often sell the same product in different units (e.g., cases, boxes, or singles). Convert all transactions into a common base unit and currency so the model can compare apples to apples.
Align product and customer identifiers. Standardize SKUs, product categories, and customer IDs across all systems (ERP, CRM, quoting tools). A single, unified key for each entity prevents confusion during model training.
Tokenize categorical data. Many AI models can’t directly read text fields like “Region = Midwest” or “Customer Type = Contractor.” Tokenization — assigning numeric or encoded values — allows these labels to become usable inputs.
Group numerical fields into bins. Continuous fields such as “quantity sold” or “order size” can be bucketed into ranges (e.g., 1–10, 11–50, 51–100) to help the model identify threshold effects, such as volume-based discount behavior.
Detect and treat outliers. An occasional “$0.01” sale or “10,000-unit” order can throw off training results. Flag and investigate these before feeding them into your model.
Remove any quotes or sales that are pre-determined (sales based on pre-agreed contracts or price sheets, for example.)
Remove any quotes for items that have never been won as the model might be overly aggressive when attempting to price these items.

By the end of this stage, your raw data becomes a standardized, trustworthy foundation. Only then can it reveal the true signals behind pricing performance — signals that a well-trained AI model can amplify into real margin improvement.

Feature Engineering for Better Predictions

Once your data is clean and consistent, the next step is to make it more informative. Feature engineering is the process of transforming raw data into new variables (or “features”) that help your AI pricing model recognize the subtle factors influencing customer behavior.

Think of it as giving the model more context — the same way an experienced sales rep instinctively knows that a contractor ordering 1,000 units in May behaves differently from a retail customer ordering ten units in December.

Here are some practical ways distributors can enhance their datasets through feature engineering:

Create ratio-based features. Calculating fields such as margin percentage, discount from list price, or average revenue per customer helps the model see relationships that aren’t obvious in raw sales data.
Add time-based context. Derived features like days since last purchase, month of year, or season capture repeat buying patterns and seasonal demand.
Segment by customer and product attributes. Creating flags or encoded values such as “key account”, “preferred supplier”, or “new product launch” gives the model behavioral cues.
Aggregate transactional history. Summarizing data into higher-level metrics — like average order size or total spend per quarter — helps smooth out noise and reveal long-term trends.
Use tokenized and bucketed fields. Earlier steps like tokenizing categories or binning order quantities now become the building blocks for modeling how price elasticity changes across segments.

Good feature engineering transforms your sales database from a record of past transactions into a simulation of your market dynamics. When these enhanced features are used to train your AI pricing model, it doesn’t just learn what happened — it begins to infer why.

From Raw Data to Model-Ready: An Example Schema

To make this more tangible, let’s look at how typical distributor data evolves from raw quotes to model-ready training data.

Typical raw quote data:

Quote Id	Quote Date	Customer Id	Product Id	Qty Quoted	UoM	Quoted GP	SalesPerson Id
1001	2029-01-02	CUST-001	PROD-010	100	EA	13.5%	SALES-100
1002	2029-01-02	CUST-002	PROD-010	1	CASE	13.2%

Normalize and populate missing data:

Quote Id	Quote Date	Customer Id	Product Id	Qty Quoted	UoM	Quoted GP	SalesPerson Id
1001	2029-01-02	CUST-001	PROD-010	100	EA	13.5%	SALES-100
1002	2029-01-02	CUST-002	PROD-010	250	EA	13.2%	SALES-104

Link related sales orders to identify won and lost quotes:

Quote Id	Quote Date	Customer Id	Product Id	Qty Quoted	UoM	Quoted GP	SalesPerson Id	Outcome
1001	2029-01-02	CUST-001	PROD-010	100	EA	13.5%	SALES-100	lost
1002	2029-01-02	CUST-002	PROD-010	250	EA	13.2%	SALES-104	won

Add additional details about the customer, product, etc:

Quote Id	Quote Date	Customer Id	Product Id	Qty Quoted	UoM	Quoted GP	SalesPerson Id	Outcome	Customer Tier	Customer Region	Product Category
1001	2029-01-02	CUST-001	PROD-010	100	EA	13.5%	SALES-100	lost	1	West	Fasteners
1002	2029-01-02	CUST-002	PROD-010	250	EA	13.2%	SALES-104	won	3	North	Bolts

Enhance & Engineer the data:

Quote Id	Quote Date	Customer Id	Product Id	Qty Quoted	UoM	Quoted GP	SalesPerson Id	Outcome	Customer Tier	Customer Region	Product Category	Qty Bucket	Discount from List	Month
1001	2029-01-02	CUST-001	PROD-010	100	EA	13.5%	SALES-100	lost	1	West	Fasteners	0-100	5%	1
1002	2029-01-02	CUST-002	PROD-010	250	EA	13.2%	SALES-104	won	3	North	Bolts	101-500	12%	1

At this stage:

Row-level ratios like margin% and discount% are new columns in the same table.
Aggregated metrics (e.g., customer lifetime value) may be calculated separately and merged back in by key (e.g., customer_id).
Tokenized fields allow categorical data to be processed numerically.
Bucketed fields (like quantity ranges) help the model learn threshold effects such as volume discounts.

The result is a flattened, model-ready table where each row represents one transaction, but each column encodes valuable business knowledge. This is what modern AI pricing models are trained on — a single, rich, structured dataset that reflects both transactional detail and business context.

Enhancing Data with External Context

Even the cleanest internal dataset can only describe what’s already happened inside your business. To train an AI pricing model that reacts to the market, not just your history, you’ll want to enrich your data with external signals. These contextual factors help the model recognize the why behind pricing shifts—things like seasonality, supplier volatility, or regional demand patterns.

Here are a few powerful types of external data you can integrate:

Market and commodity indexes. For distributors whose costs depend on raw materials (steel, copper, resin, etc.), linking supplier prices to public commodity indexes gives the model a real-world cost baseline.
Freight and logistics costs. Adding average freight rates or fuel costs by region can help the model understand variations in delivered pricing and margin erosion.
Economic indicators. Regional GDP growth, interest rates, or housing starts can all influence industrial demand. Including these variables lets your model anticipate pricing pressure before it shows up in sales data.
Weather and seasonality. For sectors tied to climate (HVAC, landscaping, construction materials), temperature or precipitation data can reveal when demand spikes are most likely.
Competitor or market pricing. Even limited competitive intelligence—such as average market prices from a benchmarking service—helps the model learn where your price points sit in context.

When combined with your cleaned and engineered sales data, these external signals transform your pricing model from a reactive tool into a forward-looking one. The model can then spot correlations your teams might miss, like how freight volatility or regional construction activity subtly shifts price sensitivity.

Ultimately, data enrichment bridges the gap between your transactional reality and the economic environment you operate in. That’s where the predictive power of AI becomes a genuine strategic advantage.

Data Governance and Ongoing Maintenance

Preparing your data for an AI pricing model isn’t a one-time task — it’s an ongoing discipline. The moment you start training models, your data pipeline becomes part of your daily operations. If the data feeding the model degrades, so will the model’s accuracy and trustworthiness.

Here are the key practices every distributor should adopt to keep their data healthy:

Standardize data entry and definitions. Ensure that all departments use consistent product categories, customer classifications, and units of measure. Small inconsistencies compound quickly in large datasets.
Monitor for data drift. Over time, market conditions and internal processes change. Regularly compare current data distributions (like average margins or order sizes) to historical ones to spot shifts that may require retraining the model.
Schedule periodic audits. Review data integrity quarterly or semiannually. This might include sampling transactions for accuracy, checking for missing fields, and validating that external feeds (like commodity prices) are still updating correctly.
Document data lineage. Keep a record of where each data source originates, what transformations are applied, and who owns it. This transparency makes troubleshooting and compliance far easier down the line.
Retrain the model on a schedule. Even a perfectly prepared dataset becomes outdated as the market evolves. Set a cadence for retraining your AI pricing model—monthly, quarterly, or annually—depending on your sales volume and industry volatility.

Strong governance ensures that the effort you put into collecting, cleaning, and enhancing your data continues to pay off. Over time, this steady flow of reliable, enriched information becomes your most valuable competitive asset — powering not just pricing optimization, but smarter forecasting, inventory management, and customer insights across the business.

Conclusion: Turning Clean Data into Profitable Intelligence

For distributors, success with artificial intelligence begins long before the first line of code. The real magic happens when clean, consistent, and enriched data becomes the foundation for smarter decision-making. By collecting, preparing, and enhancing your sales data, you’re not just creating a dataset — you’re building a digital model of how your market behaves.

With that foundation in place, you’re ready to take the next step: transforming your prepared data into a working AI pricing model. In the next article — How to Build an AI Pricing Model Using Machine Learning in Python — we’ll walk through how to feed this data into a machine learning framework, train your model, and start generating optimized pricing recommendations that boost both margin and competitiveness.

Investing in data preparation today means unlocking long-term pricing intelligence tomorrow — a true strategic edge in an increasingly data-driven distribution landscape.

November 12, 2025

How to Build an AI Pricing Model Using Machine Learning in Python

In today’s competitive markets, pricing has become one of the most powerful levers for profitability — and one of the hardest to get right. Traditional pricing methods, like simple margin targets or “last price quoted” rules, can overlook complex relationships between cost, demand, competition, and customer behavior.

That’s where Machine Learning (ML) comes in. By analyzing large volumes of historical quote and sales data, an AI Pricing Model can uncover hidden patterns — such as how product category, lead time, customer tier, or market conditions influence the probability of winning a quote or achieving a target gross profit.

In this post, we’ll walk through how to build an AI Pricing Model using Machine Learning in Python, using tools like pandas, scikit-learn, and XGBoost. You’ll learn not just how the model works, but also why it can outperform traditional pricing logic — and how to interpret the model’s predictions in a way that’s practical for business decision-making.

By the end, you’ll understand how to:

Prepare and clean pricing data for machine learning,
Train and evaluate a predictive pricing model,
Measure model accuracy using metrics like RMSE, R², and AUC,
and apply your AI pricing model to generate optimized prices that balance margin and win probability.

Why Machine Learning Works for Pricing Optimization

At its core, pricing optimization is about understanding how different factors — such as cost, competition, customer type, quantity, and lead time — influence a buyer’s willingness to pay and your ability to win profitable deals. The challenge is that these relationships are rarely linear or static. They can shift over time, differ across product categories, and interact in subtle ways that are hard to detect with traditional rule-based logic or spreadsheets.

That’s exactly where Machine Learning (ML) excels.

A Machine Learning model for pricing learns directly from historical quote and sales data. It doesn’t rely on hard-coded formulas; instead, it identifies patterns and correlations that may not be obvious to humans. For example, it might learn that:

A certain customer tier consistently pays higher prices for small-quantity orders,
Or that short lead times increase win probability but reduce achievable margins,
Or that specific product categories are more price-sensitive during certain market conditions.

Because ML models can analyze millions of records at once, they’re capable of understanding these complex interactions and weighting them appropriately. The result is an AI Pricing Model that predicts outcomes such as:

Win probability for a given quote price, or
Expected gross profit (GP%) given market and customer conditions.

With those predictions, pricing teams can simulate “what-if” scenarios — such as, “What happens to win probability if we increase price by 3%?” — and make more confident, data-driven decisions.

Dynamic Learning and Adaptability

Another reason Machine Learning works so well for pricing optimization is its ability to evolve. As new data is collected — from market shifts, supply chain changes, or customer behavior — the model can be retrained to learn from recent patterns. This makes it inherently adaptive, unlike static pricing rules that quickly become outdated.

From Insight to Action

When combined with explainability tools like SHAP (SHapley Additive exPlanations), businesses can even interpret why the model made certain pricing recommendations. This transparency builds trust and ensures that AI-driven pricing decisions are both accurate and explainable — not just “black box” outputs.

Two Core Models Behind an AI Pricing Framework: Regression and Classification

A strong AI Pricing Model using Machine Learning in Python usually isn’t just one model — it’s a combination of two that work together to predict both what price to quote and how likely you are to win at that price. These are called regression models and classification models, and each plays a distinct but complementary role.

1. The Regression Model — “What GP% should we expect (or recommend)?”

Think of the regression model as the price-setting brain of your AI Pricing system.

It answers questions like:

“Given these conditions — product group, customer tier, lead time, and market — what gross profit percentage (GP%) should we expect on this quote?”

The model learns from past quotes where you know both the inputs (features such as customer, product, and quantity) and the outcome (the GP% you actually achieved).

Over time, it learns relationships like:

“High-volume orders tend to have lower GP%,”
“Certain customers consistently negotiate tighter margins,”
“Stock items can carry higher GP% due to faster availability.”

By understanding these patterns, the regression model can predict the most reasonable or competitive GP% for a new quote — effectively recommending a data-driven price point that aligns with historical success patterns.

2. The Classification Model — “What’s the probability we’ll win this quote?”

If the regression model helps you set the price, the classification model helps you evaluate the risk and opportunity of that price.

This model predicts a win probability — essentially answering:

“Given this quote’s characteristics and price level, what’s the likelihood that the customer will award us the order?”

It learns from historical data labeled as won or lost. For each quote, it examines factors such as:

Quoted GP% (price competitiveness),
Customer relationship or tier,
Lead time,
Product category or market conditions.

The output is a probability — for example, “There’s a 72% chance of winning this quote at this price.”

With that, your pricing system can balance profit vs. win likelihood, enabling smarter trade-offs — such as lowering margin slightly on a high-probability deal or holding firm on price when the odds of winning are already low.

When You Combine the Two

When you use both models together, you get a complete AI Pricing Framework:

The Regression Model recommends a target price or GP%,
The Classification Model estimates the win probability at that price,
And together they create a feedback loop that helps your team find the optimal price point — the one that maximizes both revenue and likelihood of success.

This combination mirrors what experienced sales or pricing analysts do intuitively — except the AI can analyze millions of records and update itself continuously as new data comes in.

Tools and Frameworks in Python for Building an AI Pricing Model

One of the biggest advantages of building an AI Pricing Model using Machine Learning in Python is that the Python ecosystem already provides powerful, production-ready libraries for every stage of the workflow — from cleaning your data to training, evaluating, and explaining the model’s predictions.

Below are the key tools you’ll use, and the role each one plays in developing a pricing optimization framework.

1. pandas — The Data Preparation Workhorse

Before you can train a model, your quote and sales data needs to be cleaned and structured.

That’s where pandas comes in. It’s a Python library designed for working with tabular data — like your pricing history — using simple, spreadsheet-like commands.

With pandas, you can:

Load CSVs or Excel files into DataFrames,
Handle missing or invalid data,
Create new features (e.g., “lead_time_days” or “customer_tier”),
Filter, sort, and group data by product or customer,
Join multiple datasets (e.g., cost data with quote history).

In short: pandas is where you prepare your pricing dataset for Machine Learning.

import pandas as pd

df = pd.read_csv("quotes.csv")
df['lead_time_days'] = (df['expected_ship_date'] - df['quote_date']).dt.days

2. scikit-learn — The Foundation for Machine Learning

Once your data is ready, scikit-learn provides the essential tools for building and evaluating models.

It includes algorithms for both regression and classification, along with utilities for:

Splitting your dataset into training and test sets,
Scaling and encoding data,
Evaluating model accuracy using metrics like RMSE, MAE, and R² (for regression) or AUC (for classification),
Building pipelines that make your ML workflow repeatable and organized.

Even if you later switch to more advanced models like XGBoost, scikit-learn remains the framework that ties everything together.

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

3. XGBoost — High-Performance Predictive Modeling

For real-world pricing problems, where data can include millions of quotes and dozens of features, XGBoost (Extreme Gradient Boosting) is a top performer.

It’s a gradient-boosted decision tree algorithm known for:

Handling nonlinear relationships (e.g., between GP% and quantity),
Managing missing values gracefully,
Delivering high accuracy and fast training times.

In your AI Pricing Model, XGBoost is typically used for both:

Regression → predicting the expected GP% for a quote,
Classification → predicting the probability of winning at that price.

Its robustness and interpretability make it one of the most trusted algorithms for business-critical ML applications.

import xgboost as xgb

reg_model = xgb.XGBRegressor()
clf_model = xgb.XGBClassifier()

4. SHAP (SHapley Additive exPlanations) — Explaining the Model

Even the most accurate AI pricing model isn’t useful if you can’t explain why it makes a certain recommendation.

That’s where SHAP comes in. SHAP values quantify how much each feature — such as customer tier, lead time, or product group — contributes to a specific prediction.

For example:

“Lead time contributed +1.2% to the GP% recommendation,”
“Customer tier lowered the win probability by 8%.”

With SHAP visualizations, pricing analysts can see exactly what drives the model’s logic, turning complex AI outputs into actionable business insights.

import shap

explainer = shap.TreeExplainer(reg_model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)

Bringing It All Together

When combined, these Python tools provide a complete end-to-end solution for building an AI Pricing Model:

Stage	Goal	Tool
Data preparation	Clean and organize quote data	pandas
Model training	Build regression and classification models	scikit-learn, XGBoost
Model evaluation	Measure accuracy and predictive power	scikit-learn metrics
Model explainability	Visualize feature importance and logic	SHAP

Together, these frameworks turn raw pricing history into a living, learning system — one that continuously refines your pricing strategy based on data, not gut feel.

Step-by-Step: How to Build an AI Pricing Model Using Machine Learning in Python

This walkthrough shows how to build the two-model framework:

a Regression model to recommend a target GP%, and
a Classification model to estimate win probability at a given price.

We’ll use pandas, scikit-learn, XGBoost, and SHAP.

Mini data-sanity checklist (save the deep dive for the next post):

Remove or flag obvious data errors (negative quantities/costs, GP% > 100, etc.).

Avoid leakage: for the win model, only use features available before the decision (e.g., do not use “won/lost” derived fields or post-quote info).

Ensure time awareness: train on older data, test on newer (or do time-based CV).

Encode categories (customer tier, product group/category) and handle missing values.

1) Setup & Load Data

import pandas as pd
import numpy as np

# Load your quotes dataset
df = pd.read_csv("quotes.csv", parse_dates=["quote_date"], low_memory=False)

# Example expected columns (adjust to your schema):
# 'quoted_price', 'quoted_quantity', 'cost', 'quoted_gp_pct', 'won_flag',
# 'product_group', 'product_category', 'customer_tier', 'lead_time_days',
# 'is_stock_item', 'on_hand_qty_at_quote'

Light cleaning:

# Basic filters/sanity
df = df.dropna(subset=["quoted_gp_pct", "won_flag", "product_group", "customer_tier"])
df = df[(df["quoted_gp_pct"] > -20) & (df["quoted_gp_pct"] < 100)]  # tweak if needed

# Ensure types
df["is_stock_item"] = df["is_stock_item"].astype(int)  # 0/1
df["won_flag"] = df["won_flag"].astype(int)  # 0/1

2) Feature Sets for Each Model

Regression (target = quoted_gp_pct)
Inputs that influence achievable margin: ['product_group', 'product_category', 'customer_tier', 'lead_time_days', 'is_stock_item', 'on_hand_qty_at_quote']
Classification (target = won_flag)
Include the price signal (e.g., quoted_gp_pct) plus context features: ['quoted_gp_pct', 'product_group', 'product_category', 'customer_tier', 'lead_time_days', 'is_stock_item', 'on_hand_qty_at_quote']

Tip: Keep feature names aligned between models so the system is easy to maintain. Avoid ‘feature leakage’ by not giving any features to the models that give away the answer (for example, do not give information about the actual price or GP% from the training data to the regression model, and do not give the outcome information to the classification model.) Also, remember that the information that is chosen to be used as a feature will be the same information that will be required to give the models later when using them to predict outcomes.

3) Train/Test Split (time-aware if possible)

# Option A: random split (simple)
from sklearn.model_selection import train_test_split

reg_features = ['product_group','product_category','customer_tier',
                'lead_time_days','is_stock_item','on_hand_qty_at_quote']
clf_features = ['quoted_gp_pct'] + reg_features

X_reg = df[reg_features]
y_reg = df["quoted_gp_pct"]

X_clf = df[clf_features]
y_clf = df["won_flag"]

Xr_train, Xr_test, yr_train, yr_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)
Xc_train, Xc_test, yc_train, yc_test = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)

# Option B (recommended for production): split by date so test is newer period

4) Preprocessing Pipelines

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error, roc_auc_score
import xgboost as xgb
import numpy as np

cat_cols = ['product_group','product_category','customer_tier']
num_cols = ['lead_time_days','is_stock_item','on_hand_qty_at_quote']

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
        ("num", "passthrough", num_cols)
    ]
)

5) Train the Regression Model (GP% recommender)

reg_model = xgb.XGBRegressor(
    n_estimators=600,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.9,
    random_state=42,
    n_jobs=-1
)

reg_pipe = Pipeline([
    ("prep", preprocessor),
    ("model", reg_model)
])

reg_pipe.fit(Xr_train, yr_train)

# Evaluate
yr_pred = reg_pipe.predict(Xr_test)
rmse = np.sqrt(mean_squared_error(yr_test, yr_pred))
mae = mean_absolute_error(yr_test, yr_pred)
r2 = r2_score(yr_test, yr_pred)

print(f"Regression — RMSE: {rmse:.3f} | MAE: {mae:.3f} | R²: {r2:.3f}")

6) Train the Classification Model (win probability)

clf_model = xgb.XGBClassifier(
    n_estimators=600,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.9,
    random_state=42,
    n_jobs=-1,
    eval_metric="auc"
)

# Preprocessor is the same structure, but includes quoted_gp_pct as numeric
cat_cols_c = cat_cols
num_cols_c = ['quoted_gp_pct','lead_time_days','is_stock_item','on_hand_qty_at_quote']

preprocessor_clf = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols_c),
        ("num", "passthrough", num_cols_c)
    ]
)

clf_pipe = Pipeline([
    ("prep", preprocessor_clf),
    ("model", clf_model)
])

clf_pipe.fit(Xc_train, yc_train)

# Evaluate
yc_pred_proba = clf_pipe.predict_proba(Xc_test)[:,1]
auc = roc_auc_score(yc_test, yc_pred_proba)
print(f"Classification — AUC: {auc:.3f}")

7) Explainability with SHAP (optional but recommended)

import shap

# For tree-based models, explain on the transformed matrix
# Grab a small sample to keep plots fast
sample = Xr_test.sample(n=min(2000, len(Xr_test)), random_state=42)

# Fit a TreeExplainer on the trained XGB model inside the pipeline
# We need the model object (reg_model) and the transformed features
X_sample_transformed = reg_pipe.named_steps["prep"].transform(sample)
explainer = shap.TreeExplainer(reg_pipe.named_steps["model"])
shap_values = explainer.shap_values(X_sample_transformed)

# Summary plot (run in notebooks)
# shap.summary_plot(shap_values, X_sample_transformed, feature_names=reg_pipe.named_steps["prep"].get_feature_names_out())

Tip: For reports, capture SHAP bar plots for top features affecting GP% and win probability. This builds trust with commercial teams.

8) Put the Models to Work: Recommend a Price & Simulate Win Probability

Flow in production for a new quote:

Use the regression model to recommend a baseline GP%.
Convert GP% → price (based on cost).
Create a small price ladder around that recommendation (±2–5 percentage points).
For each rung, compute win probability via the classification model.
Pick the rung that meets your business objective (e.g., maximize expected margin = margin × win_prob, or enforce a minimum win probability).

Example:

def gp_to_price(cost, gp_pct):
    # gp_pct as percentage number, e.g., 25 means 25%
    return cost / (1 - gp_pct/100.0)

def simulate_ladder(row, reg_pipe, clf_pipe, ladder_pts=(-4,-2,0,2,4)):
    # 1) Predict baseline GP%
    reg_input = row[reg_features].to_frame().T
    gp_base = float(reg_pipe.predict(reg_input))

    results = []
    for delta in ladder_pts:
        gp_try = max(min(gp_base + delta, 95), -5)  # clamp
        price_try = gp_to_price(row["cost"], gp_try)

        clf_input = row[clf_features].copy()
        clf_input["quoted_gp_pct"] = gp_try
        win_prob = float(clf_pipe.predict_proba(clf_input.to_frame().T)[:,1])

        margin = price_try - row["cost"]
        expected_margin = margin * win_prob

        results.append({
            "gp_pct": round(gp_try,2),
            "price": round(price_try,2),
            "win_prob": round(win_prob,3),
            "expected_margin": round(expected_margin,2)
        })
    return pd.DataFrame(results).sort_values("expected_margin", ascending=False)

# Example usage on a single quote row (replace with real row)
# row = df.iloc[0]
# ladder = simulate_ladder(row, reg_pipe, clf_pipe)
# display(ladder)

9) Save & Load Models

import joblib
joblib.dump(reg_pipe, "gp_regression_pipe.joblib")
joblib.dump(clf_pipe, "win_classifier_pipe.joblib")

# Later
# reg_pipe = joblib.load("gp_regression_pipe.joblib")
# clf_pipe = joblib.load("win_classifier_pipe.joblib")

10) Production Tips

Time-based validation: use rolling windows to ensure robustness across market regimes.
Segmented models: consider separate models by product category if behavior differs drastically.
Guardrails: enforce GP% floors/ceilings by customer tier or category.
Monitoring: track drift in feature distributions and periodic re-train cadence (monthly/quarterly).

Conclusion: Turning Data Into Dynamic Pricing Decisions

Building an AI pricing model with Python transforms pricing from guesswork into a measurable, data-driven strategy. By combining regression and classification models with tools like pandas, scikit-learn, XGBoost, and SHAP, businesses can predict both the optimal price and the probability of winning at that price — all while understanding why the model makes its recommendations. The result is a pricing framework that adapts to changing markets, maximizes profit margins, and empowers your team to make confident, intelligent decisions. As AI continues to reshape competitive industries, developing your own machine learning pricing model isn’t just a technical advantage — it’s a strategic one.

November 5, 2025