In today’s competitive markets, pricing has become one of the most powerful levers for profitability — and one of the hardest to get right. Traditional pricing methods, like simple margin targets or “last price quoted” rules, can overlook complex relationships between cost, demand, competition, and customer behavior.
That’s where Machine Learning (ML) comes in. By analyzing large volumes of historical quote and sales data, an AI Pricing Model can uncover hidden patterns — such as how product category, lead time, customer tier, or market conditions influence the probability of winning a quote or achieving a target gross profit.
In this post, we’ll walk through how to build an AI Pricing Model using Machine Learning in Python, using tools like pandas, scikit-learn, and XGBoost. You’ll learn not just how the model works, but also why it can outperform traditional pricing logic — and how to interpret the model’s predictions in a way that’s practical for business decision-making.
By the end, you’ll understand how to:
- Prepare and clean pricing data for machine learning,
- Train and evaluate a predictive pricing model,
- Measure model accuracy using metrics like RMSE, R², and AUC,
- and apply your AI pricing model to generate optimized prices that balance margin and win probability.
Why Machine Learning Works for Pricing Optimization
At its core, pricing optimization is about understanding how different factors — such as cost, competition, customer type, quantity, and lead time — influence a buyer’s willingness to pay and your ability to win profitable deals. The challenge is that these relationships are rarely linear or static. They can shift over time, differ across product categories, and interact in subtle ways that are hard to detect with traditional rule-based logic or spreadsheets.
That’s exactly where Machine Learning (ML) excels.
A Machine Learning model for pricing learns directly from historical quote and sales data. It doesn’t rely on hard-coded formulas; instead, it identifies patterns and correlations that may not be obvious to humans. For example, it might learn that:
- A certain customer tier consistently pays higher prices for small-quantity orders,
- Or that short lead times increase win probability but reduce achievable margins,
- Or that specific product categories are more price-sensitive during certain market conditions.
Because ML models can analyze millions of records at once, they’re capable of understanding these complex interactions and weighting them appropriately. The result is an AI Pricing Model that predicts outcomes such as:
- Win probability for a given quote price, or
- Expected gross profit (GP%) given market and customer conditions.
With those predictions, pricing teams can simulate “what-if” scenarios — such as, “What happens to win probability if we increase price by 3%?” — and make more confident, data-driven decisions.
Dynamic Learning and Adaptability
Another reason Machine Learning works so well for pricing optimization is its ability to evolve. As new data is collected — from market shifts, supply chain changes, or customer behavior — the model can be retrained to learn from recent patterns. This makes it inherently adaptive, unlike static pricing rules that quickly become outdated.
From Insight to Action
When combined with explainability tools like SHAP (SHapley Additive exPlanations), businesses can even interpret why the model made certain pricing recommendations. This transparency builds trust and ensures that AI-driven pricing decisions are both accurate and explainable — not just “black box” outputs.
Two Core Models Behind an AI Pricing Framework: Regression and Classification
A strong AI Pricing Model using Machine Learning in Python usually isn’t just one model — it’s a combination of two that work together to predict both what price to quote and how likely you are to win at that price. These are called regression models and classification models, and each plays a distinct but complementary role.
1. The Regression Model — “What GP% should we expect (or recommend)?”
Think of the regression model as the price-setting brain of your AI Pricing system.
It answers questions like:
“Given these conditions — product group, customer tier, lead time, and market — what gross profit percentage (GP%) should we expect on this quote?”
The model learns from past quotes where you know both the inputs (features such as customer, product, and quantity) and the outcome (the GP% you actually achieved).
Over time, it learns relationships like:
- “High-volume orders tend to have lower GP%,”
- “Certain customers consistently negotiate tighter margins,”
- “Stock items can carry higher GP% due to faster availability.”
By understanding these patterns, the regression model can predict the most reasonable or competitive GP% for a new quote — effectively recommending a data-driven price point that aligns with historical success patterns.
2. The Classification Model — “What’s the probability we’ll win this quote?”
If the regression model helps you set the price, the classification model helps you evaluate the risk and opportunity of that price.
This model predicts a win probability — essentially answering:
“Given this quote’s characteristics and price level, what’s the likelihood that the customer will award us the order?”
It learns from historical data labeled as won or lost. For each quote, it examines factors such as:
- Quoted GP% (price competitiveness),
- Customer relationship or tier,
- Lead time,
- Product category or market conditions.
The output is a probability — for example, “There’s a 72% chance of winning this quote at this price.”
With that, your pricing system can balance profit vs. win likelihood, enabling smarter trade-offs — such as lowering margin slightly on a high-probability deal or holding firm on price when the odds of winning are already low.
When You Combine the Two
When you use both models together, you get a complete AI Pricing Framework:
- The Regression Model recommends a target price or GP%,
- The Classification Model estimates the win probability at that price,
- And together they create a feedback loop that helps your team find the optimal price point — the one that maximizes both revenue and likelihood of success.
This combination mirrors what experienced sales or pricing analysts do intuitively — except the AI can analyze millions of records and update itself continuously as new data comes in.
Tools and Frameworks in Python for Building an AI Pricing Model
One of the biggest advantages of building an AI Pricing Model using Machine Learning in Python is that the Python ecosystem already provides powerful, production-ready libraries for every stage of the workflow — from cleaning your data to training, evaluating, and explaining the model’s predictions.
Below are the key tools you’ll use, and the role each one plays in developing a pricing optimization framework.
1. pandas — The Data Preparation Workhorse
Before you can train a model, your quote and sales data needs to be cleaned and structured.
That’s where pandas comes in. It’s a Python library designed for working with tabular data — like your pricing history — using simple, spreadsheet-like commands.
With pandas, you can:
- Load CSVs or Excel files into DataFrames,
- Handle missing or invalid data,
- Create new features (e.g., “lead_time_days” or “customer_tier”),
- Filter, sort, and group data by product or customer,
- Join multiple datasets (e.g., cost data with quote history).
In short: pandas is where you prepare your pricing dataset for Machine Learning.
import pandas as pd
df = pd.read_csv("quotes.csv")
df['lead_time_days'] = (df['expected_ship_date'] - df['quote_date']).dt.days
2. scikit-learn — The Foundation for Machine Learning
Once your data is ready, scikit-learn provides the essential tools for building and evaluating models.
It includes algorithms for both regression and classification, along with utilities for:
- Splitting your dataset into training and test sets,
- Scaling and encoding data,
- Evaluating model accuracy using metrics like RMSE, MAE, and R² (for regression) or AUC (for classification),
- Building pipelines that make your ML workflow repeatable and organized.
Even if you later switch to more advanced models like XGBoost, scikit-learn remains the framework that ties everything together.
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
3. XGBoost — High-Performance Predictive Modeling
For real-world pricing problems, where data can include millions of quotes and dozens of features, XGBoost (Extreme Gradient Boosting) is a top performer.
It’s a gradient-boosted decision tree algorithm known for:
- Handling nonlinear relationships (e.g., between GP% and quantity),
- Managing missing values gracefully,
- Delivering high accuracy and fast training times.
In your AI Pricing Model, XGBoost is typically used for both:
- Regression → predicting the expected GP% for a quote,
- Classification → predicting the probability of winning at that price.
Its robustness and interpretability make it one of the most trusted algorithms for business-critical ML applications.
import xgboost as xgb
reg_model = xgb.XGBRegressor()
clf_model = xgb.XGBClassifier()
4. SHAP (SHapley Additive exPlanations) — Explaining the Model
Even the most accurate AI pricing model isn’t useful if you can’t explain why it makes a certain recommendation.
That’s where SHAP comes in. SHAP values quantify how much each feature — such as customer tier, lead time, or product group — contributes to a specific prediction.
For example:
- “Lead time contributed +1.2% to the GP% recommendation,”
- “Customer tier lowered the win probability by 8%.”
With SHAP visualizations, pricing analysts can see exactly what drives the model’s logic, turning complex AI outputs into actionable business insights.
import shap
explainer = shap.TreeExplainer(reg_model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
Bringing It All Together
When combined, these Python tools provide a complete end-to-end solution for building an AI Pricing Model:
| Stage | Goal | Tool |
|---|---|---|
| Data preparation | Clean and organize quote data | pandas |
| Model training | Build regression and classification models | scikit-learn, XGBoost |
| Model evaluation | Measure accuracy and predictive power | scikit-learn metrics |
| Model explainability | Visualize feature importance and logic | SHAP |
Together, these frameworks turn raw pricing history into a living, learning system — one that continuously refines your pricing strategy based on data, not gut feel.
Step-by-Step: How to Build an AI Pricing Model Using Machine Learning in Python
This walkthrough shows how to build the two-model framework:
- a Regression model to recommend a target GP%, and
- a Classification model to estimate win probability at a given price.
We’ll use pandas, scikit-learn, XGBoost, and SHAP.
Mini data-sanity checklist (save the deep dive for the next post):
- Remove or flag obvious data errors (negative quantities/costs, GP% > 100, etc.).
- Avoid leakage: for the win model, only use features available before the decision (e.g., do not use “won/lost” derived fields or post-quote info).
- Ensure time awareness: train on older data, test on newer (or do time-based CV).
- Encode categories (customer tier, product group/category) and handle missing values.
1) Setup & Load Data
import pandas as pd
import numpy as np
# Load your quotes dataset
df = pd.read_csv("quotes.csv", parse_dates=["quote_date"], low_memory=False)
# Example expected columns (adjust to your schema):
# 'quoted_price', 'quoted_quantity', 'cost', 'quoted_gp_pct', 'won_flag',
# 'product_group', 'product_category', 'customer_tier', 'lead_time_days',
# 'is_stock_item', 'on_hand_qty_at_quote'
Light cleaning:
# Basic filters/sanity
df = df.dropna(subset=["quoted_gp_pct", "won_flag", "product_group", "customer_tier"])
df = df[(df["quoted_gp_pct"] > -20) & (df["quoted_gp_pct"] < 100)] # tweak if needed
# Ensure types
df["is_stock_item"] = df["is_stock_item"].astype(int) # 0/1
df["won_flag"] = df["won_flag"].astype(int) # 0/1
2) Feature Sets for Each Model
- Regression (target =
quoted_gp_pct)
Inputs that influence achievable margin:['product_group', 'product_category', 'customer_tier', 'lead_time_days', 'is_stock_item', 'on_hand_qty_at_quote'] - Classification (target =
won_flag)
Include the price signal (e.g.,quoted_gp_pct) plus context features:['quoted_gp_pct', 'product_group', 'product_category', 'customer_tier', 'lead_time_days', 'is_stock_item', 'on_hand_qty_at_quote']
Tip: Keep feature names aligned between models so the system is easy to maintain. Avoid ‘feature leakage’ by not giving any features to the models that give away the answer (for example, do not give information about the actual price or GP% from the training data to the regression model, and do not give the outcome information to the classification model.) Also, remember that the information that is chosen to be used as a feature will be the same information that will be required to give the models later when using them to predict outcomes.
3) Train/Test Split (time-aware if possible)
# Option A: random split (simple)
from sklearn.model_selection import train_test_split
reg_features = ['product_group','product_category','customer_tier',
'lead_time_days','is_stock_item','on_hand_qty_at_quote']
clf_features = ['quoted_gp_pct'] + reg_features
X_reg = df[reg_features]
y_reg = df["quoted_gp_pct"]
X_clf = df[clf_features]
y_clf = df["won_flag"]
Xr_train, Xr_test, yr_train, yr_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)
Xc_train, Xc_test, yc_train, yc_test = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)
# Option B (recommended for production): split by date so test is newer period
4) Preprocessing Pipelines
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error, roc_auc_score
import xgboost as xgb
import numpy as np
cat_cols = ['product_group','product_category','customer_tier']
num_cols = ['lead_time_days','is_stock_item','on_hand_qty_at_quote']
preprocessor = ColumnTransformer(
transformers=[
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
("num", "passthrough", num_cols)
]
)
5) Train the Regression Model (GP% recommender)
reg_model = xgb.XGBRegressor(
n_estimators=600,
max_depth=6,
learning_rate=0.05,
subsample=0.9,
colsample_bytree=0.9,
random_state=42,
n_jobs=-1
)
reg_pipe = Pipeline([
("prep", preprocessor),
("model", reg_model)
])
reg_pipe.fit(Xr_train, yr_train)
# Evaluate
yr_pred = reg_pipe.predict(Xr_test)
rmse = np.sqrt(mean_squared_error(yr_test, yr_pred))
mae = mean_absolute_error(yr_test, yr_pred)
r2 = r2_score(yr_test, yr_pred)
print(f"Regression — RMSE: {rmse:.3f} | MAE: {mae:.3f} | R²: {r2:.3f}")
6) Train the Classification Model (win probability)
clf_model = xgb.XGBClassifier(
n_estimators=600,
max_depth=6,
learning_rate=0.05,
subsample=0.9,
colsample_bytree=0.9,
random_state=42,
n_jobs=-1,
eval_metric="auc"
)
# Preprocessor is the same structure, but includes quoted_gp_pct as numeric
cat_cols_c = cat_cols
num_cols_c = ['quoted_gp_pct','lead_time_days','is_stock_item','on_hand_qty_at_quote']
preprocessor_clf = ColumnTransformer(
transformers=[
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols_c),
("num", "passthrough", num_cols_c)
]
)
clf_pipe = Pipeline([
("prep", preprocessor_clf),
("model", clf_model)
])
clf_pipe.fit(Xc_train, yc_train)
# Evaluate
yc_pred_proba = clf_pipe.predict_proba(Xc_test)[:,1]
auc = roc_auc_score(yc_test, yc_pred_proba)
print(f"Classification — AUC: {auc:.3f}")
7) Explainability with SHAP (optional but recommended)
import shap
# For tree-based models, explain on the transformed matrix
# Grab a small sample to keep plots fast
sample = Xr_test.sample(n=min(2000, len(Xr_test)), random_state=42)
# Fit a TreeExplainer on the trained XGB model inside the pipeline
# We need the model object (reg_model) and the transformed features
X_sample_transformed = reg_pipe.named_steps["prep"].transform(sample)
explainer = shap.TreeExplainer(reg_pipe.named_steps["model"])
shap_values = explainer.shap_values(X_sample_transformed)
# Summary plot (run in notebooks)
# shap.summary_plot(shap_values, X_sample_transformed, feature_names=reg_pipe.named_steps["prep"].get_feature_names_out())
Tip: For reports, capture SHAP bar plots for top features affecting GP% and win probability. This builds trust with commercial teams.
8) Put the Models to Work: Recommend a Price & Simulate Win Probability
Flow in production for a new quote:
- Use the regression model to recommend a baseline GP%.
- Convert GP% → price (based on cost).
- Create a small price ladder around that recommendation (±2–5 percentage points).
- For each rung, compute win probability via the classification model.
- Pick the rung that meets your business objective (e.g., maximize expected margin = margin × win_prob, or enforce a minimum win probability).
Example:
def gp_to_price(cost, gp_pct):
# gp_pct as percentage number, e.g., 25 means 25%
return cost / (1 - gp_pct/100.0)
def simulate_ladder(row, reg_pipe, clf_pipe, ladder_pts=(-4,-2,0,2,4)):
# 1) Predict baseline GP%
reg_input = row[reg_features].to_frame().T
gp_base = float(reg_pipe.predict(reg_input))
results = []
for delta in ladder_pts:
gp_try = max(min(gp_base + delta, 95), -5) # clamp
price_try = gp_to_price(row["cost"], gp_try)
clf_input = row[clf_features].copy()
clf_input["quoted_gp_pct"] = gp_try
win_prob = float(clf_pipe.predict_proba(clf_input.to_frame().T)[:,1])
margin = price_try - row["cost"]
expected_margin = margin * win_prob
results.append({
"gp_pct": round(gp_try,2),
"price": round(price_try,2),
"win_prob": round(win_prob,3),
"expected_margin": round(expected_margin,2)
})
return pd.DataFrame(results).sort_values("expected_margin", ascending=False)
# Example usage on a single quote row (replace with real row)
# row = df.iloc[0]
# ladder = simulate_ladder(row, reg_pipe, clf_pipe)
# display(ladder)
9) Save & Load Models
import joblib
joblib.dump(reg_pipe, "gp_regression_pipe.joblib")
joblib.dump(clf_pipe, "win_classifier_pipe.joblib")
# Later
# reg_pipe = joblib.load("gp_regression_pipe.joblib")
# clf_pipe = joblib.load("win_classifier_pipe.joblib")
10) Production Tips
- Time-based validation: use rolling windows to ensure robustness across market regimes.
- Segmented models: consider separate models by product category if behavior differs drastically.
- Guardrails: enforce GP% floors/ceilings by customer tier or category.
- Monitoring: track drift in feature distributions and periodic re-train cadence (monthly/quarterly).
Conclusion: Turning Data Into Dynamic Pricing Decisions
Building an AI pricing model with Python transforms pricing from guesswork into a measurable, data-driven strategy. By combining regression and classification models with tools like pandas, scikit-learn, XGBoost, and SHAP, businesses can predict both the optimal price and the probability of winning at that price — all while understanding why the model makes its recommendations. The result is a pricing framework that adapts to changing markets, maximizes profit margins, and empowers your team to make confident, intelligent decisions. As AI continues to reshape competitive industries, developing your own machine learning pricing model isn’t just a technical advantage — it’s a strategic one.


Leave a Reply