The aim of this project is to predict if a car purchased at the Auction is a good / bad buy. All the variables in the data set are defined below:
Field Name: Definition
RefID: Unique (sequential) number assigned to vehicles
IsBadBuy: Identifies if the kicked vehicle was an avoidable purchase
PurchDate: The Date the vehicle was Purchased at Auction
Auction: Auction provider at which the vehicle was purchased
VehYear: The manufacturer's year of the vehicle
VehicleAge: The Years elapsed since the manufacturer's year
Make: Vehicle Manufacturer
Model: Vehicle Model
Trim: Vehicle Trim Level
SubModel: Vehicle Submodel
Color: Vehicle Color
Transmission: Vehicles transmission type (Automatic, Manual)
WheelTypeID: The type id of the vehicle wheel
WheelType: The vehicle wheel type description (Alloy, Covers)
VehOdo: The vehicles odometer reading
Nationality: The Manufacturer's country
Size: The size category of the vehicle (Compact, SUV, etc.)
TopThreeAmericanName: Identifies if the manufacturer is one of the top three American manufacturers
MMRAcquisitionAuctionAveragePrice: Acquisition price for this vehicle in average condition at time of purchase
MMRAcquisitionAuctionCleanPrice: Acquisition price for this vehicle in the above Average condition at time of purchase
MMRAcquisitionRetailAveragePrice: Acquisition price for this vehicle in the retail market in average condition at time of purchase
MMRAcquisitonRetailCleanPrice: Acquisition price for this vehicle in the retail market in above average condition at time of purchase
MMRCurrentAuctionAveragePrice: Acquisition price for this vehicle in average condition as of current day
MMRCurrentAuctionCleanPrice: Acquisition price for this vehicle in the above condition as of current day
MMRCurrentRetailAveragePrice: Acquisition price for this vehicle in the retail market in average condition as of current day
MMRCurrentRetailCleanPrice: Acquisition price for this vehicle in the retail market in above average condition as of current day
PRIMEUNIT: Identifies if the vehicle would have a higher demand than a standard purchase
AcquisitionType: Identifies how the vehicle was aquired (Auction buy, trade in, etc)
AUCGUART: The level guarntee provided by auction for the vehicle (Green light - Guaranteed/arbitratable, Yellow Light - caution/issue, red light - sold as is)
KickDate: Date the vehicle was kicked back to the auction
BYRNO: Unique number assigned to the buyer that purchased the vehicle
VNZIP: Zipcode where the car was purchased
VNST: State where the car was purchased
VehBCost: Acquisition cost paid for the vehicle at time of purchase
IsOnlineSale: Identifies if the vehicle was originally purchased online
WarrantyCost: Warranty price (term=36month and millage=36K)
The data contains missing values The dependent variable (IsBadBuy) is binary. There are 32 Independent variables. The data set is split to 60% training and 40% testing.
#import basic libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#import sklearn module for Machine Learning
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgbm
from sklearn.metrics import accuracy_score, classification_report,precision_score, \
recall_score,precision_recall_curve,roc_auc_score,confusion_matrix,roc_curve
# load data
train = pd.read_csv('training.csv').set_index('RefId')
test = pd.read_csv('test.csv').set_index('RefId')
train['split'] = 'train'
test['split'] = 'test'
data = train.copy()
data.head()
print("Dataset size: ",len(data))
print('Missing values in each column: \n', data.isnull().sum())
# print the size of each class in the dependent binary IsBadBuy variable
print(train.groupby("IsBadBuy").size())
# obtain a rough picture of correlations between the variables in the dataset
corr = data.corr()
corr.style.background_gradient()
# plot histograms of vehicle cost, broken down by the IsBadBuy variable
fig, ax = plt.subplots(1, 2, figsize = (12,4))
data['VehBCost'].hist(by=data['IsBadBuy'], bins=30, xrot=360, ax=ax)
ax[0].set_title("IsBadBuy = 0")
ax[1].set_title("IsBadBuy = 1")
ax[0].set_xlabel("Vehicle cost")
ax[1].set_xlabel("Vehicle cost")
plt.show()
fig, ax = plt.subplots(1, 2, figsize = (12,4))
#plot VehicleAge vs. IsBadBuy
data.groupby('VehicleAge').agg([np.mean,np.size])['IsBadBuy'].query('size > 250')['mean'].plot(ax=ax[0],title = "VehicleAge Vs IsBadBuy")
# create new variable essentially rounding the vehicle cost to the last two digits
data2 = data.copy()
data2['RoundVehBCost'] = round(data['VehBCost'],-2)
# plot the rounded vehicle cost vs. IsBadBuy
data2.groupby('RoundVehBCost').agg([np.mean,np.size])['IsBadBuy'].query('size > 250')['mean'].plot(ax=ax[1], title = "RoundVehBCost Vs IsBadBuy")
plt.show()
# create a bar plot of the probability of a bad buy by vehicle year
data.groupby("VehYear").mean()["IsBadBuy"].plot.bar(title = "VehYear Vs IsBadBuy")
plt.show()
data.groupby('Make').agg([np.mean,np.size])['IsBadBuy'].\
query('size > 50')['mean'].plot.bar(figsize=(14,5), title = "Vehicle Manufacturer Vs IsBadBuy")
plt.show()
data.groupby('WheelType').mean()['IsBadBuy'].plot.bar(title = "IsBadBuy Vs WheelType")
plt.show()
data.groupby('VNST').agg([np.mean,np.size])['IsBadBuy'].\
query('size > 250')['mean'].plot.bar(figsize=(12,5), title = "State where the care was purchased Vs IsBadBuy")
plt.show()
# plot odometer reading vs WarrantyCost
plt.scatter(data['VehOdo'], data['WarrantyCost'], alpha=0.5)
plt.xlabel('Odometer')
plt.ylabel('WarrantyCost')
plt.show()
Let's create a function that will carry out the following well-known ML algorithms, in increasing complexity:
def ml_models(x_train,y_train):
models = {}
logreg = LogisticRegression(class_weight='balanced',random_state=25)
randfor = RandomForestClassifier(n_estimators=75,max_features=5,max_depth=20,
min_samples_split=100,class_weight='balanced',random_state=25)
xgboost = xgb.XGBClassifier(objective ='binary:logistic', colsample_bytree = 0.3,
learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 10,random_state=25)
models["LogisticRegression"] = logreg
models["RandomForest"] = randfor
models["XGBoost"] = xgboost
for m,n in models.items():
n.fit(x_train, y_train)
models[m]=n
return models
def cross_validation(models,x_train,y_train,k_fold,metric):
AUC = {}
for m,n in models.items():
model_results = model_selection.cross_val_score(n, x_train,y_train, cv=k_fold, scoring=metric)
mean_auc = model_results.mean()
std = model_results.std()
# print out the mean and standard deviation of the training score
print('The model {} has AUC {} and STD {}.'.format(m, mean_auc, std))
AUC[m] = mean_auc
return AUC
def show_results(model, AUC , x_test, y_test):
print ('-------------- Model Summary --------------')
print("\n")
plt.figure()
for m,n in model.items():
model_predicted = n.predict(x_test)
print ('Model accuracy for {} = {}'.format(m, accuracy_score(y_test,model_predicted)))
print("\n")
model_roc_auc = roc_auc_score(y_test, model_predicted)
print ('Model ROC AUC for {} = {}'.format(m,model_roc_auc))
print("\n")
print(classification_report(y_test, model_predicted))
print("\n")
model_matrix = confusion_matrix(y_test, model_predicted)
print('Confusion Matrix for model {} : \n {}'.format(m,model_matrix))
print("\n")
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, n.predict_proba(x_test)[:,1])
# plot ROC
plt.plot(false_positive_rate, true_positive_rate, label='%s (area = %0.2f)' % (m, AUC[m]))
# plot Base Rate ROC
plt.plot([0,1], [0,1],label='Base Rate')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Plot')
plt.legend(loc="lower right")
plt.show()
def get_feature_importance(models,x_train):
for m, n in models.items():
if hasattr(n, 'feature_importances_'):
feature_importances = pd.DataFrame(n.feature_importances_,
index = x_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
feature_importances = feature_importances.reset_index()
print("Feature importances for model {} are \n {}".format(m, feature_importances))
feature_importances.plot.bar()
plt.show()
else:
print("Feature importances unavailable for model", m)
def split_categ_contin_cols(df, columns):
categ_cols = []
contin_cols = []
for col in columns:
if df[col].dtype == 'object':
categ_cols.append(col)
else:
contin_cols.append(col)
return categ_cols,contin_cols
def fillNAvalues(df, null_categcols, null_contincols):
df_nullcategcols = df[null_categcols].fillna('NA')
df_nullcontincols = df[null_contincols]
#impute with mean value
df_nullcontincols.fillna(df_nullcontincols.mean(),inplace=True)
columns = list(set(df.columns) - set(null_categcols) - set(null_contincols))
df_fillna = pd.concat([df[columns], df_nullcategcols, df_nullcontincols], axis=1)
return df_fillna
def findNullVals(df):
null_categcol = []
null_contincol = []
null_vals = df.isnull().sum().sort_values()
df_null = pd.DataFrame({'nullcols' : null_vals.index, 'countval' : null_vals.values})
df_null = df_null[df_null.countval > 0]
print ("Null variables with values :", df_null)
print ("Duplicateged values :", df_null.duplicated().sum())
nullcolumns = list(df_null.nullcols)
null_categcol, null_contincol = split_categ_contin_cols(df,df_null.nullcols)
return null_categcol, null_contincol
def find_outliers_scale(df,columns):
for col in columns:
# get variable stats
stats = df[col].describe()
IQR = stats['75%'] - stats['25%']
upper = stats['75%'] + 1.5 * IQR
lower = stats['25%'] - 1.5 * IQR
print('The upper and lower bounds of {} for candidate outliers are {} and {}.'.format(col, upper, lower))
print("Values less than lower bound : " , len(df[df[col] < lower]))
print("Values greater than upper bound : ", len(df[df[col] > upper]))
# convert to log scale
df[col]=np.log1p(df[col])
return df[columns]
def label_encode(df, columns):
label_enc = preprocessing.LabelEncoder()
for col in columns:
df[col] = np.log1p(label_enc.fit_transform(df[col]))
return df[columns]
# drop redundant columns
data_dropped = data.drop(['AUCGUART','PRIMEUNIT','Nationality','VNZIP1','VNST',\
'BYRNO','WheelTypeID','PurchDate','VehYear'],axis=1)
all_columns = data_dropped.columns
categcols, contincols = split_categ_contin_cols(data_dropped, all_columns)
print ("Categorical columns: ", categcols)
print("\n")
uid = ['RefId']
target = ['IsBadBuy']
contincols = list(set(contincols) - set(uid) - set(target))
features = categcols + contincols
print ("Continuous variables after target and id removal: ", contincols)
print("\n")
# some manual overwriting of column names and NA values
data_dropped.Transmission[data_dropped.Transmission == 'Manual'] = 'MANUAL'
data_dropped.Color[data_dropped.Color == 'NOT AVAIL'] = 'NA'
data_dropped.Color[data_dropped.Color == 'OTHER'] = 'NA'
data_dropped.TopThreeAmericanName[data_dropped.TopThreeAmericanName == 'OTHER'] = 'NA'
null_categ_cols, null_contin_cols = findNullVals(data_dropped)
data_dropped_new = fillNAvalues(data_dropped, null_categ_cols, null_contin_cols)
# find outliers and scale the continuous variables
data_contin = find_outliers_scale(data_dropped_new, contincols)
# encode labels for the categorical variables
data_categ = label_encode(data_dropped_new, categcols)
data_train = pd.concat([data_categ, data_contin, data_dropped_new[target]], axis=1)
# get the train and test splits
x_train, x_test, y_train, y_test = train_test_split(data_train[features],data_train[target],test_size=0.2,random_state=7)
kfold = model_selection.KFold(n_splits=10)
metric = 'roc_auc'
models = ml_models(x_train, y_train)
model_auc = cross_validation(models, x_train, y_train, kfold, metric)
show_results(models, model_auc, x_test, y_test)
get_feature_importance(models, x_train)
As we can see, the models, despite their simplicity and no hyperparameter tuning, show good results. The most important features, as seen from the plots above, are