Exam project: Predicting the quality of a purchased car¶

The aim of this project is to predict if a car purchased at the Auction is a good / bad buy. All the variables in the data set are defined below:

Field Name: RefID: IsBadBuy: PurchDate: Auction: VehYear: VehicleAge: Make: Model: Trim: SubModel: Color: Transmission: WheelTypeID: WheelType: VehOdo: Nationality: Size: TopThreeAmericanName: MMRAcquisitionAuctionAveragePrice: MMRAcquisitionAuctionCleanPrice: MMRAcquisitionRetailAveragePrice: MMRAcquisitonRetailCleanPrice: MMRCurrentAuctionAveragePrice: MMRCurrentAuctionCleanPrice: MMRCurrentRetailAveragePrice: MMRCurrentRetailCleanPrice: PRIMEUNIT: AcquisitionType: AUCGUART: KickDate: BYRNO: VNZIP: VNST: VehBCost: IsOnlineSale: WarrantyCost: Definition
Unique (sequential) number assigned to vehicles
Identifies if the kicked vehicle was an avoidable purchase
The Date the vehicle was Purchased at Auction
Auction provider at which the vehicle was purchased
The manufacturer's year of the vehicle
The Years elapsed since the manufacturer's year
Vehicle Manufacturer
Vehicle Model
Vehicle Trim Level
Vehicle Submodel
Vehicle Color
Vehicles transmission type (Automatic, Manual)
The type id of the vehicle wheel
The vehicle wheel type description (Alloy, Covers)
The vehicles odometer reading
The Manufacturer's country
The size category of the vehicle (Compact, SUV, etc.)
Identifies if the manufacturer is one of the top three American manufacturers
Acquisition price for this vehicle in average condition at time of purchase
Acquisition price for this vehicle in the above Average condition at time of purchase
Acquisition price for this vehicle in the retail market in average condition at time of purchase
Acquisition price for this vehicle in the retail market in above average condition at time of purchase
Acquisition price for this vehicle in average condition as of current day
Acquisition price for this vehicle in the above condition as of current day
Acquisition price for this vehicle in the retail market in average condition as of current day
Acquisition price for this vehicle in the retail market in above average condition as of current day
Identifies if the vehicle would have a higher demand than a standard purchase
Identifies how the vehicle was aquired (Auction buy, trade in, etc)
The level guarntee provided by auction for the vehicle (Green light - Guaranteed/arbitratable, Yellow Light - caution/issue, red light - sold as is)
Date the vehicle was kicked back to the auction
Unique number assigned to the buyer that purchased the vehicle
Zipcode where the car was purchased
State where the car was purchased
Acquisition cost paid for the vehicle at time of purchase
Identifies if the vehicle was originally purchased online
Warranty price (term=36month and millage=36K)

The data contains missing values The dependent variable (IsBadBuy) is binary. There are 32 Independent variables. The data set is split to 60% training and 40% testing.

#import basic libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#import sklearn module for Machine Learning
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgbm
from sklearn.metrics import accuracy_score, classification_report,precision_score, \
recall_score,precision_recall_curve,roc_auc_score,confusion_matrix,roc_curve

# load data
train = pd.read_csv('training.csv').set_index('RefId')
test = pd.read_csv('test.csv').set_index('RefId')

train['split'] = 'train'
test['split'] = 'test'

data = train.copy()
data.head()

To begin, let's do some exploratory data analysis¶

print("Dataset size: ",len(data))
print('Missing values in each column: \n', data.isnull().sum())

Dataset size:  58386
Missing values in each column: 
 IsBadBuy                                 0
PurchDate                                0
Auction                                  0
VehYear                                  0
VehicleAge                               0
Make                                     0
Model                                    0
Trim                                  1911
SubModel                                 7
Color                                    7
Transmission                             8
WheelTypeID                           2573
WheelType                             2577
VehOdo                                   0
Nationality                              4
Size                                     4
TopThreeAmericanName                     4
MMRAcquisitionAuctionAveragePrice       13
MMRAcquisitionAuctionCleanPrice         13
MMRAcquisitionRetailAveragePrice        13
MMRAcquisitonRetailCleanPrice           13
MMRCurrentAuctionAveragePrice          245
MMRCurrentAuctionCleanPrice            245
MMRCurrentRetailAveragePrice           245
MMRCurrentRetailCleanPrice             245
PRIMEUNIT                            55703
AUCGUART                             55703
BYRNO                                    0
VNZIP1                                   0
VNST                                     0
VehBCost                                 0
IsOnlineSale                             0
WarrantyCost                             0
split                                    0
dtype: int64

# print the size of each class in the dependent binary IsBadBuy variable
print(train.groupby("IsBadBuy").size())

IsBadBuy
0    51178
1     7208
dtype: int64

# obtain a rough picture of correlations between the variables in the dataset
corr = data.corr()
corr.style.background_gradient()

# plot histograms of vehicle cost, broken down by the IsBadBuy variable
fig, ax = plt.subplots(1, 2, figsize = (12,4))
data['VehBCost'].hist(by=data['IsBadBuy'], bins=30, xrot=360, ax=ax)
ax[0].set_title("IsBadBuy = 0")
ax[1].set_title("IsBadBuy = 1")
ax[0].set_xlabel("Vehicle cost")
ax[1].set_xlabel("Vehicle cost")
plt.show()

fig, ax = plt.subplots(1, 2, figsize = (12,4))

#plot VehicleAge vs. IsBadBuy
data.groupby('VehicleAge').agg([np.mean,np.size])['IsBadBuy'].query('size > 250')['mean'].plot(ax=ax[0],title = "VehicleAge Vs IsBadBuy")

# create new variable essentially rounding the vehicle cost to the last two digits
data2 = data.copy()
data2['RoundVehBCost'] = round(data['VehBCost'],-2)

# plot the rounded vehicle cost vs. IsBadBuy
data2.groupby('RoundVehBCost').agg([np.mean,np.size])['IsBadBuy'].query('size > 250')['mean'].plot(ax=ax[1], title = "RoundVehBCost Vs IsBadBuy")
plt.show()

We see that, as intuitively expected, the vehicle age increases the probability of a car being a bad buy, while the vehicle cost decreases this probability.¶

# create a bar plot of the probability of a bad buy by vehicle year
data.groupby("VehYear").mean()["IsBadBuy"].plot.bar(title = "VehYear Vs IsBadBuy")
plt.show()

Again, as expected, we see that vehicle year is strongly indicative of a bad buy.¶

Let's look at how the vehicle manufacturer is related to the probability of a bad buy:¶

data.groupby('Make').agg([np.mean,np.size])['IsBadBuy'].\
    query('size > 50')['mean'].plot.bar(figsize=(14,5), title = "Vehicle Manufacturer Vs IsBadBuy")
plt.show()

Let's do the same for Wheeltype:¶

data.groupby('WheelType').mean()['IsBadBuy'].plot.bar(title = "IsBadBuy Vs WheelType")
plt.show()

The same for the state where the care was purchased:¶

data.groupby('VNST').agg([np.mean,np.size])['IsBadBuy'].\
                    query('size > 250')['mean'].plot.bar(figsize=(12,5), title = "State where the care was purchased Vs IsBadBuy")
plt.show()

# plot odometer reading vs WarrantyCost
plt.scatter(data['VehOdo'], data['WarrantyCost'], alpha=0.5)
plt.xlabel('Odometer')
plt.ylabel('WarrantyCost')
plt.show()

Machine Learning models for good/bad buy prediction¶

Let's create a function that will carry out the following well-known ML algorithms, in increasing complexity:

Logistic Regression
Random Forest Classifier
XGBoost

def ml_models(x_train,y_train):
    models = {}
    logreg = LogisticRegression(class_weight='balanced',random_state=25)
    randfor = RandomForestClassifier(n_estimators=75,max_features=5,max_depth=20,
                                min_samples_split=100,class_weight='balanced',random_state=25)
    xgboost = xgb.XGBClassifier(objective ='binary:logistic', colsample_bytree = 0.3, 
                           learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 10,random_state=25)
    
    models["LogisticRegression"] = logreg
    models["RandomForest"] = randfor
    models["XGBoost"] = xgboost
    
    for m,n in models.items():
        n.fit(x_train, y_train)
        models[m]=n
    return models

Train the models on a k = 10-fold cross validation and compute AUC.¶

def cross_validation(models,x_train,y_train,k_fold,metric):
    AUC = {}
    for m,n in models.items():    
        model_results = model_selection.cross_val_score(n, x_train,y_train, cv=k_fold, scoring=metric)
        mean_auc = model_results.mean()
        std = model_results.std()
        # print out the mean and standard deviation of the training score 
        print('The model {} has AUC {} and STD {}.'.format(m, mean_auc, std))
        AUC[m] = mean_auc
    return AUC

Let's compute model accuracy, display confusion matrix and plot ROC curves.¶

def show_results(model, AUC , x_test, y_test):  
    
        print ('-------------- Model Summary --------------')
        print("\n")
        
        plt.figure()
        
        for m,n in model.items():
            
            model_predicted = n.predict(x_test)
            print ('Model accuracy for {} = {}'.format(m, accuracy_score(y_test,model_predicted)))
            print("\n")
            
            model_roc_auc = roc_auc_score(y_test, model_predicted)
            print ('Model ROC AUC for {} = {}'.format(m,model_roc_auc))
            print("\n")
            print(classification_report(y_test, model_predicted))
            print("\n")
            
            model_matrix = confusion_matrix(y_test, model_predicted)
            print('Confusion Matrix for model {} : \n {}'.format(m,model_matrix))
            print("\n")

            false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, n.predict_proba(x_test)[:,1])           
            # plot ROC
            plt.plot(false_positive_rate, true_positive_rate, label='%s (area = %0.2f)' % (m, AUC[m]))
        
        # plot Base Rate ROC
        plt.plot([0,1], [0,1],label='Base Rate')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.0])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('ROC Plot')
        plt.legend(loc="lower right")
        plt.show()

Let's get the feature importances for the three ML models¶

def get_feature_importance(models,x_train):
    for m, n in models.items():
        if hasattr(n, 'feature_importances_'):
            feature_importances = pd.DataFrame(n.feature_importances_,
                                       index = x_train.columns,
                                        columns=['importance']).sort_values('importance', ascending=False)
            feature_importances = feature_importances.reset_index()
            print("Feature importances for model {} are \n {}".format(m, feature_importances))
            feature_importances.plot.bar()
            plt.show()    
        else:
            print("Feature importances unavailable for model", m)

Before we move on, let's define some data preprocessing functions we will have to use for feeding the ML models the correct input¶

First, let's split the variables in the dataset into categorical and continuous variables.¶

def split_categ_contin_cols(df, columns):
    categ_cols = []
    contin_cols = []
    for col in columns:
        if df[col].dtype == 'object':
            categ_cols.append(col)
        else:
            contin_cols.append(col)
    return categ_cols,contin_cols

Then, fill missing values with appropirate values for categorical and continuos features.¶

def fillNAvalues(df, null_categcols, null_contincols):
    
    df_nullcategcols =  df[null_categcols].fillna('NA')
    df_nullcontincols = df[null_contincols]
    
    #impute with mean value
    df_nullcontincols.fillna(df_nullcontincols.mean(),inplace=True)
    
    columns = list(set(df.columns) - set(null_categcols) - set(null_contincols))
    df_fillna = pd.concat([df[columns], df_nullcategcols, df_nullcontincols], axis=1)
    return df_fillna

Next, let's find null and duplicate values from the dataframe.¶

def findNullVals(df): 
    
    null_categcol = []
    null_contincol = []
    
    null_vals = df.isnull().sum().sort_values()
    
    df_null = pd.DataFrame({'nullcols' : null_vals.index, 'countval' : null_vals.values})
    df_null = df_null[df_null.countval > 0]
    
    print ("Null variables with values :", df_null)
    print ("Duplicateged values :", df_null.duplicated().sum())
    
    nullcolumns = list(df_null.nullcols)
    null_categcol, null_contincol = split_categ_contin_cols(df,df_null.nullcols)
    
    return null_categcol, null_contincol

Next, let's define a function that will find outliers in continuous variables and normalize all variables into the log scale.¶

def find_outliers_scale(df,columns):
    
    for col in columns:
        
        # get variable stats
        stats = df[col].describe()
        
        IQR = stats['75%'] - stats['25%']
        upper = stats['75%'] + 1.5 * IQR
        lower = stats['25%'] - 1.5 * IQR
        
        print('The upper and lower bounds of {} for candidate outliers are {} and {}.'.format(col, upper, lower))
        print("Values less than lower bound : " , len(df[df[col] < lower]))
        print("Values greater than upper  bound : ", len(df[df[col] > upper]))
        
        # convert to log scale
        df[col]=np.log1p(df[col])
        
    return df[columns]

Finally, let's encode labels for categorical variables using sklearn's preprocessing module.¶

def label_encode(df, columns):
    
    label_enc = preprocessing.LabelEncoder()
    for col in columns:
        df[col] = np.log1p(label_enc.fit_transform(df[col]))
        
    return df[columns]

# drop redundant columns
data_dropped = data.drop(['AUCGUART','PRIMEUNIT','Nationality','VNZIP1','VNST',\
                        'BYRNO','WheelTypeID','PurchDate','VehYear'],axis=1)

all_columns = data_dropped.columns
categcols, contincols =  split_categ_contin_cols(data_dropped, all_columns)    
print ("Categorical columns: ", categcols)
print("\n")

uid = ['RefId']
target = ['IsBadBuy']

contincols = list(set(contincols) - set(uid) - set(target))
features = categcols + contincols 
print ("Continuous variables after target and id removal: ", contincols)
print("\n")

# some manual overwriting of column names and NA values
data_dropped.Transmission[data_dropped.Transmission == 'Manual'] = 'MANUAL'
data_dropped.Color[data_dropped.Color == 'NOT AVAIL'] = 'NA'
data_dropped.Color[data_dropped.Color == 'OTHER'] = 'NA'
data_dropped.TopThreeAmericanName[data_dropped.TopThreeAmericanName == 'OTHER'] = 'NA'

null_categ_cols, null_contin_cols = findNullVals(data_dropped)
data_dropped_new = fillNAvalues(data_dropped, null_categ_cols, null_contin_cols)

Categorical columns:  ['Auction', 'Make', 'Model', 'Trim', 'SubModel', 'Color', 'Transmission', 'WheelType', 'Size', 'TopThreeAmericanName', 'split']


Continuous variables after target and id removal:  ['MMRAcquisitionRetailAveragePrice', 'MMRCurrentAuctionAveragePrice', 'MMRCurrentRetailCleanPrice', 'MMRAcquisitionAuctionAveragePrice', 'VehBCost', 'MMRAcquisitionAuctionCleanPrice', 'IsOnlineSale', 'MMRCurrentRetailAveragePrice', 'VehOdo', 'WarrantyCost', 'VehicleAge', 'MMRCurrentAuctionCleanPrice', 'MMRAcquisitonRetailCleanPrice']


Null variables with values :                              nullcols  countval
10               TopThreeAmericanName         4
11                               Size         4
12                           SubModel         7
13                              Color         7
14                       Transmission         8
15  MMRAcquisitionAuctionAveragePrice        13
16    MMRAcquisitionAuctionCleanPrice        13
17      MMRAcquisitonRetailCleanPrice        13
18   MMRAcquisitionRetailAveragePrice        13
19      MMRCurrentAuctionAveragePrice       245
20        MMRCurrentAuctionCleanPrice       245
21       MMRCurrentRetailAveragePrice       245
22         MMRCurrentRetailCleanPrice       245
23                               Trim      1911
24                          WheelType      2577
Duplicateged values : 0

C:\Users\gevor\Anaconda3\lib\site-packages\ipykernel_launcher.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
C:\Users\gevor\Anaconda3\lib\site-packages\ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
C:\Users\gevor\Anaconda3\lib\site-packages\ipykernel_launcher.py:21: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
C:\Users\gevor\Anaconda3\lib\site-packages\ipykernel_launcher.py:22: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
C:\Users\gevor\Anaconda3\lib\site-packages\pandas\core\generic.py:6287: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)

Finally, let's run the three models we have built:¶

# find outliers and scale the continuous variables
data_contin = find_outliers_scale(data_dropped_new, contincols)
# encode labels for the categorical variables
data_categ = label_encode(data_dropped_new, categcols)

data_train = pd.concat([data_categ, data_contin, data_dropped_new[target]], axis=1)

# get the train and test splits
x_train, x_test, y_train, y_test = train_test_split(data_train[features],data_train[target],test_size=0.2,random_state=7)
kfold = model_selection.KFold(n_splits=10)
metric = 'roc_auc'

models = ml_models(x_train, y_train)
model_auc = cross_validation(models, x_train, y_train, kfold, metric)

The upper and lower bounds of MMRAcquisitionRetailAveragePrice for candidate outliers are 17205.0 and -275.0.
Values less than lower bound :  0
Values greater than upper  bound :  247
The upper and lower bounds of MMRCurrentAuctionAveragePrice for candidate outliers are 12907.5 and -888.5.
Values less than lower bound :  0
Values greater than upper  bound :  461
The upper and lower bounds of MMRCurrentRetailCleanPrice for candidate outliers are 19064.5 and 1044.5.
Values less than lower bound :  394
Values greater than upper  bound :  350
The upper and lower bounds of MMRAcquisitionAuctionAveragePrice for candidate outliers are 13001.5 and -962.5.
Values less than lower bound :  0
Values greater than upper  bound :  422
The upper and lower bounds of VehBCost for candidate outliers are 11605.0 and 1725.0.
Values less than lower bound :  5
Values greater than upper  bound :  141
The upper and lower bounds of MMRAcquisitionAuctionCleanPrice for candidate outliers are 14425.0 and 1.0.
Values less than lower bound :  552
Values greater than upper  bound :  676
The upper and lower bounds of IsOnlineSale for candidate outliers are 0.0 and 0.0.
Values less than lower bound :  0
Values greater than upper  bound :  1499
The upper and lower bounds of MMRCurrentRetailAveragePrice for candidate outliers are 17414.5 and 34.5.
Values less than lower bound :  393
Values greater than upper  bound :  244
The upper and lower bounds of VehOdo for candidate outliers are 113390.0 and 30822.0.
Values less than lower bound :  260
Values greater than upper  bound :  3
The upper and lower bounds of WarrantyCost for candidate outliers are 2802.0 and -342.0.
Values less than lower bound :  0
Values greater than upper  bound :  669
The upper and lower bounds of VehicleAge for candidate outliers are 8.0 and 0.0.
Values less than lower bound :  0
Values greater than upper  bound :  510
The upper and lower bounds of MMRCurrentAuctionCleanPrice for candidate outliers are 14382.5 and 50.5.
Values less than lower bound :  393
Values greater than upper  bound :  670
The upper and lower bounds of MMRAcquisitonRetailCleanPrice for candidate outliers are 18956.0 and 628.0.
Values less than lower bound :  648
Values greater than upper  bound :  342

C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\ipykernel_launcher.py:14: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

The model LogisticRegression has AUC 0.68814091172429 and STD 0.013346010369427138.

C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:516: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:516: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:516: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:516: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:516: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:516: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:516: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:516: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:516: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:516: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)

The model RandomForest has AUC 0.7564910927181913 and STD 0.011895178491731224.

C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\gevor\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

The model XGBoost has AUC 0.7346201171972389 and STD 0.01122953422926692.

Let's check out the model summaries and the ROC curves:¶

show_results(models, model_auc, x_test, y_test)

-------------- Model Summary --------------


Model accuracy for LogisticRegression = 0.6479705429011817


Model ROC AUC for LogisticRegression = 0.636379601660292


              precision    recall  f1-score   support

           0       0.92      0.65      0.76     10240
           1       0.20      0.62      0.30      1438

    accuracy                           0.65     11678
   macro avg       0.56      0.64      0.53     11678
weighted avg       0.84      0.65      0.71     11678


Confusion Matrix for model LogisticRegression : 
 [[6674 3566]
 [ 545  893]]


Model accuracy for RandomForest = 0.8331049837300908


Model ROC AUC for RandomForest = 0.6705143998826495


              precision    recall  f1-score   support

           0       0.92      0.89      0.90     10240
           1       0.36      0.45      0.40      1438

    accuracy                           0.83     11678
   macro avg       0.64      0.67      0.65     11678
weighted avg       0.85      0.83      0.84     11678


Confusion Matrix for model RandomForest : 
 [[9075 1165]
 [ 784  654]]


Model accuracy for XGBoost = 0.8921904435691043


Model ROC AUC for XGBoost = 0.5682167615611962


              precision    recall  f1-score   support

           0       0.89      1.00      0.94     10240
           1       0.91      0.14      0.24      1438

    accuracy                           0.89     11678
   macro avg       0.90      0.57      0.59     11678
weighted avg       0.89      0.89      0.86     11678


Confusion Matrix for model XGBoost : 
 [[10220    20]
 [ 1239   199]]

Let's also see the feature importance rankings¶

get_feature_importance(models, x_train)

Feature importances unavailable for model LogisticRegression
Feature importances for model RandomForest are 
                                 index  importance
0                           WheelType    0.260195
1                            VehBCost    0.063660
2                              VehOdo    0.061660
3                          VehicleAge    0.061259
4       MMRCurrentAuctionAveragePrice    0.054571
5         MMRCurrentAuctionCleanPrice    0.052217
6   MMRAcquisitionAuctionAveragePrice    0.051832
7     MMRAcquisitionAuctionCleanPrice    0.049321
8        MMRCurrentRetailAveragePrice    0.042204
9    MMRAcquisitionRetailAveragePrice    0.040455
10      MMRAcquisitonRetailCleanPrice    0.039955
11         MMRCurrentRetailCleanPrice    0.038161
12                       WarrantyCost    0.033108
13                           SubModel    0.029621
14                              Model    0.028395
15                               Trim    0.021077
16                               Make    0.019396
17                              Color    0.016493
18                            Auction    0.014613
19                               Size    0.012051
20               TopThreeAmericanName    0.005378
21                       Transmission    0.003031
22                       IsOnlineSale    0.001347
23                              split    0.000000

Feature importances for model XGBoost are 
                                 index  importance
0                           WheelType    0.624252
1                             Auction    0.119729
2                          VehicleAge    0.077761
3     MMRAcquisitionAuctionCleanPrice    0.035872
4                            VehBCost    0.027963
5         MMRCurrentAuctionCleanPrice    0.021132
6   MMRAcquisitionAuctionAveragePrice    0.014168
7        MMRCurrentRetailAveragePrice    0.012077
8    MMRAcquisitionRetailAveragePrice    0.011031
9                              VehOdo    0.009928
10                       Transmission    0.007939
11                       IsOnlineSale    0.007332
12      MMRAcquisitonRetailCleanPrice    0.006413
13      MMRCurrentAuctionAveragePrice    0.005746
14                       WarrantyCost    0.004261
15                               Make    0.003316
16                               Trim    0.003104
17               TopThreeAmericanName    0.003085
18                              Model    0.002865
19                           SubModel    0.001021
20                              Color    0.001007
21         MMRCurrentRetailCleanPrice    0.000000
22                              split    0.000000
23                               Size    0.000000

As we can see, the models, despite their simplicity and no hyperparameter tuning, show good results. The most important features, as seen from the plots above, are

Wheel type
the Auction at which the car was purchased
Vehicle age
Vehicle cost
Vehicle Odometer reading
MMRCurrentAuctionCleanPrice

	IsBadBuy	PurchDate	Auction	VehYear	VehicleAge	Make	Model	Trim	SubModel	Color	...	MMRCurrentRetailCleanPrice	PRIMEUNIT	AUCGUART	BYRNO	VNZIP1	VNST	VehBCost	IsOnlineSale	WarrantyCost	split
RefId
326	1	10/25/2010	ADESA	2008	2	KIA	SPECTRA	EX	4D SEDAN EX	BLUE	...	10650.0	NaN	NaN	5546	33619	FL	6100.0	0	533	train
42991	0	5/27/2009	MANHEIM	2006	3	DODGE	STRATUS V6 2.7L V6 M	SXT	4D SEDAN SXT FFV	SILVER	...	7478.0	NaN	NaN	99750	32124	FL	4000.0	0	1630	train
55273	0	3/18/2010	OTHER	2008	2	DODGE	CALIBER	SE	4D WAGON	BLACK	...	11350.0	NaN	NaN	99761	74135	OK	7500.0	0	693	train
29058	0	6/8/2009	OTHER	2005	4	FORD	FREESTAR FWD V6 3.9L	S	PASSENGER 3.9L	BROWN	...	7691.0	NaN	NaN	99761	85018	AZ	4725.0	0	1633	train
34991	0	3/4/2009	MANHEIM	2005	4	CHRYSLER	TOWN & COUNTRY FWD V	Bas	MINIVAN 3.3L	BLUE	...	7856.0	NaN	NaN	20833	75236	TX	5670.0	0	1623	train

	IsBadBuy	VehYear	VehicleAge	WheelTypeID	VehOdo	MMRAcquisitionAuctionAveragePrice	MMRAcquisitionAuctionCleanPrice	MMRAcquisitionRetailAveragePrice	MMRAcquisitonRetailCleanPrice	MMRCurrentAuctionAveragePrice	MMRCurrentAuctionCleanPrice	MMRCurrentRetailAveragePrice	MMRCurrentRetailCleanPrice	BYRNO	VNZIP1	VehBCost	IsOnlineSale	WarrantyCost
IsBadBuy	1	-0.156926	0.165325	-0.0476933	0.0828581	-0.105793	-0.0993145	-0.0845135	-0.0807117	-0.10613	-0.101036	-0.100786	-0.0971951	-0.0594116	0.00639286	-0.0981321	-0.0039694	0.050259
VehYear	-0.156926	1	-0.958049	0.261366	-0.283569	0.582416	0.533807	0.582414	0.540206	0.591968	0.549345	0.599415	0.559579	0.281799	0.0699759	0.349672	0.0573158	-0.267739
VehicleAge	0.165325	-0.958049	1	-0.252018	0.318778	-0.567106	-0.519897	-0.462299	-0.427526	-0.576914	-0.533646	-0.505322	-0.47064	-0.269907	-0.0766166	-0.311903	-0.0246593	0.257996
WheelTypeID	-0.0476933	0.261366	-0.252018	1	-0.212432	-0.0946579	-0.129608	-0.0717364	-0.101796	-0.0867982	-0.120046	-0.0726838	-0.10125	0.189817	0.0078462	-0.160986	-0.0113645	-0.133688
VehOdo	0.0828581	-0.283569	0.318778	-0.212432	1	-0.0183812	0.0237172	0.0291065	0.0614072	-0.0301414	0.010755	0.0137187	0.0460697	-0.29223	-0.0535469	-0.0644356	0.0309944	0.411273
MMRAcquisitionAuctionAveragePrice	-0.105793	0.582416	-0.567106	-0.0946579	-0.0183812	1	0.990218	0.910185	0.909832	0.937482	0.931919	0.87122	0.870424	0.108806	0.047112	0.789354	0.0380292	-0.0496823
MMRAcquisitionAuctionCleanPrice	-0.0993145	0.533807	-0.519897	-0.129608	0.0237172	0.990218	1	0.902339	0.917934	0.923144	0.931599	0.860055	0.870604	0.0628181	0.0418103	0.781123	0.0378963	-0.0196073
MMRAcquisitionRetailAveragePrice	-0.0845135	0.582414	-0.462299	-0.0717364	0.0291065	0.910185	0.902339	1	0.990085	0.85133	0.848776	0.912747	0.905523	0.107988	0.0364019	0.745718	0.0797667	-0.0535073
MMRAcquisitonRetailCleanPrice	-0.0807117	0.540206	-0.427526	-0.101796	0.0614072	0.909832	0.917934	0.990085	1	0.846393	0.854986	0.90268	0.906231	0.0698573	0.0321199	0.744622	0.0767349	-0.0267079
MMRCurrentAuctionAveragePrice	-0.10613	0.591968	-0.576914	-0.0867982	-0.0301414	0.937482	0.923144	0.85133	0.846393	1	0.990244	0.915147	0.912305	0.112075	0.0517059	0.778253	0.0395132	-0.056893
MMRCurrentAuctionCleanPrice	-0.101036	0.549345	-0.533646	-0.120046	0.010755	0.931919	0.931599	0.848776	0.854986	0.990244	1	0.90862	0.921638	0.0689366	0.0462358	0.774637	0.0400026	-0.0282446
MMRCurrentRetailAveragePrice	-0.100786	0.599415	-0.505322	-0.0726838	0.0137187	0.87122	0.860055	0.912747	0.90268	0.915147	0.90862	1	0.989693	0.111283	0.0417933	0.75679	0.0792027	-0.0589651
MMRCurrentRetailCleanPrice	-0.0971951	0.559579	-0.47064	-0.10125	0.0460697	0.870424	0.870604	0.905523	0.906231	0.912305	0.921638	0.989693	1	0.074541	0.0374538	0.756551	0.0764633	-0.0329318
BYRNO	-0.0594116	0.281799	-0.269907	0.189817	-0.29223	0.108806	0.0628181	0.107988	0.0698573	0.112075	0.0689366	0.111283	0.074541	1	0.0393516	0.046551	-0.143585	-0.0894973
VNZIP1	0.00639286	0.0699759	-0.0766166	0.0078462	-0.0535469	0.047112	0.0418103	0.0364019	0.0321199	0.0517059	0.0462358	0.0417933	0.0374538	0.0393516	1	0.0171498	0.0266398	-0.0395944
VehBCost	-0.0981321	0.349672	-0.311903	-0.160986	-0.0644356	0.789354	0.781123	0.745718	0.744622	0.778253	0.774637	0.75679	0.756551	0.046551	0.0171498	1	0.0324887	-0.0339014
IsOnlineSale	-0.0039694	0.0573158	-0.0246593	-0.0113645	0.0309944	0.0380292	0.0378963	0.0797667	0.0767349	0.0395132	0.0400026	0.0792027	0.0764633	-0.143585	0.0266398	0.0324887	1	0.00609906
WarrantyCost	0.050259	-0.267739	0.257996	-0.133688	0.411273	-0.0496823	-0.0196073	-0.0535073	-0.0267079	-0.056893	-0.0282446	-0.0589651	-0.0329318	-0.0894973	-0.0395944	-0.0339014	0.00609906	1