Introduction

We have a data set representing shared bikes rentings according to serveral parameters. The objective is to predict the number of bikes rented per hour in the city (variable “count”). Throughout this case study, we will follow the following steps:

Before presenting the data set, here is the list of all used libraries.

# management and data analysis 
import numpy as np
import pandas as pd
from datetime import datetime

# visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Machine Learning
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor

# support for Machine Learning
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# other (math, stat, etc.)
from random import *
from scipy import stats
np.random.seed(91948172)
from copy import deepcopy


Dataset presentation

The following lines of code show that the set contains 10886 instances and 12 variables. Here is a brief description of these variables:

# read the data set 
df = pd.read_csv("data/data.csv")
# dimensions of dataframe 
df.shape
(10886, 12)
# overview
df.head()
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1

Prior assumptions and remarks

Target variable(s)

The objective is to predict the number of bikes rented per hour in the city (variable “count”). Moreover, we have two variables “casual” and “registered” (which “count” is the sum of).

# check the completeness of the data on the variables to predict 
print((len(df["casual"]) == 10886) 
      and (len(df["registered"]) == 10886) 
      and (len(df["count"]) == 10886))

# check that "count" is the sum of "casual" and "registered" 
print((df["casual"] + df["registered"] == df["count"]).unique())
True
[True]

Predictors

This step comes ahead of a statistical analysis. Here are some factors (in addition to the variables already in the dataset) that could potentially influence the number of rentals:

Here are some other preliminary observations:

Data cleaning

Missing data

We find, in what follows, that there are no missing data.

# describe the dataframe
df.describe()
season holiday workingday weather temp atemp humidity windspeed casual registered count
count 10886.000000 10886.000000 10886.000000 10886.000000 10886.00000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000
mean 2.506614 0.028569 0.680875 1.418427 20.23086 23.655084 61.886460 12.799395 36.021955 155.552177 191.574132
std 1.116174 0.166599 0.466159 0.633839 7.79159 8.474601 19.245033 8.164537 49.960477 151.039033 181.144454
min 1.000000 0.000000 0.000000 1.000000 0.82000 0.760000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 2.000000 0.000000 0.000000 1.000000 13.94000 16.665000 47.000000 7.001500 4.000000 36.000000 42.000000
50% 3.000000 0.000000 1.000000 1.000000 20.50000 24.240000 62.000000 12.998000 17.000000 118.000000 145.000000
75% 4.000000 0.000000 1.000000 2.000000 26.24000 31.060000 77.000000 16.997900 49.000000 222.000000 284.000000
max 4.000000 1.000000 1.000000 4.000000 41.00000 45.455000 100.000000 56.996900 367.000000 886.000000 977.000000

Data type correction

We will mainly change the types of categorical data that are numerical here.

# information on the data columns, including their types 
df.info() 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null object
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
# data type transformation to category
df['weather'] = df['weather'].astype('category')
df['holiday'] = df['holiday'].astype('category')
df['workingday'] = df['workingday'].astype('category')
df['season'] = df['season'].astype('category')

Variable creation

Mining and preparation of variables

In the “datetime” column, there is a lot of information. “WorkingDay”, “holiday”, “season” are already extracted. Here, we extract other information.

# extract the hour 
def getHour(x):
    dt = datetime.strptime(x, "%Y-%m-%d %H:%M:%S")
    return dt.hour

df['hour'] = df['datetime'].apply(getHour)
# extract the day 
def getDay(x):
    dt = datetime.strptime(x, "%Y-%m-%d %H:%M:%S")
    return dt.day

df['day'] = df['datetime'].apply(getDay)
# extract the month 
def getMonth(x):
    dt = datetime.strptime(x, "%Y-%m-%d %H:%M:%S")
    return dt.month

df['month'] = df['datetime'].apply(getMonth)
# extract the year 
def getYear(x):
    dt = datetime.strptime(x, "%Y-%m-%d %H:%M:%S")
    return dt.year

df['year'] = df['datetime'].apply(getYear)
# extract the day of the week 
# ... 0 for Monday, 6 for Sunday 
def getWeekday(x):
    dt = datetime.strptime(x, "%Y-%m-%d %H:%M:%S")
    return dt.weekday()

df['weekday'] = df['datetime'].apply(getWeekday)

Check the data distribution

Before we do the statistical analysis, let us have a look on the data distribution and homogeneity. Here are some of our observations:

# data is distributed over the two years 2011 and 2012 
df.groupby('year')['count'].count().reset_index()
year count
0 2011 5422
1 2012 5464
# data is distributed over 12 months
count_groupby_var = df.groupby('month')['count'].count().reset_index()

fig, axs = plt.subplots(1, 1, figsize=(15, 4))
axs.bar(count_groupby_var['month'], count_groupby_var['count'])
axs.set_xticks(count_groupby_var['month'].unique())
plt.show()

png

# data is distributed over 19 days
count_groupby_var = df.groupby('day')['count'].count().reset_index()

fig, axs = plt.subplots(1, 1, figsize=(15, 4))
axs.bar(count_groupby_var['day'], count_groupby_var['count'])
axs.set_xticks(count_groupby_var['day'].unique())
plt.show()

png

# ata is distributed over 24 hours 
count_groupby_var = df.groupby('hour')['count'].count().reset_index()

fig, axs = plt.subplots(1, 1, figsize=(15, 4))
axs.bar(count_groupby_var['hour'], count_groupby_var['count'])
axs.set_xticks(count_groupby_var['hour'].unique())
plt.show()

png

Statistical analysis

We will, throughout this statistical analysis, create functions that will help us visualize the variations of the variables to be predicted and determine the influential factors.

Analysis of discrete numerical variables

# function: visualize the change of the dependent variables VS discrete variables
def visualize_by_discr_var(var):

    # means
    casual_by_var = df.groupby(var)['casual'].agg(lambda x: x.mean()).reset_index()
    registered_by_var = df.groupby(var)['registered'].agg(lambda x: x.mean()).reset_index()
    count_by_var = df.groupby(var)['count'].agg(lambda x: x.mean()).reset_index()

    # visualisation
    fig, axes = plt.subplots(2, 3, sharey="row", figsize=(15, 8))
    plt.subplots_adjust(wspace=0.3, hspace=0.3);
    fig.suptitle('')
    
    # ligne 1 
    axt = axes[0, 0]
    axt.plot(casual_by_var[var], casual_by_var['casual'])
    axt.set_title('fig.1 : ' + 'mean_casual_by_' + var)
    axt.grid(True)

    axt = axes[0, 1]
    axt.plot(registered_by_var[var], registered_by_var['registered'])
    axt.set_title('fig.2 : ' + 'mean_registered_by_' + var)
    axt.grid(True)

    axt = axes[0, 2]
    axt.plot(count_by_var[var], count_by_var['count'])
    axt.set_title('fig.3 : ' + 'mean_count_by_' + var)
    axt.grid(True)
    
    # ligne 2 
    axt = axes[1, 0]
    df.boxplot(column='casual', by=var, ax=axes[1, 0])
    axt.set_title('fig.4 : ' + 'casual_by_' + var)
    axt.grid(True)
    axt.get_figure().suptitle('')

    axt = axes[1, 1]
    df.boxplot(column='registered', by=var, ax=axes[1, 1])
    axt.set_title('fig.5 : ' + 'registered_by_' + var)
    axt.grid(True)
    axt.get_figure().suptitle('')

    axt = axes[1, 2]
    df.boxplot(column='count', by=var, ax=axes[1, 2])
    axt.set_title('fig.6 : ' + 'count_by_' + var)
    axt.grid(True)
    axt.get_figure().suptitle('')
    
    plt.show()

visualize_by_discr_var('hour')

png

We see two “phenomena”:

We also notice that we have “outlier”. However, they are probably “natural outliers”. To overcome this, we create the logarithms of the dependent variables do the analysis once again.

# creation of logarithms of the dependent variables
df['casual_log'] = np.log(df['casual'] + 1) 
df['registered_log'] = np.log(df['registered'] + 1) 
df['count_log'] = np.log(df['count'] + 1) 

# function: visualize the change in the ... 
# ... log of the dependent variables with regards to discrete variable 
def visualize_log_by_discr_var(var):

    # mean of the dependent variables
    casual_by_var = df.groupby(var)['casual_log'].agg(lambda x: x.mean()).reset_index()
    registered_by_var = df.groupby(var)['registered_log'].agg(lambda x: x.mean()).reset_index()
    count_by_var = df.groupby(var)['count_log'].agg(lambda x: x.mean()).reset_index()

    # visualisation
    fig, axes = plt.subplots(2, 3, sharey="row", figsize=(15, 8))
    plt.subplots_adjust(wspace=0.3, hspace=0.3);
    fig.suptitle('')
    
    # ligne 1 
    axt = axes[0, 0]
    axt.plot(casual_by_var[var], casual_by_var['casual_log'])
    axt.set_title('fig.1 : ' + 'mean_casual_by_' + var)
    axt.grid(True)

    axt = axes[0, 1]
    axt.plot(registered_by_var[var], registered_by_var['registered_log'])
    axt.set_title('fig.2 : ' + 'mean_registered_by_' + var)
    axt.grid(True)

    axt = axes[0, 2]
    axt.plot(count_by_var[var], count_by_var['count_log'])
    axt.set_title('fig.3 : ' + 'mean_count_by_' + var)
    axt.grid(True)
    
    # ligne 2 
    axt = axes[1, 0]
    df.boxplot(column='casual_log', by=var, ax=axes[1, 0])
    axt.set_title('fig.4 : ' + 'casual_by_' + var)
    axt.grid(True)
    axt.get_figure().suptitle('')

    axt = axes[1, 1]
    df.boxplot(column='registered_log', by=var, ax=axes[1, 1])
    axt.set_title('fig.5 : ' + 'registered_by_' + var)
    axt.grid(True)
    axt.get_figure().suptitle('')

    axt = axes[1, 2]
    df.boxplot(column='count_log', by=var, ax=axes[1, 2])
    axt.set_title('fig.6 : ' + 'count_by_' + var)
    axt.grid(True)
    axt.get_figure().suptitle('')
    
    plt.show()

visualize_log_by_discr_var('hour')

png

There are indeed less outliers regarding “casual” users. However, this is less conclusive regarding “registered” users. In what follows, we do the same for column “month” and “day”. We will see that :

visualize_by_discr_var('month')

png

visualize_log_by_discr_var('month')

png

visualize_log_by_discr_var('day')

png

We observed that the dependent variables depend on the hour and perhaps particularly on working hours. Let us assume that the “behavior” are different in the weekdays, at least for the “registered” user. We will discover that however that it is true for “registered” and “casual” users. As the contrast is very clear, we decided to create a variable “is_weekend” to be used instead of “weekday”.

# aggregations of dependent variables based on hours and days of the week 
hour_agg_casual = pd.DataFrame(df.groupby(["hour","weekday"],sort=True)["casual"].mean()).reset_index()
hour_agg_registered = pd.DataFrame(df.groupby(["hour","weekday"],sort=True)["registered"].mean()).reset_index()
hour_agg_count = pd.DataFrame(df.groupby(["hour","weekday"],sort=True)["count"].mean()).reset_index()

# visualisations 
fig, axes = plt.subplots(1, 3, sharey="row", figsize=(15, 5))

axt = axes[0]
sns.pointplot(x=hour_agg_casual["hour"], y=hour_agg_casual["casual"],
              hue=hour_agg_casual["weekday"], data=hour_agg_casual, join=True, ax=axt)
axt.set(xlabel='', ylabel='hour',title='fig 1 : mean_casual_by_hour_and_weekday')

axt = axes[1]
sns.pointplot(x=hour_agg_registered["hour"], y=hour_agg_registered["registered"],
              hue=hour_agg_registered["weekday"], data=hour_agg_registered, join=True, ax=axt)
axt.set(xlabel='', ylabel='hour',title='fig 2 : mean_registered_by_hour_and_weekday')

axt = axes[2]
sns.pointplot(x=hour_agg_count["hour"], y=hour_agg_count["count"],
              hue=hour_agg_count["weekday"], data=hour_agg_count, join=True, ax=axt)
axt.set(xlabel='', ylabel='hour',title='fig 3 : mean_count_by_hour_and_weekday')

plt.show()

png

# creation of "is_weekend"
def is_weekend(weekday):   
    return 1 if ((weekday == 5) or (weekday == 6)) else 0

df['is_weekend'] = df['weekday'].apply(is_weekend)

# data type: category
df['is_weekend'] = df['is_weekend'].astype('category')

We turn our attention to the variable year. We notice a phenomenon: there is an increase between 2011 and 2012. We decide to keep this variable. However, are not able to say relevance for future years (since we have only 2 years, it is, indeed, difficult to predict what will be the evolution to year n + 2).

df['year_month'] = df['year'] * 100 + df['month']
# mean
casual_by_ym = df.groupby('year_month')['casual'].mean().reset_index()
registered_by_ym = df.groupby('year_month')['registered'].mean().reset_index()
count_by_ym = df.groupby('year_month')['count'].mean().reset_index()

# visualisation
fig, axes = plt.subplots(1, 3, sharey="row", figsize=(15, 4))

axt = axes[0]
axt.plot(casual_by_ym['casual'])
axt.set_title('fig.1 : ' + 'mean_casual_by_year-month')
axt.grid(True)

axt = axes[1]
axt.plot(registered_by_ym['registered'])
axt.set_title('fig.2 : ' + 'mean_registered_by_year-month')
axt.grid(True)

axt = axes[2]
axt.plot(count_by_ym['count'])
axt.set_title('fig.3 : ' + 'mean_count_by_year-month')
axt.grid(True)

plt.show()

png

Analysis of categorical variables

We directly observe the distribution of the log of dependent variables. It does not appear that these variables are particularly influential, except “season” on “casual” users. However, we will not keep “season” because we believe that “month” has probably a greater explanatory power.

# function to visualize the change of the ... 
# ... log of dependent variables  
def visualize_log_by_cat_var(var):
    
    # mean
    casual_by_var = df.groupby(var)['casual_log'].agg(lambda x: x.mean()).reset_index()
    registered_by_var = df.groupby(var)['registered_log'].agg(lambda x: x.mean()).reset_index()
    count_by_var = df.groupby(var)['count_log'].agg(lambda x: x.mean()).reset_index()

    # visualisation 
    fig, axes = plt.subplots(1, 3, sharey="row", figsize=(15, 4))
    
    axt = axes[0]
    df.boxplot(column='casual_log', by=var, ax=axes[0])
    axt.set_title('fig 1 : ' + 'casual_log_by_' + var)
    axt.grid(True)
    axt.get_figure().suptitle('')

    axt = axes[1]
    df.boxplot(column='registered_log', by=var, ax=axes[1])
    axt.set_title('fig 2 : ' + 'registered_log_by_' + var)
    axt.grid(True)
    axt.get_figure().suptitle('')

    axt = axes[2]
    df.boxplot(column='count_log', by=var, ax=axes[2])
    axt.set_title('fig 3 : ' + 'count_log_by_' + var)
    axt.grid(True)
    axt.get_figure().suptitle('')
    
    plt.show()

visualize_log_by_cat_var('season')

png

visualize_log_by_cat_var('holiday')

png

visualize_log_by_cat_var('workingday')

png

#visualize_log_by_cat_var('weather')

png

Analysis of continuous variables

We first see the correlation matrix. We observe that:

# calculate correlation 
# ... note that we will take the absolute value 
corr_mat = df[["temp","atemp","humidity","windspeed","casual","registered", "count"]].corr()
corr_mat = abs(corr_mat)

fig,ax= plt.subplots()
fig.set_size_inches(15,10)
sns.heatmap(corr_mat, square=True, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x10b094668>

png

# function that allows to visualize the variation of the... 
# ... log variables of the dependent variables 
def visualize_by_cont_var(var):
    
    # visualisation des dispersion des log des variables à prédire en fonction de la variable "var"
    f, axarr = plt.subplots(1, 3, sharey=True, figsize=(15, 4))
    
    axarr[0].scatter(df[var], df['casual'], s=1)
    axarr[0].set_title('casual_by_' + var)
    axarr[0].grid(True)

    axarr[1].scatter(df[var], df['registered'], s=1)
    axarr[1].set_title('registered_by_' + var)
    axarr[1].grid(True)

    axarr[2].scatter(df[var], df['count'], s=1)
    axarr[2].set_title('count_by_' + var)
    axarr[2].grid(True)

    plt.show()

visualize_by_cont_var('temp')

png

The scatter visualization is not very clear. We decide to categorize the continuous variables.

# discretization of the variable "temp"
[df['temp_disc'], mod] = divmod(df['temp'], 5)

visualize_log_by_cat_var('temp_disc')

png

# discretization of the variable "humidity"
[df['humidity_disc'], mod] = divmod(df['humidity'], 5)

visualize_log_by_cat_var('humidity_disc')

png

# discretization of the variable "windspeed"
[df['windspeed_disc'], mod] = divmod(df['windspeed'], 5)

visualize_log_by_cat_var('windspeed_disc')

png

Complementary question

Assuming we have access to the sex and age of subscribed users, the statistical procedure that would allow us to say whether the age distributions of the two populations (women and men) are identical or not is the “Student test.”

This test has conditions such as:

Note that if these conditions are not validated, other tests may be applied.

There are two possible outcomes:

Let’s take an example. We build a dummy population of 1,000 individuals. We randomly generate gender and age independently, then we merge the lists as columns. After application of the test, we get a “pvalue” way above the threshold. This result is expected because we generated randomly ages regardless of sexes.

# 1000 "sex" randomly 
sex_list = []
for i in range(1000):
    sex_list.append( choice(["Female", "Male"]) )
    
# 1000 "age" randomly 
age_list = [round(gauss(50,15)) for i in range(1000)]
    
# build a dummy data set on subscribers by combining the two lists 
df_registered = pd.DataFrame({
        "sex": sex_list,
        "age": age_list
    })

# separate the 2 groups 
df_registered_male = df_registered.loc[df_registered["sex"] == "Male",]
df_registered_female = df_registered.loc[df_registered["sex"] == "Female",]
# Student test
stats.ttest_ind(df_registered_male['age'],df_registered_female['age'])
Ttest_indResult(statistic=0.51163904173187325, pvalue=0.6090168396254797)

Build a prediction model

Given these observations we have made previously, we make the following choices:

Process

To design the two models, here are the steps we shall follow:

Choice of models

We have selected some models. We will choose the one that will show the best performance with the best settings:

Other algorithms can be used and that we do not use here, for example:

Before training models, let us present the evaluation criterion.

In ordre to evaluate the performance, we will compute the “mean squared error” on log of the dependent variables (thus the “logarithmic mean squared error” on the variables themselves). Other criteria exist such as the “mean absolute error”. But we prefer the “logarithmic mean squared error” that penalize bigger errors.

Model taining

Here we follow the previously described steps.

# reserve 20% of the data for the final test 
df_train, df_test = train_test_split(df, test_size=0.2, random_state=91948172)

# predictor for "casual"
predictors_4_casual = ['hour', 'is_weekend', 'month', 'year', 'temp', 'humidity']
# preparation of the matrix X and the vector y for "casual"
df_train_X_4_casual = df_train[predictors_4_casual]
df_train_y_casual_log = df_train['casual_log']

# predictor for "registered"
predictors_4_registered = ['hour', 'is_weekend', 'month', 'year', 'temp']
# preparation of the matrix X and the vector y for "registered"
df_train_X_4_registered = df_train[predictors_4_registered]
df_train_y_registered_log = df_train['registered_log']

# ...'season', 'holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed',
# ...'hour', 'month', 'year', 'day', is_weekend', 

To train the models we use “scikit-learn” 0.19.1 (the latest stable version at the moment of the case study).

# global parameters

# evaluation method 
scoring_param = 'neg_mean_squared_error'

# number of groups for cross-validation
cv_param = 10
# "linear regression" for "casual" 
# note: no cross-validation here since the linear regression does not overfit
model = LinearRegression()

model.fit(df_train_X_4_casual, df_train_y_casual_log)
pred = model.predict(df_train_X_4_casual)
score_value = mean_squared_error(df_train_y_casual_log, pred)
print("MSLE : %0.3f" % (score_value)) 
MSLE : 0.888


/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/linalg/basic.py:884: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)
# "linear regression" for "registered" 
# note: no cross-validation here since the linear regression does not overfit
model = LinearRegression()

model.fit(df_train_X_4_registered, df_train_y_registered_log)
pred = model.predict(df_train_X_4_registered)
score_value = mean_squared_error(df_train_y_registered_log, pred)
print("MSLE : %0.3f" % (score_value)) 
MSLE : 1.099
# "k nearest 'neighbors" for "casual" 
# note: using GridSearchCV (cross validation and determination of optimal parameters) 
model = KNeighborsRegressor()

hyperparameters = {
    "n_neighbors": range(1,49,10)
}

grid = GridSearchCV(model, param_grid=hyperparameters, cv=cv_param, scoring=scoring_param)
grid.fit(df_train_X_4_casual, df_train_y_casual_log)

print(grid.best_params_)
print("MSLE : %0.3f" % (grid.best_score_)) 
{'n_neighbors': 11}
MSLE : -0.505
# "k nearest 'neighbors" for "registered" 
# note: using GridSearchCV (cross validation and determination of optimal parameters) 
model = KNeighborsRegressor()

hyperparameters = {
    "n_neighbors": range(1,49,10)
}

grid = GridSearchCV(model, param_grid=hyperparameters, cv=cv_param, scoring=scoring_param)
grid.fit(df_train_X_4_registered, df_train_y_registered_log)

print(grid.best_params_)
print("MSLE : %0.3f" % (grid.best_score_)) 
{'n_neighbors': 11}
MSLE : -0.244
# "random forest" for "casual" 
# note: we use here cross validation and empirical determination of the parameters 
# ... GridSearchCV is too greedy here 
model = RandomForestRegressor(n_estimators=120, min_samples_split=10, min_samples_leaf=5)

scores = cross_val_score(model, df_train_X_4_casual, df_train_y_casual_log, cv=cv_param, scoring=scoring_param)
print("MSLE : %0.3f" % (scores.mean())) 
MSLE : -0.277
# random forest" for "registered"
# note: we use here cross validation and empirical determination of the parameters 
# ... GridSearchCV is too greedy here 
model = RandomForestRegressor(n_estimators=90, min_samples_split=10, min_samples_leaf=5)

scores = cross_val_score(model, df_train_X_4_registered, df_train_y_registered_log, cv=cv_param, scoring=scoring_param)
print("MSLE : %0.3f" % (scores.mean())) 
MSLE : -0.143

Choosing the type of model and parameters

Here are our picks:

Final evaluation

We use the model with the best performance to evaluate the test data (that the model has not “seen”). We will see that the performances are not very different from those already obtained. This may indicate that we do not need additional data (without having to compare learning curves). However, if the predictions have to be done at moments outside of the study interval, the same permformances are not guaranteed.

# casual: test data 
df_test_X_4_casual = df_test[predictors_4_casual]
df_test_y_casual_log = df_test['casual_log']

# casual: reconstruction of the best model on the training data 
model_casual = RandomForestRegressor(n_estimators=120, min_samples_split=10, min_samples_leaf=5)
model_casual.fit(df_train_X_4_casual, df_train_y_casual_log)

# casual: final evaluation 
pred_casual = model_casual.predict(df_test_X_4_casual)
msle = mean_squared_error(df_test_y_casual_log, pred_casual)
print("MSLE : %0.3f" % (msle)) 
MSLE : 0.286
# registered: test data 
df_test_X_4_registered = df_test[predictors_4_registered]
df_test_y_registered_log = df_test['registered_log']

# registered: reconstruction of the best model on the training data 
model_registered = RandomForestRegressor(n_estimators=90, min_samples_split=10, min_samples_leaf=5)
model_registered.fit(df_train_X_4_registered, df_train_y_registered_log)

# registered: final evaluation 
pred_registered = model_registered.predict(df_test_X_4_registered)
msle = mean_squared_error(df_test_y_registered_log, pred_registered)
print("MSLE : %0.3f" % (msle)) 
MSLE : 0.151

Final model

Let us train one last time the type of model that had the best performance (with the optimal settings). This time, we do it on all the data.

# casual: all data 
df_X_4_casual = df[predictors_4_casual]
df_y_casual_log = df['casual_log']

# casual: build once again and finally the model on all data
model_casual = RandomForestRegressor(n_estimators=120, min_samples_split=10, min_samples_leaf=5)
model_casual.fit(df_X_4_casual, df_y_casual_log)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=5, min_samples_split=10,
           min_weight_fraction_leaf=0.0, n_estimators=120, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
# registered: all data 
df_X_4_registered = df[predictors_4_registered]
df_y_registered_log = df['registered_log']

# registered: build once again and finally the model on all data
model_registered = RandomForestRegressor(n_estimators=90, min_samples_split=10, min_samples_leaf=5)
model_registered.fit(df_X_4_registered, df_y_registered_log)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=5, min_samples_split=10,
           min_weight_fraction_leaf=0.0, n_estimators=90, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Let us show how to predict “count” for new data. Note that we should not forget that the changes we have applied on our current datasets will also have to be applied on the new set.

Here is an example of how we would proceed with a dummy dataset “df_new” (which is a copy of some of the data we already have).

# let us take as an example a sample of 100 rows among the available dataset 
df_new = deepcopy(df.iloc[sample(range(len(df)), 100)])
# casual: extraction of the matrix of predictors 
df_new_X_4_casual = df_new[predictors_4_casual]

# model application
pred_casual = model_casual.predict(df_new_X_4_casual)

# data transformation (inverse of the previously applied log transformation)
df_new['casual'] = np.exp(pred_casual) - 1
# registered: extraction of the matrix of predictors 
df_new_X_4_registered = df_new[predictors_4_registered]

# model application
pred_registered = model_registered.predict(df_new_X_4_registered)

# data transformation (inverse of the previously applied log transformation)
df_new['registered'] = np.exp(pred_casual) - 1
# finally addition of two variables in order to predict "count" 
df_new['count'] = df_new['casual'] + df_new['registered']

Outlooks

Here are some areas for improvement: