Logistic Regression¶
- To demonstrate logistic regression, we will create a binary label: Did it snow, or not?
In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import KBinsDiscretizer
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
RANDOM_STATE = 25
In [3]:
# Read in data
df = pd.read_csv('data/streamflow_prediction_dataset_averaged_cols.csv')
df = df.set_index('date')
# Create binary snow variable
df['Snow'] = np.where((df['WTEQ_BisonLake'] > 0) | (df['WTEQ_McClurePass'] > 0), 1, 0)
series_pred = df['Snow']
df = df.drop(columns=['WTEQ_BisonLake', 'WTEQ_McClurePass'])
X = df.drop(columns=['Snow']).values
y = df['Snow'].values
# Normalize data
df = (df-df.mean())/df.std()
# display
df['Snow'] = series_pred
display(df)
PREC_Avg | TAVG_Avg | soilmoisture_Avg_2ft | soilmoisture_Avg_4ft | soilmoisture_Avg_8ft | soilmoisture_Avg_20ft | Snow | |
---|---|---|---|---|---|---|---|
date | |||||||
2008-03-12 | 0.415169 | -0.780541 | -0.320587 | -1.005767 | -0.061212 | -0.087585 | 1 |
2008-03-15 | 0.460106 | -1.221430 | -0.320587 | -0.995909 | -0.034313 | -0.087551 | 1 |
2008-03-17 | 0.472362 | -1.111968 | -0.311604 | -0.995909 | -0.003572 | -0.087577 | 1 |
2008-03-18 | 0.472362 | -1.203187 | -0.302621 | -0.966333 | 0.000271 | -0.087568 | 1 |
2008-03-19 | 0.472362 | -0.737972 | -0.311604 | -0.966333 | 0.000271 | -0.087577 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
2021-07-23 | 0.268103 | 1.208021 | 0.375604 | -0.650859 | -0.664505 | -0.090355 | 0 |
2021-07-24 | 0.284443 | 1.107681 | 0.200433 | -0.690293 | -0.706774 | -0.090407 | 0 |
2021-07-25 | 0.304869 | 1.065113 | 0.474418 | -0.719869 | -0.733672 | -0.090518 | 0 |
2021-07-26 | 0.304869 | 1.305321 | 0.631623 | -0.739586 | -0.729830 | -0.090664 | 0 |
2021-07-27 | 0.304869 | 1.460393 | 0.339672 | -0.788879 | -0.764413 | -0.090793 | 0 |
2996 rows × 7 columns
In [4]:
X_train, X_test, y_train, y_test = train_test_split(
df.drop(columns='Snow'), df['Snow'], test_size=0.2, random_state=RANDOM_STATE)
print(f"{X_train.shape[0]} samples in training data")
print(f"{X_test.shape[0]} samples in testing data")
display(X_train)
display(X_test)
2396 samples in training data 600 samples in testing data
PREC_Avg | TAVG_Avg | soilmoisture_Avg_2ft | soilmoisture_Avg_4ft | soilmoisture_Avg_8ft | soilmoisture_Avg_20ft | |
---|---|---|---|---|---|---|
date | ||||||
2013-11-07 | -1.288350 | -0.458236 | 0.195942 | 1.370146 | 1.260654 | -0.087105 |
2011-09-16 | 2.167711 | 0.253268 | 0.761878 | 1.705337 | 0.745742 | -0.087937 |
2010-06-25 | 0.652110 | 1.277956 | -1.488392 | -0.621283 | -0.015100 | -0.086127 |
2014-05-13 | 0.525469 | -0.920409 | 1.741037 | 1.715195 | 2.332749 | 2.041110 |
2021-04-16 | -0.152671 | -0.956897 | 1.651206 | -2.011341 | 1.229913 | -0.086205 |
... | ... | ... | ... | ... | ... | ... |
2021-05-27 | 0.076099 | 0.581655 | 1.260441 | -2.011341 | 0.983985 | -0.085030 |
2017-11-22 | -1.455842 | -0.160255 | 0.707979 | -1.173363 | 0.234671 | -0.087285 |
2019-08-06 | 2.041070 | 1.420865 | 0.451961 | -0.118497 | -0.783627 | -0.090458 |
2009-07-12 | 1.248546 | 1.277956 | -0.846099 | 0.305421 | 0.250042 | -0.085750 |
2012-02-19 | -0.532592 | -1.233593 | -0.244231 | -0.374819 | 0.080966 | -0.085767 |
2396 rows × 6 columns
PREC_Avg | TAVG_Avg | soilmoisture_Avg_2ft | soilmoisture_Avg_4ft | soilmoisture_Avg_8ft | soilmoisture_Avg_20ft | |
---|---|---|---|---|---|---|
date | ||||||
2010-06-18 | 0.652110 | 1.016463 | -0.212790 | 0.187119 | 0.868706 | -0.084172 |
2009-12-04 | -1.280180 | -2.039356 | -0.688895 | -1.478978 | -0.683718 | -0.086410 |
2010-09-24 | 1.195438 | 0.499558 | -1.012287 | -0.700152 | -0.983444 | -0.089309 |
2014-03-20 | 0.035248 | -0.601145 | -0.127450 | 1.340570 | 0.595879 | -0.086299 |
2018-07-25 | 0.047503 | 1.326605 | -0.365503 | -0.217082 | -1.421504 | -0.092525 |
... | ... | ... | ... | ... | ... | ... |
2009-01-23 | -0.418207 | -0.573779 | -0.572114 | -0.552274 | -0.422419 | -0.089721 |
2011-11-30 | -1.092262 | -0.561617 | 0.537300 | 1.301136 | 0.653518 | -0.085047 |
2011-07-10 | 1.787789 | 0.864432 | 1.053829 | 1.705337 | 1.306766 | -0.087259 |
2018-01-19 | -1.165795 | -0.327489 | -0.899998 | -1.291666 | -1.206316 | -0.089095 |
2009-03-21 | 0.202740 | 0.192456 | 0.833743 | 1.103965 | 0.807224 | -0.086771 |
600 rows × 6 columns
In [5]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
print(f"Model Accuracy: {lr.score(X_test, y_test)}\n")
print("Coefficients:")
for col, coef in zip(X_train.columns, lr.coef_[0]):
print(f"[{col}] * {round(coef,4)}")
# make dataframe with predictions
df_pred = df.copy()
df_pred['Snow_Prediction'] = lr.predict(df.drop(columns='Snow'))
# confusion matrix
cm_lr = confusion_matrix(y_test, lr.predict(X_test))
Model Accuracy: 0.9566666666666667 Coefficients: [PREC_Avg] * -0.0706 [TAVG_Avg] * -3.4961 [soilmoisture_Avg_2ft] * -0.1441 [soilmoisture_Avg_4ft] * -1.7677 [soilmoisture_Avg_8ft] * 3.3042 [soilmoisture_Avg_20ft] * 0.2693
Multinomial Naive Bayes¶
- We will utilize the same binary snow label, but using a multinomial naive bayes approach.
- Typically, multinomial naive bayes is used for count-based or discrete feature sets. Gaussian naive bayes, on the other hand, may be more appropriate for our continuous (and somewhat normally distributed) data set. However, we want to test to see how logistic regression compares to multinomial naive bayes for exploration purposes. We will discretize (bin) data to model Multinomial Naive Bayes.
- See next section Naive Bayes for further testing and discussion
In [6]:
# Read in data
df = pd.read_csv('data/streamflow_prediction_dataset_averaged_cols.csv')
df = df.set_index('date')
# Create binary snow variable
df['Snow'] = np.where((df['WTEQ_BisonLake'] > 0) | (df['WTEQ_McClurePass'] > 0), 1, 0)
series_pred = df['Snow']
df = df.drop(columns=['WTEQ_BisonLake', 'WTEQ_McClurePass'])
# Discretize data for Multinomial Naive Bayes
kbd = KBinsDiscretizer(n_bins=20, encode='ordinal', strategy='uniform',
subsample=200000, random_state=RANDOM_STATE)
X_train_discrete = kbd.fit_transform(df.drop(columns=['Snow']))
X_test_discrete = kbd.transform(df.drop(columns=['Snow']))
In [7]:
X_train_discrete, X_test_discrete, y_train, y_test = train_test_split(
X_train_discrete, series_pred, test_size=0.2, random_state=RANDOM_STATE)
print(f"{X_train_discrete.shape[0]} samples in training data")
print(f"{X_test_discrete.shape[0]} samples in testing data")
display(X_train_discrete)
display(X_test_discrete)
2396 samples in training data 600 samples in testing data
array([[ 2., 10., 9., 16., 12., 0.], [19., 13., 12., 18., 10., 0.], [12., 17., 3., 6., 7., 0.], ..., [19., 18., 11., 9., 3., 0.], [15., 17., 5., 11., 8., 0.], [ 6., 7., 8., 8., 7., 0.]])
array([[12., 16., 8., 10., 11., 0.], [ 2., 3., 6., 2., 4., 0.], [14., 14., 4., 6., 3., 0.], ..., [17., 16., 13., 18., 12., 0.], [ 2., 10., 5., 3., 2., 0.], [ 9., 13., 12., 15., 10., 0.]])
In [8]:
mnb = MultinomialNB()
mnb.fit(X_train_discrete, y_train)
print(f'Multinomial Model Accuracy: {mnb.score(X_test_discrete, y_test)}\n')
# confusion matrix
cm_mnb = confusion_matrix(y_test, mnb.predict(X_test_discrete))
print(f'Confusion Matrix:\n{cm_mnb}')
Multinomial Model Accuracy: 0.86 Confusion Matrix: [[172 24] [ 60 344]]
Model Compare¶
In [9]:
print(f"Logistic Regression Accuracy: {lr.score(X_test, y_test)}\n")
print(f"Multinomial Naive Bayes Accuracy: {mnb.score(X_test_discrete, y_test)}\n")
Logistic Regression Accuracy: 0.9566666666666667 Multinomial Naive Bayes Accuracy: 0.86
Visualize¶
In [10]:
# plot lr coefficients
plt.figure(figsize=(10, 6))
sns.barplot(x=X_train.columns, y=lr.coef_[0])
plt.xticks(rotation=90)
plt.title("Logistic Regression Coefficients")
plt.show()
# plot lr confusion matrix
plt.figure(figsize=(10, 6))
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Logistic Regression Confusion Matrix')
plt.show()
# plot mnb confusion matrix
plt.figure(figsize=(10, 6))
sns.heatmap(cm_mnb, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Multinomial Naive Bayes Confusion Matrix')
plt.show()