Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.980
Model: OLS Adj. R-squared: 0.978
Method: Least Squares F-statistic: 741.1
Date: Sat, 07 Nov 2020 Prob (F-statistic): 6.23e-39
Time: 11:34:25 Log-Likelihood: -3.0227
No. Observations: 50 AIC: 14.05
Df Residuals: 46 BIC: 21.69
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.2072 0.091 57.006 0.000 5.023 5.391
x1 0.4696 0.014 33.337 0.000 0.441 0.498
x2 0.4563 0.055 8.239 0.000 0.345 0.568
x3 -0.0176 0.001 -14.223 0.000 -0.020 -0.015
==============================================================================
Omnibus: 10.776 Durbin-Watson: 2.145
Prob(Omnibus): 0.005 Jarque-Bera (JB): 13.024
Skew: -0.755 Prob(JB): 0.00149
Kurtosis: 4.992 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.76736342 5.20902136 5.61506369 5.9606248 6.22981294 6.4183212
6.53413514 6.59622074 6.6314084 6.66998472 6.74071656 6.86612468
7.05878367 7.31925595 7.6359994 7.98726368 8.34466375 8.67784345
8.95946295 9.16969058 9.29946112 9.35196523 9.34212516 9.29414314
9.23752499 9.20223238 9.21376024 9.28894802 9.43321299 9.6396602
9.89021532 10.1585942 10.41462126 10.62918728 10.77903194 10.85056542
10.8421012 10.76413314 10.63761134 10.49050013 10.35318359 10.25347355
10.21204017 10.23901591 10.33233075 10.47805236 10.65267507 10.82698183
10.97084449 11.05817373]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.06298811 10.94765308 10.73006132 10.45098799 10.16410754 9.92285236
9.76733063 9.71450645 9.75404645 9.85084976]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1" : x1, "y" : y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.207157
x1 0.469631
np.sin(x1) 0.456257
I((x1 - 5) ** 2) -0.017592
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 11.062988
1 10.947653
2 10.730061
3 10.450988
4 10.164108
5 9.922852
6 9.767331
7 9.714506
8 9.754046
9 9.850850
dtype: float64