Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.982
Model: OLS Adj. R-squared: 0.981
Method: Least Squares F-statistic: 847.1
Date: Tue, 08 Sep 2020 Prob (F-statistic): 3.05e-40
Time: 17:35:08 Log-Likelihood: -1.5314
No. Observations: 50 AIC: 11.06
Df Residuals: 46 BIC: 18.71
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.9557 0.089 55.897 0.000 4.777 5.134
x1 0.5119 0.014 37.438 0.000 0.484 0.539
x2 0.5263 0.054 9.790 0.000 0.418 0.634
x3 -0.0212 0.001 -17.678 0.000 -0.024 -0.019
==============================================================================
Omnibus: 0.100 Durbin-Watson: 2.074
Prob(Omnibus): 0.951 Jarque-Bera (JB): 0.207
Skew: 0.098 Prob(JB): 0.902
Kurtosis: 2.754 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.42518256 4.92609653 5.38562042 5.77507388 6.07612714 6.28381257
6.40734081 6.46858748 6.49849914 6.53200877 6.60229647 6.73533798
6.94563701 7.23384226 7.5866409 7.97894578 8.37801721 8.74884198
9.05988595 9.28827498 9.42355316 9.46940059 9.44302885 9.3723531
9.29140576 9.23474503 9.23177713 9.30192489 9.45143667 9.67235974
9.94384694 10.23558174 10.51275923 10.74180443 10.89588855 10.95933693
10.93020432 10.82059445 10.65467147 10.46468982 10.28569493 10.14976533
10.08074229 10.09031269 10.17608898 10.32200139 10.50093818 10.67919999
10.82203659 10.89935622]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.87581224 10.71322024 10.43221789 10.07983582 9.71798289 9.40828874
9.19701469 9.10372738 9.11650817 9.19487121]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1" : x1, "y" : y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 4.955748
x1 0.511913
np.sin(x1) 0.526253
I((x1 - 5) ** 2) -0.021223
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.875812
1 10.713220
2 10.432218
3 10.079836
4 9.717983
5 9.408289
6 9.197015
7 9.103727
8 9.116508
9 9.194871
dtype: float64