Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.985
Model: OLS Adj. R-squared: 0.984
Method: Least Squares F-statistic: 1027.
Date: Sun, 09 Aug 2020 Prob (F-statistic): 3.93e-42
Time: 21:12:11 Log-Likelihood: 2.9050
No. Observations: 50 AIC: 2.190
Df Residuals: 46 BIC: 9.838
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.9967 0.081 61.588 0.000 4.833 5.160
x1 0.4902 0.013 39.174 0.000 0.465 0.515
x2 0.5709 0.049 11.606 0.000 0.472 0.670
x3 -0.0183 0.001 -16.678 0.000 -0.021 -0.016
==============================================================================
Omnibus: 2.560 Durbin-Watson: 2.102
Prob(Omnibus): 0.278 Jarque-Bera (JB): 1.454
Skew: -0.055 Prob(JB): 0.483
Kurtosis: 2.172 Cond. No. 221.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.5386542 5.03704758 5.49210698 5.87271985 6.15900199 6.3455645
6.44239911 6.47323659 6.47164785 6.47552834 6.52087193 6.6358572
6.83621783 7.12265755 7.48073452 7.88323413 8.29464025 8.67697041
8.99601627 9.22696389 9.35847064 9.39452873 9.35380932 9.2665949
9.16980393 9.1009253 9.09185897 9.16367489 9.32315131 9.56166109
9.85658913 10.17504798 10.47928113 10.73286632 10.90669959 10.98377701
10.96198836 10.85446383 10.68741675 10.49583687 10.31774159 10.18792983
10.13226465 10.16342406 10.27881801 10.46101396 10.680601 10.90102196
11.08457965 11.19863041]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.21218748 11.07725494 10.81622059 10.48010344 10.13606243 9.85095361
9.67496145 9.62931189 9.70107555 9.84633324]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1" : x1, "y" : y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 4.996705
x1 0.490160
np.sin(x1) 0.570882
I((x1 - 5) ** 2) -0.018322
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 11.212187
1 11.077255
2 10.816221
3 10.480103
4 10.136062
5 9.850954
6 9.674961
7 9.629312
8 9.701076
9 9.846333
dtype: float64