Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.985
Model:                            OLS   Adj. R-squared:                  0.984
Method:                 Least Squares   F-statistic:                     1027.
Date:                Sun, 09 Aug 2020   Prob (F-statistic):           3.93e-42
Time:                        21:12:11   Log-Likelihood:                 2.9050
No. Observations:                  50   AIC:                             2.190
Df Residuals:                      46   BIC:                             9.838
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9967      0.081     61.588      0.000       4.833       5.160
x1             0.4902      0.013     39.174      0.000       0.465       0.515
x2             0.5709      0.049     11.606      0.000       0.472       0.670
x3            -0.0183      0.001    -16.678      0.000      -0.021      -0.016
==============================================================================
Omnibus:                        2.560   Durbin-Watson:                   2.102
Prob(Omnibus):                  0.278   Jarque-Bera (JB):                1.454
Skew:                          -0.055   Prob(JB):                        0.483
Kurtosis:                       2.172   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.5386542   5.03704758  5.49210698  5.87271985  6.15900199  6.3455645
  6.44239911  6.47323659  6.47164785  6.47552834  6.52087193  6.6358572
  6.83621783  7.12265755  7.48073452  7.88323413  8.29464025  8.67697041
  8.99601627  9.22696389  9.35847064  9.39452873  9.35380932  9.2665949
  9.16980393  9.1009253   9.09185897  9.16367489  9.32315131  9.56166109
  9.85658913 10.17504798 10.47928113 10.73286632 10.90669959 10.98377701
 10.96198836 10.85446383 10.68741675 10.49583687 10.31774159 10.18792983
 10.13226465 10.16342406 10.27881801 10.46101396 10.680601   10.90102196
 11.08457965 11.19863041]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.21218748 11.07725494 10.81622059 10.48010344 10.13606243  9.85095361
  9.67496145  9.62931189  9.70107555  9.84633324]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.996705
x1                  0.490160
np.sin(x1)          0.570882
I((x1 - 5) ** 2)   -0.018322
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.212187
1    11.077255
2    10.816221
3    10.480103
4    10.136062
5     9.850954
6     9.674961
7     9.629312
8     9.701076
9     9.846333
dtype: float64