Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/build/statsmodels-763zc8/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.975
Model:                            OLS   Adj. R-squared:                  0.973
Method:                 Least Squares   F-statistic:                     597.7
Date:                Sat, 02 Mar 2019   Prob (F-statistic):           7.81e-37
Time:                        00:23:27   Log-Likelihood:                -9.5126
No. Observations:                  50   AIC:                             27.03
Df Residuals:                      46   BIC:                             34.67
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1151      0.104     49.182      0.000       4.906       5.324
x1             0.4857      0.016     30.283      0.000       0.453       0.518
x2             0.4351      0.063      6.900      0.000       0.308       0.562
x3            -0.0186      0.001    -13.239      0.000      -0.021      -0.016
==============================================================================
Omnibus:                        0.045   Durbin-Watson:                   2.179
Prob(Omnibus):                  0.978   Jarque-Bera (JB):                0.212
Skew:                          -0.048   Prob(JB):                        0.899
Kurtosis:                       2.696   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.64897563  5.09290943  5.50225975  5.85331668  6.13092709  6.33098426
  6.46110257  6.53936652  6.59135946  6.64596036  6.73059902  6.86674949
  7.06640191  7.33009256  7.64681578  7.99583215  8.3500759   8.68060179
  8.96134067  9.17338262  9.30808377  9.36848644  9.36881937  9.33215989
  9.28664245  9.26083615  9.27905099  9.35734372  9.50087983  9.70308485
  9.94672452 10.2067363  10.45434713 10.66180073 10.80691787 10.87674038
 10.86966011 10.79568318 10.67478598 10.5336331  10.40119643 10.30399517
 10.26173875 10.2840887  10.36907122 10.5034015  10.6646664  10.82500682
 10.9556949  11.0318539 ]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.0268838  10.90641338 10.68750372 10.40903481 10.12218643  9.87790751
  9.71444208  9.6479658   9.66862556  9.74295192]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.115089
x1                  0.485731
np.sin(x1)          0.435051
I((x1 - 5) ** 2)   -0.018645
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    11.026884
1    10.906413
2    10.687504
3    10.409035
4    10.122186
5     9.877908
6     9.714442
7     9.647966
8     9.668626
9     9.742952
dtype: float64