Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/build/statsmodels-0.8.0/.pybuild/pythonX.Y_3.5/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.987
Model:                            OLS   Adj. R-squared:                  0.986
Method:                 Least Squares   F-statistic:                     1194.
Date:                Wed, 27 Sep 2017   Prob (F-statistic):           1.29e-43
Time:                        19:59:00   Log-Likelihood:                 6.9259
No. Observations:                  50   AIC:                            -5.852
Df Residuals:                      46   BIC:                             1.796
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9119      0.075     65.612      0.000       4.761       5.063
x1             0.5146      0.012     44.574      0.000       0.491       0.538
x2             0.5346      0.045     11.779      0.000       0.443       0.626
x3            -0.0215      0.001    -21.167      0.000      -0.023      -0.019
==============================================================================
Omnibus:                        0.125   Durbin-Watson:                   2.336
Prob(Omnibus):                  0.940   Jarque-Bera (JB):                0.194
Skew:                           0.109   Prob(JB):                        0.908
Kurtosis:                       2.787   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.37543664   4.88169946   5.34594855   5.73904781   6.0423762
   6.25088714   6.37393762   6.43375093   6.46176536   6.49346902
   6.56256927   6.69545463   6.90685928   7.19744224   7.55367912
   7.95008411   8.35339726   8.72804898   9.04200399   9.2720247
   9.40748902   9.45213554   9.42344934   9.34978919   9.2657283
   9.20637405   9.20159993   9.27113743   9.4213342    9.64411114
   9.9182898   10.2130719   10.49309944  10.7242639   10.87931017
  10.94231451  10.91130073  10.79856492  10.62865523  10.43433863
  10.25121746  10.11188002  10.04054641  10.04908926  10.13508306
  10.28220269  10.46290556  10.64295681  10.787054    10.86462717]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 10.83956096  10.67328828  10.38677474  10.02779835   9.65925174
   9.3437439    9.12827131   9.03271151   9.04495648   9.123877  ]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.911876
x1                  0.514631
np.sin(x1)          0.534616
I((x1 - 5) ** 2)   -0.021458
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.839561
1    10.673288
2    10.386775
3    10.027798
4     9.659252
5     9.343744
6     9.128271
7     9.032712
8     9.044956
9     9.123877
dtype: float64