Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.986
Model:                            OLS   Adj. R-squared:                  0.985
Method:                 Least Squares   F-statistic:                     1111.
Date:                Tue, 17 Sep 2019   Prob (F-statistic):           6.60e-43
Time:                        20:50:15   Log-Likelihood:                 5.0310
No. Observations:                  50   AIC:                            -2.062
Df Residuals:                      46   BIC:                             5.586
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9198      0.078     63.274      0.000       4.763       5.076
x1             0.5060      0.012     42.199      0.000       0.482       0.530
x2             0.4851      0.047     10.290      0.000       0.390       0.580
x3            -0.0203      0.001    -19.300      0.000      -0.022      -0.018
==============================================================================
Omnibus:                        1.097   Durbin-Watson:                   2.461
Prob(Omnibus):                  0.578   Jarque-Bera (JB):                0.432
Skew:                           0.140   Prob(JB):                        0.806
Kurtosis:                       3.358   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.41180023  4.89043725  5.33067023  5.70606301  5.99972008  6.20706251
  6.33658022  6.40843715  6.45115831  6.4969433   6.57637592  6.71339938
  6.9213825   7.20092323  7.53975043  7.91473997  8.29571396  8.6503989
  8.94972812  9.17261734  9.30942879  9.36355463  9.35085965  9.29707472
  9.2335693   9.1921975   9.20006454  9.27507349  9.42298377  9.63646494
  9.89630107 10.17454786 10.43912378 10.65908102 10.80969009 10.87650284
 10.85772622 10.76451688 10.61914818 10.45135071 10.29342768 10.17494756
 10.11788608 10.13301565 10.21813554 10.3584334  10.52891893 10.69852968
 10.83523452 10.91129632]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.89512083 10.75039842 10.49615191 10.17573187  9.84620295  9.5643724
  9.3728817   9.28976644  9.30404058  9.37838622]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.919805
x1                  0.506034
np.sin(x1)          0.485075
I((x1 - 5) ** 2)   -0.020320
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.895121
1    10.750398
2    10.496152
3    10.175732
4     9.846203
5     9.564372
6     9.372882
7     9.289766
8     9.304041
9     9.378386
dtype: float64