Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.980
Model:                            OLS   Adj. R-squared:                  0.978
Method:                 Least Squares   F-statistic:                     741.1
Date:                Sat, 07 Nov 2020   Prob (F-statistic):           6.23e-39
Time:                        11:34:25   Log-Likelihood:                -3.0227
No. Observations:                  50   AIC:                             14.05
Df Residuals:                      46   BIC:                             21.69
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.2072      0.091     57.006      0.000       5.023       5.391
x1             0.4696      0.014     33.337      0.000       0.441       0.498
x2             0.4563      0.055      8.239      0.000       0.345       0.568
x3            -0.0176      0.001    -14.223      0.000      -0.020      -0.015
==============================================================================
Omnibus:                       10.776   Durbin-Watson:                   2.145
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               13.024
Skew:                          -0.755   Prob(JB):                      0.00149
Kurtosis:                       4.992   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.76736342  5.20902136  5.61506369  5.9606248   6.22981294  6.4183212
  6.53413514  6.59622074  6.6314084   6.66998472  6.74071656  6.86612468
  7.05878367  7.31925595  7.6359994   7.98726368  8.34466375  8.67784345
  8.95946295  9.16969058  9.29946112  9.35196523  9.34212516  9.29414314
  9.23752499  9.20223238  9.21376024  9.28894802  9.43321299  9.6396602
  9.89021532 10.1585942  10.41462126 10.62918728 10.77903194 10.85056542
 10.8421012  10.76413314 10.63761134 10.49050013 10.35318359 10.25347355
 10.21204017 10.23901591 10.33233075 10.47805236 10.65267507 10.82698183
 10.97084449 11.05817373]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.06298811 10.94765308 10.73006132 10.45098799 10.16410754  9.92285236
  9.76733063  9.71450645  9.75404645  9.85084976]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.207157
x1                  0.469631
np.sin(x1)          0.456257
I((x1 - 5) ** 2)   -0.017592
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.062988
1    10.947653
2    10.730061
3    10.450988
4    10.164108
5     9.922852
6     9.767331
7     9.714506
8     9.754046
9     9.850850
dtype: float64