Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.979
Method:                 Least Squares   F-statistic:                     775.7
Date:                Mon, 29 Nov 2021   Prob (F-statistic):           2.22e-39
Time:                        22:05:33   Log-Likelihood:                -3.2976
No. Observations:                  50   AIC:                             14.60
Df Residuals:                      46   BIC:                             22.24
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9880      0.092     54.307      0.000       4.803       5.173
x1             0.4967      0.014     35.068      0.000       0.468       0.525
x2             0.4287      0.056      7.700      0.000       0.317       0.541
x3            -0.0197      0.001    -15.862      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        0.076   Durbin-Watson:                   2.118
Prob(Omnibus):                  0.963   Jarque-Bera (JB):                0.128
Skew:                          -0.081   Prob(JB):                        0.938
Kurtosis:                       2.812   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.49476846  4.94493203  5.36056249  5.71829358  6.00319179  6.21120989
  6.34985194  6.43694032  6.4976876   6.56055419  6.65257237  6.79490494
  6.99936828  7.26649089  7.58542662  7.93573662  8.29074741  8.62193324
  8.90360279  9.11712021  9.25396688  9.31714107  9.32066527  9.28728236
  9.24471902  9.22113032  9.24047389  9.31857381  9.46052068  9.65983531
  9.89953323 10.15491515 10.397625   10.60030878 10.74110871 10.80725443
 10.79716109 10.7206898  10.59752741 10.45395207 10.31851578 10.21735338
 10.16988866 10.18564311 10.26267155 10.38788171 10.53918527 10.68912691
 10.80939512 10.87547372]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[10.85896487 10.72795313 10.49925229 10.21117887  9.91417087  9.65843883
  9.48167249  9.39981308  9.40315029  9.45869975]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fdebd8777c0>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.987954
x1                  0.496738
np.sin(x1)          0.428746
I((x1 - 5) ** 2)   -0.019727
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.858965
1    10.727953
2    10.499252
3    10.211179
4     9.914171
5     9.658439
6     9.481672
7     9.399813
8     9.403150
9     9.458700
dtype: float64