Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     893.9
Date:                Sun, 16 Aug 2020   Prob (F-statistic):           9.05e-41
Time:                        18:00:44   Log-Likelihood:               -0.79447
No. Observations:                  50   AIC:                             9.589
Df Residuals:                      46   BIC:                             17.24
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.8511      0.087     55.528      0.000       4.675       5.027
x1             0.5194      0.013     38.548      0.000       0.492       0.546
x2             0.4956      0.053      9.357      0.000       0.389       0.602
x3            -0.0216      0.001    -18.278      0.000      -0.024      -0.019
==============================================================================
Omnibus:                        0.646   Durbin-Watson:                   2.055
Prob(Omnibus):                  0.724   Jarque-Bera (JB):                0.543
Skew:                          -0.246   Prob(JB):                        0.762
Kurtosis:                       2.862   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.31050146  4.80386045  5.25769481  5.64499445  5.94849709  6.16352437
  6.29875055  6.37477728  6.42074887  6.46956393  6.55247024  6.69393079
  6.90760463  7.1941027   7.54088745  7.92433275  8.31360556  8.67573173
  8.98101357  9.20790909  9.34657126  9.4004659   9.38580221  9.32886966
  9.26171874  9.21689507  9.22209235  9.29560227  9.44330932  9.65772391
  9.91921294 10.19922545 10.46498344 10.68486723 10.83361049 10.89645138
 10.87155794 10.77032902 10.61552152 10.43751148 10.26930339 10.14110762
 10.07537699 10.08311779 10.1620815  10.2971342  10.46274307 10.62717129
 10.75769239 10.82596705]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.79748741 10.63723969 10.36465968 10.02403908  9.67368138  9.37162707
  9.16144338  9.06155743  9.06074468  9.12087707]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.851059
x1                  0.519373
np.sin(x1)          0.495606
I((x1 - 5) ** 2)   -0.021622
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.797487
1    10.637240
2    10.364660
3    10.024039
4     9.673681
5     9.371627
6     9.161443
7     9.061557
8     9.060745
9     9.120877
dtype: float64