Getting started¶
This very simple case-study is designed to get you up-and-running quickly with
statsmodels. Starting from raw data, we will show the steps needed to
estimate a statistical model and to draw a diagnostic plot. We will only use
functions provided by statsmodels or its pandas and patsy
dependencies.
Loading modules and functions¶
After installing statsmodels and its dependencies, we load a few modules and functions:
In [1]: import statsmodels.api as sm
In [2]: import pandas
In [3]: from patsy import dmatrices
pandas builds on numpy arrays to provide
rich data structures and data analysis tools. The pandas.DataFrame function
provides labelled arrays of (potentially heterogenous) data, similar to the
R “data.frame”. The pandas.read_csv function can be used to convert a
comma-separated values file to a DataFrame object.
patsy is a Python library for describing
statistical models and building Design Matrices using R-like formulas.
Data¶
We download the Guerry dataset, a
collection of historical data used in support of Andre-Michel Guerry’s 1833
Essay on the Moral Statistics of France. The data set is hosted online in
comma-separated values format (CSV) by the Rdatasets repository.
We could download the file locally and then load it using read_csv, but
pandas takes care of all of this automatically for us:
In [4]: df = sm.datasets.get_rdataset("Guerry", "HistData", cache=True).data
---------------------------------------------------------------------------
gaierror Traceback (most recent call last)
/usr/lib/python3.5/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
1253 try:
-> 1254 h.request(req.get_method(), req.selector, req.data, headers)
1255 except OSError as err: # timeout error
/usr/lib/python3.5/http/client.py in request(self, method, url, body, headers)
1106 """Send a complete request to the server."""
-> 1107 self._send_request(method, url, body, headers)
1108
/usr/lib/python3.5/http/client.py in _send_request(self, method, url, body, headers)
1151 body = _encode(body, 'body')
-> 1152 self.endheaders(body)
1153
/usr/lib/python3.5/http/client.py in endheaders(self, message_body)
1102 raise CannotSendHeader()
-> 1103 self._send_output(message_body)
1104
/usr/lib/python3.5/http/client.py in _send_output(self, message_body)
933
--> 934 self.send(msg)
935 if message_body is not None:
/usr/lib/python3.5/http/client.py in send(self, data)
876 if self.auto_open:
--> 877 self.connect()
878 else:
/usr/lib/python3.5/http/client.py in connect(self)
1252
-> 1253 super().connect()
1254
/usr/lib/python3.5/http/client.py in connect(self)
848 self.sock = self._create_connection(
--> 849 (self.host,self.port), self.timeout, self.source_address)
850 self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
/usr/lib/python3.5/socket.py in create_connection(address, timeout, source_address)
693 err = None
--> 694 for res in getaddrinfo(host, port, 0, SOCK_STREAM):
695 af, socktype, proto, canonname, sa = res
/usr/lib/python3.5/socket.py in getaddrinfo(host, port, family, type, proto, flags)
732 addrlist = []
--> 733 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
734 af, socktype, proto, canonname, sa = res
gaierror: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
URLError Traceback (most recent call last)
<ipython-input-4-8c6e29b11c8f> in <module>()
----> 1 df = sm.datasets.get_rdataset("Guerry", "HistData", cache=True).data
/build/statsmodels-0.8.0/.pybuild/pythonX.Y_3.5/build/statsmodels/datasets/utils.py in get_rdataset(dataname, package, cache)
288 "master/doc/"+package+"/rst/")
289 cache = _get_cache(cache)
--> 290 data, from_cache = _get_data(data_base_url, dataname, cache)
291 data = read_csv(data, index_col=0)
292 data = _maybe_reset_index(data)
/build/statsmodels-0.8.0/.pybuild/pythonX.Y_3.5/build/statsmodels/datasets/utils.py in _get_data(base_url, dataname, cache, extension)
219 url = base_url + (dataname + ".%s") % extension
220 try:
--> 221 data, from_cache = _urlopen_cached(url, cache)
222 except HTTPError as err:
223 if '404' in str(err):
/build/statsmodels-0.8.0/.pybuild/pythonX.Y_3.5/build/statsmodels/datasets/utils.py in _urlopen_cached(url, cache)
210 # not using the cache or didn't find it in cache
211 if not from_cache:
--> 212 data = urlopen(url).read()
213 if cache is not None: # then put it in the cache
214 _cache_it(data, cache_path)
/usr/lib/python3.5/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
161 else:
162 opener = _opener
--> 163 return opener.open(url, data, timeout)
164
165 def install_opener(opener):
/usr/lib/python3.5/urllib/request.py in open(self, fullurl, data, timeout)
464 req = meth(req)
465
--> 466 response = self._open(req, data)
467
468 # post-process response
/usr/lib/python3.5/urllib/request.py in _open(self, req, data)
482 protocol = req.type
483 result = self._call_chain(self.handle_open, protocol, protocol +
--> 484 '_open', req)
485 if result:
486 return result
/usr/lib/python3.5/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
442 for handler in handlers:
443 func = getattr(handler, meth_name)
--> 444 result = func(*args)
445 if result is not None:
446 return result
/usr/lib/python3.5/urllib/request.py in https_open(self, req)
1295 def https_open(self, req):
1296 return self.do_open(http.client.HTTPSConnection, req,
-> 1297 context=self._context, check_hostname=self._check_hostname)
1298
1299 https_request = AbstractHTTPHandler.do_request_
/usr/lib/python3.5/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
1254 h.request(req.get_method(), req.selector, req.data, headers)
1255 except OSError as err: # timeout error
-> 1256 raise URLError(err)
1257 r = h.getresponse()
1258 except:
URLError: <urlopen error [Errno -3] Temporary failure in name resolution>
The Input/Output doc page shows how to import from various other formats.
We select the variables of interest and look at the bottom 5 rows:
In [5]: vars = ['Department', 'Lottery', 'Literacy', 'Wealth', 'Region']
In [6]: df = df[vars]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-97b909d2ba47> in <module>()
----> 1 df = df[vars]
NameError: name 'df' is not defined
In [7]: df[-5:]
