Recipe Objective
Many a times, we have groups and might be interested to combine them thereby calculating standard deviation of dataset.
So this recipe is a short example on how to compute standard error of mean of groups in pandas. Let’s get started.
Table of Contents
- Recipe Objective
- Step 1 — Import the library
- Step 2 — Setup the Data
- Step 3 — Finding standard error of the groups
- Step 4 — Let’s look at our dataset now
Step 1 — Import the library
import pandas as pd
import seaborn as sb
Let’s pause and look at these imports. Pandas is generally used for performing mathematical operation and preferably over arrays. Seaborn is just used in here to import dataset.
Step 2 — Setup the Data
df = sb.load_dataset('tips')
print(df.head())
Here we have imported tips dataset from seaborn library.
Step 3 — Finding standard error of the groups
print(df.groupby(['sex','smoker','day','time','size']).std())
Here we have performed groupby on certain columns and finally taking out the standard error of our dataset.
Step 4 — Let’s look at our dataset now
Once we run the above code snippet, we will see:
Scroll down to the ipython file to look at the results.
We can see standard error being found out for each groups.
Pandas делает это следующим образом:
def nansem(values, axis=None, skipna=True, ddof=1):
var = nanvar(values, axis, skipna, ddof=ddof)
mask = isnull(values)
if not is_float_dtype(values.dtype):
values = values.astype('f8')
count, _ = _get_counts_nanvar(mask, axis, ddof, values.dtype)
var = nanvar(values, axis, skipna, ddof=ddof)
return np.sqrt(var) / np.sqrt(count)
Определения вызываемых функций можно посмотреть в файле по ссылке
scipy.stats предлагает гораздо больше возможностей и он прекрасно работает с Pandas структурами — пример:
In [83]: from scipy.stats import *
In [84]: sem(series)
Out[84]: 0.22002671363672216
In [85]: series.sem()
Out[85]: 0.22002671363672216
за что у него отвечает атрибут axis
проще всего показать на примере Pandas.DataFrame:
In [1]: df = pd.DataFrame(np.random.randint(10, size=(5,3)), columns=list('abc'))
In [2]: df
Out[2]:
a b c
0 0 1 7
1 8 1 1
2 8 2 7
3 1 3 8
4 1 0 4
In [3]: df.sum(axis=0)
Out[3]:
a 18
b 7
c 27
dtype: int64
In [4]: df.sum(axis=1)
Out[4]:
0 8
1 10
2 17
3 12
4 5
dtype: int64
Comments
A very common operation when trying to work with data is to find out the error range for the data. In scientific research, including error ranges is required.
There are two main ways to do this: standard deviation and standard error of the mean. Pandas has an optimized std aggregation method for both dataframe and groupby. However, it does not have an optimized standard error method, meaning users who want to compute error ranges have to rely on the unoptimized scipy method.
Since computing error ranges is such a common operation, I think it would be very useful if there was an optimized sem
method like there is for std
.
Does statsmodels do this?
On Apr 17, 2014 2:27 AM, «toddrjen» notifications@github.com wrote:
A very common operation when trying to work with data is to find out the
error range for the data. In scientific research, including error ranges is
required.There are two main ways to do this: standard deviation and standard error
of the mean. Pandas has an optimized std aggregation method for both
dataframe and groupby. However, it does not have an optimized standard
error method, meaning users who want to compute error ranges have to rely
on the unoptimized scipy method.Since computing error ranges is such a common operation, I think it would
be very useful if there was an optimized sem method like there is for std.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/6897
.
Copy link
Contributor
Author
Not as far as I can find. And I don’t think it really belongs in statsmodels. In my opinion it is a pretty basic data wrangling task, like getting a mean or standard deviation, not the more advanced statistical modeling provided by statsmodel.
can u point to the scipy method?
http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.sem.html
@toddrjen What do you mean with an optimized method? std
is optimized, so you don’t have to rely on an ‘unoptimized’ scipy.stats method, you can just do: df.std()/len(df)
And by the way, scipy.stats.sem is not that ‘unoptimized’. In fact, it is even faster, as this does not do eg the extra nan-checking as pandas does:
In [2]: s = pd.Series(np.random.randn(1000))
In [7]: from scipy import stats
In [8]: stats.sem(s.values)
Out[8]: 0.031635197968083853
In [9]: s.std() / np.sqrt(len(s))
Out[9]: 0.031635197968083832
In [11]: %timeit stats.sem(s.values)
10000 loops, best of 3: 46.2 µs per loop
In [12]: %timeit s.std() / np.sqrt(len(s))
10000 loops, best of 3: 85.7 µs per loop
In [12]: %timeit s.std() / np.sqrt(len(s))
But of course, the question still remains, do we provide a shortcut to this functionality in the form of a sem
method, or do we just expect out users to divide the std themselves.
would be code-bloat IMHO, closing
thanks for the suggestion.
if you disagree, pls comment.
@jreback i don’t think this is code bloat relative to the alternative:
You can’t really use scipy.stats.sem
because it doesn’t handle nan
s:
In [19]: from scipy.stats import sem
In [20]: df = DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'c'])
In [21]: df
Out[21]:
a b c
0 1.1658 0.2184 -2.0823
1 0.5625 -0.5034 0.7028
2 -0.8424 0.1333 -1.1065
3 0.9335 -0.6088 1.4308
4 -0.1027 -0.1888 -0.5816
5 -0.5202 0.3210 -0.9942
6 -0.8666 0.8711 -0.5691
7 -0.7701 -2.1855 -0.4302
8 1.0664 -1.2672 0.7117
9 -0.7530 -0.8466 0.0194
[10 rows x 3 columns]
In [22]: sem(df[df > 0])
Out[22]: array([ nan, nan, nan])
Okay, so let’s try it with scipy.stats.mstats.sem
:
In [26]: from scipy.stats.mstats import sem as sem
In [27]: sem(df[df > 0])
Out[27]:
masked_array(data = [-- -- --],
mask = [ True True True],
fill_value = 1e+20)
That’s hardly what I would expect here, and masked arrays are almost as fun as recarrays. I’m +1 on reopening this.
Here’s what it would take to get the desired result from scipy
:
In [32]: Series(sem(np.ma.masked_invalid(df[df > 0])),index=df.columns)
Out[32]:
a 0.1321
b 0.1662
c 0.2881
dtype: float64
In [33]: df[df > 0].std() / sqrt(df[df > 0].count())
Out[33]:
a 0.1321
b 0.1662
c 0.2881
dtype: float64
no, but isn’t this just
«s.std()/np.sqrt(len(s))` and even that’s ‘arbitrary’ in my book
not an issue with the code-bloat per se, but the definition
agreed. that’s really simple. i was just making a point about the nan handling, you can’t just do len
because that counts nans. not a huge deal
not averse to this, but it just seems so simple that a user should do this (as I might want a different definition); that said if this is pretty ‘standard’ then would be ok
every science institution i’ve ever worked in (just 3 really so not a whole lot of weight there) has used sem
at some point (even if just to get a rough idea of error ranges). i see your point about different definitions, maybe other folks want to chime in
ok…will reopen for consideration in 0.15 then
Copy link
Contributor
Author
I have also been at three different institutions, and they also all used SEM. And I have seen it on hundreds of papers, presentations, and posters.
@toddrjen
ok…that’s fine then, pls submit a PR! (needs to go in core/nanops.py
) with some updating in core/ops.py
Copy link
Contributor
Author
Pull request submitted: #7133
Pandas has df.sem() function or series.sem()