# v0.15.0 (October 18, 2014)¶

This is a major release from 0.14.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

pandas >= 0.15.0 will no longer support compatibility with NumPy versions < 1.7.0. If you want to use the latest versions of pandas, please upgrade to NumPy >= 1.7.0 (GH7711)

- Highlights include:
- The
`Categorical`

type was integrated as a first-class pandas type, see here - New scalar type
`Timedelta`

, and a new index type`TimedeltaIndex`

, see here - New datetimelike properties accessor
`.dt`

for Series, see Datetimelike Properties - New DataFrame default display for
`df.info()`

to include memory usage, see Memory Usage `read_csv`

will now by default ignore blank lines when parsing, see here- API change in using Indexes in set operations, see here
- Enhancements in the handling of timezones, see here
- A lot of improvements to the rolling and expanding moment functions, see here
- Internal refactoring of the
`Index`

class to no longer sub-class`ndarray`

, see Internal Refactoring - dropping support for
`PyTables`

less than version 3.0.0, and`numexpr`

less than version 2.1 (GH7990) - Split indexing documentation into Indexing and Selecting Data and MultiIndex / Advanced Indexing
- Split out string methods documentation into Working with Text Data

- The
- Check the API Changes and deprecations before updating
- Other Enhancements
- Performance Improvements
- Bug Fixes

In 0.15.0 `Index`

has internally been refactored to no longer sub-class `ndarray`

but instead subclass `PandasObject`

, similarly to the rest of the pandas objects. This change allows very easy sub-classing and creation of new index types. This should be
a transparent change with only very limited API implications (See the Internal Refactoring)

The refactoring in `Categorical`

changed the two argument constructor from
“codes/labels and levels” to “values and levels (now called ‘categories’)”. This can lead to subtle bugs. If you use
`Categorical`

directly, please audit your code before updating to this pandas
version and change it to use the `from_codes()`

constructor. See more on `Categorical`

here

## New features¶

### Categoricals in Series/DataFrame¶

`Categorical`

can now be included in Series and DataFrames and gained new
methods to manipulate. Thanks to Jan Schulz for much of this API/implementation. (GH3943, GH5313, GH5314,
GH7444, GH7839, GH7848, GH7864, GH7914, GH7768, GH8006, GH3678,
GH8075, GH8076, GH8143, GH8453, GH8518).

For full docs, see the categorical introduction and the API documentation.

```
In [1]: df = DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
In [2]: df["grade"] = df["raw_grade"].astype("category")
In [3]: df["grade"]
Out[3]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, Length: 6, dtype: category
Categories (3, object): [a, b, e]
# Rename the categories
In [4]: df["grade"].cat.categories = ["very good", "good", "very bad"]
# Reorder the categories and simultaneously add the missing categories
In [5]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
In [6]: df["grade"]
Out[6]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, Length: 6, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
In [7]: df.sort_values("grade")
Out[7]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
[6 rows x 3 columns]
In [8]: df.groupby("grade").size()
Out[8]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
Length: 5, dtype: int64
```

`pandas.core.group_agg`

and`pandas.core.factor_agg`

were removed. As an alternative, construct a dataframe and use`df.groupby(<group>).agg(<func>)`

.- Supplying “codes/labels and levels” to the
`Categorical`

constructor is not supported anymore. Supplying two arguments to the constructor is now interpreted as “values and levels (now called ‘categories’)”. Please change your code to use the`from_codes()`

constructor. - The
`Categorical.labels`

attribute was renamed to`Categorical.codes`

and is read only. If you want to manipulate codes, please use one of the API methods on Categoricals. - The
`Categorical.levels`

attribute is renamed to`Categorical.categories`

.

### TimedeltaIndex/Scalar¶

We introduce a new scalar type `Timedelta`

, which is a subclass of `datetime.timedelta`

, and behaves in a similar manner,
but allows compatibility with `np.timedelta64`

types as well as a host of custom representation, parsing, and attributes.
This type is very similar to how `Timestamp`

works for `datetimes`

. It is a nice-API box for the type. See the docs.
(GH3009, GH4533, GH8209, GH8187, GH8190, GH7869, GH7661, GH8345, GH8471)

`Timedelta`

scalars (and `TimedeltaIndex`

) component fields are *not the same* as the component fields on a `datetime.timedelta`

object. For example, `.seconds`

on a `datetime.timedelta`

object returns the total number of seconds combined between `hours`

, `minutes`

and `seconds`

. In contrast, the pandas `Timedelta`

breaks out hours, minutes, microseconds and nanoseconds separately.

```
# Timedelta accessor
In [9]: tds = Timedelta('31 days 5 min 3 sec')
In [10]: tds.minutes
Out[10]: 5L
In [11]: tds.seconds
Out[11]: 3L
# datetime.timedelta accessor
# this is 5 minutes * 60 + 3 seconds
In [12]: tds.to_pytimedelta().seconds
Out[12]: 303
```

**Note**: this is no longer true starting from v0.16.0, where full
compatibility with `datetime.timedelta`

is introduced. See the
0.16.0 whatsnew entry

Prior to 0.15.0 `pd.to_timedelta`

would return a `Series`

for list-like/Series input, and a `np.timedelta64`

for scalar input.
It will now return a `TimedeltaIndex`

for list-like input, `Series`

for Series input, and `Timedelta`

for scalar input.

The arguments to `pd.to_timedelta`

are now `(arg,unit='ns',box=True,coerce=False)`

, previously were `(arg,box=True,unit='ns')`

as these are more logical.

Construct a scalar

```
In [9]: Timedelta('1 days 06:05:01.00003')
Out[9]: Timedelta('1 days 06:05:01.000030')
In [10]: Timedelta('15.5us')
Out[10]: Timedelta('0 days 00:00:00.000015')
In [11]: Timedelta('1 hour 15.5us')
Out[11]: Timedelta('0 days 01:00:00.000015')
# negative Timedeltas have this string repr
# to be more consistent with datetime.timedelta conventions
In [12]: Timedelta('-1us')
Out[12]: Timedelta('-1 days +23:59:59.999999')
# a NaT
In [13]: Timedelta('nan')
Out[13]: NaT
```

Access fields for a `Timedelta`

```
In [14]: td = Timedelta('1 hour 3m 15.5us')
In [15]: td.seconds
Out[15]: 3780
In [16]: td.microseconds
Out[16]: 16
In [17]: td.nanoseconds
Out[17]: 500
```

Construct a `TimedeltaIndex`

```
In [18]: TimedeltaIndex(['1 days','1 days, 00:00:05',
....: np.timedelta64(2,'D'),timedelta(days=2,seconds=2)])
....:
Out[18]:
TimedeltaIndex(['1 days 00:00:00', '1 days 00:00:05', '2 days 00:00:00',
'2 days 00:00:02'],
dtype='timedelta64[ns]', freq=None)
```

Constructing a `TimedeltaIndex`

with a regular range

```
In [19]: timedelta_range('1 days',periods=5,freq='D')
Out[19]: TimedeltaIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq='D')
In [20]: timedelta_range(start='1 days',end='2 days',freq='30T')
Out[20]:
TimedeltaIndex(['1 days 00:00:00', '1 days 00:30:00', '1 days 01:00:00',
'1 days 01:30:00', '1 days 02:00:00', '1 days 02:30:00',
'1 days 03:00:00', '1 days 03:30:00', '1 days 04:00:00',
'1 days 04:30:00', '1 days 05:00:00', '1 days 05:30:00',
'1 days 06:00:00', '1 days 06:30:00', '1 days 07:00:00',
'1 days 07:30:00', '1 days 08:00:00', '1 days 08:30:00',
'1 days 09:00:00', '1 days 09:30:00', '1 days 10:00:00',
'1 days 10:30:00', '1 days 11:00:00', '1 days 11:30:00',
'1 days 12:00:00', '1 days 12:30:00', '1 days 13:00:00',
'1 days 13:30:00', '1 days 14:00:00', '1 days 14:30:00',
'1 days 15:00:00', '1 days 15:30:00', '1 days 16:00:00',
'1 days 16:30:00', '1 days 17:00:00', '1 days 17:30:00',
'1 days 18:00:00', '1 days 18:30:00', '1 days 19:00:00',
'1 days 19:30:00', '1 days 20:00:00', '1 days 20:30:00',
'1 days 21:00:00', '1 days 21:30:00', '1 days 22:00:00',
'1 days 22:30:00', '1 days 23:00:00', '1 days 23:30:00',
'2 days 00:00:00'],
dtype='timedelta64[ns]', freq='30T')
```

You can now use a `TimedeltaIndex`

as the index of a pandas object

```
In [21]: s = Series(np.arange(5),
....: index=timedelta_range('1 days',periods=5,freq='s'))
....:
In [22]: s
Out[22]:
1 days 00:00:00 0
1 days 00:00:01 1
1 days 00:00:02 2
1 days 00:00:03 3
1 days 00:00:04 4
Freq: S, Length: 5, dtype: int64
```

You can select with partial string selections

```
In [23]: s['1 day 00:00:02']
Out[23]: 2
In [24]: s['1 day':'1 day 00:00:02']
Out[24]:
1 days 00:00:00 0
1 days 00:00:01 1
1 days 00:00:02 2
Freq: S, Length: 3, dtype: int64
```

Finally, the combination of `TimedeltaIndex`

with `DatetimeIndex`

allow certain combination operations that are `NaT`

preserving:

```
In [25]: tdi = TimedeltaIndex(['1 days',pd.NaT,'2 days'])
In [26]: tdi.tolist()
Out[26]: [Timedelta('1 days 00:00:00'), NaT, Timedelta('2 days 00:00:00')]
In [27]: dti = date_range('20130101',periods=3)
In [28]: dti.tolist()
Out[28]:
[Timestamp('2013-01-01 00:00:00', freq='D'),
Timestamp('2013-01-02 00:00:00', freq='D'),
Timestamp('2013-01-03 00:00:00', freq='D')]
In [29]: (dti + tdi).tolist()
Out[29]: [Timestamp('2013-01-02 00:00:00'), NaT, Timestamp('2013-01-05 00:00:00')]
In [30]: (dti - tdi).tolist()
Out[30]: [Timestamp('2012-12-31 00:00:00'), NaT, Timestamp('2013-01-01 00:00:00')]
```

- iteration of a
`Series`

e.g.`list(Series(...))`

of`timedelta64[ns]`

would prior to v0.15.0 return`np.timedelta64`

for each element. These will now be wrapped in`Timedelta`

.

### Memory Usage¶

Implemented methods to find memory usage of a DataFrame. See the FAQ for more. (GH6852).

A new display option `display.memory_usage`

(see Options and Settings) sets the default behavior of the `memory_usage`

argument in the `df.info()`

method. By default `display.memory_usage`

is `True`

.

```
In [31]: dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]',
....: 'complex128', 'object', 'bool']
....:
In [32]: n = 5000
In [33]: data = dict([ (t, np.random.randint(100, size=n).astype(t))
....: for t in dtypes])
....:
In [34]: df = DataFrame(data)
In [35]: df['categorical'] = df['object'].astype('category')
In [36]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
int64 5000 non-null int64
float64 5000 non-null float64
datetime64[ns] 5000 non-null datetime64[ns]
timedelta64[ns] 5000 non-null timedelta64[ns]
complex128 5000 non-null complex128
object 5000 non-null object
bool 5000 non-null bool
categorical 5000 non-null category
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 289.1+ KB
```

Additionally `memory_usage()`

is an available method for a dataframe object which returns the memory usage of each column.

```
In [37]: df.memory_usage(index=True)
Out[37]:
Index 80
int64 40000
float64 40000
datetime64[ns] 40000
timedelta64[ns] 40000
complex128 80000
object 40000
bool 5000
categorical 10920
Length: 9, dtype: int64
```

### .dt accessor¶

`Series`

has gained an accessor to succinctly return datetime like properties for the *values* of the Series, if its a datetime/period like Series. (GH7207)
This will return a Series, indexed like the existing Series. See the docs

```
# datetime
In [38]: s = Series(date_range('20130101 09:10:12',periods=4))
In [39]: s
Out[39]:
0 2013-01-01 09:10:12
1 2013-01-02 09:10:12
2 2013-01-03 09:10:12
3 2013-01-04 09:10:12
Length: 4, dtype: datetime64[ns]
In [40]: s.dt.hour
Out[40]:
0 9
1 9
2 9
3 9
Length: 4, dtype: int64
In [41]: s.dt.second
Out[41]:
0 12
1 12
2 12
3 12
Length: 4, dtype: int64
In [42]: s.dt.day
Out[42]:
0 1
1 2
2 3
3 4
Length: 4, dtype: int64
In [43]: s.dt.freq
Out[43]: 'D'
```

This enables nice expressions like this:

```
In [44]: s[s.dt.day==2]
Out[44]:
1 2013-01-02 09:10:12
Length: 1, dtype: datetime64[ns]
```

You can easily produce tz aware transformations:

```
In [45]: stz = s.dt.tz_localize('US/Eastern')
In [46]: stz
Out[46]:
0 2013-01-01 09:10:12-05:00
1 2013-01-02 09:10:12-05:00
2 2013-01-03 09:10:12-05:00
3 2013-01-04 09:10:12-05:00
Length: 4, dtype: datetime64[ns, US/Eastern]
In [47]: stz.dt.tz
Out[47]: <DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>
```

You can also chain these types of operations:

```
In [48]: s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Out[48]:
0 2013-01-01 04:10:12-05:00
1 2013-01-02 04:10:12-05:00
2 2013-01-03 04:10:12-05:00
3 2013-01-04 04:10:12-05:00
Length: 4, dtype: datetime64[ns, US/Eastern]
```

The `.dt`

accessor works for period and timedelta dtypes.

```
# period
In [49]: s = Series(period_range('20130101',periods=4,freq='D'))
In [50]: s
Out[50]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
Length: 4, dtype: period[D]
In [51]: s.dt.year
Out[51]:
0 2013
1 2013
2 2013
3 2013
Length: 4, dtype: int64
In [52]: s.dt.day
Out[52]:
0 1
1 2
2 3
3 4
Length: 4, dtype: int64
```

```
# timedelta
In [53]: s = Series(timedelta_range('1 day 00:00:05',periods=4,freq='s'))
In [54]: s
Out[54]:
0 1 days 00:00:05
1 1 days 00:00:06
2 1 days 00:00:07
3 1 days 00:00:08
Length: 4, dtype: timedelta64[ns]
In [55]: s.dt.days
Out[55]:
0 1
1 1
2 1
3 1
Length: 4, dtype: int64
In [56]: s.dt.seconds
Out[56]:
0 5
1 6
2 7
3 8
Length: 4, dtype: int64
In [57]: s.dt.components
Out[57]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 1 0 0 5 0 0 0
1 1 0 0 6 0 0 0
2 1 0 0 7 0 0 0
3 1 0 0 8 0 0 0
[4 rows x 7 columns]
```

### Timezone handling improvements¶

`tz_localize(None)`

for tz-aware`Timestamp`

and`DatetimeIndex`

now removes timezone holding local time, previously this resulted in`Exception`

or`TypeError`

(GH7812)In [58]: ts = Timestamp('2014-08-01 09:00', tz='US/Eastern') In [59]: ts Out[59]: Timestamp('2014-08-01 09:00:00-0400', tz='US/Eastern') In [60]: ts.tz_localize(None) Out[60]: Timestamp('2014-08-01 09:00:00') In [61]: didx = DatetimeIndex(start='2014-08-01 09:00', freq='H', periods=10, tz='US/Eastern') In [62]: didx Out[62]: DatetimeIndex(['2014-08-01 09:00:00-04:00', '2014-08-01 10:00:00-04:00', '2014-08-01 11:00:00-04:00', '2014-08-01 12:00:00-04:00', '2014-08-01 13:00:00-04:00', '2014-08-01 14:00:00-04:00', '2014-08-01 15:00:00-04:00', '2014-08-01 16:00:00-04:00', '2014-08-01 17:00:00-04:00', '2014-08-01 18:00:00-04:00'], dtype='datetime64[ns, US/Eastern]', freq='H') In [63]: didx.tz_localize(None) Out[63]: DatetimeIndex(['2014-08-01 09:00:00', '2014-08-01 10:00:00', '2014-08-01 11:00:00', '2014-08-01 12:00:00', '2014-08-01 13:00:00', '2014-08-01 14:00:00', '2014-08-01 15:00:00', '2014-08-01 16:00:00', '2014-08-01 17:00:00', '2014-08-01 18:00:00'], dtype='datetime64[ns]', freq='H')

`tz_localize`

now accepts the`ambiguous`

keyword which allows for passing an array of bools indicating whether the date belongs in DST or not, ‘NaT’ for setting transition times to NaT, ‘infer’ for inferring DST/non-DST, and ‘raise’ (default) for an`AmbiguousTimeError`

to be raised. See the docs for more details (GH7943)`DataFrame.tz_localize`

and`DataFrame.tz_convert`

now accepts an optional`level`

argument for localizing a specific level of a MultiIndex (GH7846)`Timestamp.tz_localize`

and`Timestamp.tz_convert`

now raise`TypeError`

in error cases, rather than`Exception`

(GH8025)a timeseries/index localized to UTC when inserted into a Series/DataFrame will preserve the UTC timezone (rather than being a naive

`datetime64[ns]`

) as`object`

dtype (GH8411)`Timestamp.__repr__`

displays`dateutil.tz.tzoffset`

info (GH7907)

### Rolling/Expanding Moments improvements¶

`rolling_min()`

,`rolling_max()`

,`rolling_cov()`

, and`rolling_corr()`

now return objects with all`NaN`

when`len(arg) < min_periods <= window`

rather than raising. (This makes all rolling functions consistent in this behavior). (GH7766)Prior to 0.15.0

In [64]: s = Series([10, 11, 12, 13])

In [15]: rolling_min(s, window=10, min_periods=5) ValueError: min_periods (5) must be <= window (4)

New behavior

In [4]: pd.rolling_min(s, window=10, min_periods=5) Out[4]: 0 NaN 1 NaN 2 NaN 3 NaN dtype: float64

`rolling_max()`

,`rolling_min()`

,`rolling_sum()`

,`rolling_mean()`

,`rolling_median()`

,`rolling_std()`

,`rolling_var()`

,`rolling_skew()`

,`rolling_kurt()`

,`rolling_quantile()`

,`rolling_cov()`

,`rolling_corr()`

,`rolling_corr_pairwise()`

,`rolling_window()`

, and`rolling_apply()`

with`center=True`

previously would return a result of the same structure as the input`arg`

with`NaN`

in the final`(window-1)/2`

entries.Now the final

`(window-1)/2`

entries of the result are calculated as if the input`arg`

were followed by`(window-1)/2`

`NaN`

values (or with shrinking windows, in the case of`rolling_apply()`

). (GH7925, GH8269)Prior behavior (note final value is

`NaN`

):In [7]: rolling_sum(Series(range(4)), window=3, min_periods=0, center=True) Out[7]: 0 1 1 3 2 6 3 NaN dtype: float64

New behavior (note final value is

`5 = sum([2, 3, NaN])`

):In [7]: rolling_sum(Series(range(4)), window=3, min_periods=0, center=True) Out[7]: 0 1 1 3 2 6 3 5 dtype: float64

`rolling_window()`

now normalizes the weights properly in rolling mean mode (mean=True) so that the calculated weighted means (e.g. ‘triang’, ‘gaussian’) are distributed about the same means as those calculated without weighting (i.e. ‘boxcar’). See the note on normalization for further details. (GH7618)In [65]: s = Series([10.5, 8.8, 11.4, 9.7, 9.3])

Behavior prior to 0.15.0:

In [39]: rolling_window(s, window=3, win_type='triang', center=True) Out[39]: 0 NaN 1 6.583333 2 6.883333 3 6.683333 4 NaN dtype: float64

New behavior

In [10]: pd.rolling_window(s, window=3, win_type='triang', center=True) Out[10]: 0 NaN 1 9.875 2 10.325 3 10.025 4 NaN dtype: float64

Removed

`center`

argument from all`expanding_`

functions (see list), as the results produced when`center=True`

did not make much sense. (GH7925)Added optional

`ddof`

argument to`expanding_cov()`

and`rolling_cov()`

. The default value of`1`

is backwards-compatible. (GH8279)Documented the

`ddof`

argument to`expanding_var()`

,`expanding_std()`

,`rolling_var()`

, and`rolling_std()`

. These functions’ support of a`ddof`

argument (with a default value of`1`

) was previously undocumented. (GH8064)`ewma()`

,`ewmstd()`

,`ewmvol()`

,`ewmvar()`

,`ewmcov()`

, and`ewmcorr()`

now interpret`min_periods`

in the same manner that the`rolling_*()`

and`expanding_*()`

functions do: a given result entry will be`NaN`

if the (expanding, in this case) window does not contain at least`min_periods`

values. The previous behavior was to set to`NaN`

the`min_periods`

entries starting with the first non-`NaN`

value. (GH7977)Prior behavior (note values start at index

`2`

, which is`min_periods`

after index`0`

(the index of the first non-empty value)):In [66]: s = Series([1, None, None, None, 2, 3])

In [51]: ewma(s, com=3., min_periods=2) Out[51]: 0 NaN 1 NaN 2 1.000000 3 1.000000 4 1.571429 5 2.189189 dtype: float64

New behavior (note values start at index

`4`

, the location of the 2nd (since`min_periods=2`

) non-empty value):In [2]: pd.ewma(s, com=3., min_periods=2) Out[2]: 0 NaN 1 NaN 2 NaN 3 NaN 4 1.759644 5 2.383784 dtype: float64

`ewmstd()`

,`ewmvol()`

,`ewmvar()`

,`ewmcov()`

, and`ewmcorr()`

now have an optional`adjust`

argument, just like`ewma()`

does, affecting how the weights are calculated. The default value of`adjust`

is`True`

, which is backwards-compatible. See Exponentially weighted moment functions for details. (GH7911)`ewma()`

,`ewmstd()`

,`ewmvol()`

,`ewmvar()`

,`ewmcov()`

, and`ewmcorr()`

now have an optional`ignore_na`

argument. When`ignore_na=False`

(the default), missing values are taken into account in the weights calculation. When`ignore_na=True`

(which reproduces the pre-0.15.0 behavior), missing values are ignored in the weights calculation. (GH7543)In [7]: pd.ewma(Series([None, 1., 8.]), com=2.) Out[7]: 0 NaN 1 1.0 2 5.2 dtype: float64 In [8]: pd.ewma(Series([1., None, 8.]), com=2., ignore_na=True) # pre-0.15.0 behavior Out[8]: 0 1.0 1 1.0 2 5.2 dtype: float64 In [9]: pd.ewma(Series([1., None, 8.]), com=2., ignore_na=False) # new default Out[9]: 0 1.000000 1 1.000000 2 5.846154 dtype: float64

By default (

`ignore_na=False`

) the`ewm*()`

functions’ weights calculation in the presence of missing values is different than in pre-0.15.0 versions. To reproduce the pre-0.15.0 calculation of weights in the presence of missing values one must specify explicitly`ignore_na=True`

.Bug in

`expanding_cov()`

,`expanding_corr()`

,`rolling_cov()`

,`rolling_cor()`

,`ewmcov()`

, and`ewmcorr()`

returning results with columns sorted by name and producing an error for non-unique columns; now handles non-unique columns and returns columns in original order (except for the case of two DataFrames with`pairwise=False`

, where behavior is unchanged) (GH7542)Bug in

`rolling_count()`

and`expanding_*()`

functions unnecessarily producing error message for zero-length data (GH8056)Bug in

`rolling_apply()`

and`expanding_apply()`

interpreting`min_periods=0`

as`min_periods=1`

(GH8080)Bug in

`expanding_std()`

and`expanding_var()`

for a single value producing a confusing error message (GH7900)Bug in

`rolling_std()`

and`rolling_var()`

for a single value producing`0`

rather than`NaN`

(GH7900)Bug in

`ewmstd()`

,`ewmvol()`

,`ewmvar()`

, and`ewmcov()`

calculation of de-biasing factors when`bias=False`

(the default). Previously an incorrect constant factor was used, based on`adjust=True`

,`ignore_na=True`

, and an infinite number of observations. Now a different factor is used for each entry, based on the actual weights (analogous to the usual`N/(N-1)`

factor). In particular, for a single point a value of`NaN`

is returned when`bias=False`

, whereas previously a value of (approximately)`0`

was returned.For example, consider the following pre-0.15.0 results for

`ewmvar(..., bias=False)`

, and the corresponding debiasing factors:In [67]: s = Series([1., 2., 0., 4.])

In [89]: ewmvar(s, com=2., bias=False) Out[89]: 0 -2.775558e-16 1 3.000000e-01 2 9.556787e-01 3 3.585799e+00 dtype: float64 In [90]: ewmvar(s, com=2., bias=False) / ewmvar(s, com=2., bias=True) Out[90]: 0 1.25 1 1.25 2 1.25 3 1.25 dtype: float64

Note that entry

`0`

is approximately 0, and the debiasing factors are a constant 1.25. By comparison, the following 0.15.0 results have a`NaN`

for entry`0`

, and the debiasing factors are decreasing (towards 1.25):In [14]: pd.ewmvar(s, com=2., bias=False) Out[14]: 0 NaN 1 0.500000 2 1.210526 3 4.089069 dtype: float64 In [15]: pd.ewmvar(s, com=2., bias=False) / pd.ewmvar(s, com=2., bias=True) Out[15]: 0 NaN 1 2.083333 2 1.583333 3 1.425439 dtype: float64

See Exponentially weighted moment functions for details. (GH7912)

### Improvements in the sql io module¶

Added support for a

`chunksize`

parameter to`to_sql`

function. This allows DataFrame to be written in chunks and avoid packet-size overflow errors (GH8062).Added support for a

`chunksize`

parameter to`read_sql`

function. Specifying this argument will return an iterator through chunks of the query result (GH2908).Added support for writing

`datetime.date`

and`datetime.time`

object columns with`to_sql`

(GH6932).Added support for specifying a

`schema`

to read from/write to with`read_sql_table`

and`to_sql`

(GH7441, GH7952). For example:df.to_sql('table', engine, schema='other_schema') pd.read_sql_table('table', engine, schema='other_schema')

Added support for writing

`NaN`

values with`to_sql`

(GH2754).Added support for writing datetime64 columns with

`to_sql`

for all database flavors (GH7103).

## Backwards incompatible API changes¶

### Breaking changes¶

API changes related to `Categorical`

(see here
for more details):

The

`Categorical`

constructor with two arguments changed from “codes/labels and levels” to “values and levels (now called ‘categories’)”. This can lead to subtle bugs. If you use`Categorical`

directly, please audit your code by changing it to use the`from_codes()`

constructor.An old function call like (prior to 0.15.0):

pd.Categorical([0,1,0,2,1], levels=['a', 'b', 'c'])

will have to adapted to the following to keep the same behaviour:

In [2]: pd.Categorical.from_codes([0,1,0,2,1], categories=['a', 'b', 'c']) Out[2]: [a, b, a, c, b] Categories (3, object): [a, b, c]

API changes related to the introduction of the `Timedelta`

scalar (see
above for more details):

- Prior to 0.15.0
`to_timedelta()`

would return a`Series`

for list-like/Series input, and a`np.timedelta64`

for scalar input. It will now return a`TimedeltaIndex`

for list-like input,`Series`

for Series input, and`Timedelta`

for scalar input.

For API changes related to the rolling and expanding functions, see detailed overview above.

Other notable API changes:

Consistency when indexing with

`.loc`

and a list-like indexer when no values are found.In [68]: df = DataFrame([['a'],['b']],index=[1,2]) In [69]: df Out[69]: 0 1 a 2 b [2 rows x 1 columns]

In prior versions there was a difference in these two constructs:

`df.loc[[3]]`

would return a frame reindexed by 3 (with all`np.nan`

values)`df.loc[[3],:]`

would raise`KeyError`

.

Both will now raise a

`KeyError`

. The rule is that*at least 1*indexer must be found when using a list-like and`.loc`

(GH7999)Furthermore in prior versions these were also different:

`df.loc[[1,3]]`

would return a frame reindexed by [1,3]`df.loc[[1,3],:]`

would raise`KeyError`

.

Both will now return a frame reindex by [1,3]. E.g.

In [3]: df.loc[[1,3]] Out[3]: 0 1 a 3 NaN In [4]: df.loc[[1,3],:] Out[4]: 0 1 a 3 NaN

This can also be seen in multi-axis indexing with a

`Panel`

.In [70]: p = Panel(np.arange(2*3*4).reshape(2,3,4), ....: items=['ItemA','ItemB'], ....: major_axis=[1,2,3], ....: minor_axis=['A','B','C','D']) ....: In [71]: p Out[71]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis) Items axis: ItemA to ItemB Major_axis axis: 1 to 3 Minor_axis axis: A to D

The following would raise

`KeyError`

prior to 0.15.0:In [5]: Out[5]: ItemA ItemD 1 3 NaN 2 7 NaN 3 11 NaN

Furthermore,

`.loc`

will raise If no values are found in a MultiIndex with a list-like indexer:In [72]: s = Series(np.arange(3,dtype='int64'), ....: index=MultiIndex.from_product([['A'],['foo','bar','baz']], ....: names=['one','two']) ....: ).sort_index() ....: In [73]: s Out[73]: one two A bar 1 baz 2 foo 0 Length: 3, dtype: int64 In [74]: try: ....: s.loc[['D']] ....: except KeyError as e: ....: print("KeyError: " + str(e)) ....: KeyError: "['D'] not in index"

Assigning values to

`None`

now considers the dtype when choosing an ‘empty’ value (GH7941).Previously, assigning to

`None`

in numeric containers changed the dtype to object (or errored, depending on the call). It now uses`NaN`

:In [75]: s = Series([1, 2, 3]) In [76]: s.loc[0] = None In [77]: s Out[77]: 0 NaN 1 2.0 2 3.0 Length: 3, dtype: float64

`NaT`

is now used similarly for datetime containers.For object containers, we now preserve

`None`

values (previously these were converted to`NaN`

values).In [78]: s = Series(["a", "b", "c"]) In [79]: s.loc[0] = None In [80]: s Out[80]: 0 None 1 b 2 c Length: 3, dtype: object

To insert a

`NaN`

, you must explicitly use`np.nan`

. See the docs.In prior versions, updating a pandas object inplace would not reflect in other python references to this object. (GH8511, GH5104)

In [81]: s = Series([1, 2, 3]) In [82]: s2 = s In [83]: s += 1.5

Behavior prior to v0.15.0

# the original object In [5]: s Out[5]: 0 2.5 1 3.5 2 4.5 dtype: float64 # a reference to the original object In [7]: s2 Out[7]: 0 1 1 2 2 3 dtype: int64

This is now the correct behavior

# the original object In [84]: s Out[84]: 0 2.5 1 3.5 2 4.5 Length: 3, dtype: float64 # a reference to the original object In [85]: s2 Out[85]: 0 2.5 1 3.5 2 4.5 Length: 3, dtype: float64

Made both the C-based and Python engines for read_csv and read_table ignore empty lines in input as well as white space-filled lines, as long as

`sep`

is not white space. This is an API change that can be controlled by the keyword parameter`skip_blank_lines`

. See the docs (GH4466)A timeseries/index localized to UTC when inserted into a Series/DataFrame will preserve the UTC timezone and inserted as

`object`

dtype rather than being converted to a naive`datetime64[ns]`

(GH8411).Bug in passing a

`DatetimeIndex`

with a timezone that was not being retained in DataFrame construction from a dict (GH7822)In prior versions this would drop the timezone, now it retains the timezone, but gives a column of

`object`

dtype:In [86]: i = date_range('1/1/2011', periods=3, freq='10s', tz = 'US/Eastern') In [87]: i Out[87]: DatetimeIndex(['2011-01-01 00:00:00-05:00', '2011-01-01 00:00:10-05:00', '2011-01-01 00:00:20-05:00'], dtype='datetime64[ns, US/Eastern]', freq='10S') In [88]: df = DataFrame( {'a' : i } ) In [89]: df Out[89]: a 0 2011-01-01 00:00:00-05:00 1 2011-01-01 00:00:10-05:00 2 2011-01-01 00:00:20-05:00 [3 rows x 1 columns] In [90]: df.dtypes Out[90]: a datetime64[ns, US/Eastern] Length: 1, dtype: object

Previously this would have yielded a column of

`datetime64`

dtype, but without timezone info.The behaviour of assigning a column to an existing dataframe as df[‘a’] = i remains unchanged (this already returned an

`object`

column with a timezone).When passing multiple levels to

`stack()`

, it will now raise a`ValueError`

when the levels aren’t all level names or all level numbers (GH7660). See Reshaping by stacking and unstacking.Raise a

`ValueError`

in`df.to_hdf`

with ‘fixed’ format, if`df`

has non-unique columns as the resulting file will be broken (GH7761)`SettingWithCopy`

raise/warnings (according to the option`mode.chained_assignment`

) will now be issued when setting a value on a sliced mixed-dtype DataFrame using chained-assignment. (GH7845, GH7950)In [1]: df = DataFrame(np.arange(0,9), columns=['count']) In [2]: df['group'] = 'b' In [3]: df.iloc[0:5]['group'] = 'a' /usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

`merge`

,`DataFrame.merge`

, and`ordered_merge`

now return the same type as the`left`

argument (GH7737).Previously an enlargement with a mixed-dtype frame would act unlike

`.append`

which will preserve dtypes (related GH2578, GH8176):In [91]: df = DataFrame([[True, 1],[False, 2]], ....: columns=["female","fitness"]) ....: In [92]: df Out[92]: female fitness 0 True 1 1 False 2 [2 rows x 2 columns] In [93]: df.dtypes Out[93]: female bool fitness int64 Length: 2, dtype: object # dtypes are now preserved In [94]: df.loc[2] = df.loc[1] In [95]: df Out[95]: female fitness 0 True 1 1 False 2 2 False 2 [3 rows x 2 columns] In [96]: df.dtypes Out[96]: female bool fitness int64 Length: 2, dtype: object

`Series.to_csv()`

now returns a string when`path=None`

, matching the behaviour of`DataFrame.to_csv()`

(GH8215).`read_hdf`

now raises`IOError`

when a file that doesn’t exist is passed in. Previously, a new, empty file was created, and a`KeyError`

raised (GH7715).`DataFrame.info()`

now ends its output with a newline character (GH8114)Concatenating no objects will now raise a

`ValueError`

rather than a bare`Exception`

.Merge errors will now be sub-classes of

`ValueError`

rather than raw`Exception`

(GH8501)`DataFrame.plot`

and`Series.plot`

keywords are now have consistent orders (GH8037)

### Internal Refactoring¶

In 0.15.0 `Index`

has internally been refactored to no longer sub-class `ndarray`

but instead subclass `PandasObject`

, similarly to the rest of the pandas objects. This
change allows very easy sub-classing and creation of new index types. This should be
a transparent change with only very limited API implications (GH5080, GH7439, GH7796, GH8024, GH8367, GH7997, GH8522):

- you may need to unpickle pandas version < 0.15.0 pickles using
`pd.read_pickle`

rather than`pickle.load`

. See pickle docs - when plotting with a
`PeriodIndex`

, the matplotlib internal axes will now be arrays of`Period`

rather than a`PeriodIndex`

(this is similar to how a`DatetimeIndex`

passes arrays of`datetimes`

now) - MultiIndexes will now raise similarly to other pandas objects w.r.t. truth testing, see here (GH7897).
- When plotting a DatetimeIndex directly with matplotlib’s plot function,
the axis labels will no longer be formatted as dates but as integers (the
internal representation of a
`datetime64`

).**UPDATE**This is fixed in 0.15.1, see here.

### Deprecations¶

- The attributes
`Categorical`

`labels`

and`levels`

attributes are deprecated and renamed to`codes`

and`categories`

. - The
`outtype`

argument to`pd.DataFrame.to_dict`

has been deprecated in favor of`orient`

. (GH7840) - The
`convert_dummies`

method has been deprecated in favor of`get_dummies`

(GH8140) - The
`infer_dst`

argument in`tz_localize`

will be deprecated in favor of`ambiguous`

to allow for more flexibility in dealing with DST transitions. Replace`infer_dst=True`

with`ambiguous='infer'`

for the same behavior (GH7943). See the docs for more details. - The top-level
`pd.value_range`

has been deprecated and can be replaced by`.describe()`

(GH8481)

The

`Index`

set operations`+`

and`-`

were deprecated in order to provide these for numeric type operations on certain index types.`+`

can be replaced by`.union()`

or`|`

, and`-`

by`.difference()`

. Further the method name`Index.diff()`

is deprecated and can be replaced by`Index.difference()`

(GH8226)# + Index(['a','b','c']) + Index(['b','c','d']) # should be replaced by Index(['a','b','c']).union(Index(['b','c','d']))

# - Index(['a','b','c']) - Index(['b','c','d']) # should be replaced by Index(['a','b','c']).difference(Index(['b','c','d']))

The

`infer_types`

argument to`read_html()`

now has no effect and is deprecated (GH7762, GH7032).

### Removal of prior version deprecations/changes¶

- Remove
`DataFrame.delevel`

method in favor of`DataFrame.reset_index`

## Enhancements¶

Enhancements in the importing/exporting of Stata files:

- Added support for bool, uint8, uint16 and uint32 data types in
`to_stata`

(GH7097, GH7365) - Added conversion option when importing Stata files (GH8527)
`DataFrame.to_stata`

and`StataWriter`

check string length for compatibility with limitations imposed in dta files where fixed-width strings must contain 244 or fewer characters. Attempting to write Stata dta files with strings longer than 244 characters raises a`ValueError`

. (GH7858)`read_stata`

and`StataReader`

can import missing data information into a`DataFrame`

by setting the argument`convert_missing`

to`True`

. When using this options, missing values are returned as`StataMissingValue`

objects and columns containing missing values have`object`

data type. (GH8045)

Enhancements in the plotting functions:

- Added
`layout`

keyword to`DataFrame.plot`

. You can pass a tuple of`(rows, columns)`

, one of which can be`-1`

to automatically infer (GH6667, GH8071). - Allow to pass multiple axes to
`DataFrame.plot`

,`hist`

and`boxplot`

(GH5353, GH6970, GH7069) - Added support for
`c`

,`colormap`

and`colorbar`

arguments for`DataFrame.plot`

with`kind='scatter'`

(GH7780) - Histogram from
`DataFrame.plot`

with`kind='hist'`

(GH7809), See the docs. - Boxplot from
`DataFrame.plot`

with`kind='box'`

(GH7998), See the docs.

Other:

`read_csv`

now has a keyword parameter`float_precision`

which specifies which floating-point converter the C engine should use during parsing, see here (GH8002, GH8044)Added

`searchsorted`

method to`Series`

objects (GH7447)`describe()`

on mixed-types DataFrames is more flexible. Type-based column filtering is now possible via the`include`

/`exclude`

arguments. See the docs (GH8164).In [97]: df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8, ....: 'catB': ['a', 'b', 'c', 'd'] * 6, ....: 'numC': np.arange(24), ....: 'numD': np.arange(24.) + .5}) ....: In [98]: df.describe(include=["object"]) Out[98]: catA catB count 24 24 unique 2 4 top foo b freq 16 6 [4 rows x 2 columns] In [99]: df.describe(include=["number", "object"], exclude=["float"]) Out[99]: catA catB numC count 24 24 24.000000 unique 2 4 NaN top foo b NaN freq 16 6 NaN mean NaN NaN 11.500000 std NaN NaN 7.071068 min NaN NaN 0.000000 25% NaN NaN 5.750000 50% NaN NaN 11.500000 75% NaN NaN 17.250000 max NaN NaN 23.000000 [11 rows x 3 columns]

Requesting all columns is possible with the shorthand ‘all’

In [100]: df.describe(include='all') Out[100]: catA catB numC numD count 24 24 24.000000 24.000000 unique 2 4 NaN NaN top foo b NaN NaN freq 16 6 NaN NaN mean NaN NaN 11.500000 12.000000 std NaN NaN 7.071068 7.071068 min NaN NaN 0.000000 0.500000 25% NaN NaN 5.750000 6.250000 50% NaN NaN 11.500000 12.000000 75% NaN NaN 17.250000 17.750000 max NaN NaN 23.000000 23.500000 [11 rows x 4 columns]

Without those arguments,

`describe`

will behave as before, including only numerical columns or, if none are, only categorical columns. See also the docsAdded

`split`

as an option to the`orient`

argument in`pd.DataFrame.to_dict`

. (GH7840)The

`get_dummies`

method can now be used on DataFrames. By default only categorical columns are encoded as 0’s and 1’s, while other columns are left untouched.In [101]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'], .....: 'C': [1, 2, 3]}) .....: In [102]: pd.get_dummies(df) Out[102]: C A_a A_b B_b B_c 0 1 1 0 0 1 1 2 0 1 0 1 2 3 1 0 1 0 [3 rows x 5 columns]

`PeriodIndex`

supports`resolution`

as the same as`DatetimeIndex`

(GH7708)`pandas.tseries.holiday`

has added support for additional holidays and ways to observe holidays (GH7070)`pandas.tseries.holiday.Holiday`

now supports a list of offsets in Python3 (GH7070)`pandas.tseries.holiday.Holiday`

now supports a days_of_week parameter (GH7070)`GroupBy.nth()`

now supports selecting multiple nth values (GH7910)In [103]: business_dates = date_range(start='4/1/2014', end='6/30/2014', freq='B') In [104]: df = DataFrame(1, index=business_dates, columns=['a', 'b']) # get the first, 4th, and last date index for each month In [105]: df.groupby([df.index.year, df.index.month]).nth([0, 3, -1]) Out[105]: a b 2014 4 1 1 4 1 1 4 1 1 5 1 1 5 1 1 5 1 1 6 1 1 6 1 1 6 1 1 [9 rows x 2 columns]

`Period`

and`PeriodIndex`

supports addition/subtraction with`timedelta`

-likes (GH7966)If

`Period`

freq is`D`

,`H`

,`T`

,`S`

,`L`

,`U`

,`N`

,`Timedelta`

-like can be added if the result can have same freq. Otherwise, only the same`offsets`

can be added.In [106]: idx = pd.period_range('2014-07-01 09:00', periods=5, freq='H') In [107]: idx Out[107]: PeriodIndex(['2014-07-01 09:00', '2014-07-01 10:00', '2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00'], dtype='period[H]', freq='H') In [108]: idx + pd.offsets.Hour(2) Out[108]: PeriodIndex(['2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00', '2014-07-01 14:00', '2014-07-01 15:00'], dtype='period[H]', freq='H') In [109]: idx + Timedelta('120m') Out[109]: PeriodIndex(['2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00', '2014-07-01 14:00', '2014-07-01 15:00'], dtype='period[H]', freq='H') In [110]: idx = pd.period_range('2014-07', periods=5, freq='M') In [111]: idx Out[111]: PeriodIndex(['2014-07', '2014-08', '2014-09', '2014-10', '2014-11'], dtype='period[M]', freq='M') In [112]: idx + pd.offsets.MonthEnd(3) Out[112]: PeriodIndex(['2014-10', '2014-11', '2014-12', '2015-01', '2015-02'], dtype='period[M]', freq='M')

Added experimental compatibility with

`openpyxl`

for versions >= 2.0. The`DataFrame.to_excel`

method`engine`

keyword now recognizes`openpyxl1`

and`openpyxl2`

which will explicitly require openpyxl v1 and v2 respectively, failing if the requested version is not available. The`openpyxl`

engine is a now a meta-engine that automatically uses whichever version of openpyxl is installed. (GH7177)`DataFrame.fillna`

can now accept a`DataFrame`

as a fill value (GH8377)Passing multiple levels to

`stack()`

will now work when multiple level numbers are passed (GH7660). See Reshaping by stacking and unstacking.`set_names()`

,`set_labels()`

, and`set_levels()`

methods now take an optional`level`

keyword argument to all modification of specific level(s) of a MultiIndex. Additionally`set_names()`

now accepts a scalar string value when operating on an`Index`

or on a specific level of a`MultiIndex`

(GH7792)In [113]: idx = MultiIndex.from_product([['a'], range(3), list("pqr")], names=['foo', 'bar', 'baz']) In [114]: idx.set_names('qux', level=0) Out[114]: MultiIndex(levels=[['a'], [0, 1, 2], ['p', 'q', 'r']], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]], names=['qux', 'bar', 'baz']) In [115]: idx.set_names(['qux','corge'], level=[0,1]) Out[115]: MultiIndex(levels=[['a'], [0, 1, 2], ['p', 'q', 'r']], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]], names=['qux', 'corge', 'baz']) In [116]: idx.set_levels(['a','b','c'], level='bar') Out[116]: MultiIndex(levels=[['a'], ['a', 'b', 'c'], ['p', 'q', 'r']], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]], names=['foo', 'bar', 'baz']) In [117]: idx.set_levels([['a','b','c'],[1,2,3]], level=[1,2]) Out[117]: MultiIndex(levels=[['a'], ['a', 'b', 'c'], [1, 2, 3]], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]], names=['foo', 'bar', 'baz'])

`Index.isin`

now supports a`level`

argument to specify which index level to use for membership tests (GH7892, GH7890)In [1]: idx = MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]) In [2]: idx.values Out[2]: array([(0, 'a'), (0, 'b'), (0, 'c'), (1, 'a'), (1, 'b'), (1, 'c')], dtype=object) In [3]: idx.isin(['a', 'c', 'e'], level=1) Out[3]: array([ True, False, True, True, False, True], dtype=bool)

`Index`

now supports`duplicated`

and`drop_duplicates`

. (GH4060)In [118]: idx = Index([1, 2, 3, 4, 1, 2]) In [119]: idx Out[119]: Int64Index([1, 2, 3, 4, 1, 2], dtype='int64') In [120]: idx.duplicated() Out[120]: array([False, False, False, False, True, True], dtype=bool) In [121]: idx.drop_duplicates() Out[121]: Int64Index([1, 2, 3, 4], dtype='int64')

add

`copy=True`

argument to`pd.concat`

to enable pass through of complete blocks (GH8252)Added support for numpy 1.8+ data types (

`bool_`

,`int_`

,`float_`

,`string_`

) for conversion to R dataframe (GH8400)

## Performance¶

- Performance improvements in
`DatetimeIndex.__iter__`

to allow faster iteration (GH7683) - Performance improvements in
`Period`

creation (and`PeriodIndex`

setitem) (GH5155) - Improvements in Series.transform for significant performance gains (revised) (GH6496)
- Performance improvements in
`StataReader`

when reading large files (GH8040, GH8073) - Performance improvements in
`StataWriter`

when writing large files (GH8079) - Performance and memory usage improvements in multi-key
`groupby`

(GH8128) - Performance improvements in groupby
`.agg`

and`.apply`

where builtins max/min were not mapped to numpy/cythonized versions (GH7722) - Performance improvement in writing to sql (
`to_sql`

) of up to 50% (GH8208). - Performance benchmarking of groupby for large value of ngroups (GH6787)
- Performance improvement in
`CustomBusinessDay`

,`CustomBusinessMonth`

(GH8236) - Performance improvement for
`MultiIndex.values`

for multi-level indexes containing datetimes (GH8543)

## Bug Fixes¶

- Bug in pivot_table, when using margins and a dict aggfunc (GH8349)
- Bug in
`read_csv`

where`squeeze=True`

would return a view (GH8217) - Bug in checking of table name in
`read_sql`

in certain cases (GH7826). - Bug in
`DataFrame.groupby`

where`Grouper`

does not recognize level when frequency is specified (GH7885) - Bug in multiindexes dtypes getting mixed up when DataFrame is saved to SQL table (GH8021)
- Bug in
`Series`

0-division with a float and integer operand dtypes (GH7785) - Bug in
`Series.astype("unicode")`

not calling`unicode`

on the values correctly (GH7758) - Bug in
`DataFrame.as_matrix()`

with mixed`datetime64[ns]`

and`timedelta64[ns]`

dtypes (GH7778) - Bug in
`HDFStore.select_column()`

not preserving UTC timezone info when selecting a`DatetimeIndex`

(GH7777) - Bug in
`to_datetime`

when`format='%Y%m%d'`

and`coerce=True`

are specified, where previously an object array was returned (rather than a coerced time-series with`NaT`

), (GH7930) - Bug in
`DatetimeIndex`

and`PeriodIndex`

in-place addition and subtraction cause different result from normal one (GH6527) - Bug in adding and subtracting
`PeriodIndex`

with`PeriodIndex`

raise`TypeError`

(GH7741) - Bug in
`combine_first`

with`PeriodIndex`

data raises`TypeError`

(GH3367) - Bug in MultiIndex slicing with missing indexers (GH7866)
- Bug in MultiIndex slicing with various edge cases (GH8132)
- Regression in MultiIndex indexing with a non-scalar type object (GH7914)
- Bug in
`Timestamp`

comparisons with`==`

and`int64`

dtype (GH8058) - Bug in pickles contains
`DateOffset`

may raise`AttributeError`

when`normalize`

attribute is referred internally (GH7748) - Bug in
`Panel`

when using`major_xs`

and`copy=False`

is passed (deprecation warning fails because of missing`warnings`

) (GH8152). - Bug in pickle deserialization that failed for pre-0.14.1 containers with dup items trying to avoid ambiguity when matching block and manager items, when there’s only one block there’s no ambiguity (GH7794)
- Bug in putting a
`PeriodIndex`

into a`Series`

would convert to`int64`

dtype, rather than`object`

of`Periods`

(GH7932) - Bug in
`HDFStore`

iteration when passing a where (GH8014) - Bug in
`DataFrameGroupby.transform`

when transforming with a passed non-sorted key (GH8046, GH8430) - Bug in repeated timeseries line and area plot may result in
`ValueError`

or incorrect kind (GH7733) - Bug in inference in a
`MultiIndex`

with`datetime.date`

inputs (GH7888) - Bug in
`get`

where an`IndexError`

would not cause the default value to be returned (GH7725) - Bug in
`offsets.apply`

,`rollforward`

and`rollback`

may reset nanosecond (GH7697) - Bug in
`offsets.apply`

,`rollforward`

and`rollback`

may raise`AttributeError`

if`Timestamp`

has`dateutil`

tzinfo (GH7697) - Bug in sorting a MultiIndex frame with a
`Float64Index`

(GH8017) - Bug in inconsistent panel setitem with a rhs of a
`DataFrame`

for alignment (GH7763) - Bug in
`is_superperiod`

and`is_subperiod`

cannot handle higher frequencies than`S`

(GH7760, GH7772, GH7803) - Bug in 32-bit platforms with
`Series.shift`

(GH8129) - Bug in
`PeriodIndex.unique`

returns int64`np.ndarray`

(GH7540) - Bug in
`groupby.apply`

with a non-affecting mutation in the function (GH8467) - Bug in
`DataFrame.reset_index`

which has`MultiIndex`

contains`PeriodIndex`

or`DatetimeIndex`

with tz raises`ValueError`

(GH7746, GH7793) - Bug in
`DataFrame.plot`

with`subplots=True`

may draw unnecessary minor xticks and yticks (GH7801) - Bug in
`StataReader`

which did not read variable labels in 117 files due to difference between Stata documentation and implementation (GH7816) - Bug in
`StataReader`

where strings were always converted to 244 characters-fixed width irrespective of underlying string size (GH7858) - Bug in
`DataFrame.plot`

and`Series.plot`

may ignore`rot`

and`fontsize`

keywords (GH7844) - Bug in
`DatetimeIndex.value_counts`

doesn’t preserve tz (GH7735) - Bug in
`PeriodIndex.value_counts`

results in`Int64Index`

(GH7735) - Bug in
`DataFrame.join`

when doing left join on index and there are multiple matches (GH5391) - Bug in
`GroupBy.transform()`

where int groups with a transform that didn’t preserve the index were incorrectly truncated (GH7972). - Bug in
`groupby`

where callable objects without name attributes would take the wrong path, and produce a`DataFrame`

instead of a`Series`

(GH7929) - Bug in
`groupby`

error message when a DataFrame grouping column is duplicated (GH7511) - Bug in
`read_html`

where the`infer_types`

argument forced coercion of date-likes incorrectly (GH7762, GH7032). - Bug in
`Series.str.cat`

with an index which was filtered as to not include the first item (GH7857) - Bug in
`Timestamp`

cannot parse`nanosecond`

from string (GH7878) - Bug in
`Timestamp`

with string offset and`tz`

results incorrect (GH7833) - Bug in
`tslib.tz_convert`

and`tslib.tz_convert_single`

may return different results (GH7798) - Bug in
`DatetimeIndex.intersection`

of non-overlapping timestamps with tz raises`IndexError`

(GH7880) - Bug in alignment with TimeOps and non-unique indexes (GH8363)
- Bug in
`GroupBy.filter()`

where fast path vs. slow path made the filter return a non scalar value that appeared valid but wasn’t (GH7870). - Bug in
`date_range()`

/`DatetimeIndex()`

when the timezone was inferred from input dates yet incorrect times were returned when crossing DST boundaries (GH7835, GH7901). - Bug in
`to_excel()`

where a negative sign was being prepended to positive infinity and was absent for negative infinity (GH7949) - Bug in area plot draws legend with incorrect
`alpha`

when`stacked=True`

(GH8027) `Period`

and`PeriodIndex`

addition/subtraction with`np.timedelta64`

results in incorrect internal representations (GH7740)- Bug in
`Holiday`

with no offset or observance (GH7987) - Bug in
`DataFrame.to_latex`

formatting when columns or index is a`MultiIndex`

(GH7982). - Bug in
`DateOffset`

around Daylight Savings Time produces unexpected results (GH5175). - Bug in
`DataFrame.shift`

where empty columns would throw`ZeroDivisionError`

on numpy 1.7 (GH8019) - Bug in installation where
`html_encoding/*.html`

wasn’t installed and therefore some tests were not running correctly (GH7927). - Bug in
`read_html`

where`bytes`

objects were not tested for in`_read`

(GH7927). - Bug in
`DataFrame.stack()`

when one of the column levels was a datelike (GH8039) - Bug in broadcasting numpy scalars with
`DataFrame`

(GH8116) - Bug in
`pivot_table`

performed with nameless`index`

and`columns`

raises`KeyError`

(GH8103) - Bug in
`DataFrame.plot(kind='scatter')`

draws points and errorbars with different colors when the color is specified by`c`

keyword (GH8081) - Bug in
`Float64Index`

where`iat`

and`at`

were not testing and were failing (GH8092). - Bug in
`DataFrame.boxplot()`

where y-limits were not set correctly when producing multiple axes (GH7528, GH5517). - Bug in
`read_csv`

where line comments were not handled correctly given a custom line terminator or`delim_whitespace=True`

(GH8122). - Bug in
`read_html`

where empty tables caused a`StopIteration`

(GH7575) - Bug in casting when setting a column in a same-dtype block (GH7704)
- Bug in accessing groups from a
`GroupBy`

when the original grouper was a tuple (GH8121). - Bug in
`.at`

that would accept integer indexers on a non-integer index and do fallback (GH7814) - Bug with kde plot and NaNs (GH8182)
- Bug in
`GroupBy.count`

with float32 data type were nan values were not excluded (GH8169). - Bug with stacked barplots and NaNs (GH8175).
- Bug in resample with non evenly divisible offsets (e.g. ‘7s’) (GH8371)
- Bug in interpolation methods with the
`limit`

keyword when no values needed interpolating (GH7173). - Bug where
`col_space`

was ignored in`DataFrame.to_string()`

when`header=False`

(GH8230). - Bug with
`DatetimeIndex.asof`

incorrectly matching partial strings and returning the wrong date (GH8245). - Bug in plotting methods modifying the global matplotlib rcParams (GH8242).
- Bug in
`DataFrame.__setitem__`

that caused errors when setting a dataframe column to a sparse array (GH8131) - Bug where
`Dataframe.boxplot()`

failed when entire column was empty (GH8181). - Bug with messed variables in
`radviz`

visualization (GH8199). - Bug in interpolation methods with the
`limit`

keyword when no values needed interpolating (GH7173). - Bug where
`col_space`

was ignored in`DataFrame.to_string()`

when`header=False`

(GH8230). - Bug in
`to_clipboard`

that would clip long column data (GH8305) - Bug in
`DataFrame`

terminal display: Setting max_column/max_rows to zero did not trigger auto-resizing of dfs to fit terminal width/height (GH7180). - Bug in OLS where running with “cluster” and “nw_lags” parameters did not work correctly, but also did not throw an error (GH5884).
- Bug in
`DataFrame.dropna`

that interpreted non-existent columns in the subset argument as the ‘last column’ (GH8303) - Bug in
`Index.intersection`

on non-monotonic non-unique indexes (GH8362). - Bug in masked series assignment where mismatching types would break alignment (GH8387)
- Bug in
`NDFrame.equals`

gives false negatives with dtype=object (GH8437) - Bug in assignment with indexer where type diversity would break alignment (GH8258)
- Bug in
`NDFrame.loc`

indexing when row/column names were lost when target was a list/ndarray (GH6552) - Regression in
`NDFrame.loc`

indexing when rows/columns were converted to Float64Index if target was an empty list/ndarray (GH7774) - Bug in
`Series`

that allows it to be indexed by a`DataFrame`

which has unexpected results. Such indexing is no longer permitted (GH8444) - Bug in item assignment of a
`DataFrame`

with MultiIndex columns where right-hand-side columns were not aligned (GH7655) - Suppress FutureWarning generated by NumPy when comparing object arrays containing NaN for equality (GH7065)
- Bug in
`DataFrame.eval()`

where the dtype of the`not`

operator (`~`

) was not correctly inferred as`bool`

.