# What’s New¶

These are new features and improvements of note in each release.

## v0.24.0 (Month XX, 2018)¶

Warning

Starting January 1, 2019, pandas feature releases will support Python 3 only. See Plan for dropping Python 2.7 for more.

### New features¶

`merge()`

now directly allows merge between objects of type`DataFrame`

and named`Series`

, without the need to convert the`Series`

object into a`DataFrame`

beforehand (GH21220)`ExcelWriter`

now accepts`mode`

as a keyword argument, enabling append to existing workbooks when using the`openpyxl`

engine (GH3441)`DataFrame.to_parquet()`

now accepts`index`

as an argument, allowing

the user to override the engine’s default behavior to include or omit the
dataframe’s indexes from the resulting Parquet file. (GH20768)
- `DataFrame.corr()`

and `Series.corr()`

now accept a callable for generic calculation methods of correlation, e.g. histogram intersection (GH22684)

`ExtensionArray`

operator support¶

A `Series`

based on an `ExtensionArray`

now supports arithmetic and comparison
operators (GH19577). There are two approaches for providing operator support for an `ExtensionArray`

:

- Define each of the operators on your
`ExtensionArray`

subclass. - Use an operator implementation from pandas that depends on operators that are already defined
on the underlying elements (scalars) of the
`ExtensionArray`

.

See the ExtensionArray Operator Support documentation section for details on both ways of adding operator support.

#### Optional Integer NA Support¶

Pandas has gained the ability to hold integer dtypes with missing values. This long requested feature is enabled through the use of extension types. Here is an example of the usage.

We can construct a `Series`

with the specified dtype. The dtype string `Int64`

is a pandas `ExtensionDtype`

. Specifying a list or array using the traditional missing value
marker of `np.nan`

will infer to integer dtype. The display of the `Series`

will also use the `NaN`

to indicate missing values in string outputs. (GH20700, GH20747, GH22441, GH21789, GH22346)

```
In [1]: s = pd.Series([1, 2, np.nan], dtype='Int64')
In [2]: s
Out[2]:
0 1
1 2
2 NaN
dtype: Int64
```

Operations on these dtypes will propagate `NaN`

as other pandas operations.

```
# arithmetic
In [3]: s + 1
Out[3]:
0 2
1 3
2 NaN
dtype: Int64
# comparison
In [4]: s == 1
Out[4]:
0 True
1 False
2 False
dtype: bool
# indexing
In [5]: s.iloc[1:3]
Out[5]:
1 2
2 NaN
dtype: Int64
# operate with other dtypes
In [6]: s + s.iloc[1:3].astype('Int8')
Out[6]:
0 NaN
1 4
2 NaN
dtype: Int64
# coerce when needed
In [7]: s + 0.01
Out[7]:
0 1.01
1 2.01
2 NaN
dtype: float64
```

These dtypes can operate as part of of `DataFrame`

.

```
In [8]: df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
In [9]: df
Out[9]:
A B C
0 1 1 a
1 2 1 a
2 NaN 3 b
In [10]: df.dtypes
Out[10]:
A Int64
B int64
C object
dtype: object
```

These dtypes can be merged & reshaped & casted.

```
In [11]: pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
Out[11]:
A Int64
B int64
C object
dtype: object
In [12]: df['A'].astype(float)
Out[12]:
0 1.0
1 2.0
2 NaN
Name: A, dtype: float64
```

Reduction and groupby operations such as ‘sum’ work.

```
In [13]: df.sum()
Out[13]:
A 3
B 5
C aab
dtype: object
In [14]: df.groupby('B').A.sum()
Out[14]:
B
1 3
3 0
Name: A, dtype: int64
```

Warning

The Integer NA support currently uses the captilized dtype version, e.g. `Int8`

as compared to the traditional `int8`

. This may be changed at a future date.

`read_html`

Enhancements¶

`read_html()`

previously ignored `colspan`

and `rowspan`

attributes.
Now it understands them, treating them as sequences of cells with the same
value. (GH17054)

```
In [15]: result = pd.read_html("""
....: <table>
....: <thead>
....: <tr>
....: <th>A</th><th>B</th><th>C</th>
....: </tr>
....: </thead>
....: <tbody>
....: <tr>
....: <td colspan="2">1</td><td>2</td>
....: </tr>
....: </tbody>
....: </table>""")
....:
```

Previous Behavior:

```
In [13]: result
Out [13]:
[ A B C
0 1 2 NaN]
```

Current Behavior:

```
In [16]: result
Out[16]:
[ A B C
0 1 1 2]
```

#### Storing Interval Data in Series and DataFrame¶

Interval data may now be stored in a `Series`

or `DataFrame`

, in addition to an
`IntervalIndex`

like previously (GH19453).

```
In [17]: ser = pd.Series(pd.interval_range(0, 5))
In [18]: ser
Out[18]:
0 (0, 1]
1 (1, 2]
2 (2, 3]
3 (3, 4]
4 (4, 5]
dtype: interval
In [19]: ser.dtype
Out[19]: interval[int64]
```

Previously, these would be cast to a NumPy array of `Interval`

objects. In general,
this should result in better performance when storing an array of intervals in
a `Series`

.

Note that the `.values`

of a `Series`

containing intervals is no longer a NumPy
array, but rather an `ExtensionArray`

:

```
In [20]: ser.values
Out[20]:
IntervalArray([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
closed='right',
dtype='interval[int64]')
```

This is the same behavior as `Series.values`

for categorical data. See
IntervalIndex.values is now an IntervalArray for more.

#### Other Enhancements¶

`to_datetime()`

now supports the`%Z`

and`%z`

directive when passed into`format`

(GH13486)`Series.mode()`

and`DataFrame.mode()`

now support the`dropna`

parameter which can be used to specify whether`NaN`

/`NaT`

values should be considered (GH17534)`to_csv()`

now supports`compression`

keyword when a file handle is passed. (GH21227)`Index.droplevel()`

is now implemented also for flat indexes, for compatibility with`MultiIndex`

(GH21115)`Series.droplevel()`

and`DataFrame.droplevel()`

are now implemented (GH20342)- Added support for reading from/writing to Google Cloud Storage via the
`gcsfs`

library (GH19454, GH23094) `to_gbq()`

and`read_gbq()`

signature and documentation updated to reflect changes from the Pandas-GBQ library version 0.6.0. (GH21627, GH22557)- New method
`HDFStore.walk()`

will recursively walk the group hierarchy of an HDF5 file (GH10932) `read_html()`

copies cell data across`colspan`

and`rowspan`

, and it treats all-`th`

table rows as headers if`header`

kwarg is not given and there is no`thead`

(GH17054)`Series.nlargest()`

,`Series.nsmallest()`

,`DataFrame.nlargest()`

, and`DataFrame.nsmallest()`

now accept the value`"all"`

for the`keep`

argument. This keeps all ties for the nth largest/smallest value (GH16818)`IntervalIndex`

has gained the`set_closed()`

method to change the existing`closed`

value (GH21670)`to_csv()`

,`to_csv()`

,`to_json()`

, and`to_json()`

now support`compression='infer'`

to infer compression based on filename extension (GH15008). The default compression for`to_csv`

,`to_json`

, and`to_pickle`

methods has been updated to`'infer'`

(GH22004).`to_timedelta()`

now supports iso-formated timedelta strings (GH21877)`Series`

and`DataFrame`

now support`Iterable`

in constructor (GH2193)`DatetimeIndex`

gained`DatetimeIndex.timetz`

attribute. Returns local time with timezone information. (GH21358)`round()`

,`ceil()`

, and meth:floor for`DatetimeIndex`

and`Timestamp`

now support an`ambiguous`

argument for handling datetimes that are rounded to ambiguous times (GH18946)`Resampler`

now is iterable like`GroupBy`

(GH15314).`Series.resample()`

and`DataFrame.resample()`

have gained the`Resampler.quantile()`

(GH15023).`Index.to_frame()`

now supports overriding column name(s) (GH22580).- New attribute
`__git_version__`

will return git commit sha of current build (GH21295). - Compatibility with Matplotlib 3.0 (GH22790).

### Backwards incompatible API changes¶

- A newly constructed empty
`DataFrame`

with integer as the`dtype`

will now only be cast to`float64`

if`index`

is specified (GH22858)

#### Dependencies have increased minimum versions¶

We have updated our minimum supported versions of dependencies (GH21242). If installed, we now require:

Package | Minimum Version | Required |
---|---|---|

numpy | 1.12.0 | X |

bottleneck | 1.2.0 | |

matplotlib | 2.0.0 | |

numexpr | 2.6.1 | |

pytables | 3.4.2 | |

scipy | 0.18.1 |

`IntervalIndex.values`

is now an `IntervalArray`

¶

The `values`

attribute of an `IntervalIndex`

now returns an
`IntervalArray`

, rather than a NumPy array of `Interval`

objects (GH19453).

Previous Behavior:

```
In [1]: idx = pd.interval_range(0, 4)
In [2]: idx.values
Out[2]:
array([Interval(0, 1, closed='right'), Interval(1, 2, closed='right'),
Interval(2, 3, closed='right'), Interval(3, 4, closed='right')],
dtype=object)
```

New Behavior:

```
In [21]: idx = pd.interval_range(0, 4)
In [22]: idx.values
Out[22]:
IntervalArray([(0, 1], (1, 2], (2, 3], (3, 4]],
closed='right',
dtype='interval[int64]')
```

This mirrors `CategoricalIndex.values`

, which returns a `Categorical`

.

For situations where you need an `ndarray`

of `Interval`

objects, use
`numpy.asarray()`

or `idx.astype(object)`

.

```
In [23]: np.asarray(idx)
Out[23]:
array([Interval(0, 1, closed='right'), Interval(1, 2, closed='right'),
Interval(2, 3, closed='right'), Interval(3, 4, closed='right')], dtype=object)
In [24]: idx.values.astype(object)
Out[24]:
array([Interval(0, 1, closed='right'), Interval(1, 2, closed='right'),
Interval(2, 3, closed='right'), Interval(3, 4, closed='right')], dtype=object)
```

#### Parsing Datetime Strings with Timezone Offsets¶

Previously, parsing datetime strings with UTC offsets with `to_datetime()`

or `DatetimeIndex`

would automatically convert the datetime to UTC
without timezone localization. This is inconsistent from parsing the same
datetime string with `Timestamp`

which would preserve the UTC
offset in the `tz`

attribute. Now, `to_datetime()`

preserves the UTC
offset in the `tz`

attribute when all the datetime strings have the same
UTC offset (GH17697, GH11736, GH22457)

*Previous Behavior*:

```
In [2]: pd.to_datetime("2015-11-18 15:30:00+05:30")
Out[2]: Timestamp('2015-11-18 10:00:00')
In [3]: pd.Timestamp("2015-11-18 15:30:00+05:30")
Out[3]: Timestamp('2015-11-18 15:30:00+0530', tz='pytz.FixedOffset(330)')
# Different UTC offsets would automatically convert the datetimes to UTC (without a UTC timezone)
In [4]: pd.to_datetime(["2015-11-18 15:30:00+05:30", "2015-11-18 16:30:00+06:30"])
Out[4]: DatetimeIndex(['2015-11-18 10:00:00', '2015-11-18 10:00:00'], dtype='datetime64[ns]', freq=None)
```

*Current Behavior*:

```
In [25]: pd.to_datetime("2015-11-18 15:30:00+05:30")
Out[25]: Timestamp('2015-11-18 15:30:00+0530', tz='pytz.FixedOffset(330)')
In [26]: pd.Timestamp("2015-11-18 15:30:00+05:30")
Out[26]: Timestamp('2015-11-18 15:30:00+0530', tz='pytz.FixedOffset(330)')
```

Parsing datetime strings with the same UTC offset will preserve the UTC offset in the `tz`

```
In [27]: pd.to_datetime(["2015-11-18 15:30:00+05:30"] * 2)
Out[27]: DatetimeIndex(['2015-11-18 15:30:00+05:30', '2015-11-18 15:30:00+05:30'], dtype='datetime64[ns, pytz.FixedOffset(330)]', freq=None)
```

Parsing datetime strings with different UTC offsets will now create an Index of
`datetime.datetime`

objects with different UTC offsets

```
In [28]: idx = pd.to_datetime(["2015-11-18 15:30:00+05:30", "2015-11-18 16:30:00+06:30"])
In [29]: idx
Out[29]: Index([2015-11-18 15:30:00+05:30, 2015-11-18 16:30:00+06:30], dtype='object')
In [30]: idx[0]
Out[30]: datetime.datetime(2015, 11, 18, 15, 30, tzinfo=tzoffset(None, 19800))
In [31]: idx[1]
Out[31]: datetime.datetime(2015, 11, 18, 16, 30, tzinfo=tzoffset(None, 23400))
```

Passing `utc=True`

will mimic the previous behavior but will correctly indicate
that the dates have been converted to UTC

```
In [32]: pd.to_datetime(["2015-11-18 15:30:00+05:30", "2015-11-18 16:30:00+06:30"], utc=True)
Out[32]: DatetimeIndex(['2015-11-18 10:00:00+00:00', '2015-11-18 10:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
```

#### CalendarDay Offset¶

`Day`

and associated frequency alias `'D'`

were documented to represent
a calendar day; however, arithmetic and operations with `Day`

sometimes
respected absolute time instead (i.e. `Day(n)`

and acted identically to `Timedelta(days=n)`

).

*Previous Behavior*:

```
In [2]: ts = pd.Timestamp('2016-10-30 00:00:00', tz='Europe/Helsinki')
# Respects calendar arithmetic
In [3]: pd.date_range(start=ts, freq='D', periods=3)
Out[3]:
DatetimeIndex(['2016-10-30 00:00:00+03:00', '2016-10-31 00:00:00+02:00',
'2016-11-01 00:00:00+02:00'],
dtype='datetime64[ns, Europe/Helsinki]', freq='D')
# Respects absolute arithmetic
In [4]: ts + pd.tseries.frequencies.to_offset('D')
Out[4]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')
```

`CalendarDay`

and associated frequency alias `'CD'`

are now available
and respect calendar day arithmetic while `Day`

and frequency alias `'D'`

will now respect absolute time (GH22274, GH20596, GH16980, GH8774)
See the documentation here for more information.

Addition with `CalendarDay`

across a daylight savings time transition:

```
In [33]: ts = pd.Timestamp('2016-10-30 00:00:00', tz='Europe/Helsinki')
In [34]: ts + pd.offsets.Day(1)
Out[34]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')
In [35]: ts + pd.offsets.CalendarDay(1)
Out[35]: Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki')
```

#### Time values in `dt.end_time`

and `to_timestamp(how='end')`

¶

The time values in `Period`

and `PeriodIndex`

objects are now set
to ‘23:59:59.999999999’ when calling `Series.dt.end_time`

, `Period.end_time`

,
`PeriodIndex.end_time`

, `Period.to_timestamp()`

with `how='end'`

,
or `PeriodIndex.to_timestamp()`

with `how='end'`

(GH17157)

Previous Behavior:

```
In [2]: p = pd.Period('2017-01-01', 'D')
In [3]: pi = pd.PeriodIndex([p])
In [4]: pd.Series(pi).dt.end_time[0]
Out[4]: Timestamp(2017-01-01 00:00:00)
In [5]: p.end_time
Out[5]: Timestamp(2017-01-01 23:59:59.999999999)
```

Current Behavior:

Calling `Series.dt.end_time`

will now result in a time of ‘23:59:59.999999999’ as
is the case with `Period.end_time`

, for example

```
In [36]: p = pd.Period('2017-01-01', 'D')
In [37]: pi = pd.PeriodIndex([p])
In [38]: pd.Series(pi).dt.end_time[0]
Out[38]: Timestamp('2017-01-01 23:59:59.999999999')
In [39]: p.end_time
Out[39]: Timestamp('2017-01-01 23:59:59.999999999')
```

#### Sparse Data Structure Refactor¶

`SparseArray`

, the array backing `SparseSeries`

and the columns in a `SparseDataFrame`

,
is now an extension array (GH21978, GH19056, GH22835).
To conform to this interface and for consistency with the rest of pandas, some API breaking
changes were made:

`SparseArray`

is no longer a subclass of`numpy.ndarray`

. To convert a SparseArray to a NumPy array, use`numpy.asarray()`

.`SparseArray.dtype`

and`SparseSeries.dtype`

are now instances of`SparseDtype`

, rather than`np.dtype`

. Access the underlying dtype with`SparseDtype.subtype`

.`numpy.asarray(sparse_array)()`

now returns a dense array with all the values, not just the non-fill-value values (GH14167)`SparseArray.take`

now matches the API of`pandas.api.extensions.ExtensionArray.take()`

(GH19506):- The default value of
`allow_fill`

has changed from`False`

to`True`

. - The
`out`

and`mode`

parameters are now longer accepted (previously, this raised if they were specified). - Passing a scalar for
`indices`

is no longer allowed.

- The default value of
- The result of concatenating a mix of sparse and dense Series is a Series with sparse values, rather than a
`SparseSeries`

. `SparseDataFrame.combine`

and`DataFrame.combine_first`

no longer supports combining a sparse column with a dense column while preserving the sparse subtype. The result will be an object-dtype SparseArray.- Setting
`SparseArray.fill_value`

to a fill value with a different dtype is now allowed.

Some new warnings are issued for operations that require or are likely to materialize a large dense array:

- A
`errors.PerformanceWarning`

is issued when using fillna with a`method`

, as a dense array is constructed to create the filled array. Filling with a`value`

is the efficient way to fill a sparse array. - A
`errors.PerformanceWarning`

is now issued when concatenating sparse Series with differing fill values. The fill value from the first sparse array continues to be used.

In addition to these API breaking changes, many performance improvements and bug fixes have been made.

#### Raise ValueError in `DataFrame.to_dict(orient='index')`

¶

Bug in `DataFrame.to_dict()`

raises `ValueError`

when used with
`orient='index'`

and a non-unique index instead of losing data (GH22801)

```
In [40]: df = pd.DataFrame({'a': [1, 2], 'b': [0.5, 0.75]}, index=['A', 'A'])
In [41]: df
Out[41]:
a b
A 1 0.50
A 2 0.75
In [42]: df.to_dict(orient='index')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-42-f5309a7c6adb> in <module>()
----> 1 df.to_dict(orient='index')
~/build/pandas-dev/pandas/pandas/core/frame.py in to_dict(self, orient, into)
1229 if not self.index.is_unique:
1230 raise ValueError(
-> 1231 "DataFrame index must be unique for orient='index'."
1232 )
1233 return into_c((t[0], dict(zip(self.columns, t[1:])))
ValueError: DataFrame index must be unique for orient='index'.
```

#### Tick DateOffset Normalize Restrictions¶

Creating a `Tick`

object (`Day`

, `Hour`

, `Minute`

,
`Second`

, `Milli`

, `Micro`

, `Nano`

) with
`normalize=True`

is no longer supported. This prevents unexpected behavior
where addition could fail to be monotone or associative. (GH21427)

*Previous Behavior*:

```
In [2]: ts = pd.Timestamp('2018-06-11 18:01:14')
In [3]: ts
Out[3]: Timestamp('2018-06-11 18:01:14')
In [4]: tic = pd.offsets.Hour(n=2, normalize=True)
...:
In [5]: tic
Out[5]: <2 * Hours>
In [6]: ts + tic
Out[6]: Timestamp('2018-06-11 00:00:00')
In [7]: ts + tic + tic + tic == ts + (tic + tic + tic)
Out[7]: False
```

*Current Behavior*:

```
In [43]: ts = pd.Timestamp('2018-06-11 18:01:14')
In [44]: tic = pd.offsets.Hour(n=2)
In [45]: ts + tic + tic + tic == ts + (tic + tic + tic)
Out[45]: True
```

#### Period Subtraction¶

Subtraction of a `Period`

from another `Period`

will give a `DateOffset`

.
instead of an integer (GH21314)

```
In [46]: june = pd.Period('June 2018')
In [47]: april = pd.Period('April 2018')
In [48]: june - april
Out[48]: <2 * MonthEnds>
```

Previous Behavior:

```
In [2]: june = pd.Period('June 2018')
In [3]: april = pd.Period('April 2018')
In [4]: june - april
Out [4]: 2
```

Similarly, subtraction of a `Period`

from a `PeriodIndex`

will now return
an `Index`

of `DateOffset`

objects instead of an `Int64Index`

```
In [49]: pi = pd.period_range('June 2018', freq='M', periods=3)
In [50]: pi - pi[0]
Out[50]: Index([<0 * MonthEnds>, <MonthEnd>, <2 * MonthEnds>], dtype='object')
```

Previous Behavior:

```
In [2]: pi = pd.period_range('June 2018', freq='M', periods=3)
In [3]: pi - pi[0]
Out[3]: Int64Index([0, 1, 2], dtype='int64')
```

#### Addition/Subtraction of `NaN`

from `DataFrame`

¶

Adding or subtracting `NaN`

from a `DataFrame`

column with
`timedelta64[ns]`

dtype will now raise a `TypeError`

instead of returning
all-`NaT`

. This is for compatibility with `TimedeltaIndex`

and
`Series`

behavior (GH22163)

```
In [51]: df = pd.DataFrame([pd.Timedelta(days=1)])
In [52]: df - np.nan
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-2fbc21b58712> in <module>()
----> 1 df - np.nan
~/build/pandas-dev/pandas/pandas/core/ops.py in f(self, other, axis, level, fill_value)
1879
1880 pass_op = op if lib.is_scalar(other) else na_op
-> 1881 return self._combine_const(other, pass_op, try_cast=True)
1882
1883 f.__name__ = op_name
~/build/pandas-dev/pandas/pandas/core/frame.py in _combine_const(self, other, func, errors, try_cast)
4952 def _combine_const(self, other, func, errors='raise', try_cast=True):
4953 if lib.is_scalar(other) or np.ndim(other) == 0:
-> 4954 return ops.dispatch_to_series(self, other, func)
4955
4956 new_data = self._data.eval(func=func, other=other,
~/build/pandas-dev/pandas/pandas/core/ops.py in dispatch_to_series(left, right, func, str_rep, axis)
1722 raise NotImplementedError(right)
1723
-> 1724 new_data = expressions.evaluate(column_op, str_rep, left, right)
1725
1726 result = left._constructor(new_data, index=left.index, copy=False)
~/build/pandas-dev/pandas/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
203 use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b)
204 if use_numexpr:
--> 205 return _evaluate(op, op_str, a, b, **eval_kwargs)
206 return _evaluate_standard(op, op_str, a, b)
207
~/build/pandas-dev/pandas/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
118
119 if result is None:
--> 120 result = _evaluate_standard(op, op_str, a, b)
121
122 return result
~/build/pandas-dev/pandas/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
63 _store_test_result(False)
64 with np.errstate(all='ignore'):
---> 65 return op(a, b)
66
67
~/build/pandas-dev/pandas/pandas/core/ops.py in column_op(a, b)
1693 def column_op(a, b):
1694 return {i: func(a.iloc[:, i], b)
-> 1695 for i in range(len(a.columns))}
1696
1697 elif isinstance(right, ABCDataFrame):
~/build/pandas-dev/pandas/pandas/core/ops.py in <dictcomp>(.0)
1693 def column_op(a, b):
1694 return {i: func(a.iloc[:, i], b)
-> 1695 for i in range(len(a.columns))}
1696
1697 elif isinstance(right, ABCDataFrame):
~/build/pandas-dev/pandas/pandas/core/ops.py in wrapper(left, right)
1308
1309 elif is_timedelta64_dtype(left):
-> 1310 result = dispatch_to_index_op(op, left, right, pd.TimedeltaIndex)
1311 return construct_result(left, result,
1312 index=left.index, name=res_name,
~/build/pandas-dev/pandas/pandas/core/ops.py in dispatch_to_index_op(op, left, right, index_class)
1360 left_idx = left_idx._shallow_copy(freq=None)
1361 try:
-> 1362 result = op(left_idx, right)
1363 except NullFrequencyError:
1364 # DatetimeIndex and TimedeltaIndex with freq == None raise ValueError
TypeError: unsupported operand type(s) for -: 'TimedeltaIndex' and 'float'
```

Previous Behavior:

```
In [4]: df = pd.DataFrame([pd.Timedelta(days=1)])
In [5]: df - np.nan
Out[5]:
0
0 NaT
```

#### DataFrame Arithmetic Operations Broadcasting Changes¶

`DataFrame`

arithmetic operations when operating with 2-dimensional
`np.ndarray`

objects now broadcast in the same way as ``np.ndarray``s
broadcast. (GH23000)

Previous Behavior:

```
In [3]: arr = np.arange(6).reshape(3, 2)
In [4]: df = pd.DataFrame(arr)
In [5]: df + arr[[0], :] # 1 row, 2 columns
...
ValueError: Unable to coerce to DataFrame, shape must be (3, 2): given (1, 2)
In [6]: df + arr[:, [1]] # 1 column, 3 rows
...
ValueError: Unable to coerce to DataFrame, shape must be (3, 2): given (3, 1)
```

*Current Behavior*:

#### ExtensionType Changes¶

**:class:`pandas.api.extensions.ExtensionDtype` Equality and Hashability**

Pandas now requires that extension dtypes be hashable. The base class implements
a default `__eq__`

and `__hash__`

. If you have a parametrized dtype, you should
update the `ExtensionDtype._metadata`

tuple to match the signature of your
`__init__`

method. See `pandas.api.extensions.ExtensionDtype`

for more (GH22476).

**Other changes**

`ExtensionArray`

has gained the abstract methods`.dropna()`

(GH21185)`ExtensionDtype`

has gained the ability to instantiate from string dtypes, e.g.`decimal`

would instantiate a registered`DecimalDtype`

; furthermore the`ExtensionDtype`

has gained the method`construct_array_type`

(GH21185)- An
`ExtensionArray`

with a boolean dtype now works correctly as a boolean indexer.`pandas.api.types.is_bool_dtype()`

now properly considers them boolean (GH22326) - Added
`ExtensionDtype._is_numeric`

for controlling whether an extension dtype is considered numeric (GH22290). - The
`ExtensionArray`

constructor,`_from_sequence`

now take the keyword arg`copy=False`

(GH21185) - Bug in
`Series.get()`

for`Series`

using`ExtensionArray`

and integer index (GH21257) `shift()`

now dispatches to`ExtensionArray.shift()`

(GH22386)`Series.combine()`

works correctly with`ExtensionArray`

inside of`Series`

(GH20825)`Series.combine()`

with scalar argument now works for any function type (GH21248)`Series.astype()`

and`DataFrame.astype()`

now dispatch to`ExtensionArray.astype()`

(GH21185:).- Slicing a single row of a
`DataFrame`

with multiple ExtensionArrays of the same type now preserves the dtype, rather than coercing to object (GH22784) - Added
`pandas.api.types.register_extension_dtype()`

to register an extension type with pandas (GH22664) - Series backed by an
`ExtensionArray`

now work with`util.hash_pandas_object()`

(GH23066) - Updated the
`.type`

attribute for`PeriodDtype`

,`DatetimeTZDtype`

, and`IntervalDtype`

to be instances of the dtype (`Period`

,`Timestamp`

, and`Interval`

respectively) (GH22938) `ExtensionArray.isna()`

is allowed to return an`ExtensionArray`

(GH22325).- Support for reduction operations such as
`sum`

,`mean`

via opt-in base class method override (GH22762)

#### Series and Index Data-Dtype Incompatibilities¶

`Series`

and `Index`

constructors now raise when the
data is incompatible with a passed `dtype=`

(GH15832)

Previous Behavior:

```
In [4]: pd.Series([-1], dtype="uint64")
Out [4]:
0 18446744073709551615
dtype: uint64
```

Current Behavior:

```
In [4]: pd.Series([-1], dtype="uint64")
Out [4]:
...
OverflowError: Trying to coerce negative values to unsigned integers
```

#### Crosstab Preserves Dtypes¶

`crosstab()`

will preserve now dtypes in some cases that previously would
cast from integer dtype to floating dtype (GH22019)

Previous Behavior:

```
In [3]: df = pd.DataFrame({'a': [1, 2, 2, 2, 2], 'b': [3, 3, 4, 4, 4],
...: 'c': [1, 1, np.nan, 1, 1]})
In [4]: pd.crosstab(df.a, df.b, normalize='columns')
Out[4]:
b 3 4
a
1 0.5 0.0
2 0.5 1.0
```

Current Behavior:

```
In [3]: df = pd.DataFrame({'a': [1, 2, 2, 2, 2], 'b': [3, 3, 4, 4, 4],
...: 'c': [1, 1, np.nan, 1, 1]})
In [4]: pd.crosstab(df.a, df.b, normalize='columns')
```

#### Datetimelike API Changes¶

- For
`DatetimeIndex`

and`TimedeltaIndex`

with non-`None`

`freq`

attribute, addition or subtraction of integer-dtyped array or`Index`

will return an object of the same class (GH19959) `DateOffset`

objects are now immutable. Attempting to alter one of these will now raise`AttributeError`

(GH21341)`PeriodIndex`

subtraction of another`PeriodIndex`

will now return an object-dtype`Index`

of`DateOffset`

objects instead of raising a`TypeError`

(GH20049)`cut()`

and`qcut()`

now returns a`DatetimeIndex`

or`TimedeltaIndex`

bins when the input is datetime or timedelta dtype respectively and`retbins=True`

(GH19891)`DatetimeIndex.to_period()`

and`Timestamp.to_period()`

will issue a warning when timezone information will be lost (GH21333)

#### Other API Changes¶

`DatetimeIndex`

now accepts`Int64Index`

arguments as epoch timestamps (GH20997)- Accessing a level of a
`MultiIndex`

with a duplicate name (e.g. in`get_level_values()`

) now raises a`ValueError`

instead of a`KeyError`

(GH21678). - Invalid construction of
`IntervalDtype`

will now always raise a`TypeError`

rather than a`ValueError`

if the subdtype is invalid (GH21185) - Trying to reindex a
`DataFrame`

with a non unique`MultiIndex`

now raises a`ValueError`

instead of an`Exception`

(GH21770) `PeriodIndex.tz_convert()`

and`PeriodIndex.tz_localize()`

have been removed (GH21781)`Index`

subtraction will attempt to operate element-wise instead of raising`TypeError`

(GH19369)`pandas.io.formats.style.Styler`

supports a`number-format`

property when using`to_excel()`

(GH22015)`DataFrame.corr()`

and`Series.corr()`

now raise a`ValueError`

along with a helpful error message instead of a`KeyError`

when supplied with an invalid method (GH22298)`shift()`

will now always return a copy, instead of the previous behaviour of returning self when shifting by 0 (GH22397)- Slicing a single row of a DataFrame with multiple ExtensionArrays of the same type now preserves the dtype, rather than coercing to object (GH22784)
`DateOffset`

attribute _cacheable and method _should_cache have been removed (GH23118)

### Deprecations¶

`DataFrame.to_stata()`

,`read_stata()`

,`StataReader`

and`StataWriter`

have deprecated the`encoding`

argument. The encoding of a Stata dta file is determined by the file type and cannot be changed (GH21244)`MultiIndex.to_hierarchical()`

is deprecated and will be removed in a future version (GH21613)`Series.ptp()`

is deprecated. Use`numpy.ptp`

instead (GH21614)`Series.compress()`

is deprecated. Use`Series[condition]`

instead (GH18262)- The signature of
`Series.to_csv()`

has been uniformed to that of`DataFrame.to_csv()`

: the name of the first argument is now`path_or_buf`

, the order of subsequent arguments has changed, the`header`

argument now defaults to`True`

. (GH19715) `Categorical.from_codes()`

has deprecated providing float values for the`codes`

argument. (GH21767)`pandas.read_table()`

is deprecated. Instead, use`pandas.read_csv()`

passing`sep='\t'`

if necessary (GH21948)`Series.str.cat()`

has deprecated using arbitrary list-likes*within*list-likes. A list-like container may still contain many`Series`

,`Index`

or 1-dimensional`np.ndarray`

, or alternatively, only scalar values. (GH21950)`FrozenNDArray.searchsorted()`

has deprecated the`v`

parameter in favor of`value`

(GH14645)`DatetimeIndex.shift()`

and`PeriodIndex.shift()`

now accept`periods`

argument instead of`n`

for consistency with`Index.shift()`

and`Series.shift()`

. Using`n`

throws a deprecation warning (GH22458, GH22912)

### Removal of prior version deprecations/changes¶

- The
`LongPanel`

and`WidePanel`

classes have been removed (GH10892) `Series.repeat()`

has renamed the`reps`

argument to`repeats`

(GH14645)- Several private functions were removed from the (non-public) module
`pandas.core.common`

(GH22001) - Removal of the previously deprecated module
`pandas.core.datetools`

(GH14105, GH14094) - Strings passed into
`DataFrame.groupby()`

that refer to both column and index levels will raise a`ValueError`

(GH14432) `Index.repeat()`

and`MultiIndex.repeat()`

have renamed the`n`

argument to`repeats`

(GH14645)- Removal of the previously deprecated
`as_indexer`

keyword completely from`str.match()`

(GH22356, GH6581) - Removed the
`pandas.formats.style`

shim for`pandas.io.formats.style.Styler`

(GH16059) `Categorical.searchsorted()`

and`Series.searchsorted()`

have renamed the`v`

argument to`value`

(GH14645)`TimedeltaIndex.searchsorted()`

,`DatetimeIndex.searchsorted()`

, and`PeriodIndex.searchsorted()`

have renamed the`key`

argument to`value`

(GH14645)- Removal of the previously deprecated module
`pandas.json`

(GH19944)

### Performance Improvements¶

- Very large improvement in performance of slicing when the index is a
`CategoricalIndex`

, both when indexing by label (using .loc) and position(.iloc). Likewise, slicing a`CategoricalIndex`

itself (i.e.`ci[100:200]`

) shows similar speed improvements (GH21659) - Improved performance of
`Series.describe()`

in case of numeric dtpyes (GH21274) - Improved performance of
`pandas.core.groupby.GroupBy.rank()`

when dealing with tied rankings (GH21237) - Improved performance of
`DataFrame.set_index()`

with columns consisting of`Period`

objects (GH21582, GH21606) - Improved performance of membership checks in
`Categorical`

and`CategoricalIndex`

(i.e.`x in cat`

-style checks are much faster).`CategoricalIndex.contains()`

is likewise much faster (GH21369, GH21508) - Improved performance of
`HDFStore.groups()`

(and dependent functions like`keys()`

. (i.e.`x in store`

checks are much faster) (GH21372) - Improved the performance of
`pandas.get_dummies()`

with`sparse=True`

(GH21997) - Improved performance of
`IndexEngine.get_indexer_non_unique()`

for sorted, non-unique indexes (GH9466) - Improved performance of
`PeriodIndex.unique()`

(GH23083)

### Documentation Changes¶

- Added sphinx spelling extension, updated documentation on how to use the spell check (GH21079)

### Bug Fixes¶

#### Categorical¶

- Bug in
`Categorical.from_codes()`

where`NaN`

values in`codes`

were silently converted to`0`

(GH21767). In the future this will raise a`ValueError`

. Also changes the behavior of`.from_codes([1.1, 2.0])`

. - Bug when indexing with a boolean-valued
`Categorical`

. Now a boolean-valued`Categorical`

is treated as a boolean mask (GH22665) - Constructing a
`CategoricalIndex`

with empty values and boolean categories was raising a`ValueError`

after a change to dtype coercion (GH22702).

#### Datetimelike¶

- Fixed bug where two
`DateOffset`

objects with different`normalize`

attributes could evaluate as equal (GH21404) - Fixed bug where
`Timestamp.resolution()`

incorrectly returned 1-microsecond`timedelta`

instead of 1-nanosecond`Timedelta`

(GH21336, GH21365) - Bug in
`to_datetime()`

that did not consistently return an`Index`

when`box=True`

was specified (GH21864) - Bug in
`DatetimeIndex`

comparisons where string comparisons incorrectly raises`TypeError`

(GH22074) - Bug in
`DatetimeIndex`

comparisons when comparing against`timedelta64[ns]`

dtyped arrays; in some cases`TypeError`

was incorrectly raised, in others it incorrectly failed to raise (GH22074) - Bug in
`DatetimeIndex`

comparisons when comparing against object-dtyped arrays (GH22074) - Bug in
`DataFrame`

with`datetime64[ns]`

dtype addition and subtraction with`Timedelta`

-like objects (GH22005, GH22163) - Bug in
`DataFrame`

with`datetime64[ns]`

dtype addition and subtraction with`DateOffset`

objects returning an`object`

dtype instead of`datetime64[ns]`

dtype (GH21610, GH22163) - Bug in
`DataFrame`

with`datetime64[ns]`

dtype comparing against`NaT`

incorrectly (GH22242, GH22163) - Bug in
`DataFrame`

with`datetime64[ns]`

dtype subtracting`Timestamp`

-like object incorrectly returned`datetime64[ns]`

dtype instead of`timedelta64[ns]`

dtype (GH8554, GH22163) - Bug in
`DataFrame`

with`datetime64[ns]`

dtype subtracting`np.datetime64`

object with non-nanosecond unit failing to convert to nanoseconds (GH18874, GH22163) - Bug in
`DataFrame`

comparisons against`Timestamp`

-like objects failing to raise`TypeError`

for inequality checks with mismatched types (GH8932, GH22163) - Bug in
`DataFrame`

with mixed dtypes including`datetime64[ns]`

incorrectly raising`TypeError`

on equality comparisons (GH13128, GH22163) - Bug in
`DataFrame.eq()`

comparison against`NaT`

incorrectly returning`True`

or`NaN`

(GH15697, GH22163) - Bug in
`DatetimeIndex`

subtraction that incorrectly failed to raise`OverflowError`

(GH22492, GH22508) - Bug in
`DatetimeIndex`

incorrectly allowing indexing with`Timedelta`

object (GH20464) - Bug in
`DatetimeIndex`

where frequency was being set if original frequency was`None`

(GH22150) - Bug in rounding methods of
`DatetimeIndex`

(`round()`

,`ceil()`

,`floor()`

) and`Timestamp`

(`round()`

,`ceil()`

,`floor()`

) could give rise to loss of precision (GH22591) - Bug in
`to_datetime()`

with an`Index`

argument that would drop the`name`

from the result (GH21697) - Bug in
`PeriodIndex`

where adding or subtracting a`timedelta`

or`Tick`

object produced incorrect results (GH22988)

#### Timedelta¶

- Bug in
`DataFrame`

with`timedelta64[ns]`

dtype division by`Timedelta`

-like scalar incorrectly returning`timedelta64[ns]`

dtype instead of`float64`

dtype (GH20088, GH22163) - Bug in adding a
`Index`

with object dtype to a`Series`

with`timedelta64[ns]`

dtype incorrectly raising (GH22390) - Bug in multiplying a
`Series`

with numeric dtype against a`timedelta`

object (GH22390) - Bug in
`Series`

with numeric dtype when adding or subtracting an an array or`Series`

with`timedelta64`

dtype (GH22390) - Bug in
`Index`

with numeric dtype when multiplying or dividing an array with dtype`timedelta64`

(GH22390) - Bug in
`TimedeltaIndex`

incorrectly allowing indexing with`Timestamp`

object (GH20464) - Fixed bug where subtracting
`Timedelta`

from an object-dtyped array would raise`TypeError`

(GH21980) - Fixed bug in adding a
`DataFrame`

with all-timedelta64[ns] dtypes to a`DataFrame`

with all-integer dtypes returning incorrect results instead of raising`TypeError`

(GH22696)

#### Timezones¶

- Bug in
`DatetimeIndex.shift()`

where an`AssertionError`

would raise when shifting across DST (GH8616) - Bug in
`Timestamp`

constructor where passing an invalid timezone offset designator (`Z`

) would not raise a`ValueError`

(GH8910) - Bug in
`Timestamp.replace()`

where replacing at a DST boundary would retain an incorrect offset (GH7825) - Bug in
`Series.replace()`

with`datetime64[ns, tz]`

data when replacing`NaT`

(GH11792) - Bug in
`Timestamp`

when passing different string date formats with a timezone offset would produce different timezone offsets (GH12064) - Bug when comparing a tz-naive
`Timestamp`

to a tz-aware`DatetimeIndex`

which would coerce the`DatetimeIndex`

to tz-naive (GH12601) - Bug in
`Series.truncate()`

with a tz-aware`DatetimeIndex`

which would cause a core dump (GH9243) - Bug in
`Series`

constructor which would coerce tz-aware and tz-naive`Timestamp`

to tz-aware (GH13051) - Bug in
`Index`

with`datetime64[ns, tz]`

dtype that did not localize integer data correctly (GH20964) - Bug in
`DatetimeIndex`

where constructing with an integer and tz would not localize correctly (GH12619) - Fixed bug where
`DataFrame.describe()`

and`Series.describe()`

on tz-aware datetimes did not show first and last result (GH21328) - Bug in
`DatetimeIndex`

comparisons failing to raise`TypeError`

when comparing timezone-aware`DatetimeIndex`

against`np.datetime64`

(GH22074) - Bug in
`DataFrame`

assignment with a timezone-aware scalar (GH19843) - Bug in
`Dataframe.asof()`

that raised a`TypeError`

when attempting to compare tz-naive and tz-aware timestamps (GH21194) - Bug when constructing a
`DatetimeIndex`

with`Timestamp`s constructed with the ``replace``

method across DST (GH18785) - Bug when setting a new value with
`DataFrame.loc()`

with a`DatetimeIndex`

with a DST transition (GH18308, GH20724) - Bug in
`DatetimeIndex.unique()`

that did not re-localize tz-aware dates correctly (GH21737) - Bug when indexing a
`Series`

with a DST transition (GH21846) - Bug in
`DataFrame.resample()`

and`Series.resample()`

where an`AmbiguousTimeError`

or`NonExistentTimeError`

would raise if a timezone aware timeseries ended on a DST transition (GH19375, GH10117)

#### Offsets¶

#### Numeric¶

- Bug in
`Series`

`__rmatmul__`

doesn’t support matrix vector multiplication (GH21530) - Bug in
`factorize()`

fails with read-only array (GH12813) - Fixed bug in
`unique()`

handled signed zeros inconsistently: for some inputs 0.0 and -0.0 were treated as equal and for some inputs as different. Now they are treated as equal for all inputs (GH21866) - Bug in
`DataFrame.agg()`

,`DataFrame.transform()`

and`DataFrame.apply()`

where, when supplied with a list of functions and`axis=1`

(e.g.`df.apply(['sum', 'mean'], axis=1)`

), a`TypeError`

was wrongly raised. For all three methods such calculation are now done correctly. (GH16679). - Bug in
`Series`

comparison against datetime-like scalars and arrays (GH22074) - Bug in
`DataFrame`

multiplication between boolean dtype and integer returning`object`

dtype instead of integer dtype (GH22047, GH22163) - Bug in
`DataFrame.apply()`

where, when supplied with a string argument and additional positional or keyword arguments (e.g.`df.apply('sum', min_count=1)`

), a`TypeError`

was wrongly raised (GH22376) - Bug in
`DataFrame.astype()`

to extension dtype may raise`AttributeError`

(GH22578) - Bug in
`DataFrame`

with`timedelta64[ns]`

dtype arithmetic operations with`ndarray`

with integer dtype incorrectly treating the narray as`timedelta64[ns]`

dtype (GH23114)

#### Strings¶

#### Interval¶

- Bug in the
`IntervalIndex`

constructor where the`closed`

parameter did not always override the inferred`closed`

(GH19370) - Bug in the
`IntervalIndex`

repr where a trailing comma was missing after the list of intervals (GH20611) - Bug in
`Interval`

where scalar arithmetic operations did not retain the`closed`

value (GH22313) - Bug in
`IntervalIndex`

where indexing with datetime-like values raised a`KeyError`

(GH20636)

#### Indexing¶

- The traceback from a
`KeyError`

when asking`.loc`

for a single missing label is now shorter and more clear (GH21557) - When
`.ix`

is asked for a missing integer label in a`MultiIndex`

with a first level of integer type, it now raises a`KeyError`

, consistently with the case of a flat`Int64Index`

, rather than falling back to positional indexing (GH21593) - Bug in
`DatetimeIndex.reindex()`

when reindexing a tz-naive and tz-aware`DatetimeIndex`

(GH8306) - Bug in
`DataFrame`

when setting values with`.loc`

and a timezone aware`DatetimeIndex`

(GH11365) `DataFrame.__getitem__`

now accepts dictionaries and dictionary keys as list-likes of labels, consistently with`Series.__getitem__`

(GH21294)- Fixed
`DataFrame[np.nan]`

when columns are non-unique (GH21428) - Bug when indexing
`DatetimeIndex`

with nanosecond resolution dates and timezones (GH11679) - Bug where indexing with a Numpy array containing negative values would mutate the indexer (GH21867)
- Bug where mixed indexes wouldn’t allow integers for
`.at`

(GH19860) `Float64Index.get_loc`

now raises`KeyError`

when boolean key passed. (GH19087)- Bug in
`DataFrame.loc()`

when indexing with an`IntervalIndex`

(GH19977) `Index`

no longer mangles`None`

,`NaN`

and`NaT`

, i.e. they are treated as three different keys. However, for numeric Index all three are still coerced to a`NaN`

(GH22332)- Bug in scalar in Index if scalar is a float while the
`Index`

is of integer dtype (GH22085)

#### Missing¶

- Bug in
`DataFrame.fillna()`

where a`ValueError`

would raise when one column contained a`datetime64[ns, tz]`

dtype (GH15522) - Bug in
`Series.hasnans()`

that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (GH19700) `Series.isin()`

now treats all NaN-floats as equal also for np.object-dtype. This behavior is consistent with the behavior for float64 (GH22119)`unique()`

no longer mangles NaN-floats and the`NaT`

-object for np.object-dtype, i.e.`NaT`

is no longer coerced to a NaN-value and is treated as a different entity. (GH22295)

#### MultiIndex¶

- Removed compatibility for
`MultiIndex`

pickles prior to version 0.8.0; compatibility with`MultiIndex`

pickles from version 0.13 forward is maintained (GH21654) `MultiIndex.get_loc_level()`

(and as a consequence,`.loc`

on a`MultiIndex`ed object) will now raise a ``KeyError``

, rather than returning an empty`slice`

, if asked a label which is present in the`levels`

but is unused (GH22221)- Fix
`TypeError`

in Python 3 when creating`MultiIndex`

in which some levels have mixed types, e.g. when some labels are tuples (GH15457)

#### I/O¶

`read_html()`

no longer ignores all-whitespace`<tr>`

within`<thead>`

when considering the`skiprows`

and`header`

arguments. Previously, users had to decrease their`header`

and`skiprows`

values on such tables to work around the issue. (GH21641)`read_excel()`

will correctly show the deprecation warning for previously deprecated`sheetname`

(GH17994)`read_csv()`

and func:read_table() will throw`UnicodeError`

and not coredump on badly encoded strings (GH22748)`read_csv()`

will correctly parse timezone-aware datetimes (GH22256)`read_sas()`

will parse numbers in sas7bdat-files that have width less than 8 bytes correctly. (GH21616)`read_sas()`

will correctly parse sas7bdat files with many columns (GH22628)`read_sas()`

will correctly parse sas7bdat files with data page types having also bit 7 set (so page type is 128 + 256 = 384) (GH16615)- Bug in
`detect_client_encoding()`

where potential`IOError`

goes unhandled when importing in a mod_wsgi process due to restricted access to stdout. (GH21552) - Bug in
`to_string()`

that broke column alignment when`index=False`

and width of first column’s values is greater than the width of first column’s header (GH16839, GH13032)

#### Plotting¶

- Bug in
`DataFrame.plot.scatter()`

and`DataFrame.plot.hexbin()`

caused x-axis label and ticklabels to disappear when colorbar was on in IPython inline backend (GH10611, GH10678, and GH20455) - Bug in plotting a Series with datetimes using
`matplotlib.axes.Axes.scatter()`

(GH22039)

#### Groupby/Resample/Rolling¶

- Bug in
`pandas.core.groupby.GroupBy.first()`

and`pandas.core.groupby.GroupBy.last()`

with`as_index=False`

leading to the loss of timezone information (GH15884) - Bug in
`DatetimeIndex.resample()`

when downsampling across a DST boundary (GH8531) - Bug where
`ValueError`

is wrongly raised when calling`count()`

method of a`SeriesGroupBy`

when the grouping variable only contains NaNs and numpy version < 1.13 (GH21956). - Multiple bugs in
`pandas.core.Rolling.min()`

with`closed='left'`

and a datetime-like index leading to incorrect results and also segfault. (GH21704) - Bug in
`Resampler.apply()`

when passing postiional arguments to applied func (GH14615). - Bug in
`Series.resample()`

when passing`numpy.timedelta64`

to`loffset`

kwarg (GH7687). - Bug in
`Resampler.asfreq()`

when frequency of`TimedeltaIndex`

is a subperiod of a new frequency (GH13022). - Bug in
`SeriesGroupBy.mean()`

when values were integral but could not fit inside of int64, overflowing instead. (GH22487) `RollingGroupby.agg()`

and`ExpandingGroupby.agg()`

now support multiple aggregation functions as parameters (GH15072)- Bug in
`DataFrame.resample()`

and`Series.resample()`

when resampling by a weekly offset (`'W'`

) across a DST transition (GH9119, GH21459)

#### Reshaping¶

- Bug in
`pandas.concat()`

when joining resampled DataFrames with timezone aware index (GH13783) - Bug in
`Series.combine_first()`

with`datetime64[ns, tz]`

dtype which would return tz-naive result (GH21469) - Bug in
`Series.where()`

and`DataFrame.where()`

with`datetime64[ns, tz]`

dtype (GH21546) - Bug in
`Series.mask()`

and`DataFrame.mask()`

with`list`

conditionals (GH21891) - Bug in
`DataFrame.replace()`

raises RecursionError when converting OutOfBounds`datetime64[ns, tz]`

(GH20380) `pandas.core.groupby.GroupBy.rank()`

now raises a`ValueError`

when an invalid value is passed for argument`na_option`

(GH22124)- Bug in
`get_dummies()`

with Unicode attributes in Python 2 (GH22084) - Bug in
`DataFrame.replace()`

raises`RecursionError`

when replacing empty lists (GH22083) - Bug in
`Series.replace()`

and meth:DataFrame.replace when dict is used as the`to_replace`

value and one key in the dict is is another key’s value, the results were inconsistent between using integer key and using string key (GH20656) - Bug in
`DataFrame.drop_duplicates()`

for empty`DataFrame`

which incorrectly raises an error (GH20516) - Bug in
`pandas.wide_to_long()`

when a string is passed to the stubnames argument and a column name is a substring of that stubname (GH22468) - Bug in
`merge()`

when merging`datetime64[ns, tz]`

data that contained a DST transition (GH18885) - Bug in
`merge_asof()`

when merging on float values within defined tolerance (GH22981) - Bug in
`pandas.concat()`

when concatenating a multicolumn DataFrame with tz-aware data against a DataFrame with a different number of columns (:issue`22796`) - Bug in
`merge_asof()`

where confusing error message raised when attempting to merge with missing values (GH23189)

#### Sparse¶

- Updating a boolean, datetime, or timedelta column to be Sparse now works (GH22367)
- Bug in
`Series.to_sparse()`

with Series already holding sparse data not constructing properly (GH22389) - Providing a
`sparse_index`

to the SparseArray constructor no longer defaults the na-value to`np.nan`

for all dtypes. The correct na_value for`data.dtype`

is now used. - Bug in
`SparseArray.nbytes`

under-reporting its memory usage by not including the size of its sparse index. - Improved performance of
`Series.shift()`

for non-NA`fill_value`

, as values are no longer converted to a dense array. - Bug in
`DataFrame.groupby`

not including`fill_value`

in the groups for non-NA`fill_value`

when grouping by a sparse column (GH5078) - Bug in unary inversion operator (
`~`

) on a`SparseSeries`

with boolean values. The performance of this has also been improved (GH22835)

#### Build Changes¶

- Building pandas for development now requires
`cython >= 0.28.2`

(GH21688) - Testing pandas now requires
`hypothesis>=3.58`

. You can find the Hypothesis docs here, and a pandas-specific introduction in the contributing guide. (GH22280)

#### Other¶

`background_gradient()`

now takes a`text_color_threshold`

parameter to automatically lighten the text color based on the luminance of the background color. This improves readability with dark background colors without the need to limit the background colormap range. (GH21258)- Require at least 0.28.2 version of
`cython`

to support read-only memoryviews (GH21688) `background_gradient()`

now also supports tablewise application (in addition to rowwise and columnwise) with`axis=None`

(GH15204)`DataFrame.nlargest()`

and`DataFrame.nsmallest()`

now returns the correct n values when keep != ‘all’ also when tied on the first columns (GH22752)`bar()`

now also supports tablewise application (in addition to rowwise and columnwise) with`axis=None`

and setting clipping range with`vmin`

and`vmax`

(GH21548 and GH21526).`NaN`

values are also handled properly.- Logical operations
`&, |, ^`

between`Series`

and`Index`

will no longer raise`ValueError`

(GH22092) - Bug in
`DataFrame.combine_first()`

in which column types were unexpectedly converted to float (GH20699)

## v0.23.4 (August 3, 2018)¶

This is a minor bug-fix release in the 0.23.x series and includes some small regression fixes and bug fixes. We recommend that all users upgrade to this version.

Warning

Starting January 1, 2019, pandas feature releases will support Python 3 only. See Plan for dropping Python 2.7 for more.

What’s new in v0.23.4

### Fixed Regressions¶

- Python 3.7 with Windows gave all missing values for rolling variance calculations (GH21813)

### Bug Fixes¶

**Groupby/Resample/Rolling**

- Bug where calling
`DataFrameGroupBy.agg()`

with a list of functions including`ohlc`

as the non-initial element would raise a`ValueError`

(GH21716) - Bug in
`roll_quantile`

caused a memory leak when calling`.rolling(...).quantile(q)`

with`q`

in (0,1) (GH21965)

**Missing**

- Bug in
`Series.clip()`

and`DataFrame.clip()`

cannot accept list-like threshold containing`NaN`

(GH19992)

## v0.23.3 (July 7, 2018)¶

This release fixes a build issue with the sdist for Python 3.7 (GH21785) There are no other changes.

## v0.23.2 (July 5, 2018)¶

This is a minor bug-fix release in the 0.23.x series and includes some small regression fixes and bug fixes. We recommend that all users upgrade to this version.

Note

Pandas 0.23.2 is first pandas release that’s compatible with Python 3.7 (GH20552)

Warning

Starting January 1, 2019, pandas feature releases will support Python 3 only. See Plan for dropping Python 2.7 for more.

What’s new in v0.23.2

### Logical Reductions over Entire DataFrame¶

`DataFrame.all()`

and `DataFrame.any()`

now accept `axis=None`

to reduce over all axes to a scalar (GH19976)

```
In [1]: df = pd.DataFrame({"A": [1, 2], "B": [True, False]})
In [2]: df.all(axis=None)
Out[2]: False
```

This also provides compatibility with NumPy 1.15, which now dispatches to `DataFrame.all`

.
With NumPy 1.15 and pandas 0.23.1 or earlier, `numpy.all()`

will no longer reduce over every axis:

```
>>> # NumPy 1.15, pandas 0.23.1
>>> np.any(pd.DataFrame({"A": [False], "B": [False]}))
A False
B False
dtype: bool
```

With pandas 0.23.2, that will correctly return False, as it did with NumPy < 1.15.

```
In [3]: np.any(pd.DataFrame({"A": [False], "B": [False]}))
Out[3]: False
```

### Fixed Regressions¶

- Fixed regression in
`to_csv()`

when handling file-like object incorrectly (GH21471) - Re-allowed duplicate level names of a
`MultiIndex`

. Accessing a level that has a duplicate name by name still raises an error (GH19029). - Bug in both
`DataFrame.first_valid_index()`

and`Series.first_valid_index()`

raised for a row index having duplicate values (GH21441) - Fixed printing of DataFrames with hierarchical columns with long names (GH21180)
- Fixed regression in
`reindex()`

and`groupby()`

with a MultiIndex or multiple keys that contains categorical datetime-like values (GH21390). - Fixed regression in unary negative operations with object dtype (GH21380)
- Bug in
`Timestamp.ceil()`

and`Timestamp.floor()`

when timestamp is a multiple of the rounding frequency (GH21262) - Fixed regression in
`to_clipboard()`

that defaulted to copying dataframes with space delimited instead of tab delimited (GH21104)

### Build Changes¶

- The source and binary distributions no longer include test data files, resulting in smaller download sizes. Tests relying on these data files will be skipped when using
`pandas.test()`

. (GH19320)

### Bug Fixes¶

**Conversion**

- Bug in constructing
`Index`

with an iterator or generator (GH21470) - Bug in
`Series.nlargest()`

for signed and unsigned integer dtypes when the minimum value is present (GH21426)

**Indexing**

- Bug in
`Index.get_indexer_non_unique()`

with categorical key (GH21448) - Bug in comparison operations for
`MultiIndex`

where error was raised on equality / inequality comparison involving a MultiIndex with`nlevels == 1`

(GH21149) - Bug in
`DataFrame.drop()`

behaviour is not consistent for unique and non-unique indexes (GH21494) - Bug in
`DataFrame.duplicated()`

with a large number of columns causing a ‘maximum recursion depth exceeded’ (GH21524).

**I/O**

- Bug in
`read_csv()`

that caused it to incorrectly raise an error when`nrows=0`

,`low_memory=True`

, and`index_col`

was not`None`

(GH21141) - Bug in
`json_normalize()`

when formatting the`record_prefix`

with integer columns (GH21536)

**Categorical**

**Timezones**

- Bug in
`Timestamp`

and`DatetimeIndex`

where passing a`Timestamp`

localized after a DST transition would return a datetime before the DST transition (GH20854) - Bug in comparing
`DataFrame`s with tz-aware :class:`DatetimeIndex`

columns with a DST transition that raised a`KeyError`

(GH19970)

**Timedelta**

## v0.23.1 (June 12, 2018)¶

This is a minor bug-fix release in the 0.23.x series and includes some small regression fixes and bug fixes. We recommend that all users upgrade to this version.

Warning

What’s new in v0.23.1

### Fixed Regressions¶

**Comparing Series with datetime.date**

We’ve reverted a 0.23.0 change to comparing a `Series`

holding datetimes and a `datetime.date`

object (GH21152).
In pandas 0.22 and earlier, comparing a Series holding datetimes and `datetime.date`

objects would coerce the `datetime.date`

to a datetime before comapring.
This was inconsistent with Python, NumPy, and `DatetimeIndex`

, which never consider a datetime and `datetime.date`

equal.

In 0.23.0, we unified operations between DatetimeIndex and Series, and in the process changed comparisons between a Series of datetimes and `datetime.date`

without warning.

We’ve temporarily restored the 0.22.0 behavior, so datetimes and dates may again compare equal, but restore the 0.23.0 behavior in a future release.

To summarize, here’s the behavior in 0.22.0, 0.23.0, 0.23.1:

```
# 0.22.0... Silently coerce the datetime.date
>>> Series(pd.date_range('2017', periods=2)) == datetime.date(2017, 1, 1)
0 True
1 False
dtype: bool
# 0.23.0... Do not coerce the datetime.date
>>> Series(pd.date_range('2017', periods=2)) == datetime.date(2017, 1, 1)
0 False
1 False
dtype: bool
# 0.23.1... Coerce the datetime.date with a warning
>>> Series(pd.date_range('2017', periods=2)) == datetime.date(2017, 1, 1)
/bin/python:1: FutureWarning: Comparing Series of datetimes with 'datetime.date'. Currently, the
'datetime.date' is coerced to a datetime. In the future pandas will
not coerce, and the values not compare equal to the 'datetime.date'.
To retain the current behavior, convert the 'datetime.date' to a
datetime with 'pd.Timestamp'.
#!/bin/python3
0 True
1 False
dtype: bool
```

In addition, ordering comparisons will raise a `TypeError`

in the future.

**Other Fixes**

- Reverted the ability of
`to_sql()`

to perform multivalue inserts as this caused regression in certain cases (GH21103). In the future this will be made configurable. - Fixed regression in the
`DatetimeIndex.date`

and`DatetimeIndex.time`

attributes in case of timezone-aware data:`DatetimeIndex.time`

returned a tz-aware time instead of tz-naive (GH21267) and`DatetimeIndex.date`

returned incorrect date when the input date has a non-UTC timezone (GH21230). - Fixed regression in
`pandas.io.json.json_normalize()`

when called with`None`

values in nested levels in JSON, and to not drop keys with value as None (GH21158, GH21356). - Bug in
`to_csv()`

causes encoding error when compression and encoding are specified (GH21241, GH21118) - Bug preventing pandas from being importable with -OO optimization (GH21071)
- Bug in
`Categorical.fillna()`

incorrectly raising a`TypeError`

when value the individual categories are iterable and value is an iterable (GH21097, GH19788) - Fixed regression in constructors coercing NA values like
`None`

to strings when passing`dtype=str`

(GH21083) - Regression in
`pivot_table()`

where an ordered`Categorical`

with missing values for the pivot’s`index`

would give a mis-aligned result (GH21133) - Fixed regression in merging on boolean index/columns (GH21119).

### Performance Improvements¶

### Bug Fixes¶

**Groupby/Resample/Rolling**

- Bug in
`DataFrame.agg()`

where applying multiple aggregation functions to a`DataFrame`

with duplicated column names would cause a stack overflow (GH21063) - Bug in
`pandas.core.groupby.GroupBy.ffill()`

and`pandas.core.groupby.GroupBy.bfill()`

where the fill within a grouping would not always be applied as intended due to the implementations’ use of a non-stable sort (GH21207) - Bug in
`pandas.core.groupby.GroupBy.rank()`

where results did not scale to 100% when specifying`method='dense'`

and`pct=True`

- Bug in
`pandas.DataFrame.rolling()`

and`pandas.Series.rolling()`

which incorrectly accepted a 0 window size rather than raising (GH21286)

**Data-type specific**

- Bug in
`Series.str.replace()`

where the method throws TypeError on Python 3.5.2 (GH21078) - Bug in
`Timedelta`

where passing a float with a unit would prematurely round the float precision (GH14156) - Bug in
`pandas.testing.assert_index_equal()`

which raised`AssertionError`

incorrectly, when comparing two`CategoricalIndex`

objects with param`check_categorical=False`

(GH19776)

**Sparse**

- Bug in
`SparseArray.shape`

which previously only returned the shape`SparseArray.sp_values`

(GH21126)

**Indexing**

- Bug in
`Series.reset_index()`

where appropriate error was not raised with an invalid level name (GH20925) - Bug in
`interval_range()`

when`start`

/`periods`

or`end`

/`periods`

are specified with float`start`

or`end`

(GH21161) - Bug in
`MultiIndex.set_names()`

where error raised for a`MultiIndex`

with`nlevels == 1`

(GH21149) - Bug in
`IntervalIndex`

constructors where creating an`IntervalIndex`

from categorical data was not fully supported (GH21243, GH21253) - Bug in
`MultiIndex.sort_index()`

which was not guaranteed to sort correctly with`level=1`

; this was also causing data misalignment in particular`DataFrame.stack()`

operations (GH20994, GH20945, GH21052)

**Plotting**

- New keywords (sharex, sharey) to turn on/off sharing of x/y-axis by subplots generated with pandas.DataFrame().groupby().boxplot() (GH20968)

**I/O**

- Bug in IO methods specifying
`compression='zip'`

which produced uncompressed zip archives (GH17778, GH21144) - Bug in
`DataFrame.to_stata()`

which prevented exporting DataFrames to buffers and most file-like objects (GH21041) - Bug in
`read_stata()`

and`StataReader`

which did not correctly decode utf-8 strings on Python 3 from Stata 14 files (dta version 118) (GH21244) - Bug in IO JSON
`read_json()`

reading empty JSON schema with`orient='table'`

back to`DataFrame`

caused an error (GH21287)

**Reshaping**

- Bug in
`concat()`

where error was raised in concatenating`Series`

with numpy scalar and tuple names (GH21015) - Bug in
`concat()`

warning message providing the wrong guidance for future behavior (GH21101)

**Other**

## v0.23.0 (May 15, 2018)¶

This is a major release from 0.22.0 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

- Round-trippable JSON format with ‘table’ orient.
- Instantiation from dicts respects order for Python 3.6+.
- Dependent column arguments for assign.
- Merging / sorting on a combination of columns and index levels.
- Extending Pandas with custom types.
- Excluding unobserved categories from groupby.
- Changes to make output shape of DataFrame.apply consistent.

Check the API Changes and deprecations before updating.

Warning

What’s new in v0.23.0

- New features
- JSON read/write round-trippable with
`orient='table'`

`.assign()`

accepts dependent arguments- Merging on a combination of columns and index levels
- Sorting by a combination of columns and index levels
- Extending Pandas with Custom Types (Experimental)
- New
`observed`

keyword for excluding unobserved categories in`groupby`

- Rolling/Expanding.apply() accepts
`raw=False`

to pass a`Series`

to the function `DataFrame.interpolate`

has gained the`limit_area`

kwarg`get_dummies`

now supports`dtype`

argument- Timedelta mod method
`.rank()`

handles`inf`

values when`NaN`

are present`Series.str.cat`

has gained the`join`

kwarg`DataFrame.astype`

performs column-wise conversion to`Categorical`

- Other Enhancements

- JSON read/write round-trippable with
- Backwards incompatible API changes
- Dependencies have increased minimum versions
- Instantiation from dicts preserves dict insertion order for python 3.6+
- Deprecate Panel
- pandas.core.common removals
- Changes to make output of
`DataFrame.apply`

consistent - Concatenation will no longer sort
- Build Changes
- Index Division By Zero Fills Correctly
- Extraction of matching patterns from strings
- Default value for the
`ordered`

parameter of`CategoricalDtype`

- Better pretty-printing of DataFrames in a terminal
- Datetimelike API Changes
- Other API Changes

- Deprecations
- Removal of prior version deprecations/changes
- Performance Improvements
- Documentation Changes
- Bug Fixes

### New features¶

#### JSON read/write round-trippable with `orient='table'`

¶

A `DataFrame`

can now be written to and subsequently read back via JSON while preserving metadata through usage of the `orient='table'`

argument (see GH18912 and GH9146). Previously, none of the available `orient`

values guaranteed the preservation of dtypes and index names, amongst other metadata.

```
In [1]: df = pd.DataFrame({'foo': [1, 2, 3, 4],
...: 'bar': ['a', 'b', 'c', 'd'],
...: 'baz': pd.date_range('2018-01-01', freq='d', periods=4),
...: 'qux': pd.Categorical(['a', 'b', 'c', 'c'])
...: }, index=pd.Index(range(4), name='idx'))
...:
In [2]: df
Out[2]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
In [3]: df.dtypes
Out[3]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object
In [4]: df.to_json('test.json', orient='table')
In [5]: new_df = pd.read_json('test.json', orient='table')
In [6]: new_df
Out[6]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
In [7]: new_df.dtypes
Out[7]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object
```

Please note that the string index is not supported with the round trip format, as it is used by default in `write_json`

to indicate a missing index name.

```
In [8]: df.index.name = 'index'
In [9]: df.to_json('test.json', orient='table')
In [10]: new_df = pd.read_json('test.json', orient='table')
In [11]: new_df
Out[11]:
foo bar baz qux
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
In [12]: new_df.dtypes
Out[12]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object
```

`.assign()`

accepts dependent arguments¶

The `DataFrame.assign()`

now accepts dependent keyword arguments for python version later than 3.6 (see also PEP 468). Later keyword arguments may now refer to earlier ones if the argument is a callable. See the
documentation here (GH14207)

```
In [13]: df = pd.DataFrame({'A': [1, 2, 3]})
In [14]: df
Out[14]:
A
0 1
1 2
2 3
In [15]: df.assign(B=df.A, C=lambda x:x['A']+ x['B'])
Out[15]:
A B C
0 1 1 2
1 2 2 4
2 3 3 6
```

Warning

This may subtly change the behavior of your code when you’re
using `.assign()`

to update an existing column. Previously, callables
referring to other variables being updated would get the “old” values

Previous Behavior:

```
In [2]: df = pd.DataFrame({"A": [1, 2, 3]})
In [3]: df.assign(A=lambda df: df.A + 1, C=lambda df: df.A * -1)
Out[3]:
A C
0 2 -1
1 3 -2
2 4 -3
```

New Behavior:

```
In [16]: df.assign(A=df.A+1, C= lambda df: df.A* -1)
Out[16]:
A C
0 2 -2
1 3 -3
2 4 -4
```

#### Merging on a combination of columns and index levels¶

Strings passed to `DataFrame.merge()`

as the `on`

, `left_on`

, and `right_on`

parameters may now refer to either column names or index level names.
This enables merging `DataFrame`

instances on a combination of index levels
and columns without resetting indexes. See the Merge on columns and
levels documentation section.
(GH14355)

```
In [17]: left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
In [18]: left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
....: 'B': ['B0', 'B1', 'B2', 'B3'],
....: 'key2': ['K0', 'K1', 'K0', 'K1']},
....: index=left_index)
....:
In [19]: right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
In [20]: right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
....: 'D': ['D0', 'D1', 'D2', 'D3'],
....: 'key2': ['K0', 'K0', 'K0', 'K1']},
....: index=right_index)
....:
In [21]: left.merge(right, on=['key1', 'key2'])
Out[21]:
A B key2 C D
key1
K0 A0 B0 K0 C0 D0
K1 A2 B2 K0 C1 D1
K2 A3 B3 K1 C3 D3
```

#### Sorting by a combination of columns and index levels¶

Strings passed to `DataFrame.sort_values()`

as the `by`

parameter may
now refer to either column names or index level names. This enables sorting
`DataFrame`

instances by a combination of index levels and columns without
resetting indexes. See the Sorting by Indexes and Values documentation section.
(GH14353)

```
# Build MultiIndex
In [22]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
....: ('b', 2), ('b', 1), ('b', 1)])
....:
In [23]: idx.names = ['first', 'second']
# Build DataFrame
In [24]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
....: index=idx)
....:
In [25]: df_multi
Out[25]:
A
first second
a 1 6
2 5
2 4
b 2 3
1 2
1 1
# Sort by 'second' (index) and 'A' (column)
In [26]: df_multi.sort_values(by=['second', 'A'])
Out[26]:
A
first second
b 1 1
1 2
a 1 6
b 2 3
a 2 4
2 5
```

#### Extending Pandas with Custom Types (Experimental)¶

Pandas now supports storing array-like objects that aren’t necessarily 1-D NumPy arrays as columns in a DataFrame or values in a Series. This allows third-party libraries to implement extensions to NumPy’s types, similar to how pandas implemented categoricals, datetimes with timezones, periods, and intervals.

As a demonstration, we’ll use cyberpandas, which provides an `IPArray`

type
for storing ip addresses.

```
In [1]: from cyberpandas import IPArray
In [2]: values = IPArray([
...: 0,
...: 3232235777,
...: 42540766452641154071740215577757643572
...: ])
...:
...:
```

`IPArray`

isn’t a normal 1-D NumPy array, but because it’s a pandas
`ExtensionArray`

, it can be stored properly inside pandas’ containers.

```
In [3]: ser = pd.Series(values)
In [4]: ser
Out[4]:
0 0.0.0.0
1 192.168.1.1
2 2001:db8:85a3::8a2e:370:7334
dtype: ip
```

Notice that the dtype is `ip`

. The missing value semantics of the underlying
array are respected:

```
In [5]: ser.isna()
Out[5]:
0 True
1 False
2 False
dtype: bool
```

For more, see the extension types documentation. If you build an extension array, publicize it on our ecosystem page.

#### New `observed`

keyword for excluding unobserved categories in `groupby`

¶

Grouping by a categorical includes the unobserved categories in the output.
When grouping by multiple categorical columns, this means you get the cartesian product of all the
categories, including combinations where there are no observations, which can result in a large
number of groups. We have added a keyword `observed`

to control this behavior, it defaults to
`observed=False`

for backward-compatibility. (GH14942, GH8138, GH15217, GH17594, GH8669, GH20583, GH20902)

```
In [27]: cat1 = pd.Categorical(["a", "a", "b", "b"],
....: categories=["a", "b", "z"], ordered=True)
....:
In [28]: cat2 = pd.Categorical(["c", "d", "c", "d"],
....: categories=["c", "d", "y"], ordered=True)
....:
In [29]: df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
In [30]: df['C'] = ['foo', 'bar'] * 2
In [31]: df
Out[31]:
A B values C
0 a c 1 foo
1 a d 2 bar
2 b c 3 foo
3 b d 4 bar
```

To show all values, the previous behavior:

```
In [32]: df.groupby(['A', 'B', 'C'], observed=False).count()
Out[32]:
values
A B C
a c bar NaN
foo 1.0
d bar 1.0
foo NaN
y bar NaN
foo NaN
b c bar NaN
... ...
y foo NaN
z c bar NaN
foo NaN
d bar NaN
foo NaN
y bar NaN
foo NaN
[18 rows x 1 columns]
```

To show only observed values:

```
In [33]: df.groupby(['A', 'B', 'C'], observed=True).count()
Out[33]:
values
A B C
a c foo 1
d bar 1
b c foo 1
d bar 1
```

For pivotting operations, this behavior is *already* controlled by the `dropna`

keyword:

```
In [34]: cat1 = pd.Categorical(["a", "a", "b", "b"],
....: categories=["a", "b", "z"], ordered=True)
....:
In [35]: cat2 = pd.Categorical(["c", "d", "c", "d"],
....: categories=["c", "d", "y"], ordered=True)
....:
In [36]: df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
In [37]: df
Out[37]:
A B values
0 a c 1
1 a d 2
2 b c 3
3 b d 4
```

```
In [38]: pd.pivot_table(df, values='values', index=['A', 'B'],
....: dropna=True)
....:
Out[38]:
values
A B
a c 1
d 2
b c 3
d 4
In [39]: pd.pivot_table(df, values='values', index=['A', 'B'],
....: dropna=False)
....:
Out[39]:
values
A B
a c 1.0
d 2.0
y NaN
b c 3.0
d 4.0
y NaN
z c NaN
d NaN
y NaN
```

#### Rolling/Expanding.apply() accepts `raw=False`

to pass a `Series`

to the function¶

`Series.rolling().apply()`

, `DataFrame.rolling().apply()`

,
`Series.expanding().apply()`

, and `DataFrame.expanding().apply()`

have gained a `raw=None`

parameter.
This is similar to `DataFame.apply()`

. This parameter, if `True`

allows one to send a `np.ndarray`

to the applied function. If `False`

a `Series`

will be passed. The
default is `None`

, which preserves backward compatibility, so this will default to `True`

, sending an `np.ndarray`

.
In a future version the default will be changed to `False`

, sending a `Series`

. (GH5071, GH20584)

```
In [40]: s = pd.Series(np.arange(5), np.arange(5) + 1)
In [41]: s
Out[41]:
1 0
2 1
3 2
4 3
5 4
dtype: int64
```

Pass a `Series`

:

```
In [42]: s.rolling(2, min_periods=1).apply(lambda x: x.iloc[-1], raw=False)
Out[42]:
1 0.0
2 1.0
3 2.0
4 3.0
5 4.0
dtype: float64
```

Mimic the original behavior of passing a ndarray:

```
In [43]: s.rolling(2, min_periods=1).apply(lambda x: x[-1], raw=True)
Out[43]:
1 0.0
2 1.0
3 2.0
4 3.0
5 4.0
dtype: float64
```

`DataFrame.interpolate`

has gained the `limit_area`

kwarg¶

`DataFrame.interpolate()`

has gained a `limit_area`

parameter to allow further control of which `NaN`

s are replaced.
Use `limit_area='inside'`

to fill only NaNs surrounded by valid values or use `limit_area='outside'`

to fill only `NaN`

s
outside the existing valid values while preserving those inside. (GH16284) See the full documentation here.

```
In [44]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan])
In [45]: ser
Out[45]:
0 NaN
1 NaN
2 5.0
3 NaN
4 NaN
5 NaN
6 13.0
7 NaN
8 NaN
dtype: float64
```

Fill one consecutive inside value in both directions

```
In [46]: ser.interpolate(limit_direction='both', limit_area='inside', limit=1)
Out[46]:
0 NaN
1 NaN
2 5.0
3 7.0
4 NaN
5 11.0
6 13.0
7 NaN
8 NaN
dtype: float64
```

Fill all consecutive outside values backward

```
In [47]: ser.interpolate(limit_direction='backward', limit_area='outside')
Out[47]:
0 5.0
1 5.0
2 5.0
3 NaN
4 NaN
5 NaN
6 13.0
7 NaN
8 NaN
dtype: float64
```

Fill all consecutive outside values in both directions

```
In [48]: ser.interpolate(limit_direction='both', limit_area='outside')
Out[48]:
0 5.0
1 5.0
2 5.0
3 NaN
4 NaN
5 NaN
6 13.0
7 13.0
8 13.0
dtype: float64
```

`get_dummies`

now supports `dtype`

argument¶

The `get_dummies()`

now accepts a `dtype`

argument, which specifies a dtype for the new columns. The default remains uint8. (GH18330)

```
In [49]: df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
In [50]: pd.get_dummies(df, columns=['c']).dtypes
Out[50]:
a int64
b int64
c_5 uint8
c_6 uint8
dtype: object
In [51]: pd.get_dummies(df, columns=['c'], dtype=bool).dtypes
Out[51]:
a int64
b int64
c_5 bool
c_6 bool
dtype: object
```

#### Timedelta mod method¶

`mod`

(%) and `divmod`

operations are now defined on `Timedelta`

objects
when operating with either timedelta-like or with numeric arguments.
See the documentation here. (GH19365)

```
In [52]: td = pd.Timedelta(hours=37)
In [53]: td % pd.Timedelta(minutes=45)
Out[53]: Timedelta('0 days 00:15:00')
```

`.rank()`

handles `inf`

values when `NaN`

are present¶

In previous versions, `.rank()`

would assign `inf`

elements `NaN`

as their ranks. Now ranks are calculated properly. (GH6945)

```
In [54]: s = pd.Series([-np.inf, 0, 1, np.nan, np.inf])
In [55]: s
Out[55]:
0 -inf
1 0.000000
2 1.000000
3 NaN
4 inf
dtype: float64
```

Previous Behavior:

```
In [11]: s.rank()
Out[11]:
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
dtype: float64
```

Current Behavior:

```
In [56]: s.rank()
Out[56]:
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
dtype: float64
```

Furthermore, previously if you rank `inf`

or `-inf`

values together with `NaN`

values, the calculation won’t distinguish `NaN`

from infinity when using ‘top’ or ‘bottom’ argument.

```
In [57]: s = pd.Series([np.nan, np.nan, -np.inf, -np.inf])
In [58]: s
Out[58]:
0 NaN
1 NaN
2 -inf
3 -inf
dtype: float64
```

Previous Behavior:

```
In [15]: s.rank(na_option='top')
Out[15]:
0 2.5
1 2.5
2 2.5
3 2.5
dtype: float64
```

Current Behavior:

```
In [59]: s.rank(na_option='top')
Out[59]:
0 1.5
1 1.5
2 3.5
3 3.5
dtype: float64
```

These bugs were squashed:

- Bug in
`DataFrame.rank()`

and`Series.rank()`

when`method='dense'`

and`pct=True`

in which percentile ranks were not being used with the number of distinct observations (GH15630) - Bug in
`Series.rank()`

and`DataFrame.rank()`

when`ascending='False'`

failed to return correct ranks for infinity if`NaN`

were present (GH19538) - Bug in
`DataFrameGroupBy.rank()`

where ranks were incorrect when both infinity and`NaN`

were present (GH20561)

`Series.str.cat`

has gained the `join`

kwarg¶

Previously, `Series.str.cat()`

did not – in contrast to most of `pandas`

– align `Series`

on their index before concatenation (see GH18657).
The method has now gained a keyword `join`

to control the manner of alignment, see examples below and here.

In v.0.23 join will default to None (meaning no alignment), but this default will change to `'left'`

in a future version of pandas.

```
In [60]: s = pd.Series(['a', 'b', 'c', 'd'])
In [61]: t = pd.Series(['b', 'd', 'e', 'c'], index=[1, 3, 4, 2])
In [62]: s.str.cat(t)
Out[62]:
0 ab
1 bd
2 ce
3 dc
dtype: object
In [63]: s.str.cat(t, join='left', na_rep='-')
Out[63]:
0 a-
1 bb
2 cc
3 dd
dtype: object
```

Furthermore, `Series.str.cat()`

now works for `CategoricalIndex`

as well (previously raised a `ValueError`

; see GH20842).

`DataFrame.astype`

performs column-wise conversion to `Categorical`

¶

`DataFrame.astype()`

can now perform column-wise conversion to `Categorical`

by supplying the string `'category'`

or
a `CategoricalDtype`

. Previously, attempting this would raise a `NotImplementedError`

. See the
Object Creation section of the documentation for more details and examples. (GH12860, GH18099)

Supplying the string `'category'`

performs column-wise conversion, with only labels appearing in a given column set as categories:

```
In [64]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
In [65]: df = df.astype('category')
In [66]: df['A'].dtype
Out[66]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
In [67]: df['B'].dtype
Out[67]: CategoricalDtype(categories=['b', 'c', 'd'], ordered=False)
```

Supplying a `CategoricalDtype`

will make the categories in each column consistent with the supplied dtype:

```
In [68]: from pandas.api.types import CategoricalDtype
In [69]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
In [70]: cdt = CategoricalDtype(categories=list('abcd'), ordered=True)
In [71]: df = df.astype(cdt)
In [72]: df['A'].dtype
Out[72]: CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
In [73]: df['B'].dtype
Out[73]: CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
```

#### Other Enhancements¶

- Unary
`+`

now permitted for`Series`

and`DataFrame`

as numeric operator (GH16073) - Better support for
`to_excel()`

output with the`xlsxwriter`

engine. (GH16149) `pandas.tseries.frequencies.to_offset()`

now accepts leading ‘+’ signs e.g. ‘+1h’. (GH18171)`MultiIndex.unique()`

now supports the`level=`

argument, to get unique values from a specific index level (GH17896)`pandas.io.formats.style.Styler`

now has method`hide_index()`

to determine whether the index will be rendered in output (GH14194)`pandas.io.formats.style.Styler`

now has method`hide_columns()`

to determine whether columns will be hidden in output (GH14194)- Improved wording of
`ValueError`

raised in`to_datetime()`

when`unit=`

is passed with a non-convertible value (GH14350) `Series.fillna()`

now accepts a Series or a dict as a`value`

for a categorical dtype (GH17033)`pandas.read_clipboard()`

updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (GH17722)- Improved wording of
`ValueError`

raised in`read_csv()`

when the`usecols`

argument cannot match all columns. (GH17301) `DataFrame.corrwith()`

now silently drops non-numeric columns when passed a Series. Before, an exception was raised (GH18570).`IntervalIndex`

now supports time zone aware`Interval`

objects (GH18537, GH18538)`Series()`

/`DataFrame()`

tab completion also returns identifiers in the first level of a`MultiIndex()`

. (GH16326)`read_excel()`

has gained the`nrows`

parameter (GH16645)`DataFrame.append()`

can now in more cases preserve the type of the calling dataframe’s columns (e.g. if both are`CategoricalIndex`

) (GH18359)`DataFrame.to_json()`

and`Series.to_json()`

now accept an`index`

argument which allows the user to exclude the index from the JSON output (GH17394)`IntervalIndex.to_tuples()`

has gained the`na_tuple`

parameter to control whether NA is returned as a tuple of NA, or NA itself (GH18756)`Categorical.rename_categories`

,`CategoricalIndex.rename_categories`

and`Series.cat.rename_categories`

can now take a callable as their argument (GH18862)`Interval`

and`IntervalIndex`

have gained a`length`

attribute (GH18789)`Resampler`

objects now have a functioning`pipe`

method. Previously, calls to`pipe`

were diverted to the`mean`

method (GH17905).`is_scalar()`

now returns`True`

for`DateOffset`

objects (GH18943).`DataFrame.pivot()`

now accepts a list for the`values=`

kwarg (GH17160).- Added
`pandas.api.extensions.register_dataframe_accessor()`

,`pandas.api.extensions.register_series_accessor()`

, and`pandas.api.extensions.register_index_accessor()`

, accessor for libraries downstream of pandas to register custom accessors like`.cat`

on pandas objects. See Registering Custom Accessors for more (GH14781). `IntervalIndex.astype`

now supports conversions between subtypes when passed an`IntervalDtype`

(GH19197)`IntervalIndex`

and its associated constructor methods (`from_arrays`

,`from_breaks`

,`from_tuples`

) have gained a`dtype`

parameter (GH19262)- Added
`pandas.core.groupby.SeriesGroupBy.is_monotonic_increasing()`

and`pandas.core.groupby.SeriesGroupBy.is_monotonic_decreasing()`

(GH17015) - For subclassed
`DataFrames`

,`DataFrame.apply()`

will now preserve the`Series`

subclass (if defined) when passing the data to the applied function (GH19822) `DataFrame.from_dict()`

now accepts a`columns`

argument that can be used to specify the column names when`orient='index'`

is used (GH18529)- Added option
`display.html.use_mathjax`

so MathJax can be disabled when rendering tables in`Jupyter`

notebooks (GH19856, GH19824) `DataFrame.replace()`

now supports the`method`

parameter, which can be used to specify the replacement method when`to_replace`

is a scalar, list or tuple and`value`

is`None`

(GH19632)`Timestamp.month_name()`

,`DatetimeIndex.month_name()`

, and`Series.dt.month_name()`

are now available (GH12805)`Timestamp.day_name()`

and`DatetimeIndex.day_name()`

are now available to return day names with a specified locale (GH12806)`DataFrame.to_sql()`

now performs a multi-value insert if the underlying connection supports itk rather than inserting row by row.`SQLAlchemy`

dialects supporting multi-value inserts include:`mysql`

,`postgresql`

,`sqlite`

and any dialect with`supports_multivalues_insert`

. (GH14315, GH8953)`read_html()`

now accepts a`displayed_only`

keyword argument to controls whether or not hidden elements are parsed (`True`

by default) (GH20027)`read_html()`

now reads all`<tbody>`

elements in a`<table>`

, not just the first. (GH20690)`quantile()`

and`quantile()`

now accept the`interpolation`

keyword,`linear`

by default (GH20497)- zip compression is supported via
`compression=zip`

in`DataFrame.to_pickle()`

,`Series.to_pickle()`

,`DataFrame.to_csv()`

,`Series.to_csv()`

,`DataFrame.to_json()`

,`Series.to_json()`

. (GH17778) `WeekOfMonth`

constructor now supports`n=0`

(GH20517).`DataFrame`

and`Series`

now support matrix multiplication (`@`

) operator (GH10259) for Python>=3.5- Updated
`DataFrame.to_gbq()`

and`pandas.read_gbq()`

signature and documentation to reflect changes from the Pandas-GBQ library version 0.4.0. Adds intersphinx mapping to Pandas-GBQ library. (GH20564) - Added new writer for exporting Stata dta files in version 117,
`StataWriter117`

. This format supports exporting strings with lengths up to 2,000,000 characters (GH16450) `to_hdf()`

and`read_hdf()`

now accept an`errors`

keyword argument to control encoding error handling (GH20835)`cut()`

has gained the`duplicates='raise'|'drop'`

option to control whether to raise on duplicated edges (GH20947)`date_range()`

,`timedelta_range()`

, and`interval_range()`

now return a linearly spaced index if`start`

,`stop`

, and`periods`

are specified, but`freq`

is not. (GH20808, GH20983, GH20976)

### Backwards incompatible API changes¶

#### Dependencies have increased minimum versions¶

We have updated our minimum supported versions of dependencies (GH15184). If installed, we now require:

Package | Minimum Version | Required | Issue |
---|---|---|---|

python-dateutil | 2.5.0 | X | GH15184 |

openpyxl | 2.4.0 | GH15184 | |

beautifulsoup4 | 4.2.1 | GH20082 | |

setuptools | 24.2.0 | GH20698 |

#### Instantiation from dicts preserves dict insertion order for python 3.6+¶

Until Python 3.6, dicts in Python had no formally defined ordering. For Python
version 3.6 and later, dicts are ordered by insertion order, see
PEP 468.
Pandas will use the dict’s insertion order, when creating a `Series`

or
`DataFrame`

from a dict and you’re using Python version 3.6 or
higher. (GH19884)

Previous Behavior (and current behavior if on Python < 3.6):

```
pd.Series({'Income': 2000,
'Expenses': -1500,
'Taxes': -200,
'Net result': 300})
Expenses -1500
Income 2000
Net result 300
Taxes -200
dtype: int64
```

Note the Series above is ordered alphabetically by the index values.

New Behavior (for Python >= 3.6):

```
In [74]: pd.Series({'Income': 2000,
....: 'Expenses': -1500,
....: 'Taxes': -200,
....: 'Net result': 300})
....:
Out[74]:
Income 2000
Expenses -1500
Taxes -200
Net result 300
dtype: int64
```

Notice that the Series is now ordered by insertion order. This new behavior is
used for all relevant pandas types (`Series`

, `DataFrame`

, `SparseSeries`

and `SparseDataFrame`

).

If you wish to retain the old behavior while using Python >= 3.6, you can use
`.sort_index()`

:

```
In [75]: pd.Series({'Income': 2000,
....: 'Expenses': -1500,
....: 'Taxes': -200,
....: 'Net result': 300}).sort_index()
....:
Out[75]:
Expenses -1500
Income 2000
Net result 300
Taxes -200
dtype: int64
```

#### Deprecate Panel¶

`Panel`

was deprecated in the 0.20.x release, showing as a `DeprecationWarning`

. Using `Panel`

will now show a `FutureWarning`

. The recommended way to represent 3-D data are
with a `MultiIndex`

on a `DataFrame`

via the `to_frame()`

or with the xarray package. Pandas
provides a `to_xarray()`

method to automate this conversion. For more details see Deprecate Panel documentation. (GH13563, GH18324).

```
In [76]: p = tm.makePanel()
In [77]: p
Out[77]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D
```

Convert to a MultiIndex DataFrame

```
In [78]: p.to_frame()
Out[78]:
ItemA ItemB ItemC
major minor
2000-01-03 A 1.474071 -0.964980 -1.197071
B 0.781836 1.846883 -0.858447
C 2.353925 -1.717693 0.384316
D -0.744471 0.901805 0.476720
2000-01-04 A -0.064034 -0.845696 -1.066969
B -1.071357 -1.328865 0.306996
C 0.583787 0.888782 1.574159
D 0.758527 1.171216 0.473424
2000-01-05 A -1.282782 -1.340896 -0.303421
B 0.441153 1.682706 -0.028665
C 0.221471 0.228440 1.588931
D 1.729689 0.520260 -0.242861
```

Convert to an xarray DataArray

```
In [79]: p.to_xarray()
Out[79]:
<xarray.DataArray (items: 3, major_axis: 3, minor_axis: 4)>
array([[[ 1.474071, 0.781836, 2.353925, -0.744471],
[-0.064034, -1.071357, 0.583787, 0.758527],
[-1.282782, 0.441153, 0.221471, 1.729689]],
[[-0.96498 , 1.846883, -1.717693, 0.901805],
[-0.845696, -1.328865, 0.888782, 1.171216],
[-1.340896, 1.682706, 0.22844 , 0.52026 ]],
[[-1.197071, -0.858447, 0.384316, 0.47672 ],
[-1.066969, 0.306996, 1.574159, 0.473424],
[-0.303421, -0.028665, 1.588931, -0.242861]]])
Coordinates:
* items (items) object 'ItemA' 'ItemB' 'ItemC'
* major_axis (major_axis) datetime64[ns] 2000-01-03 2000-01-04 2000-01-05
* minor_axis (minor_axis) object 'A' 'B' 'C' 'D'
```

#### pandas.core.common removals¶

The following error & warning messages are removed from `pandas.core.common`

(GH13634, GH19769):

`PerformanceWarning`

`UnsupportedFunctionCall`

`UnsortedIndexError`

`AbstractMethodError`

These are available from import from `pandas.errors`

(since 0.19.0).

#### Changes to make output of `DataFrame.apply`

consistent¶

`DataFrame.apply()`

was inconsistent when applying an arbitrary user-defined-function that returned a list-like with `axis=1`

. Several bugs and inconsistencies
are resolved. If the applied function returns a Series, then pandas will return a DataFrame; otherwise a Series will be returned, this includes the case
where a list-like (e.g. `tuple`

or `list`

is returned) (GH16353, GH17437, GH17970, GH17348, GH17892, GH18573,
GH17602, GH18775, GH18901, GH18919).

```
In [80]: df = pd.DataFrame(np.tile(np.arange(3), 6).reshape(6, -1) + 1, columns=['A', 'B', 'C'])
In [81]: df
Out[81]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
```

Previous Behavior: if the returned shape happened to match the length of original columns, this would return a `DataFrame`

.
If the return shape did not match, a `Series`

with lists was returned.

```
In [3]: df.apply(lambda x: [1, 2, 3], axis=1)
Out[3]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
In [4]: df.apply(lambda x: [1, 2], axis=1)
Out[4]:
0 [1, 2]
1 [1, 2]
2 [1, 2]
3 [1, 2]
4 [1, 2]
5 [1, 2]
dtype: object
```

New Behavior: When the applied function returns a list-like, this will now *always* return a `Series`

.

```
In [82]: df.apply(lambda x: [1, 2, 3], axis=1)
Out[82]:
0 [1, 2, 3]
1 [1, 2, 3]
2 [1, 2, 3]
3 [1, 2, 3]
4 [1, 2, 3]
5 [1, 2, 3]
dtype: object
In [83]: df.apply(lambda x: [1, 2], axis=1)
Out[83]:
0 [1, 2]
1 [1, 2]
2 [1, 2]
3 [1, 2]
4 [1, 2]
5 [1, 2]
dtype: object
```

To have expanded columns, you can use `result_type='expand'`

```
In [84]: df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')
Out[84]:
0 1 2
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
```

To broadcast the result across the original columns (the old behaviour for
list-likes of the correct length), you can use `result_type='broadcast'`

.
The shape must match the original columns.

```
In [85]: df.apply(lambda x: [1, 2, 3], axis=1, result_type='broadcast')
Out[85]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
```

Returning a `Series`

allows one to control the exact return structure and column names:

```
In [86]: df.apply(lambda x: Series([1, 2, 3], index=['D', 'E', 'F']), axis=1)
Out[86]:
D E F
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
```

#### Concatenation will no longer sort¶

In a future version of pandas `pandas.concat()`

will no longer sort the non-concatenation axis when it is not already aligned.
The current behavior is the same as the previous (sorting), but now a warning is issued when `sort`

is not specified and the non-concatenation axis is not aligned (GH4588).

```
In [87]: df1 = pd.DataFrame({"a": [1, 2], "b": [1, 2]}, columns=['b', 'a'])
In [88]: df2 = pd.DataFrame({"a": [4, 5]})
In [89]: pd.concat([df1, df2])
Out[89]:
a b
0 1 1.0
1 2 2.0
0 4 NaN
1 5 NaN
```

To keep the previous behavior (sorting) and silence the warning, pass `sort=True`

```
In [90]: pd.concat([df1, df2], sort=True)
Out[90]:
a b
0 1 1.0
1 2 2.0
0 4 NaN
1 5 NaN
```

To accept the future behavior (no sorting), pass `sort=False`

Note that this change also applies to `DataFrame.append()`

, which has also received a `sort`

keyword for controlling this behavior.

#### Build Changes¶

#### Index Division By Zero Fills Correctly¶

Division operations on `Index`

and subclasses will now fill division of positive numbers by zero with `np.inf`

, division of negative numbers by zero with `-np.inf`

and 0 / 0 with `np.nan`

. This matches existing `Series`

behavior. (GH19322, GH19347)

Previous Behavior:

```
In [6]: index = pd.Int64Index([-1, 0, 1])
In [7]: index / 0
Out[7]: Int64Index([0, 0, 0], dtype='int64')
# Previous behavior yielded different results depending on the type of zero in the divisor
In [8]: index / 0.0
Out[8]: Float64Index([-inf, nan, inf], dtype='float64')
In [9]: index = pd.UInt64Index([0, 1])
In [10]: index / np.array([0, 0], dtype=np.uint64)
Out[10]: UInt64Index([0, 0], dtype='uint64')
In [11]: pd.RangeIndex(1, 5) / 0
ZeroDivisionError: integer division or modulo by zero
```

Current Behavior:

```
In [91]: index = pd.Int64Index([-1, 0, 1])
# division by zero gives -infinity where negative, +infinity where positive, and NaN for 0 / 0
In [92]: index / 0
Out[92]: Float64Index([-inf, nan, inf], dtype='float64')
# The result of division by zero should not depend on whether the zero is int or float
In [93]: index / 0.0
Out[93]: Float64Index([-inf, nan, inf], dtype='float64')
In [94]: index = pd.UInt64Index([0, 1])
In [95]: index / np.array([0, 0], dtype=np.uint64)
Out[95]: Float64Index([nan, inf], dtype='float64')
In [96]: pd.RangeIndex(1, 5) / 0
Out[96]: Float64Index([inf, inf, inf, inf], dtype='float64')
```

#### Extraction of matching patterns from strings¶

By default, extracting matching patterns from strings with `str.extract()`

used to return a
`Series`

if a single group was being extracted (a `DataFrame`

if more than one group was
extracted). As of Pandas 0.23.0 `str.extract()`

always returns a `DataFrame`

, unless
`expand`

is set to `False`

. Finally, `None`

was an accepted value for
the `expand`

parameter (which was equivalent to `False`

), but now raises a `ValueError`

. (GH11386)

Previous Behavior:

```
In [1]: s = pd.Series(['number 10', '12 eggs'])
In [2]: extracted = s.str.extract('.*(\d\d).*')
In [3]: extracted
Out [3]:
0 10
1 12
dtype: object
In [4]: type(extracted)
Out [4]:
pandas.core.series.Series
```

New Behavior:

```
In [97]: s = pd.Series(['number 10', '12 eggs'])
In [98]: extracted = s.str.extract('.*(\d\d).*')
In [99]: extracted
Out[99]:
0
0 10
1 12
In [100]: type(extracted)
Out[100]: pandas.core.frame.DataFrame
```

To restore previous behavior, simply set `expand`

to `False`

:

```
In [101]: s = pd.Series(['number 10', '12 eggs'])
In [102]: extracted = s.str.extract('.*(\d\d).*', expand=False)
In [103]: extracted
Out[103]:
0 10
1 12
dtype: object
In [104]: type(extracted)
Out[104]: pandas.core.series.Series
```

#### Default value for the `ordered`

parameter of `CategoricalDtype`

¶

The default value of the `ordered`

parameter for `CategoricalDtype`

has changed from `False`

to `None`

to allow updating of `categories`

without impacting `ordered`

. Behavior should remain consistent for downstream objects, such as `Categorical`

(GH18790)

In previous versions, the default value for the `ordered`

parameter was `False`

. This could potentially lead to the `ordered`

parameter unintentionally being changed from `True`

to `False`

when users attempt to update `categories`

if `ordered`

is not explicitly specified, as it would silently default to `False`

. The new behavior for `ordered=None`

is to retain the existing value of `ordered`

.

New Behavior:

```
In [105]: from pandas.api.types import CategoricalDtype
In [106]: cat = pd.Categorical(list('abcaba'), ordered=True, categories=list('cba'))
In [107]: cat
Out[107]:
[a, b, c, a, b, a]
Categories (3, object): [c < b < a]
In [108]: cdt = CategoricalDtype(categories=list('cbad'))
In [109]: cat.astype(cdt)
Out[109]:
[a, b, c, a, b, a]
Categories (4, object): [c < b < a < d]
```

Notice in the example above that the converted `Categorical`

has retained `ordered=True`

. Had the default value for `ordered`

remained as `False`

, the converted `Categorical`

would have become unordered, despite `ordered=False`

never being explicitly specified. To change the value of `ordered`

, explicitly pass it to the new dtype, e.g. `CategoricalDtype(categories=list('cbad'), ordered=False)`

.

Note that the unintentional conversion of `ordered`

discussed above did not arise in previous versions due to separate bugs that prevented `astype`

from doing any type of category to category conversion (GH10696, GH18593). These bugs have been fixed in this release, and motivated changing the default value of `ordered`

.

#### Better pretty-printing of DataFrames in a terminal¶

Previously, the default value for the maximum number of columns was
`pd.options.display.max_columns=20`

. This meant that relatively wide data
frames would not fit within the terminal width, and pandas would introduce line
breaks to display these 20 columns. This resulted in an output that was
relatively difficult to read:

If Python runs in a terminal, the maximum number of columns is now determined
automatically so that the printed data frame fits within the current terminal
width (`pd.options.display.max_columns=0`

) (GH17023). If Python runs
as a Jupyter kernel (such as the Jupyter QtConsole or a Jupyter notebook, as
well as in many IDEs), this value cannot be inferred automatically and is thus
set to 20 as in previous versions. In a terminal, this results in a much
nicer output:

Note that if you don’t like the new default, you can always set this option yourself. To revert to the old setting, you can run this line:

```
pd.options.display.max_columns = 20
```

#### Datetimelike API Changes¶

- The default
`Timedelta`

constructor now accepts an`ISO 8601 Duration`

string as an argument (GH19040) - Subtracting
`NaT`

from a`Series`

with`dtype='datetime64[ns]'`

returns a`Series`

with`dtype='timedelta64[ns]'`

instead of`dtype='datetime64[ns]'`

(GH18808) - Addition or subtraction of
`NaT`

from`TimedeltaIndex`

will return`TimedeltaIndex`

instead of`DatetimeIndex`

(GH19124) `DatetimeIndex.shift()`

and`TimedeltaIndex.shift()`

will now raise`NullFrequencyError`

(which subclasses`ValueError`

, which was raised in older versions) when the index object frequency is`None`

(GH19147)- Addition and subtraction of
`NaN`

from a`Series`

with`dtype='timedelta64[ns]'`

will raise a`TypeError`

instead of treating the`NaN`

as`NaT`

(GH19274) `NaT`

division with`datetime.timedelta`

will now return`NaN`

instead of raising (GH17876)- Operations between a
`Series`

with dtype`dtype='datetime64[ns]'`

and a`PeriodIndex`

will correctly raises`TypeError`

(GH18850) - Subtraction of
`Series`

with timezone-aware`dtype='datetime64[ns]'`

with mis-matched timezones will raise`TypeError`

instead of`ValueError`

(GH18817) `Timestamp`

will no longer silently ignore unused or invalid`tz`

or`tzinfo`

keyword arguments (GH17690)`Timestamp`

will no longer silently ignore invalid`freq`

arguments (GH5168)`CacheableOffset`

and`WeekDay`

are no longer available in the`pandas.tseries.offsets`

module (GH17830)`pandas.tseries.frequencies.get_freq_group()`

and`pandas.tseries.frequencies.DAYS`

are removed from the public API (GH18034)`Series.truncate()`

and`DataFrame.truncate()`

will raise a`ValueError`

if the index is not sorted instead of an unhelpful`KeyError`

(GH17935)`Series.first`

and`DataFrame.first`

will now raise a`TypeError`

rather than`NotImplementedError`

when index is not a`DatetimeIndex`

(GH20725).`Series.last`

and`DataFrame.last`

will now raise a`TypeError`

rather than`NotImplementedError`

when index is not a`DatetimeIndex`

(GH20725).- Restricted
`DateOffset`

keyword arguments. Previously,`DateOffset`

subclasses allowed arbitrary keyword arguments which could lead to unexpected behavior. Now, only valid arguments will be accepted. (GH17176, GH18226). `pandas.merge()`

provides a more informative error message when trying to merge on timezone-aware and timezone-naive columns (GH15800)- For
`DatetimeIndex`

and`TimedeltaIndex`

with`freq=None`

, addition or subtraction of integer-dtyped array or`Index`

will raise`NullFrequencyError`

instead of`TypeError`

(GH19895) `Timestamp`

constructor now accepts a nanosecond keyword or positional argument (GH18898)`DatetimeIndex`

will now raise an`AttributeError`

when the`tz`

attribute is set after instantiation (GH3746)`DatetimeIndex`

with a`pytz`

timezone will now return a consistent`pytz`

timezone (GH18595)

#### Other API Changes¶

`Series.astype()`

and`Index.astype()`

with an incompatible dtype will now raise a`TypeError`

rather than a`ValueError`

(GH18231)`Series`

construction with an`object`

dtyped tz-aware datetime and`dtype=object`

specified, will now return an`object`

dtyped`Series`

, previously this would infer the datetime dtype (GH18231)- A
`Series`

of`dtype=category`

constructed from an empty`dict`

will now have categories of`dtype=object`

rather than`dtype=float64`

, consistently with the case in which an empty list is passed (GH18515) - All-NaN levels in a
`MultiIndex`

are now assigned`float`

rather than`object`

dtype, promoting consistency with`Index`

(GH17929). - Levels names of a
`MultiIndex`

(when not None) are now required to be unique: trying to create a`MultiIndex`

with repeated names will raise a`ValueError`

(GH18872) - Both construction and renaming of
`Index`

/`MultiIndex`

with non-hashable`name`

/`names`

will now raise`TypeError`

(GH20527) `Index.map()`

can now accept`Series`

and dictionary input objects (GH12756, GH18482, GH18509).`DataFrame.unstack()`

will now default to filling with`np.nan`

for`object`

columns. (GH12815)`IntervalIndex`

constructor will raise if the`closed`

parameter conflicts with how the input data is inferred to be closed (GH18421)- Inserting missing values into indexes will work for all types of indexes and automatically insert the correct type of missing value (
`NaN`

,`NaT`

, etc.) regardless of the type passed in (GH18295) - When created with duplicate labels,
`MultiIndex`

now raises a`ValueError`

. (GH17464) `Series.fillna()`

now raises a`TypeError`

instead of a`ValueError`

when passed a list, tuple or DataFrame as a`value`

(GH18293)`pandas.DataFrame.merge()`

no longer casts a`float`

column to`object`

when merging on`int`

and`float`

columns (GH16572)`pandas.merge()`

now raises a`ValueError`

when trying to merge on incompatible data types (GH9780)- The default NA value for
`UInt64Index`

has changed from 0 to`NaN`

, which impacts methods that mask with NA, such as`UInt64Index.where()`

(GH18398) - Refactored
`setup.py`

to use`find_packages`

instead of explicitly listing out all subpackages (GH18535) - Rearranged the order of keyword arguments in
`read_excel()`

to align with`read_csv()`

(GH16672) `wide_to_long()`

previously kept numeric-like suffixes as`object`

dtype. Now they are cast to numeric if possible (GH17627)- In
`read_excel()`

, the`comment`

argument is now exposed as a named parameter (GH18735) - Rearranged the order of keyword arguments in
`read_excel()`

to align with`read_csv()`

(GH16672) - The options
`html.border`

and`mode.use_inf_as_null`

were deprecated in prior versions, these will now show`FutureWarning`

rather than a`DeprecationWarning`

(GH19003) `IntervalIndex`

and`IntervalDtype`

no longer support categorical, object, and string subtypes (GH19016)`IntervalDtype`

now returns`True`

when compared against`'interval'`

regardless of subtype, and`IntervalDtype.name`

now returns`'interval'`

regardless of subtype (GH18980)`KeyError`

now raises instead of`ValueError`

in`drop()`

,`drop()`

,`drop()`

,`drop()`

when dropping a non-existent element in an axis with duplicates (GH19186)`Series.to_csv()`

now accepts a`compression`

argument that works in the same way as the`compression`

argument in`DataFrame.to_csv()`

(GH18958)- Set operations (union, difference…) on
`IntervalIndex`

with incompatible index types will now raise a`TypeError`

rather than a`ValueError`

(GH19329) `DateOffset`

objects render more simply, e.g.`<DateOffset: days=1>`

instead of`<DateOffset: kwds={'days': 1}>`

(GH19403)`Categorical.fillna`

now validates its`value`

and`method`

keyword arguments. It now raises when both or none are specified, matching the behavior of`Series.fillna()`

(GH19682)`pd.to_datetime('today')`

now returns a datetime, consistent with`pd.Timestamp('today')`

; previously`pd.to_datetime('today')`

returned a`.normalized()`

datetime (GH19935)`Series.str.replace()`

now takes an optional regex keyword which, when set to`False`

, uses literal string replacement rather than regex replacement (GH16808)`DatetimeIndex.strftime()`

and`PeriodIndex.strftime()`

now return an`Index`

instead of a numpy array to be consistent with similar accessors (GH20127)- Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is specified (GH19714, GH20391).
`DataFrame.to_dict()`

with`orient='index'`

no longer casts int columns to float for a DataFrame with only int and float columns (GH18580)- A user-defined-function that is passed to
`Series.rolling().aggregate()`

,`DataFrame.rolling().aggregate()`

, or its expanding cousins, will now*always*be passed a`Series`

, rather than a`np.array`

;`.apply()`

only has the`raw`

keyword, see here. This is consistent with the signatures of`.aggregate()`

across pandas (GH20584) - Rolling and Expanding types raise
`NotImplementedError`

upon iteration (GH11704).

### Deprecations¶

`Series.from_array`

and`SparseSeries.from_array`

are deprecated. Use the normal constructor`Series(..)`

and`SparseSeries(..)`

instead (GH18213).`DataFrame.as_matrix`

is deprecated. Use`DataFrame.values`

instead (GH18458).`Series.asobject`

,`DatetimeIndex.asobject`

,`PeriodIndex.asobject`

and`TimeDeltaIndex.asobject`

have been deprecated. Use`.astype(object)`

instead (GH18572)- Grouping by a tuple of keys now emits a
`FutureWarning`

and is deprecated. In the future, a tuple passed to`'by'`

will always refer to a single key that is the actual tuple, instead of treating the tuple as multiple keys. To retain the previous behavior, use a list instead of a tuple (GH18314) `Series.valid`

is deprecated. Use`Series.dropna()`

instead (GH18800).`read_excel()`

has deprecated the`skip_footer`

parameter. Use`skipfooter`

instead (GH18836)`ExcelFile.parse()`

has deprecated`sheetname`

in favor of`sheet_name`

for consistency with`read_excel()`

(GH20920).- The
`is_copy`

attribute is deprecated and will be removed in a future version (GH18801). `IntervalIndex.from_intervals`

is deprecated in favor of the`IntervalIndex`

constructor (GH19263)`DataFrame.from_items`

is deprecated. Use`DataFrame.from_dict()`

instead, or`DataFrame.from_dict(OrderedDict())`

if you wish to preserve the key order (GH17320, GH17312)- Indexing a
`MultiIndex`

or a`FloatIndex`

with a list containing some missing keys will now show a`FutureWarning`

, which is consistent with other types of indexes (GH17758). - The
`broadcast`

parameter of`.apply()`

is deprecated in favor of`result_type='broadcast'`

(GH18577) - The
`reduce`

parameter of`.apply()`

is deprecated in favor of`result_type='reduce'`

(GH18577) - The
`order`

parameter of`factorize()`

is deprecated and will be removed in a future release (GH19727) `Timestamp.weekday_name`

,`DatetimeIndex.weekday_name`

, and`Series.dt.weekday_name`

are deprecated in favor of`Timestamp.day_name()`

,`DatetimeIndex.day_name()`

, and`Series.dt.day_name()`

(GH12806)`pandas.tseries.plotting.tsplot`

is deprecated. Use`Series.plot()`

instead (GH18627)`Index.summary()`

is deprecated and will be removed in a future version (GH18217)`NDFrame.get_ftype_counts()`

is deprecated and will be removed in a future version (GH18243)- The
`convert_datetime64`

parameter in`DataFrame.to_records()`

has been deprecated and will be removed in a future version. The NumPy bug motivating this parameter has been resolved. The default value for this parameter has also changed from`True`

to`None`

(GH18160). `Series.rolling().apply()`

,`DataFrame.rolling().apply()`

,`Series.expanding().apply()`

, and`DataFrame.expanding().apply()`

have deprecated passing an`np.array`

by default. One will need to pass the new`raw`

parameter to be explicit about what is passed (GH20584)- The
`data`

,`base`

,`strides`

,`flags`

and`itemsize`

properties of the`Series`

and`Index`

classes have been deprecated and will be removed in a future version (GH20419). `DatetimeIndex.offset`

is deprecated. Use`DatetimeIndex.freq`

instead (GH20716)- Floor division between an integer ndarray and a
`Timedelta`

is deprecated. Divide by`Timedelta.value`

instead (GH19761) - Setting
`PeriodIndex.freq`

(which was not guaranteed to work correctly) is deprecated. Use`PeriodIndex.asfreq()`

instead (GH20678) `Index.get_duplicates()`

is deprecated and will be removed in a future version (GH20239)- The previous default behavior of negative indices in
`Categorical.take`

is deprecated. In a future version it will change from meaning missing values to meaning positional indices from the right. The future behavior is consistent with`Series.take()`

(GH20664). - Passing multiple axes to the
`axis`

parameter in`DataFrame.dropna()`

has been deprecated and will be removed in a future version (GH20987)

### Removal of prior version deprecations/changes¶

- Warnings against the obsolete usage
`Categorical(codes, categories)`

, which were emitted for instance when the first two arguments to`Categorical()`

had different dtypes, and recommended the use of`Categorical.from_codes`

, have now been removed (GH8074) - The
`levels`

and`labels`

attributes of a`MultiIndex`

can no longer be set directly (GH4039). `pd.tseries.util.pivot_annual`

has been removed (deprecated since v0.19). Use`pivot_table`

instead (GH18370)`pd.tseries.util.isleapyear`

has been removed (deprecated since v0.19). Use`.is_leap_year`

property in Datetime-likes instead (GH18370)`pd.ordered_merge`

has been removed (deprecated since v0.19). Use`pd.merge_ordered`

instead (GH18459)- The
`SparseList`

class has been removed (GH14007) - The
`pandas.io.wb`

and`pandas.io.data`

stub modules have been removed (GH13735) `Categorical.from_array`

has been removed (GH13854)- The
`freq`

and`how`

parameters have been removed from the`rolling`

/`expanding`

/`ewm`

methods of DataFrame and Series (deprecated since v0.18). Instead, resample before calling the methods. (GH18601 & GH18668) `DatetimeIndex.to_datetime`

,`Timestamp.to_datetime`

,`PeriodIndex.to_datetime`

, and`Index.to_datetime`

have been removed (GH8254, GH14096, GH14113)`read_csv()`

has dropped the`skip_footer`

parameter (GH13386)`read_csv()`

has dropped the`as_recarray`

parameter (GH13373)`read_csv()`

has dropped the`buffer_lines`

parameter (GH13360)`read_csv()`

has dropped the`compact_ints`

and`use_unsigned`

parameters (GH13323)- The
`Timestamp`

class has dropped the`offset`

attribute in favor of`freq`

(GH13593) - The
`Series`

,`Categorical`

, and`Index`

classes have dropped the`reshape`

method (GH13012) `pandas.tseries.frequencies.get_standard_freq`

has been removed in favor of`pandas.tseries.frequencies.to_offset(freq).rule_code`

(GH13874)- The
`freqstr`

keyword has been removed from`pandas.tseries.frequencies.to_offset`

in favor of`freq`

(GH13874) - The
`Panel4D`

and`PanelND`

classes have been removed (GH13776) - The
`Panel`

class has dropped the`to_long`

and`toLong`

methods (GH19077) - The options
`display.line_with`

and`display.height`

are removed in favor of`display.width`

and`display.max_rows`

respectively (GH4391, GH19107) - The
`labels`

attribute of the`Categorical`

class has been removed in favor of`Categorical.codes`

(GH7768) - The
`flavor`

parameter have been removed from func:to_sql method (GH13611) - The modules
`pandas.tools.hashing`

and`pandas.util.hashing`

have been removed (GH16223) - The top-level functions
`pd.rolling_*`

,`pd.expanding_*`

and`pd.ewm*`

have been removed (Deprecated since v0.18). Instead, use the DataFrame/Series methods`rolling`

,`expanding`

and`ewm`

(GH18723) - Imports from
`pandas.core.common`

for functions such as`is_datetime64_dtype`

are now removed. These are located in`pandas.api.types`

. (GH13634, GH19769) - The
`infer_dst`

keyword in`Series.tz_localize()`

,`DatetimeIndex.tz_localize()`

and`DatetimeIndex`

have been removed.`infer_dst=True`

is equivalent to`ambiguous='infer'`

, and`infer_dst=False`

to`ambiguous='raise'`

(GH7963). - When
`.resample()`

was changed from an eager to a lazy operation, like`.groupby()`

in v0.18.0, we put in place compatibility (with a`FutureWarning`

), so operations would continue to work. This is now fully removed, so a`Resampler`

will no longer forward compat operations (GH20554) - Remove long deprecated
`axis=None`

parameter from`.replace()`

(GH20271)

### Performance Improvements¶

- Indexers on
`Series`

or`DataFrame`

no longer create a reference cycle (GH17956) - Added a keyword argument,
`cache`

, to`to_datetime()`

that improved the performance of converting duplicate datetime arguments (GH11665) `DateOffset`

arithmetic performance is improved (GH18218)- Converting a
`Series`

of`Timedelta`

objects to days, seconds, etc… sped up through vectorization of underlying methods (GH18092) - Improved performance of
`.map()`

with a`Series/dict`

input (GH15081) - The overridden
`Timedelta`

properties of days, seconds and microseconds have been removed, leveraging their built-in Python versions instead (GH18242) `Series`

construction will reduce the number of copies made of the input data in certain cases (GH17449)- Improved performance of
`Series.dt.date()`

and`DatetimeIndex.date()`

(GH18058) - Improved performance of
`Series.dt.time()`

and`DatetimeIndex.time()`

(GH18461) - Improved performance of
`IntervalIndex.symmetric_difference()`

(GH18475) - Improved performance of
`DatetimeIndex`

and`Series`

arithmetic operations with Business-Month and Business-Quarter frequencies (GH18489) `Series()`

/`DataFrame()`

tab completion limits to 100 values, for better performance. (GH18587)- Improved performance of
`DataFrame.median()`

with`axis=1`

when bottleneck is not installed (GH16468) - Improved performance of
`MultiIndex.get_loc()`

for large indexes, at the cost of a reduction in performance for small ones (GH18519) - Improved performance of
`MultiIndex.remove_unused_levels()`

when there are no unused levels, at the cost of a reduction in performance when there are (GH19289) - Improved performance of
`Index.get_loc()`

for non-unique indexes (GH19478) - Improved performance of pairwise
`.rolling()`

and`.expanding()`

with`.cov()`

and`.corr()`

operations (GH17917) - Improved performance of
`pandas.core.groupby.GroupBy.rank()`

(GH15779) - Improved performance of variable
`.rolling()`

on`.min()`

and`.max()`

(GH19521) - Improved performance of
`pandas.core.groupby.GroupBy.ffill()`

and`pandas.core.groupby.GroupBy.bfill()`

(GH11296) - Improved performance of
`pandas.core.groupby.GroupBy.any()`

and`pandas.core.groupby.GroupBy.all()`

(GH15435) - Improved performance of
`pandas.core.groupby.GroupBy.pct_change()`

(GH19165) - Improved performance of
`Series.isin()`

in the case of categorical dtypes (GH20003) - Improved performance of
`getattr(Series, attr)`

when the Series has certain index types. This manifested in slow printing of large Series with a`DatetimeIndex`

(GH19764) - Fixed a performance regression for
`GroupBy.nth()`

and`GroupBy.last()`

with some object columns (GH19283) - Improved performance of
`pandas.core.arrays.Categorical.from_codes()`

(GH18501)

### Documentation Changes¶

Thanks to all of the contributors who participated in the Pandas Documentation Sprint, which took place on March 10th. We had about 500 participants from over 30 locations across the world. You should notice that many of the API docstrings have greatly improved.

There were too many simultaneous contributions to include a release note for each improvement, but this GitHub search should give you an idea of how many docstrings were improved.

Special thanks to Marc Garcia for organizing the sprint. For more information, read the NumFOCUS blogpost recapping the sprint.

- Changed spelling of “numpy” to “NumPy”, and “python” to “Python”. (GH19017)
- Consistency when introducing code samples, using either colon or period. Rewrote some sentences for greater clarity, added more dynamic references to functions, methods and classes. (GH18941, GH18948, GH18973, GH19017)
- Added a reference to
`DataFrame.assign()`

in the concatenate section of the merging documentation (GH18665)

### Bug Fixes¶

#### Categorical¶

Warning

A class of bugs were introduced in pandas 0.21 with `CategoricalDtype`

that
affects the correctness of operations like `merge`

, `concat`

, and
indexing when comparing multiple unordered `Categorical`

arrays that have
the same categories, but in a different order. We highly recommend upgrading
or manually aligning your categories before doing these operations.

- Bug in
`Categorical.equals`

returning the wrong result when comparing two unordered`Categorical`

arrays with the same categories, but in a different order (GH16603) - Bug in
`pandas.api.types.union_categoricals()`

returning the wrong result when for unordered categoricals with the categories in a different order. This affected`pandas.concat()`

with Categorical data (GH19096). - Bug in
`pandas.merge()`

returning the wrong result when joining on an unordered`Categorical`

that had the same categories but in a different order (GH19551) - Bug in
`CategoricalIndex.get_indexer()`

returning the wrong result when`target`

was an unordered`Categorical`

that had the same categories as`self`

but in a different order (GH19551) - Bug in
`Index.astype()`

with a categorical dtype where the resultant index is not converted to a`CategoricalIndex`

for all types of index (GH18630) - Bug in
`Series.astype()`

and`Categorical.astype()`

where an existing categorical data does not get updated (GH10696, GH18593) - Bug in
`Series.str.split()`

with`expand=True`

incorrectly raising an IndexError on empty strings (GH20002). - Bug in
`Index`

constructor with`dtype=CategoricalDtype(...)`

where`categories`

and`ordered`

are not maintained (GH19032) - Bug in
`Series`

constructor with scalar and`dtype=CategoricalDtype(...)`

where`categories`

and`ordered`

are not maintained (GH19565) - Bug in
`Categorical.__iter__`

not converting to Python types (GH19909) - Bug in
`pandas.factorize()`

returning the unique codes for the`uniques`

. This now returns a`Categorical`

with the same dtype as the input (GH19721) - Bug in
`pandas.factorize()`

including an item for missing values in the`uniques`

return value (GH19721) - Bug in
`Series.take()`

with categorical data interpreting`-1`

in indices as missing value markers, rather than the last element of the Series (GH20664)

#### Datetimelike¶

- Bug in
`Series.__sub__()`

subtracting a non-nanosecond`np.datetime64`

object from a`Series`

gave incorrect results (GH7996) - Bug in
`DatetimeIndex`

,`TimedeltaIndex`

addition and subtraction of zero-dimensional integer arrays gave incorrect results (GH19012) - Bug in
`DatetimeIndex`

and`TimedeltaIndex`

where adding or subtracting an array-like of`DateOffset`

objects either raised (`np.array`

,`pd.Index`

) or broadcast incorrectly (`pd.Series`

) (GH18849) - Bug in
`Series.__add__()`

adding Series with dtype`timedelta64[ns]`

to a timezone-aware`DatetimeIndex`

incorrectly dropped timezone information (GH13905) - Adding a
`Period`

object to a`datetime`

or`Timestamp`

object will now correctly raise a`TypeError`

(GH17983) - Bug in
`Timestamp`

where comparison with an array of`Timestamp`

objects would result in a`RecursionError`

(GH15183) - Bug in
`Series`

floor-division where operating on a scalar`timedelta`

raises an exception (GH18846) - Bug in
`DatetimeIndex`

where the repr was not showing high-precision time values at the end of a day (e.g., 23:59:59.999999999) (GH19030) - Bug in
`.astype()`

to non-ns timedelta units would hold the incorrect dtype (GH19176, GH19223, GH12425) - Bug in subtracting
`Series`

from`NaT`

incorrectly returning`NaT`

(GH19158) - Bug in
`Series.truncate()`

which raises`TypeError`

with a monotonic`PeriodIndex`

(GH17717) - Bug in
`pct_change()`

using`periods`

and`freq`

returned different length outputs (GH7292) - Bug in comparison of
`DatetimeIndex`

against`None`

or`datetime.date`

objects raising`TypeError`

for`==`

and`!=`

comparisons instead of all-`False`

and all-`True`

, respectively (GH19301) - Bug in
`Timestamp`

and`to_datetime()`

where a string representing a barely out-of-bounds timestamp would be incorrectly rounded down instead of raising`OutOfBoundsDatetime`

(GH19382) - Bug in
`Timestamp.floor()`

`DatetimeIndex.floor()`

where time stamps far in the future and past were not rounded correctly (GH19206) - Bug in
`to_datetime()`

where passing an out-of-bounds datetime with`errors='coerce'`

and`utc=True`

would raise`OutOfBoundsDatetime`

instead of parsing to`NaT`

(GH19612) - Bug in
`DatetimeIndex`

and`TimedeltaIndex`

addition and subtraction where name of the returned object was not always set consistently. (GH19744) - Bug in
`DatetimeIndex`

and`TimedeltaIndex`

addition and subtraction where operations with numpy arrays raised`TypeError`

(GH19847) - Bug in
`DatetimeIndex`

and`TimedeltaIndex`

where setting the`freq`

attribute was not fully supported (GH20678)

#### Timedelta¶

- Bug in
`Timedelta.__mul__()`

where multiplying by`NaT`

returned`NaT`

instead of raising a`TypeError`

(GH19819) - Bug in
`Series`

with`dtype='timedelta64[ns]'`

where addition or subtraction of`TimedeltaIndex`

had results cast to`dtype='int64'`

(GH17250) - Bug in
`Series`

with`dtype='timedelta64[ns]'`

where addition or subtraction of`TimedeltaIndex`

could return a`Series`

with an incorrect name (GH19043) - Bug in
`Timedelta.__floordiv__()`

and`Timedelta.__rfloordiv__()`

dividing by many incompatible numpy objects was incorrectly allowed (GH18846) - Bug where dividing a scalar timedelta-like object with
`TimedeltaIndex`

performed the reciprocal operation (GH19125) - Bug in
`TimedeltaIndex`

where division by a`Series`

would return a`TimedeltaIndex`

instead of a`Series`

(GH19042) - Bug in
`Timedelta.__add__()`

,`Timedelta.__sub__()`

where adding or subtracting a`np.timedelta64`

object would return another`np.timedelta64`

instead of a`Timedelta`

(GH19738) - Bug in
`Timedelta.__floordiv__()`

,`Timedelta.__rfloordiv__()`

where operating with a`Tick`

object would raise a`TypeError`

instead of returning a numeric value (GH19738) - Bug in
`Period.asfreq()`

where periods near`datetime(1, 1, 1)`

could be converted incorrectly (GH19643, GH19834) - Bug in
`Timedelta.total_seconds()`

causing precision errors, for example`Timedelta('30S').total_seconds()==30.000000000000004`

(GH19458) - Bug in
`Timedelta.__rmod__()`

where operating with a`numpy.timedelta64`

returned a`timedelta64`

object instead of a`Timedelta`

(GH19820) - Multiplication of
`TimedeltaIndex`

by`TimedeltaIndex`

will now raise`TypeError`

instead of raising`ValueError`

in cases of length mis-match (GH19333) - Bug in indexing a
`TimedeltaIndex`

with a`np.timedelta64`

object which was raising a`TypeError`

(GH20393)

#### Timezones¶

- Bug in creating a
`Series`

from an array that contains both tz-naive and tz-aware values will result in a`Series`

whose dtype is tz-aware instead of object (GH16406) - Bug in comparison of timezone-aware
`DatetimeIndex`

against`NaT`

incorrectly raising`TypeError`

(GH19276) - Bug in
`DatetimeIndex.astype()`

when converting between timezone aware dtypes, and converting from timezone aware to naive (GH18951) - Bug in comparing
`DatetimeIndex`

, which failed to raise`TypeError`

when attempting to compare timezone-aware and timezone-naive datetimelike objects (GH18162) - Bug in localization of a naive, datetime string in a
`Series`

constructor with a`datetime64[ns, tz]`

dtype (GH174151) `Timestamp.replace()`

will now handle Daylight Savings transitions gracefully (GH18319)- Bug in tz-aware
`DatetimeIndex`

where addition/subtraction with a`TimedeltaIndex`

or array with`dtype='timedelta64[ns]'`

was incorrect (GH17558) - Bug in
`DatetimeIndex.insert()`

where inserting`NaT`

into a timezone-aware index incorrectly raised (GH16357) - Bug in
`DataFrame`

constructor, where tz-aware Datetimeindex and a given column name will result in an empty`DataFrame`

(GH19157) - Bug in
`Timestamp.tz_localize()`

where localizing a timestamp near the minimum or maximum valid values could overflow and return a timestamp with an incorrect nanosecond value (GH12677) - Bug when iterating over
`DatetimeIndex`

that was localized with fixed timezone offset that rounded nanosecond precision to microseconds (GH19603) - Bug in
`DataFrame.diff()`

that raised an`IndexError`

with tz-aware values (GH18578) - Bug in
`melt()`

that converted tz-aware dtypes to tz-naive (GH15785) - Bug in
`Dataframe.count()`

that raised an`ValueError`

, if`Dataframe.dropna()`

was called for a single column with timezone-aware values. (GH13407)

#### Offsets¶

- Bug in
`WeekOfMonth`

and`Week`

where addition and subtraction did not roll correctly (GH18510, GH18672, GH18864) - Bug in
`WeekOfMonth`

and`LastWeekOfMonth`

where default keyword arguments for constructor raised`ValueError`

(GH19142) - Bug in
`FY5253Quarter`

,`LastWeekOfMonth`

where rollback and rollforward behavior was inconsistent with addition and subtraction behavior (GH18854) - Bug in
`FY5253`

where`datetime`

addition and subtraction incremented incorrectly for dates on the year-end but not normalized to midnight (GH18854) - Bug in
`FY5253`

where date offsets could incorrectly raise an`AssertionError`

in arithmetic operations (GH14774)

#### Numeric¶

- Bug in
`Series`

constructor with an int or float list where specifying`dtype=str`

,`dtype='str'`

or`dtype='U'`

failed to convert the data elements to strings (GH16605) - Bug in
`Index`

multiplication and division methods where operating with a`Series`

would return an`Index`

object instead of a`Series`

object (GH19042) - Bug in the
`DataFrame`

constructor in which data containing very large positive or very large negative numbers was causing`OverflowError`

(GH18584) - Bug in
`Index`

constructor with`dtype='uint64'`

where int-like floats were not coerced to`UInt64Index`

(GH18400) - Bug in
`DataFrame`

flex arithmetic (e.g.`df.add(other, fill_value=foo)`

) with a`fill_value`

other than`None`

failed to raise`NotImplementedError`

in corner cases where either the frame or`other`

has length zero (GH19522) - Multiplication and division of numeric-dtyped
`Index`

objects with timedelta-like scalars returns`TimedeltaIndex`

instead of raising`TypeError`

(GH19333) - Bug where
`NaN`

was returned instead of 0 by`Series.pct_change()`

and`DataFrame.pct_change()`

when`fill_method`

is not`None`

(GH19873)

#### Strings¶

- Bug in
`Series.str.get()`

with a dictionary in the values and the index not in the keys, raising KeyError (GH20671)

#### Indexing¶

- Bug in
`Index`

construction from list of mixed type tuples (GH18505) - Bug in
`Index.drop()`

when passing a list of both tuples and non-tuples (GH18304) - Bug in
`DataFrame.drop()`

,`Panel.drop()`

,`Series.drop()`

,`Index.drop()`

where no`KeyError`

is raised when dropping a non-existent element from an axis that contains duplicates (GH19186) - Bug in indexing a datetimelike
`Index`

that raised`ValueError`

instead of`IndexError`

(GH18386). `Index.to_series()`

now accepts`index`

and`name`

kwargs (GH18699)`DatetimeIndex.to_series()`

now accepts`index`

and`name`

kwargs (GH18699)- Bug in indexing non-scalar value from
`Series`

having non-unique`Index`

will return value flattened (GH17610) - Bug in indexing with iterator containing only missing keys, which raised no error (GH20748)
- Fixed inconsistency in
`.ix`

between list and scalar keys when the index has integer dtype and does not include the desired keys (GH20753) - Bug in
`__setitem__`

when indexing a`DataFrame`

with a 2-d boolean ndarray (GH18582) - Bug in
`str.extractall`

when there were no matches empty`Index`

was returned instead of appropriate`MultiIndex`

(GH19034) - Bug in
`IntervalIndex`

where empty and purely NA data was constructed inconsistently depending on the construction method (GH18421) - Bug in
`IntervalIndex.symmetric_difference()`

where the symmetric difference with a non-`IntervalIndex`

did not raise (GH18475) - Bug in
`IntervalIndex`

where set operations that returned an empty`IntervalIndex`

had the wrong dtype (GH19101) - Bug in
`DataFrame.drop_duplicates()`

where no`KeyError`

is raised when passing in columns that don’t exist on the`DataFrame`

(GH19726) - Bug in
`Index`

subclasses constructors that ignore unexpected keyword arguments (GH19348) - Bug in
`Index.difference()`

when taking difference of an`Index`

with itself (GH20040) - Bug in
`DataFrame.first_valid_index()`

and`DataFrame.last_valid_index()`

in presence of entire rows of NaNs in the middle of values (GH20499). - Bug in
`IntervalIndex`

where some indexing operations were not supported for overlapping or non-monotonic`uint64`

data (GH20636) - Bug in
`Series.is_unique`

where extraneous output in stderr is shown if Series contains objects with`__ne__`

defined (GH20661) - Bug in
`.loc`

assignment with a single-element list-like incorrectly assigns as a list (GH19474) - Bug in partial string indexing on a
`Series/DataFrame`

with a monotonic decreasing`DatetimeIndex`

(GH19362) - Bug in performing in-place operations on a
`DataFrame`

with a duplicate`Index`

(GH17105) - Bug in
`IntervalIndex.get_loc()`

and`IntervalIndex.get_indexer()`

when used with an`IntervalIndex`

containing a single interval (GH17284, GH20921) - Bug in
`.loc`

with a`uint64`

indexer (GH20722)

#### MultiIndex¶

- Bug in
`MultiIndex.__contains__()`

where non-tuple keys would return`True`

even if they had been dropped (GH19027) - Bug in
`MultiIndex.set_labels()`

which would cause casting (and potentially clipping) of the new labels if the`level`

argument is not 0 or a list like [0, 1, … ] (GH19057) - Bug in
`MultiIndex.get_level_values()`

which would return an invalid index on level of ints with missing values (GH17924) - Bug in
`MultiIndex.unique()`

when called on empty`MultiIndex`

(GH20568) - Bug in
`MultiIndex.unique()`

which would not preserve level names (GH20570) - Bug in
`MultiIndex.remove_unused_levels()`

which would fill nan values (GH18417) - Bug in
`MultiIndex.from_tuples()`

which would fail to take zipped tuples in python3 (GH18434) - Bug in
`MultiIndex.get_loc()`

which would fail to automatically cast values between float and int (GH18818, GH15994) - Bug in
`MultiIndex.get_loc()`

which would cast boolean to integer labels (GH19086) - Bug in
`MultiIndex.get_loc()`

which would fail to locate keys containing`NaN`

(GH18485) - Bug in
`MultiIndex.get_loc()`

in large`MultiIndex`

, would fail when levels had different dtypes (GH18520) - Bug in indexing where nested indexers having only numpy arrays are handled incorrectly (GH19686)

#### I/O¶

`read_html()`

now rewinds seekable IO objects after parse failure, before attempting to parse with a new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting the use of a different parser (GH17975)`DataFrame.to_html()`

now has an option to add an id to the leading <table> tag (GH8496)- Bug in
`read_msgpack()`

with a non existent file is passed in Python 2 (GH15296) - Bug in
`read_csv()`

where a`MultiIndex`

with duplicate columns was not being mangled appropriately (GH18062) - Bug in
`read_csv()`

where missing values were not being handled properly when`keep_default_na=False`

with dictionary`na_values`

(GH19227) - Bug in
`read_csv()`

causing heap corruption on 32-bit, big-endian architectures (GH20785) - Bug in
`read_sas()`

where a file with 0 variables gave an`AttributeError`

incorrectly. Now it gives an`EmptyDataError`

(GH18184) - Bug in
`DataFrame.to_latex()`

where pairs of braces meant to serve as invisible placeholders were escaped (GH18667) - Bug in
`DataFrame.to_latex()`

where a`NaN`

in a`MultiIndex`

would cause an`IndexError`

or incorrect output (GH14249) - Bug in
`DataFrame.to_latex()`

where a non-string index-level name would result in an`AttributeError`

(GH19981) - Bug in
`DataFrame.to_latex()`

where the combination of an index name and the index_names=False option would result in incorrect output (GH18326) - Bug in
`DataFrame.to_latex()`

where a`MultiIndex`

with an empty string as its name would result in incorrect output (GH18669) - Bug in
`DataFrame.to_latex()`

where missing space characters caused wrong escaping and produced non-valid latex in some cases (GH20859) - Bug in
`read_json()`

where large numeric values were causing an`OverflowError`

(GH18842) - Bug in
`DataFrame.to_parquet()`

where an exception was raised if the write destination is S3 (GH19134) `Interval`

now supported in`DataFrame.to_excel()`

for all Excel file types (GH19242)`Timedelta`

now supported in`DataFrame.to_excel()`

for all Excel file types (GH19242, GH9155, GH19900)- Bug in
`pandas.io.stata.StataReader.value_labels()`

raising an`AttributeError`

when called on very old files. Now returns an empty dict (GH19417) - Bug in
`read_pickle()`

when unpickling objects with`TimedeltaIndex`

or`Float64Index`

created with pandas prior to version 0.20 (GH19939) - Bug in
`pandas.io.json.json_normalize()`

where sub-records are not properly normalized if any sub-records values are NoneType (GH20030) - Bug in
`usecols`

parameter in`read_csv()`

where error is not raised correctly when passing a string. (GH20529) - Bug in
`HDFStore.keys()`

when reading a file with a soft link causes exception (GH20523) - Bug in
`HDFStore.select_column()`

where a key which is not a valid store raised an`AttributeError`

instead of a`KeyError`

(GH17912)

#### Plotting¶

- Better error message when attempting to plot but matplotlib is not installed (GH19810).
`DataFrame.plot()`

now raises a`ValueError`

when the`x`

or`y`

argument is improperly formed (GH18671)- Bug in
`DataFrame.plot()`

when`x`

and`y`

arguments given as positions caused incorrect referenced columns for line, bar and area plots (GH20056) - Bug in formatting tick labels with
`datetime.time()`

and fractional seconds (GH18478). `Series.plot.kde()`

has exposed the args`ind`

and`bw_method`

in the docstring (GH18461). The argument`ind`

may now also be an integer (number of sample points).`DataFrame.plot()`

now supports multiple columns to the`y`

argument (GH19699)

#### Groupby/Resample/Rolling¶

- Bug when grouping by a single column and aggregating with a class like
`list`

or`tuple`

(GH18079) - Fixed regression in
`DataFrame.groupby()`

which would not emit an error when called with a tuple key not in the index (GH18798) - Bug in
`DataFrame.resample()`

which silently ignored unsupported (or mistyped) options for`label`

,`closed`

and`convention`

(GH19303) - Bug in
`DataFrame.groupby()`

where tuples were interpreted as lists of keys rather than as keys (GH17979, GH18249) - Bug in
`DataFrame.groupby()`

where aggregation by`first`

/`last`

/`min`

/`max`

was causing timestamps to lose precision (GH19526) - Bug in
`DataFrame.transform()`

where particular aggregation functions were being incorrectly cast to match the dtype(s) of the grouped data (GH19200) - Bug in
`DataFrame.groupby()`

passing the on= kwarg, and subsequently using`.apply()`

(GH17813) - Bug in
`DataFrame.resample().aggregate`

not raising a`KeyError`

when aggregating a non-existent column (GH16766, GH19566) - Bug in
`DataFrameGroupBy.cumsum()`

and`DataFrameGroupBy.cumprod()`

when`skipna`

was passed (GH19806) - Bug in
`DataFrame.resample()`

that dropped timezone information (GH13238) - Bug in
`DataFrame.groupby()`

where transformations using`np.all`

and`np.any`

were raising a`ValueError`

(GH20653) - Bug in
`DataFrame.resample()`

where`ffill`

,`bfill`

,`pad`

,`backfill`

,`fillna`

,`interpolate`

, and`asfreq`

were ignoring`loffset`

. (GH20744) - Bug in
`DataFrame.groupby()`

when applying a function that has mixed data types and the user supplied function can fail on the grouping column (GH20949) - Bug in
`DataFrameGroupBy.rolling().apply()`

where operations performed against the associated`DataFrameGroupBy`

object could impact the inclusion of the grouped item(s) in the result (GH14013)

#### Sparse¶

- Bug in which creating a
`SparseDataFrame`

from a dense`Series`

or an unsupported type raised an uncontrolled exception (GH19374) - Bug in
`SparseDataFrame.to_csv`

causing exception (GH19384) - Bug in
`SparseSeries.memory_usage`

which caused segfault by accessing non sparse elements (GH19368) - Bug in constructing a
`SparseArray`

: if`data`

is a scalar and`index`

is defined it will coerce to`float64`

regardless of scalar’s dtype. (GH19163)

#### Reshaping¶

- Bug in
`DataFrame.merge()`

where referencing a`CategoricalIndex`

by name, where the`by`

kwarg would`KeyError`

(GH20777) - Bug in
`DataFrame.stack()`

which fails trying to sort mixed type levels under Python 3 (GH18310) - Bug in
`DataFrame.unstack()`

which casts int to float if`columns`

is a`MultiIndex`

with unused levels (GH17845) - Bug in
`DataFrame.unstack()`

which raises an error if`index`

is a`MultiIndex`

with unused labels on the unstacked level (GH18562) - Fixed construction of a
`Series`

from a`dict`

containing`NaN`

as key (GH18480) - Fixed construction of a
`DataFrame`

from a`dict`

containing`NaN`

as key (GH18455) - Disabled construction of a
`Series`

where len(index) > len(data) = 1, which previously would broadcast the data item, and now raises a`ValueError`

(GH18819) - Suppressed error in the construction of a
`DataFrame`

from a`dict`

containing scalar values when the corresponding keys are not included in the passed index (GH18600) - Fixed (changed from
`object`

to`float64`

) dtype of`DataFrame`

initialized with axes, no data, and`dtype=int`

(GH19646) - Bug in
`Series.rank()`

where`Series`

containing`NaT`

modifies the`Series`

inplace (GH18521) - Bug in
`cut()`

which fails when using readonly arrays (GH18773) - Bug in
`DataFrame.pivot_table()`

which fails when the`aggfunc`

arg is of type string. The behavior is now consistent with other methods like`agg`

and`apply`

(GH18713) - Bug in
`DataFrame.merge()`

in which merging using`Index`

objects as vectors raised an Exception (GH19038) - Bug in
`DataFrame.stack()`

,`DataFrame.unstack()`

,`Series.unstack()`

which were not returning subclasses (GH15563) - Bug in timezone comparisons, manifesting as a conversion of the index to UTC in
`.concat()`

(GH18523) - Bug in
`concat()`

when concatenating sparse and dense series it returns only a`SparseDataFrame`

. Should be a`DataFrame`

. (GH18914, GH18686, and GH16874) - Improved error message for
`DataFrame.merge()`

when there is no common merge key (GH19427) - Bug in
`DataFrame.join()`

which does an`outer`

instead of a`left`

join when being called with multiple DataFrames and some have non-unique indices (GH19624) `Series.rename()`

now accepts`axis`

as a kwarg (GH18589)- Bug in
`rename()`

where an Index of same-length tuples was converted to a MultiIndex (GH19497) - Comparisons between
`Series`

and`Index`

would return a`Series`

with an incorrect name, ignoring the`Index`

’s name attribute (GH19582) - Bug in
`qcut()`

where datetime and timedelta data with`NaT`

present raised a`ValueError`

(GH19768) - Bug in
`DataFrame.iterrows()`

, which would infers strings not compliant to ISO8601 to datetimes (GH19671) - Bug in
`Series`

constructor with`Categorical`

where a`ValueError`

is not raised when an index of different length is given (GH19342) - Bug in
`DataFrame.astype()`

where column metadata is lost when converting to categorical or a dictionary of dtypes (GH19920) - Bug in
`cut()`

and`qcut()`

where timezone information was dropped (GH19872) - Bug in
`Series`

constructor with a`dtype=str`

, previously raised in some cases (GH19853) - Bug in
`get_dummies()`

, and`select_dtypes()`

, where duplicate column names caused incorrect behavior (GH20848) - Bug in
`isna()`

, which cannot handle ambiguous typed lists (GH20675) - Bug in
`concat()`

which raises an error when concatenating TZ-aware dataframes and all-NaT dataframes (GH12396) - Bug in
`concat()`

which raises an error when concatenating empty TZ-aware series (GH18447)

#### Other¶

- Improved error message when attempting to use a Python keyword as an identifier in a
`numexpr`

backed query (GH18221) - Bug in accessing a
`pandas.get_option()`

, which raised`KeyError`

rather than`OptionError`

when looking up a non-existent option key in some cases (GH19789) - Bug in
`testing.assert_series_equal()`

and`testing.assert_frame_equal()`

for Series or DataFrames with differing unicode data (GH20503)

## v0.22.0 (December 29, 2017)¶

This is a major release from 0.21.1 and includes a single, API-breaking change. We recommend that all users upgrade to this version after carefully reading the release note (singular!).

### Backwards incompatible API changes¶

Pandas 0.22.0 changes the handling of empty and all-*NA* sums and products. The
summary is that

- The sum of an empty or all-
*NA*`Series`

is now`0`

- The product of an empty or all-
*NA*`Series`

is now`1`

- We’ve added a
`min_count`

parameter to`.sum()`

and`.prod()`

controlling the minimum number of valid values for the result to be valid. If fewer than`min_count`

non-*NA*values are present, the result is*NA*. The default is`0`

. To return`NaN`

, the 0.21 behavior, use`min_count=1`

.

Some background: In pandas 0.21, we fixed a long-standing inconsistency
in the return value of all-*NA* series depending on whether or not bottleneck
was installed. See Sum/Prod of all-NaN or empty Series/DataFrames is now consistently NaN. At the same
time, we changed the sum and prod of an empty `Series`

to also be `NaN`

.

Based on feedback, we’ve partially reverted those changes.

#### Arithmetic Operations¶

The default sum for empty or all-*NA* `Series`

is now `0`

.

*pandas 0.21.x*

```
In [1]: pd.Series([]).sum()
Out[1]: nan
In [2]: pd.Series([np.nan]).sum()
Out[2]: nan
```

*pandas 0.22.0*

```
In [1]: pd.Series([]).sum()
Out[1]: 0.0
In [2]: pd.Series([np.nan]).sum()
Out[2]: 0.0
```

The default behavior is the same as pandas 0.20.3 with bottleneck installed. It
also matches the behavior of NumPy’s `np.nansum`

on empty and all-*NA* arrays.

To have the sum of an empty series return `NaN`

(the default behavior of
pandas 0.20.3 without bottleneck, or pandas 0.21.x), use the `min_count`

keyword.

```
In [3]: pd.Series([]).sum(min_count=1)
Out[3]: nan
```

Thanks to the `skipna`

parameter, the `.sum`

on an all-*NA*
series is conceptually the same as the `.sum`

of an empty one with
`skipna=True`

(the default).

```
In [4]: pd.Series([np.nan]).sum(min_count=1) # skipna=True by default
Out[4]: nan
```

The `min_count`

parameter refers to the minimum number of *non-null* values
required for a non-NA sum or product.

`Series.prod()`

has been updated to behave the same as `Series.sum()`

,
returning `1`

instead.

```
In [5]: pd.Series([]).prod()
Out[5]: 1.0
In [6]: pd.Series([np.nan]).prod()
Out[6]: 1.0
In [7]: pd.Series([]).prod(min_count=1)
Out[7]: nan
```

These changes affect `DataFrame.sum()`

and `DataFrame.prod()`

as well.
Finally, a few less obvious places in pandas are affected by this change.

#### Grouping by a Categorical¶

Grouping by a `Categorical`

and summing now returns `0`

instead of
`NaN`

for categories with no observations. The product now returns `1`

instead of `NaN`

.

*pandas 0.21.x*

```
In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
In [9]: pd.Series([1, 2]).groupby(grouper).sum()
Out[9]:
a 3.0
b NaN
dtype: float64
```

*pandas 0.22*

```
In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
In [9]: pd.Series([1, 2]).groupby(grouper).sum()
Out[9]:
a 3
b 0
dtype: int64
```

To restore the 0.21 behavior of returning `NaN`

for unobserved groups,
use `min_count>=1`

.

```
In [10]: pd.Series([1, 2]).groupby(grouper).sum(min_count=1)
Out[10]:
a 3.0
b NaN
dtype: float64
```

#### Resample¶

The sum and product of all-*NA* bins has changed from `NaN`

to `0`

for
sum and `1`

for product.

*pandas 0.21.x*

```
In [11]: s = pd.Series([1, 1, np.nan, np.nan],
...: index=pd.date_range('2017', periods=4))
...: s
Out[11]:
2017-01-01 1.0
2017-01-02 1.0
2017-01-03 NaN
2017-01-04 NaN
Freq: D, dtype: float64
In [12]: s.resample('2d').sum()
Out[12]:
2017-01-01 2.0
2017-01-03 NaN
Freq: 2D, dtype: float64
```

*pandas 0.22.0*

```
In [11]: s = pd.Series([1, 1, np.nan, np.nan],
....: index=pd.date_range('2017', periods=4))
....:
In [12]: s.resample('2d').sum()
Out[12]:
2017-01-01 2.0
2017-01-03 0.0
dtype: float64
```

To restore the 0.21 behavior of returning `NaN`

, use `min_count>=1`

.

```
In [13]: s.resample('2d').sum(min_count=1)
Out[13]:
2017-01-01 2.0
2017-01-03 NaN
dtype: float64
```

In particular, upsampling and taking the sum or product is affected, as upsampling introduces missing values even if the original series was entirely valid.

*pandas 0.21.x*

```
In [14]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
In [15]: pd.Series([1, 2], index=idx).resample('12H').sum()
Out[15]:
2017-01-01 00:00:00 1.0
2017-01-01 12:00:00 NaN
2017-01-02 00:00:00 2.0
Freq: 12H, dtype: float64
```

*pandas 0.22.0*

```
In [14]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
In [15]: pd.Series([1, 2], index=idx).resample("12H").sum()
Out[15]:
2017-01-01 00:00:00 1
2017-01-01 12:00:00 0
2017-01-02 00:00:00 2
Freq: 12H, dtype: int64
```

Once again, the `min_count`

keyword is available to restore the 0.21 behavior.

```
In [16]: pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1)
Out[16]:
2017-01-01 00:00:00 1.0
2017-01-01 12:00:00 NaN
2017-01-02 00:00:00 2.0
Freq: 12H, dtype: float64
```

#### Rolling and Expanding¶

Rolling and expanding already have a `min_periods`

keyword that behaves
similar to `min_count`

. The only case that changes is when doing a rolling
or expanding sum with `min_periods=0`

. Previously this returned `NaN`

,
when fewer than `min_periods`

non-*NA* values were in the window. Now it
returns `0`

.

*pandas 0.21.1*

```
In [17]: s = pd.Series([np.nan, np.nan])
In [18]: s.rolling(2, min_periods=0).sum()
Out[18]:
0 NaN
1 NaN
dtype: float64
```

*pandas 0.22.0*

```
In [17]: s = pd.Series([np.nan, np.nan])
In [18]: s.rolling(2, min_periods=0).sum()
Out[18]:
0 0.0
1 0.0
dtype: float64
```

The default behavior of `min_periods=None`

, implying that `min_periods`

equals the window size, is unchanged.

### Compatibility¶

If you maintain a library that should work across pandas versions, it
may be easiest to exclude pandas 0.21 from your requirements. Otherwise, all your
`sum()`

calls would need to check if the `Series`

is empty before summing.

With setuptools, in your `setup.py`

use:

```
install_requires=['pandas!=0.21.*', ...]
```

With conda, use

```
requirements:
run:
- pandas !=0.21.0,!=0.21.1
```

Note that the inconsistency in the return value for all-*NA* series is still
there for pandas 0.20.3 and earlier. Avoiding pandas 0.21 will only help with
the empty case.

## v0.21.1 (December 12, 2017)¶

This is a minor bug-fix release in the 0.21.x series and includes some small regression fixes, bug fixes and performance improvements. We recommend that all users upgrade to this version.

Highlights include:

- Temporarily restore matplotlib datetime plotting functionality. This should resolve issues for users who implicitly relied on pandas to plot datetimes with matplotlib. See here.
- Improvements to the Parquet IO functions introduced in 0.21.0. See here.

What’s new in v0.21.1

### Restore Matplotlib datetime Converter Registration¶

Pandas implements some matplotlib converters for nicely formatting the axis
labels on plots with `datetime`

or `Period`

values. Prior to pandas 0.21.0,
these were implicitly registered with matplotlib, as a side effect of ```
import
pandas
```

.

In pandas 0.21.0, we required users to explicitly register the
converter. This caused problems for some users who relied on those converters
being present for regular `matplotlib.pyplot`

plotting methods, so we’re
temporarily reverting that change; pandas 0.21.1 again registers the converters on
import, just like before 0.21.0.

We’ve added a new option to control the converters:
`pd.options.plotting.matplotlib.register_converters`

. By default, they are
registered. Toggling this to `False`

removes pandas’ formatters and restore
any converters we overwrote when registering them (GH18301).

We’re working with the matplotlib developers to make this easier. We’re trying to balance user convenience (automatically registering the converters) with import performance and best practices (importing pandas shouldn’t have the side effect of overwriting any custom converters you’ve already set). In the future we hope to have most of the datetime formatting functionality in matplotlib, with just the pandas-specific converters in pandas. We’ll then gracefully deprecate the automatic registration of converters in favor of users explicitly registering them when they want them.

### New features¶

#### Improvements to the Parquet IO functionality¶

`DataFrame.to_parquet()`

will now write non-default indexes when the underlying engine supports it. The indexes will be preserved when reading back in with`read_parquet()`

(GH18581).`read_parquet()`

now allows to specify the columns to read from a parquet file (GH18154)`read_parquet()`

now allows to specify kwargs which are passed to the respective engine (GH18216)

#### Other Enhancements¶

`Timestamp.timestamp()`

is now available in Python 2.7. (GH17329)`Grouper`

and`TimeGrouper`

now have a friendly repr output (GH18203).

### Deprecations¶

`pandas.tseries.register`

has been renamed to`pandas.plotting.register_matplotlib_converters()`

(GH18301)

### Bug Fixes¶

#### Conversion¶

- Bug in
`TimedeltaIndex`

subtraction could incorrectly overflow when`NaT`

is present (GH17791) - Bug in
`DatetimeIndex`

subtracting datetimelike from DatetimeIndex could fail to overflow (GH18020) - Bug in
`IntervalIndex.copy()`

when copying and`IntervalIndex`

with non-default`closed`

(GH18339) - Bug in
`DataFrame.to_dict()`

where columns of datetime that are tz-aware were not converted to required arrays when used with`orient='records'`

, raising`TypeError`

(GH18372) - Bug in
`DateTimeIndex`

and`date_range()`

where mismatching tz-aware`start`

and`end`

timezones would not raise an err if`end.tzinfo`

is None (GH18431) - Bug in
`Series.fillna()`

which raised when passed a long integer on Python 2 (GH18159).

#### Indexing¶

- Bug in a boolean comparison of a
`datetime.datetime`

and a`datetime64[ns]`

dtype Series (GH17965) - Bug where a
`MultiIndex`

with more than a million records was not raising`AttributeError`

when trying to access a missing attribute (GH18165) - Bug in
`IntervalIndex`

constructor when a list of intervals is passed with non-default`closed`

(GH18334) - Bug in
`Index.putmask`

when an invalid mask passed (GH18368) - Bug in masked assignment of a
`timedelta64[ns]`

dtype`Series`

, incorrectly coerced to float (GH18493)

#### I/O¶

- Bug in class:~pandas.io.stata.StataReader not converting date/time columns with display formatting addressed (GH17990). Previously columns with display formatting were normally left as ordinal numbers and not converted to datetime objects.
- Bug in
`read_csv()`

when reading a compressed UTF-16 encoded file (GH18071) - Bug in
`read_csv()`

for handling null values in index columns when specifying`na_filter=False`

(GH5239) - Bug in
`read_csv()`

when reading numeric category fields with high cardinality (GH18186) - Bug in
`DataFrame.to_csv()`

when the table had`MultiIndex`

columns, and a list of strings was passed in for`header`

(GH5539) - Bug in parsing integer datetime-like columns with specified format in
`read_sql`

(GH17855). - Bug in
`DataFrame.to_msgpack()`

when serializing data of the`numpy.bool_`

datatype (GH18390) - Bug in
`read_json()`

not decoding when reading line delimited JSON from S3 (GH17200) - Bug in
`pandas.io.json.json_normalize()`

to avoid modification of`meta`

(GH18610) - Bug in
`to_latex()`

where repeated MultiIndex values were not printed even though a higher level index differed from the previous row (GH14484) - Bug when reading NaN-only categorical columns in
`HDFStore`

(GH18413) - Bug in
`DataFrame.to_latex()`

with`longtable=True`

where a latex multicolumn always spanned over three columns (GH17959)

#### Plotting¶

- Bug in
`DataFrame.plot()`

and`Series.plot()`

with`DatetimeIndex`

where a figure generated by them is not pickleable in Python 3 (GH18439)

#### Groupby/Resample/Rolling¶

- Bug in
`DataFrame.resample(...).apply(...)`

when there is a callable that returns different columns (GH15169) - Bug in
`DataFrame.resample(...)`

when there is a time change (DST) and resampling frequency is 12h or higher (GH15549) - Bug in
`pd.DataFrameGroupBy.count()`

when counting over a datetimelike column (GH13393) - Bug in
`rolling.var`

where calculation is inaccurate with a zero-valued array (GH18430)

#### Reshaping¶

- Error message in
`pd.merge_asof()`

for key datatype mismatch now includes datatype of left and right key (GH18068) - Bug in
`pd.concat`

when empty and non-empty DataFrames or Series are concatenated (GH18178 GH18187) - Bug in
`DataFrame.filter(...)`

when`unicode`

is passed as a condition in Python 2 (GH13101) - Bug when merging empty DataFrames when
`np.seterr(divide='raise')`

is set (GH17776)

#### Numeric¶

- Bug in
`pd.Series.rolling.skew()`

and`rolling.kurt()`

with all equal values has floating issue (GH18044)

#### Categorical¶

- Bug in
`DataFrame.astype()`

where casting to ‘category’ on an empty`DataFrame`

causes a segmentation fault (GH18004) - Error messages in the testing module have been improved when items have different
`CategoricalDtype`

(GH18069) `CategoricalIndex`

can now correctly take a`pd.api.types.CategoricalDtype`

as its dtype (GH18116)- Bug in
`Categorical.unique()`

returning read-only`codes`

array when all categories were`NaN`

(GH18051) - Bug in
`DataFrame.groupby(axis=1)`

with a`CategoricalIndex`

(GH18432)

#### String¶

`Series.str.split()`

will now propagate`NaN`

values across all expanded columns instead of`None`

(GH18450)

## v0.21.0 (October 27, 2017)¶

This is a major release from 0.20.3 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

- Integration with Apache Parquet, including a new top-level
`read_parquet()`

function and`DataFrame.to_parquet()`

method, see here. - New user-facing
`pandas.api.types.CategoricalDtype`

for specifying categoricals independent of the data, see here. - The behavior of
`sum`

and`prod`

on all-NaN Series/DataFrames is now consistent and no longer depends on whether bottleneck is installed, and`sum`

and`prod`

on empty Series now return NaN instead of 0, see here. - Compatibility fixes for pypy, see here.
- Additions to the
`drop`

,`reindex`

and`rename`

API to make them more consistent, see here. - Addition of the new methods
`DataFrame.infer_objects`

(see here) and`GroupBy.pipe`

(see here). - Indexing with a list of labels, where one or more of the labels is missing, is deprecated and will raise a KeyError in a future version, see here.

Check the API Changes and deprecations before updating.

What’s new in v0.21.0

- New features
- Integration with Apache Parquet file format
`infer_objects`

type conversion- Improved warnings when attempting to create columns
`drop`

now also accepts index/columns keywords`rename`

,`reindex`

now also accept axis keyword`CategoricalDtype`

for specifying categoricals`GroupBy`

objects now have a`pipe`

method`Categorical.rename_categories`

accepts a dict-like- Other Enhancements

- Backwards incompatible API changes
- Dependencies have increased minimum versions
- Sum/Prod of all-NaN or empty Series/DataFrames is now consistently NaN
- Indexing with a list with missing labels is Deprecated
- NA naming Changes
- Iteration of Series/Index will now return Python scalars
- Indexing with a Boolean Index
`PeriodIndex`

resampling- Improved error handling during item assignment in pd.eval
- Dtype Conversions
- MultiIndex Constructor with a Single Level
- UTC Localization with Series
- Consistency of Range Functions
- No Automatic Matplotlib Converters
- Other API Changes

- Deprecations
- Removal of prior version deprecations/changes
- Performance Improvements
- Documentation Changes
- Bug Fixes

### New features¶

#### Integration with Apache Parquet file format¶

Integration with Apache Parquet, including a new top-level `read_parquet()`

and `DataFrame.to_parquet()`

method, see here (GH15838, GH17438).

Apache Parquet provides a cross-language, binary file format for reading and writing data frames efficiently.
Parquet is designed to faithfully serialize and de-serialize `DataFrame`

s, supporting all of the pandas
dtypes, including extension dtypes such as datetime with timezones.

This functionality depends on either the pyarrow or fastparquet library. For more details, see see the IO docs on Parquet.

`infer_objects`

type conversion¶

The `DataFrame.infer_objects()`

and `Series.infer_objects()`

methods have been added to perform dtype inference on object columns, replacing
some of the functionality of the deprecated `convert_objects`

method. See the documentation here
for more details. (GH11221)

This method only performs soft conversions on object columns, converting Python objects to native types, but not any coercive conversions. For example:

```
In [1]: df = pd.DataFrame({'A': [1, 2, 3],
...: 'B': np.array([1, 2, 3], dtype='object'),
...: 'C': ['1', '2', '3']})
...:
In [2]: df.dtypes
Out[2]:
A int64
B object
C object
dtype: object
In [3]: df.infer_objects().dtypes
Out[3]:
A int64
B int64
C object
dtype: object
```

Note that column `'C'`

was not converted - only scalar numeric types
will be converted to a new type. Other types of conversion should be accomplished
using the `to_numeric()`

function (or `to_datetime()`

, `to_timedelta()`

).

```
In [4]: df = df.infer_objects()
In [5]: df['C'] = pd.to_numeric(df['C'], errors='coerce')
In [6]: df.dtypes
Out[6]:
A int64
B int64
C int64
dtype: object
```

#### Improved warnings when attempting to create columns¶

New users are often puzzled by the relationship between column operations and
attribute access on `DataFrame`

instances (GH7175). One specific
instance of this confusion is attempting to create a new column by setting an
attribute on the `DataFrame`

:

```
In[1]: df = pd.DataFrame({'one': [1., 2., 3.]})
In[2]: df.two = [4, 5, 6]
```

This does not raise any obvious exceptions, but also does not create a new column:

```
In[3]: df
Out[3]:
one
0 1.0
1 2.0
2 3.0
```

Setting a list-like data structure into a new attribute now raises a `UserWarning`

about the potential for unexpected behavior. See Attribute Access.

`drop`

now also accepts index/columns keywords¶

The `drop()`

method has gained `index`

/`columns`

keywords as an
alternative to specifying the `axis`

. This is similar to the behavior of `reindex`

(GH12392).

For example:

```
In [7]: df = pd.DataFrame(np.arange(8).reshape(2,4),
...: columns=['A', 'B', 'C', 'D'])
...:
In [8]: df
Out[8]:
A B C D
0 0 1 2 3
1 4 5 6 7
In [9]: df.drop(['B', 'C'], axis=1)
Out[9]:
A D
0 0 3
1 4 7
# the following is now equivalent
In [10]: df.drop(columns=['B', 'C'])
Out[10]:
A D
0 0 3
1 4 7
```

`rename`

, `reindex`

now also accept axis keyword¶

The `DataFrame.rename()`

and `DataFrame.reindex()`

methods have gained
the `axis`

keyword to specify the axis to target with the operation
(GH12392).

Here’s `rename`

:

```
In [11]: df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
In [12]: df.rename(str.lower, axis='columns')
Out[12]:
a b
0 1 4
1 2 5
2 3 6
In [13]: df.rename(id, axis='index')
Out[13]:
A B
94650055490048 1 4
94650055490080 2 5
94650055490112 3 6
```

And `reindex`

:

```
In [14]: df.reindex(['A', 'B', 'C'], axis='columns')
Out[14]:
A B C
0 1 4 NaN
1 2 5 NaN
2 3 6 NaN
In [15]: df.reindex([0, 1, 3], axis='index')
Out[15]:
A B
0 1.0 4.0
1 2.0 5.0
3 NaN NaN
```

The “index, columns” style continues to work as before.

```
In [16]: df.rename(index=id, columns=str.lower)
Out[16]:
a b
94650055490048 1 4
94650055490080 2 5
94650055490112 3 6
In [17]: df.reindex(index=[0, 1, 3], columns=['A', 'B', 'C'])
Out[17]:
A B C
0 1.0 4.0 NaN
1 2.0 5.0 NaN
3 NaN NaN NaN
```

We *highly* encourage using named arguments to avoid confusion when using either
style.

`CategoricalDtype`

for specifying categoricals¶

`pandas.api.types.CategoricalDtype`

has been added to the public API and
expanded to include the `categories`

and `ordered`

attributes. A
`CategoricalDtype`

can be used to specify the set of categories and
orderedness of an array, independent of the data. This can be useful for example,
when converting string data to a `Categorical`

(GH14711,
GH15078, GH16015, GH17643):

```
In [18]: from pandas.api.types import CategoricalDtype
In [19]: s = pd.Series(['a', 'b', 'c', 'a']) # strings
In [20]: dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
In [21]: s.astype(dtype)
Out[21]:
0 a
1 b
2 c
3 a
dtype: category
Categories (4, object): [a < b < c < d]
```

One place that deserves special mention is in `read_csv()`

. Previously, with
`dtype={'col': 'category'}`

, the returned values and categories would always
be strings.

```
In [22]: data = 'A,B\na,1\nb,2\nc,3'
In [23]: pd.read_csv(StringIO(data), dtype={'B': 'category'}).B.cat.categories
Out[23]: Index(['1', '2', '3'], dtype='object')
```

Notice the “object” dtype.

With a `CategoricalDtype`

of all numerics, datetimes, or
timedeltas, we can automatically convert to the correct type

```
In [24]: dtype = {'B': CategoricalDtype([1, 2, 3])}
In [25]: pd.read_csv(StringIO(data), dtype=dtype).B.cat.categories
Out[25]: Int64Index([1, 2, 3], dtype='int64')
```

The values have been correctly interpreted as integers.

The `.dtype`

property of a `Categorical`

, `CategoricalIndex`

or a
`Series`

with categorical type will now return an instance of
`CategoricalDtype`

. While the repr has changed, `str(CategoricalDtype())`

is
still the string `'category'`

. We’ll take this moment to remind users that the
*preferred* way to detect categorical data is to use
`pandas.api.types.is_categorical_dtype()`

, and not `str(dtype) == 'category'`

.

See the CategoricalDtype docs for more.

`GroupBy`

objects now have a `pipe`

method¶

`GroupBy`

objects now have a `pipe`

method, similar to the one on
`DataFrame`

and `Series`

, that allow for functions that take a
`GroupBy`

to be composed in a clean, readable syntax. (GH17871)

For a concrete example on combining `.groupby`

and `.pipe`

, imagine having a
DataFrame with columns for stores, products, revenue and sold quantity. We’d like to
do a groupwise calculation of *prices* (i.e. revenue/quantity) per store and per product.
We could do this in a multi-step operation, but expressing it in terms of piping can make the
code more readable.

First we set the data:

```
In [26]: import numpy as np
In [27]: n = 1000
In [28]: df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n),
....: 'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], n),
....: 'Revenue': (np.random.random(n)*50+10).round(2),
....: 'Quantity': np.random.randint(1, 10, size=n)})
....:
In [29]: df.head(2)
Out[29]:
Store Product Revenue Quantity
0 Store_1 Product_3 54.28 3
1 Store_2 Product_2 30.91 1
```

Now, to find prices per store/product, we can simply do:

```
In [30]: (df.groupby(['Store', 'Product'])
....: .pipe(lambda grp: grp.Revenue.sum()/grp.Quantity.sum())
....: .unstack().round(2))
....:
Out[30]:
Product Product_1 Product_2 Product_3
Store
Store_1 6.37 6.98 7.49
Store_2 7.60 7.01 7.13
```

See the documentation for more.

`Categorical.rename_categories`

accepts a dict-like¶

`rename_categories()`

now accepts a dict-like argument for
`new_categories`

. The previous categories are looked up in the dictionary’s
keys and replaced if found. The behavior of missing and extra keys is the same
as in `DataFrame.rename()`

.

```
In [31]: c = pd.Categorical(['a', 'a', 'b'])
In [32]: c.rename_categories({"a": "eh", "b": "bee"})
Out[32]:
[eh, eh, bee]
Categories (2, object): [eh, bee]
```

Warning

To assist with upgrading pandas, `rename_categories`

treats `Series`

as
list-like. Typically, Series are considered to be dict-like (e.g. in
`.rename`

, `.map`

). In a future version of pandas `rename_categories`

will change to treat them as dict-like. Follow the warning message’s
recommendations for writing future-proof code.

```
In [33]: c.rename_categories(pd.Series([0, 1], index=['a', 'c']))
FutureWarning: Treating Series 'new_categories' as a list-like and using the values.
In a future version, 'rename_categories' will treat Series like a dictionary.
For dict-like, use 'new_categories.to_dict()'
For list-like, use 'new_categories.values'.
Out[33]:
[0, 0, 1]
Categories (2, int64): [0, 1]
```

#### Other Enhancements¶

##### New functions or methods¶

##### New keywords¶

- Added a
`skipna`

parameter to`infer_dtype()`

to support type inference in the presence of missing values (GH17059). `Series.to_dict()`

and`DataFrame.to_dict()`

now support an`into`

keyword which allows you to specify the`collections.Mapping`

subclass that you would like returned. The default is`dict`

, which is backwards compatible. (GH16122)`Series.set_axis()`

and`DataFrame.set_axis()`

now support the`inplace`

parameter. (GH14636)`Series.to_pickle()`

and`DataFrame.to_pickle()`

have gained a`protocol`

parameter (GH16252). By default, this parameter is set to HIGHEST_PROTOCOL`read_feather()`

has gained the`nthreads`

parameter for multi-threaded operations (GH16359)`DataFrame.clip()`

and`Series.clip()`

have gained an`inplace`

argument. (GH15388)`crosstab()`

has gained a`margins_name`

parameter to define the name of the row / column that will contain the totals when`margins=True`

. (GH15972)`read_json()`

now accepts a`chunksize`

parameter that can be used when`lines=True`

. If`chunksize`

is passed, read_json now returns an iterator which reads in`chunksize`

lines with each iteration. (GH17048)`read_json()`

and`to_json()`

now accept a`compression`

argument which allows them to transparently handle compressed files. (GH17798)

##### Various enhancements¶

- Improved the import time of pandas by about 2.25x. (GH16764)
- Support for PEP 519 – Adding a file system path protocol on most readers (e.g.
`read_csv()`

) and writers (e.g.`DataFrame.to_csv()`

) (GH13823). - Added a
`__fspath__`

method to`pd.HDFStore`

,`pd.ExcelFile`

, and`pd.ExcelWriter`

to work properly with the file system path protocol (GH13823). - The
`validate`

argument for`merge()`

now checks whether a merge is one-to-one, one-to-many, many-to-one, or many-to-many. If a merge is found to not be an example of specified merge type, an exception of type`MergeError`

will be raised. For more, see here (GH16270) - Added support for PEP 518 (
`pyproject.toml`

) to the build system (GH16745) `RangeIndex.append()`

now returns a`RangeIndex`

object when possible (GH16212)`Series.rename_axis()`

and`DataFrame.rename_axis()`

with`inplace=True`

now return`None`

while renaming the axis inplace. (GH15704)`api.types.infer_dtype()`

now infers decimals. (GH15690)`DataFrame.select_dtypes()`

now accepts scalar values for include/exclude as well as list-like. (GH16855)`date_range()`

now accepts ‘YS’ in addition to ‘AS’ as an alias for start of year. (GH9313)`date_range()`

now accepts ‘Y’ in addition to ‘A’ as an alias for end of year. (GH9313)`DataFrame.add_prefix()`

and`DataFrame.add_suffix()`

now accept strings containing the ‘%’ character. (GH17151)- Read/write methods that infer compression (
`read_csv()`

,`read_table()`

,`read_pickle()`

, and`to_pickle()`

) can now infer from path-like objects, such as`pathlib.Path`

. (GH17206) `read_sas()`

now recognizes much more of the most frequently used date (datetime) formats in SAS7BDAT files. (GH15871)`DataFrame.items()`

and`Series.items()`

are now present in both Python 2 and 3 and is lazy in all cases. (GH13918, GH17213)`pandas.io.formats.style.Styler.where()`

has been implemented as a convenience for`pandas.io.formats.style.Styler.applymap()`

. (GH17474)`MultiIndex.is_monotonic_decreasing()`

has been implemented. Previously returned`False`

in all cases. (GH16554)`read_excel()`

raises`ImportError`

with a better message if`xlrd`

is not installed. (GH17613)`DataFrame.assign()`

will preserve the original order of`**kwargs`

for Python 3.6+ users instead of sorting the column names. (GH14207)`Series.reindex()`

,`DataFrame.reindex()`

,`Index.get_indexer()`

now support list-like argument for`tolerance`

. (GH17367)

### Backwards incompatible API changes¶

#### Dependencies have increased minimum versions¶

We have updated our minimum supported versions of dependencies (GH15206, GH15543, GH15214). If installed, we now require:

Package Minimum Version Required Numpy 1.9.0 X Matplotlib 1.4.3 Scipy 0.14.0 Bottleneck 1.0.0

Additionally, support has been dropped for Python 3.4 (GH15251).

#### Sum/Prod of all-NaN or empty Series/DataFrames is now consistently NaN¶

Note

The changes described here have been partially reverted. See the v0.22.0 Whatsnew for more.

The behavior of `sum`

and `prod`

on all-NaN Series/DataFrames no longer depends on
whether bottleneck is installed, and return value of `sum`

and `prod`

on an empty Series has changed (GH9422, GH15507).

Calling `sum`

or `prod`

on an empty or all-`NaN`

`Series`

, or columns of a `DataFrame`

, will result in `NaN`

. See the docs.

```
In [33]: s = Series([np.nan])
```

Previously WITHOUT `bottleneck`

installed:

```
In [2]: s.sum()
Out[2]: np.nan
```

Previously WITH `bottleneck`

:

```
In [2]: s.sum()
Out[2]: 0.0
```

New Behavior, without regard to the bottleneck installation:

```
In [34]: s.sum()
Out[34]: 0.0
```

Note that this also changes the sum of an empty `Series`

. Previously this always returned 0 regardless of a `bottlenck`

installation:

```
In [1]: pd.Series([]).sum()
Out[1]: 0
```

but for consistency with the all-NaN case, this was changed to return NaN as well:

```
In [35]: pd.Series([]).sum()
Out[35]: 0.0
```

#### Indexing with a list with missing labels is Deprecated¶

Previously, selecting with a list of labels, where one or more labels were missing would always succeed, returning `NaN`

for missing labels.
This will now show a `FutureWarning`

. In the future this will raise a `KeyError`

(GH15747).
This warning will trigger on a `DataFrame`

or a `Series`

for using `.loc[]`

or `[[]]`

when passing a list-of-labels with at least 1 missing label.
See the deprecation docs.

```
In [36]: s = pd.Series([1, 2, 3])
In [37]: s
Out[37]:
0 1
1 2
2 3
dtype: int64
```

Previous Behavior

```
In [4]: s.loc[[1, 2, 3]]
Out[4]:
1 2.0
2 3.0
3 NaN
dtype: float64
```

Current Behavior

```
In [4]: s.loc[[1, 2, 3]]
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Out[4]:
1 2.0
2 3.0
3 NaN
dtype: float64
```

The idiomatic way to achieve selecting potentially not-found elements is via `.reindex()`

```
In [38]: s.reindex([1, 2, 3])
Out[38]:
1 2.0
2 3.0
3 NaN
dtype: float64
```

Selection with all keys found is unchanged.

```
In [39]: s.loc[[1, 2]]
Out[39]:
1 2
2 3
dtype: int64
```

#### NA naming Changes¶

In order to promote more consistency among the pandas API, we have added additional top-level
functions `isna()`

and `notna()`

that are aliases for `isnull()`

and `notnull()`

.
The naming scheme is now more consistent with methods like `.dropna()`

and `.fillna()`

. Furthermore
in all cases where `.isnull()`

and `.notnull()`

methods are defined, these have additional methods
named `.isna()`

and `.notna()`

, these are included for classes `Categorical`

,
`Index`

, `Series`

, and `DataFrame`

. (GH15001).

The configuration option `pd.options.mode.use_inf_as_null`

is deprecated, and `pd.options.mode.use_inf_as_na`

is added as a replacement.

#### Iteration of Series/Index will now return Python scalars¶

Previously, when using certain iteration methods for a `Series`

with dtype `int`

or `float`

, you would receive a `numpy`

scalar, e.g. a `np.int64`

, rather than a Python `int`

. Issue (GH10904) corrected this for `Series.tolist()`

and `list(Series)`

. This change makes all iteration methods consistent, in particular, for `__iter__()`

and `.map()`

; note that this only affects int/float dtypes. (GH13236, GH13258, GH14216).

```
In [40]: s = pd.Series([1, 2, 3])
In [41]: s
Out[41]:
0 1
1 2
2 3
dtype: int64
```

Previously:

```
In [2]: type(list(s)[0])
Out[2]: numpy.int64
```

New Behaviour:

```
In [42]: type(list(s)[0])
Out[42]: int
```

Furthermore this will now correctly box the results of iteration for `DataFrame.to_dict()`

as well.

```
In [43]: d = {'a':[1], 'b':['b']}
In [44]: df = pd.DataFrame(d)
```

Previously:

```
In [8]: type(df.to_dict()['a'][0])
Out[8]: numpy.int64
```

New Behaviour:

```
In [45]: type(df.to_dict()['a'][0])
Out[45]: int
```

#### Indexing with a Boolean Index¶

Previously when passing a boolean `Index`

to `.loc`

, if the index of the `Series/DataFrame`

had `boolean`

labels,
you would get a label based selection, potentially duplicating result labels, rather than a boolean indexing selection
(where `True`

selects elements), this was inconsistent how a boolean numpy array indexed. The new behavior is to
act like a boolean numpy array indexer. (GH17738)

Previous Behavior:

```
In [46]: s = pd.Series([1, 2, 3], index=[False, True, False])
In [47]: s
Out[47]:
False 1
True 2
False 3
dtype: int64
```

```
In [59]: s.loc[pd.Index([True, False, True])]
Out[59]:
True 2
False 1
False 3
True 2
dtype: int64
```

Current Behavior

```
In [48]: s.loc[pd.Index([True, False, True])]
Out[48]:
False 1
False 3
dtype: int64
```

Furthermore, previously if you had an index that was non-numeric (e.g. strings), then a boolean Index would raise a `KeyError`

.
This will now be treated as a boolean indexer.

Previously Behavior:

```
In [49]: s = pd.Series([1,2,3], index=['a', 'b', 'c'])
In [50]: s
Out[50]:
a 1
b 2
c 3
dtype: int64
```

```
In [39]: s.loc[pd.Index([True, False, True])]
KeyError: "None of [Index([True, False, True], dtype='object')] are in the [index]"
```

Current Behavior

```
In [51]: s.loc[pd.Index([True, False, True])]
Out[51]:
a 1
c 3
dtype: int64
```

`PeriodIndex`

resampling¶

In previous versions of pandas, resampling a `Series`

/`DataFrame`

indexed by a `PeriodIndex`

returned a `DatetimeIndex`

in some cases (GH12884). Resampling to a multiplied frequency now returns a `PeriodIndex`

(GH15944). As a minor enhancement, resampling a `PeriodIndex`

can now handle `NaT`

values (GH13224)

Previous Behavior:

```
In [1]: pi = pd.period_range('2017-01', periods=12, freq='M')
In [2]: s = pd.Series(np.arange(12), index=pi)
In [3]: resampled = s.resample('2Q').mean()
In [4]: resampled
Out[4]:
2017-03-31 1.0
2017-09-30 5.5
2018-03-31 10.0
Freq: 2Q-DEC, dtype: float64
In [5]: resampled.index
Out[5]: DatetimeIndex(['2017-03-31', '2017-09-30', '2018-03-31'], dtype='datetime64[ns]', freq='2Q-DEC')
```

New Behavior:

```
In [52]: pi = pd.period_range('2017-01', periods=12, freq='M')
In [53]: s = pd.Series(np.arange(12), index=pi)
In [54]: resampled = s.resample('2Q').mean()
In [55]: resampled
Out[55]:
2017Q1 2.5
2017Q3 8.5
Freq: 2Q-DEC, dtype: float64
In [56]: resampled.index
Out[56]: PeriodIndex(['2017Q1', '2017Q3'], dtype='period[2Q-DEC]', freq='2Q-DEC')
```

Upsampling and calling `.ohlc()`

previously returned a `Series`

, basically identical to calling `.asfreq()`

. OHLC upsampling now returns a DataFrame with columns `open`

, `high`

, `low`

and `close`

(GH13083). This is consistent with downsampling and `DatetimeIndex`

behavior.

Previous Behavior:

```
In [1]: pi = pd.PeriodIndex(start='2000-01-01', freq='D', periods=10)
In [2]: s = pd.Series(np.arange(10), index=pi)
In [3]: s.resample('H').ohlc()
Out[3]:
2000-01-01 00:00 0.0
...
2000-01-10 23:00 NaN
Freq: H, Length: 240, dtype: float64
In [4]: s.resample('M').ohlc()
Out[4]:
open high low close
2000-01 0 9 0 9
```

New Behavior:

```
In [57]: pi = pd.PeriodIndex(start='2000-01-01', freq='D', periods=10)
In [58]: s = pd.Series(np.arange(10), index=pi)
In [59]: s.resample('H').ohlc()
Out[59]:
open high low close
2000-01-01 00:00 0.0 0.0 0.0 0.0
2000-01-01 01:00 NaN NaN NaN NaN
2000-01-01 02:00 NaN NaN NaN NaN
2000-01-01 03:00 NaN NaN NaN NaN
2000-01-01 04:00 NaN NaN NaN NaN
2000-01-01 05:00 NaN NaN NaN NaN
2000-01-01 06:00 NaN NaN NaN NaN
... ... ... ... ...
2000-01-10 17:00 NaN NaN NaN NaN
2000-01-10 18:00 NaN NaN NaN NaN
2000-01-10 19:00 NaN NaN NaN NaN
2000-01-10 20:00 NaN NaN NaN NaN
2000-01-10 21:00 NaN NaN NaN NaN
2000-01-10 22:00 NaN NaN NaN NaN
2000-01-10 23:00 NaN NaN NaN NaN
[240 rows x 4 columns]
In [60]: s.resample('M').ohlc()
Out[60]:
open high low close
2000-01 0 9 0 9
```

#### Improved error handling during item assignment in pd.eval¶

`eval()`

will now raise a `ValueError`

when item assignment malfunctions, or
inplace operations are specified, but there is no item assignment in the expression (GH16732)

```
In [61]: arr = np.array([1, 2, 3])
```

Previously, if you attempted the following expression, you would get a not very helpful error message:

```
In [3]: pd.eval("a = 1 + 2", target=arr, inplace=True)
...
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`)
and integer or boolean arrays are valid indices
```

This is a very long way of saying numpy arrays don’t support string-item indexing. With this change, the error message is now this:

```
In [3]: pd.eval("a = 1 + 2", target=arr, inplace=True)
...
ValueError: Cannot assign expression output to target
```

It also used to be possible to evaluate expressions inplace, even if there was no item assignment:

```
In [4]: pd.eval("1 + 2", target=arr, inplace=True)
Out[4]: 3
```

However, this input does not make much sense because the output is not being assigned to
the target. Now, a `ValueError`

will be raised when such an input is passed in:

```
In [4]: pd.eval("1 + 2", target=arr, inplace=True)
...
ValueError: Cannot operate inplace if there is no assignment
```

#### Dtype Conversions¶

Previously assignments, `.where()`

and `.fillna()`

with a `bool`

assignment, would coerce to same the type (e.g. int / float), or raise for datetimelikes. These will now preserve the bools with `object`

dtypes. (GH16821).

```
In [62]: s = Series([1, 2, 3])
```

```
In [5]: s[1] = True
In [6]: s
Out[6]:
0 1
1 1
2 3
dtype: int64
```

New Behavior

```
In [63]: s[1] = True
In [64]: s
Out[64]:
0 1
1 True
2 3
dtype: object
```

Previously, as assignment to a datetimelike with a non-datetimelike would coerce the non-datetime-like item being assigned (GH14145).

```
In [65]: s = pd.Series([pd.Timestamp('2011-01-01'), pd.Timestamp('2012-01-01')])
```

```
In [1]: s[1] = 1
In [2]: s
Out[2]:
0 2011-01-01 00:00:00.000000000
1 1970-01-01 00:00:00.000000001
dtype: datetime64[ns]
```

These now coerce to `object`

dtype.

```
In [66]: s[1] = 1
In [67]: s
Out[67]:
0 2011-01-01 00:00:00
1 1
dtype: object
```

#### MultiIndex Constructor with a Single Level¶

The `MultiIndex`

constructors no longer squeezes a MultiIndex with all
length-one levels down to a regular `Index`

. This affects all the
`MultiIndex`

constructors. (GH17178)

Previous behavior:

```
In [2]: pd.MultiIndex.from_tuples([('a',), ('b',)])
Out[2]: Index(['a', 'b'], dtype='object')
```

Length 1 levels are no longer special-cased. They behave exactly as if you had
length 2+ levels, so a `MultiIndex`

is always returned from all of the
`MultiIndex`

constructors:

```
In [68]: pd.MultiIndex.from_tuples([('a',), ('b',)])
Out[68]:
MultiIndex(levels=[['a', 'b']],
labels=[[0, 1]])
```

#### UTC Localization with Series¶

Previously, `to_datetime()`

did not localize datetime `Series`

data when `utc=True`

was passed. Now, `to_datetime()`

will correctly localize `Series`

with a `datetime64[ns, UTC]`

dtype to be consistent with how list-like and `Index`

data are handled. (GH6415).

Previous Behavior

```
In [69]: s = Series(['20130101 00:00:00'] * 3)
```

```
In [12]: pd.to_datetime(s, utc=True)
Out[12]:
0 2013-01-01
1 2013-01-01
2 2013-01-01
dtype: datetime64[ns]
```

New Behavior

```
In [70]: pd.to_datetime(s, utc=True)
Out[70]:
0 2013-01-01 00:00:00+00:00
1 2013-01-01 00:00:00+00:00
2 2013-01-01 00:00:00+00:00
dtype: datetime64[ns, UTC]
```

Additionally, DataFrames with datetime columns that were parsed by `read_sql_table()`

and `read_sql_query()`

will also be localized to UTC only if the original SQL columns were timezone aware datetime columns.

#### Consistency of Range Functions¶

In previous versions, there were some inconsistencies between the various range functions: `date_range()`

, `bdate_range()`

, `period_range()`

, `timedelta_range()`

, and `interval_range()`

. (GH17471).

One of the inconsistent behaviors occurred when the `start`

, `end`

and `period`

parameters were all specified, potentially leading to ambiguous ranges. When all three parameters were passed, `interval_range`

ignored the `period`

parameter, `period_range`

ignored the `end`

parameter, and the other range functions raised. To promote consistency among the range functions, and avoid potentially ambiguous ranges, `interval_range`

and `period_range`

will now raise when all three parameters are passed.

Previous Behavior:

```
In [2]: pd.interval_range(start=0, end=4, periods=6)
Out[2]:
IntervalIndex([(0, 1], (1, 2], (2, 3]]
closed='right',
dtype='interval[int64]')
In [3]: pd.period_range(start='2017Q1', end='2017Q4', periods=6, freq='Q')
Out[3]: PeriodIndex(['2017Q1', '2017Q2', '2017Q3', '2017Q4', '2018Q1', '2018Q2'], dtype='period[Q-DEC]', freq='Q-DEC')
```

New Behavior:

```
In [2]: pd.interval_range(start=0, end=4, periods=6)
---------------------------------------------------------------------------
ValueError: Of the three parameters: start, end, and periods, exactly two must be specified
In [3]: pd.period_range(start='2017Q1', end='2017Q4', periods=6, freq='Q')
---------------------------------------------------------------------------
ValueError: Of the three parameters: start, end, and periods, exactly two must be specified
```

Additionally, the endpoint parameter `end`

was not included in the intervals produced by `interval_range`

. However, all other range functions include `end`

in their output. To promote consistency among the range functions, `interval_range`

will now include `end`

as the right endpoint of the final interval, except if `freq`

is specified in a way which skips `end`

.

Previous Behavior:

```
In [4]: pd.interval_range(start=0, end=4)
Out[4]:
IntervalIndex([(0, 1], (1, 2], (2, 3]]
closed='right',
dtype='interval[int64]')
```

New Behavior:

```
In [71]: pd.interval_range(start=0, end=4)
Out[71]:
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4]],
closed='right',
dtype='interval[int64]')
```

#### No Automatic Matplotlib Converters¶

Pandas no longer registers our `date`

, `time`

, `datetime`

,
`datetime64`

, and `Period`

converters with matplotlib when pandas is
imported. Matplotlib plot methods (`plt.plot`

, `ax.plot`

, …), will not
nicely format the x-axis for `DatetimeIndex`

or `PeriodIndex`

values. You
must explicitly register these methods:

Pandas built-in `Series.plot`

and `DataFrame.plot`

*will* register these
converters on first-use (GH17710).

Note

This change has been temporarily reverted in pandas 0.21.1, for more details see here.

#### Other API Changes¶

- The Categorical constructor no longer accepts a scalar for the
`categories`

keyword. (GH16022) - Accessing a non-existent attribute on a closed
`HDFStore`

will now raise an`AttributeError`

rather than a`ClosedFileError`

(GH16301) `read_csv()`

now issues a`UserWarning`

if the`names`

parameter contains duplicates (GH17095)`read_csv()`

now treats`'null'`

and`'n/a'`

strings as missing values by default (GH16471, GH16078)`pandas.HDFStore`

’s string representation is now faster and less detailed. For the previous behavior, use`pandas.HDFStore.info()`

. (GH16503).- Compression defaults in HDF stores now follow pytables standards. Default is no compression and if
`complib`

is missing and`complevel`

> 0`zlib`

is used (GH15943) `Index.get_indexer_non_unique()`

now returns a ndarray indexer rather than an`Index`

; this is consistent with`Index.get_indexer()`

(GH16819)- Removed the
`@slow`

decorator from`pandas.util.testing`

, which caused issues for some downstream packages’ test suites. Use`@pytest.mark.slow`

instead, which achieves the same thing (GH16850) - Moved definition of
`MergeError`

to the`pandas.errors`

module. - The signature of
`Series.set_axis()`

and`DataFrame.set_axis()`

has been changed from`set_axis(axis, labels)`

to`set_axis(labels, axis=0)`

, for consistency with the rest of the API. The old signature is deprecated and will show a`FutureWarning`

(GH14636) `Series.argmin()`

and`Series.argmax()`

will now raise a`TypeError`

when used with`object`

dtypes, instead of a`ValueError`

(GH13595)`Period`

is now immutable, and will now raise an`AttributeError`

when a user tries to assign a new value to the`ordinal`

or`freq`

attributes (GH17116).`to_datetime()`

when passed a tz-aware`origin=`

kwarg will now raise a more informative`ValueError`

rather than a`TypeError`

(GH16842)`to_datetime()`

now raises a`ValueError`

when format includes`%W`

or`%U`

without also including day of the week and calendar year (GH16774)- Renamed non-functional
`index`

to`index_col`

in`read_stata()`

to improve API consistency (GH16342) - Bug in
`DataFrame.drop()`

caused boolean labels`False`

and`True`

to be treated as labels 0 and 1 respectively when dropping indices from a numeric index. This will now raise a ValueError (GH16877) - Restricted DateOffset keyword arguments. Previously,
`DateOffset`

subclasses allowed arbitrary keyword arguments which could lead to unexpected behavior. Now, only valid arguments will be accepted. (GH17176).

### Deprecations¶

`DataFrame.from_csv()`

and`Series.from_csv()`

have been deprecated in favor of`read_csv()`

(GH4191)`read_excel()`

has deprecated`sheetname`

in favor of`sheet_name`

for consistency with`.to_excel()`

(GH10559).`read_excel()`

has deprecated`parse_cols`

in favor of`usecols`

for consistency with`read_csv()`

(GH4988)`read_csv()`

has deprecated the`tupleize_cols`

argument. Column tuples will always be converted to a`MultiIndex`

(GH17060)`DataFrame.to_csv()`

has deprecated the`tupleize_cols`

argument. MultiIndex columns will be always written as rows in the CSV file (GH17060)- The
`convert`

parameter has been deprecated in the`.take()`

method, as it was not being respected (GH16948) `pd.options.html.border`

has been deprecated in favor of`pd.options.display.html.border`

(GH15793).`SeriesGroupBy.nth()`

has deprecated`True`

in favor of`'all'`

for its kwarg`dropna`

(GH11038).`DataFrame.as_blocks()`

is deprecated, as this is exposing the internal implementation (GH17302)`pd.TimeGrouper`

is deprecated in favor of`pandas.Grouper`

(GH16747)`cdate_range`

has been deprecated in favor of`bdate_range()`

, which has gained`weekmask`

and`holidays`

parameters for building custom frequency date ranges. See the documentation for more details (GH17596)- passing
`categories`

or`ordered`

kwargs to`Series.astype()`

is deprecated, in favor of passing a CategoricalDtype (GH17636) `.get_value`

and`.set_value`

on`Series`

,`DataFrame`

,`Panel`

,`SparseSeries`

, and`SparseDataFrame`

are deprecated in favor of using`.iat[]`

or`.at[]`

accessors (GH15269)- Passing a non-existent column in
`.to_excel(..., columns=)`

is deprecated and will raise a`KeyError`

in the future (GH17295) `raise_on_error`

parameter to`Series.where()`

,`Series.mask()`

,`DataFrame.where()`

,`DataFrame.mask()`

is deprecated, in favor of`errors=`

(GH14968)- Using
`DataFrame.rename_axis()`

and`Series.rename_axis()`

to alter index or column*labels*is now deprecated in favor of using`.rename`

.`rename_axis`

may still be used to alter the name of the index or columns (GH17833). `reindex_axis()`

has been deprecated in favor of`reindex()`

. See here for more (GH17833).

#### Series.select and DataFrame.select¶

The `Series.select()`

and `DataFrame.select()`

methods are deprecated in favor of using `df.loc[labels.map(crit)]`

(GH12401)

```
In [72]: df = DataFrame({'A': [1, 2, 3]}, index=['foo', 'bar', 'baz'])
```

```
In [3]: df.select(lambda x: x in ['bar', 'baz'])
FutureWarning: select is deprecated and will be removed in a future release. You can use .loc[crit] as a replacement
Out[3]:
A
bar 2
baz 3
```

```
In [73]: df.loc[df.index.map(lambda x: x in ['bar', 'baz'])]
Out[73]:
A
bar 2
baz 3
```

#### Series.argmax and Series.argmin¶

The behavior of `Series.argmax()`

and `Series.argmin()`

have been deprecated in favor of `Series.idxmax()`

and `Series.idxmin()`

, respectively (GH16830).

For compatibility with NumPy arrays, `pd.Series`

implements `argmax`

and
`argmin`

. Since pandas 0.13.0, `argmax`

has been an alias for
`pandas.Series.idxmax()`

, and `argmin`

has been an alias for
`pandas.Series.idxmin()`

. They return the *label* of the maximum or minimum,
rather than the *position*.

We’ve deprecated the current behavior of `Series.argmax`

and
`Series.argmin`

. Using either of these will emit a `FutureWarning`

. Use
`Series.idxmax()`

if you want the label of the maximum. Use
`Series.values.argmax()`

if you want the position of the maximum. Likewise for
the minimum. In a future release `Series.argmax`

and `Series.argmin`

will
return the position of the maximum or minimum.

### Removal of prior version deprecations/changes¶

`read_excel()`

has dropped the`has_index_names`

parameter (GH10967)- The
`pd.options.display.height`

configuration has been dropped (GH3663) - The
`pd.options.display.line_width`

configuration has been dropped (GH2881) - The
`pd.options.display.mpl_style`

configuration has been dropped (GH12190) `Index`

has dropped the`.sym_diff()`

method in favor of`.symmetric_difference()`

(GH12591)`Categorical`

has dropped the`.order()`

and`.sort()`

methods in favor of`.sort_values()`

(GH12882)`eval()`

and`DataFrame.eval()`

have changed the default of`inplace`

from`None`

to`False`

(GH11149)- The function
`get_offset_name`

has been dropped in favor of the`.freqstr`

attribute for an offset (GH11834) - pandas no longer tests for compatibility with hdf5-files created with pandas < 0.11 (GH17404).

### Performance Improvements¶

- Improved performance of instantiating
`SparseDataFrame`

(GH16773) `Series.dt`

no longer performs frequency inference, yielding a large speedup when accessing the attribute (GH17210)- Improved performance of
`set_categories()`

by not materializing the values (GH17508) `Timestamp.microsecond`

no longer re-computes on attribute access (GH17331)- Improved performance of the
`CategoricalIndex`

for data that is already categorical dtype (GH17513) - Improved performance of
`RangeIndex.min()`

and`RangeIndex.max()`

by using`RangeIndex`

properties to perform the computations (GH17607)

### Documentation Changes¶

### Bug Fixes¶

#### Conversion¶

- Bug in assignment against datetime-like data with
`int`

may incorrectly convert to datetime-like (GH14145) - Bug in assignment against
`int64`

data with`np.ndarray`

with`float64`

dtype may keep`int64`

dtype (GH14001) - Fixed the return type of
`IntervalIndex.is_non_overlapping_monotonic`

to be a Python`bool`

for consistency with similar attributes/methods. Previously returned a`numpy.bool_`

. (GH17237) - Bug in
`IntervalIndex.is_non_overlapping_monotonic`

when intervals are closed on both sides and overlap at a point (GH16560) - Bug in
`Series.fillna()`

returns frame when`inplace=True`

and`value`

is dict (GH16156) - Bug in
`Timestamp.weekday_name`

returning a UTC-based weekday name when localized to a timezone (GH17354) - Bug in
`Timestamp.replace`

when replacing`tzinfo`

around DST changes (GH15683) - Bug in
`Timedelta`

construction and arithmetic that would not propagate the`Overflow`

exception (GH17367) - Bug in
`astype()`

converting to object dtype when passed extension type classes (`DatetimeTZDtype`

,`CategoricalDtype`

) rather than instances. Now a`TypeError`

is raised when a class is passed (GH17780). - Bug in
`to_numeric()`

in which elements were not always being coerced to numeric when`errors='coerce'`

(GH17007, GH17125) - Bug in
`DataFrame`

and`Series`

constructors where`range`

objects are converted to`int32`

dtype on Windows instead of`int64`

(GH16804)

#### Indexing¶

- When called with a null slice (e.g.
`df.iloc[:]`

), the`.iloc`

and`.loc`

indexers return a shallow copy of the original object. Previously they returned the original object. (GH13873). - When called on an unsorted
`MultiIndex`

, the`loc`

indexer now will raise`UnsortedIndexError`

only if proper slicing is used on non-sorted levels (GH16734). - Fixes regression in 0.20.3 when indexing with a string on a
`TimedeltaIndex`

(GH16896). - Fixed
`TimedeltaIndex.get_loc()`

handling of`np.timedelta64`

inputs (GH16909). - Fix
`MultiIndex.sort_index()`

ordering when`ascending`

argument is a list, but not all levels are specified, or are in a different order (GH16934). - Fixes bug where indexing with
`np.inf`

caused an`OverflowError`

to be raised (GH16957) - Bug in reindexing on an empty
`CategoricalIndex`

(GH16770) - Fixes
`DataFrame.loc`

for setting with alignment and tz-aware`DatetimeIndex`

(GH16889) - Avoids
`IndexError`

when passing an Index or Series to`.iloc`

with older numpy (GH17193) - Allow unicode empty strings as placeholders in multilevel columns in Python 2 (GH17099)
- Bug in
`.iloc`

when used with inplace addition or assignment and an int indexer on a`MultiIndex`

causing the wrong indexes to be read from and written to (GH17148) - Bug in
`.isin()`

in which checking membership in empty`Series`

objects raised an error (GH16991) - Bug in
`CategoricalIndex`

reindexing in which specified indices containing duplicates were not being respected (GH17323) - Bug in intersection of
`RangeIndex`

with negative step (GH17296) - Bug in
`IntervalIndex`

where performing a scalar lookup fails for included right endpoints of non-overlapping monotonic decreasing indexes (GH16417, GH17271) - Bug in
`DataFrame.first_valid_index()`

and`DataFrame.last_valid_index()`

when no valid entry (GH17400) - Bug in
`Series.rename()`

when called with a callable, incorrectly alters the name of the`Series`

, rather than the name of the`Index`

. (GH17407) - Bug in
`String.str_get()`

raises`IndexError`

instead of inserting NaNs when using a negative index. (GH17704)

#### I/O¶

- Bug in
`read_hdf()`

when reading a timezone aware index from`fixed`

format HDFStore (GH17618) - Bug in
`read_csv()`

in which columns were not being thoroughly de-duplicated (GH17060) - Bug in
`read_csv()`

in which specified column names were not being thoroughly de-duplicated (GH17095) - Bug in
`read_csv()`

in which non integer values for the header argument generated an unhelpful / unrelated error message (GH16338) - Bug in
`read_csv()`

in which memory management issues in exception handling, under certain conditions, would cause the interpreter to segfault (GH14696, GH16798). - Bug in
`read_csv()`

when called with`low_memory=False`

in which a CSV with at least one column > 2GB in size would incorrectly raise a`MemoryError`

(GH16798). - Bug in
`read_csv()`

when called with a single-element list`header`

would return a`DataFrame`

of all NaN values (GH7757) - Bug in
`DataFrame.to_csv()`

defaulting to ‘ascii’ encoding in Python 3, instead of ‘utf-8’ (GH17097) - Bug in
`read_stata()`

where value labels could not be read when using an iterator (GH16923) - Bug in
`read_stata()`

where the index was not set (GH16342) - Bug in
`read_html()`

where import check fails when run in multiple threads (GH16928) - Bug in
`read_csv()`

where automatic delimiter detection caused a`TypeError`

to be thrown when a bad line was encountered rather than the correct error message (GH13374) - Bug in
`DataFrame.to_html()`

with`notebook=True`

where DataFrames with named indices or non-MultiIndex indices had undesired horizontal or vertical alignment for column or row labels, respectively (GH16792) - Bug in
`DataFrame.to_html()`

in which there was no validation of the`justify`

parameter (GH17527) - Bug in
`HDFStore.select()`

when reading a contiguous mixed-data table featuring VLArray (GH17021) - Bug in
`to_json()`

where several conditions (including objects with unprintable symbols, objects with deep recursion, overlong labels) caused segfaults instead of raising the appropriate exception (GH14256)

#### Plotting¶

- Bug in plotting methods using
`secondary_y`

and`fontsize`

not setting secondary axis font size (GH12565) - Bug when plotting
`timedelta`

and`datetime`

dtypes on y-axis (GH16953) - Line plots no longer assume monotonic x data when calculating xlims, they show the entire lines now even for unsorted x data. (GH11310, GH11471)
- With matplotlib 2.0.0 and above, calculation of x limits for line plots is left to matplotlib, so that its new default settings are applied. (GH15495)
- Bug in
`Series.plot.bar`

or`DataFrame.plot.bar`

with`y`

not respecting user-passed`color`

(GH16822) - Bug causing
`plotting.parallel_coordinates`

to reset the random seed when using random colors (GH17525)

#### Groupby/Resample/Rolling¶

- Bug in
`DataFrame.resample(...).size()`

where an empty`DataFrame`

did not return a`Series`

(GH14962) - Bug in
`infer_freq()`

causing indices with 2-day gaps during the working week to be wrongly inferred as business daily (GH16624) - Bug in
`.rolling(...).quantile()`

which incorrectly used different defaults than`Series.quantile()`

and`DataFrame.quantile()`

(GH9413, GH16211) - Bug in
`groupby.transform()`

that would coerce boolean dtypes back to float (GH16875) - Bug in
`Series.resample(...).apply()`

where an empty`Series`

modified the source index and did not return the name of a`Series`

(GH14313) - Bug in
`.rolling(...).apply(...)`

with a`DataFrame`

with a`DatetimeIndex`

, a`window`

of a timedelta-convertible and`min_periods >= 1`

(GH15305) - Bug in
`DataFrame.groupby`

where index and column keys were not recognized correctly when the number of keys equaled the number of elements on the groupby axis (GH16859) - Bug in
`groupby.nunique()`

with`TimeGrouper`

which cannot handle`NaT`

correctly (GH17575) - Bug in
`DataFrame.groupby`

where a single level selection from a`MultiIndex`

unexpectedly sorts (GH17537) - Bug in
`DataFrame.groupby`

where spurious warning is raised when`Grouper`

object is used to override ambiguous column name (GH17383) - Bug in
`TimeGrouper`

differs when passes as a list and as a scalar (GH17530)

#### Sparse¶

- Bug in
`SparseSeries`

raises`AttributeError`

when a dictionary is passed in as data (GH16905) - Bug in
`SparseDataFrame.fillna()`

not filling all NaNs when frame was instantiated from SciPy sparse matrix (GH16112) - Bug in
`SparseSeries.unstack()`

and`SparseDataFrame.stack()`

(GH16614, GH15045) - Bug in
`make_sparse()`

treating two numeric/boolean data, which have same bits, as same when array`dtype`

is`object`

(GH17574) `SparseArray.all()`

and`SparseArray.any()`

are now implemented to handle`SparseArray`

, these were used but not implemented (GH17570)

#### Reshaping¶

- Joining/Merging with a non unique
`PeriodIndex`

raised a`TypeError`

(GH16871) - Bug in
`crosstab()`

where non-aligned series of integers were casted to float (GH17005) - Bug in merging with categorical dtypes with datetimelikes incorrectly raised a
`TypeError`

(GH16900) - Bug when using
`isin()`

on a large object series and large comparison array (GH16012) - Fixes regression from 0.20,
`Series.aggregate()`

and`DataFrame.aggregate()`

allow dictionaries as return values again (GH16741) - Fixes dtype of result with integer dtype input, from
`pivot_table()`

when called with`margins=True`

(GH17013) - Bug in
`crosstab()`

where passing two`Series`

with the same name raised a`KeyError`

(GH13279) `Series.argmin()`

,`Series.argmax()`

, and their counterparts on`DataFrame`

and groupby objects work correctly with floating point data that contains infinite values (GH13595).- Bug in
`unique()`

where checking a tuple of strings raised a`TypeError`

(GH17108) - Bug in
`concat()`

where order of result index was unpredictable if it contained non-comparable elements (GH17344) - Fixes regression when sorting by multiple columns on a
`datetime64`

dtype`Series`

with`NaT`

values (GH16836) - Bug in
`pivot_table()`

where the result’s columns did not preserve the categorical dtype of`columns`

when`dropna`

was`False`

(GH17842) - Bug in
`DataFrame.drop_duplicates`

where dropping with non-unique column names raised a`ValueError`

(GH17836) - Bug in
`unstack()`

which, when called on a list of levels, would discard the`fillna`

argument (GH13971) - Bug in the alignment of
`range`

objects and other list-likes with`DataFrame`

leading to operations being performed row-wise instead of column-wise (GH17901)

#### Numeric¶

- Bug in
`.clip()`

with`axis=1`

and a list-like for`threshold`

is passed; previously this raised`ValueError`

(GH15390) `Series.clip()`

and`DataFrame.clip()`

now treat NA values for upper and lower arguments as`None`

instead of raising`ValueError`

(GH17276).

#### Categorical¶

- Bug in
`Series.isin()`

when called with a categorical (GH16639) - Bug in the categorical constructor with empty values and categories causing the
`.categories`

to be an empty`Float64Index`

rather than an empty`Index`

with object dtype (GH17248) - Bug in categorical operations with Series.cat not preserving the original Series’ name (GH17509)
- Bug in
`DataFrame.merge()`

failing for categorical columns with boolean/int data types (GH17187) - Bug in constructing a
`Categorical`

/`CategoricalDtype`

when the specified`categories`

are of categorical type (GH17884).

#### PyPy¶

- Compatibility with PyPy in
`read_csv()`

with`usecols=[<unsorted ints>]`

and`read_json()`

(GH17351) - Split tests into cases for CPython and PyPy where needed, which highlights the fragility
of index matching with
`float('nan')`

,`np.nan`

and`NAT`

(GH17351) - Fix
`DataFrame.memory_usage()`

to support PyPy. Objects on PyPy do not have a fixed size, so an approximation is used instead (GH17228)

## v0.20.3 (July 7, 2017)¶

This is a minor bug-fix release in the 0.20.x series and includes some small regression fixes and bug fixes. We recommend that all users upgrade to this version.

What’s new in v0.20.3

### Bug Fixes¶

- Fixed a bug in failing to compute rolling computations of a column-MultiIndexed
`DataFrame`

(GH16789, GH16825) - Fixed a pytest marker failing downstream packages’ tests suites (GH16680)

#### Conversion¶

- Bug in pickle compat prior to the v0.20.x series, when
`UTC`

is a timezone in a Series/DataFrame/Index (GH16608) - Bug in
`Series`

construction when passing a`Series`

with`dtype='category'`

(GH16524). - Bug in
`DataFrame.astype()`

when passing a`Series`

as the`dtype`

kwarg. (GH16717).

#### Indexing¶

- Bug in
`Float64Index`

causing an empty array instead of`None`

to be returned from`.get(np.nan)`

on a Series whose index did not contain any`NaN`

s (GH8569) - Bug in
`MultiIndex.isin`

causing an error when passing an empty iterable (GH16777) - Fixed a bug in a slicing DataFrame/Series that have a
`TimedeltaIndex`

(GH16637)

#### I/O¶

- Bug in
`read_csv()`

in which files weren’t opened as binary files by the C engine on Windows, causing EOF characters mid-field, which would fail (GH16039, GH16559, GH16675) - Bug in
`read_hdf()`

in which reading a`Series`

saved to an HDF file in ‘fixed’ format fails when an explicit`mode='r'`

argument is supplied (GH16583) - Bug in
`DataFrame.to_latex()`

where`bold_rows`

was wrongly specified to be`True`

by default, whereas in reality row labels remained non-bold whatever parameter provided. (GH16707) - Fixed an issue with
`DataFrame.style()`

where generated element ids were not unique (GH16780) - Fixed loading a
`DataFrame`

with a`PeriodIndex`

, from a`format='fixed'`

HDFStore, in Python 3, that was written in Python 2 (GH16781)

#### Plotting¶

- Fixed regression that prevented RGB and RGBA tuples from being used as color arguments (GH16233)
- Fixed an issue with
`DataFrame.plot.scatter()`

that incorrectly raised a`KeyError`

when categorical data is used for plotting (GH16199)

#### Reshaping¶

## v0.20.2 (June 4, 2017)¶

This is a minor bug-fix release in the 0.20.x series and includes some small regression fixes, bug fixes and performance improvements. We recommend that all users upgrade to this version.

What’s new in v0.20.2

### Enhancements¶

- Unblocked access to additional compression types supported in pytables: ‘blosc:blosclz, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’ (GH14478)
`Series`

provides a`to_latex`

method (GH16180)- A new groupby method
`ngroup()`

, parallel to the existing`cumcount()`

, has been added to return the group order (GH11642); see here.

### Performance Improvements¶

- Performance regression fix when indexing with a list-like (GH16285)
- Performance regression fix for MultiIndexes (GH16319, GH16346)
- Improved performance of
`.clip()`

with scalar arguments (GH15400) - Improved performance of groupby with categorical groupers (GH16413)
- Improved performance of
`MultiIndex.remove_unused_levels()`

(GH16556)

### Bug Fixes¶

- Silenced a warning on some Windows environments about “tput: terminal attributes: No such device or address” when detecting the terminal size. This fix only applies to python 3 (GH16496)
- Bug in using
`pathlib.Path`

or`py.path.local`

objects with io functions (GH16291) - Bug in
`Index.symmetric_difference()`

on two equal MultiIndex’s, results in a`TypeError`

(GH13490) - Bug in
`DataFrame.update()`

with`overwrite=False`

and`NaN values`

(GH15593) - Passing an invalid engine to
`read_csv()`

now raises an informative`ValueError`

rather than`UnboundLocalError`

. (GH16511) - Bug in
`unique()`

on an array of tuples (GH16519) - Bug in
`cut()`

when`labels`

are set, resulting in incorrect label ordering (GH16459) - Fixed a compatibility issue with IPython 6.0’s tab completion showing deprecation warnings on
`Categoricals`

(GH16409)

#### Conversion¶

- Bug in
`to_numeric()`

in which empty data inputs were causing a segfault of the interpreter (GH16302) - Silence numpy warnings when broadcasting
`DataFrame`

to`Series`

with comparison ops (GH16378, GH16306)

#### Indexing¶

- Bug in
`DataFrame.reset_index(level=)`

with single level index (GH16263) - Bug in partial string indexing with a monotonic, but not strictly-monotonic, index incorrectly reversing the slice bounds (GH16515)
- Bug in
`MultiIndex.remove_unused_levels()`

that would not return a`MultiIndex`

equal to the original. (GH16556)

#### I/O¶

- Bug in
`read_csv()`

when`comment`

is passed in a space delimited text file (GH16472) - Bug in
`read_csv()`

not raising an exception with nonexistent columns in`usecols`

when it had the correct length (GH14671) - Bug that would force importing of the clipboard routines unnecessarily, potentially causing an import error on startup (GH16288)
- Bug that raised
`IndexError`

when HTML-rendering an empty`DataFrame`

(GH15953) - Bug in
`read_csv()`

in which tarfile object inputs were raising an error in Python 2.x for the C engine (GH16530) - Bug where
`DataFrame.to_html()`

ignored the`index_names`

parameter (GH16493) - Bug where
`pd.read_hdf()`

returns numpy strings for index names (GH13492) - Bug in
`HDFStore.select_as_multiple()`

where start/stop arguments were not respected (GH16209)

#### Plotting¶

#### Groupby/Resample/Rolling¶

#### Reshaping¶

- Bug in
`DataFrame.stack`

with unsorted levels in`MultiIndex`

columns (GH16323) - Bug in
`pd.wide_to_long()`

where no error was raised when`i`

was not a unique identifier (GH16382) - Bug in
`Series.isin(..)`

with a list of tuples (GH16394) - Bug in construction of a
`DataFrame`

with mixed dtypes including an all-NaT column. (GH16395) - Bug in
`DataFrame.agg()`

and`Series.agg()`

with aggregating on non-callable attributes (GH16405)

#### Numeric¶

- Bug in
`.interpolate()`

, where`limit_direction`

was not respected when`limit=None`

(default) was passed (GH16282)

## v0.20.1 (May 5, 2017)¶

This is a major release from 0.19.2 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

- New
`.agg()`

API for Series/DataFrame similar to the groupby-rolling-resample API’s, see here - Integration with the
`feather-format`

, including a new top-level`pd.read_feather()`

and`DataFrame.to_feather()`

method, see here. - The
`.ix`

indexer has been deprecated, see here `Panel`

has been deprecated, see here- Addition of an
`IntervalIndex`

and`Interval`

scalar type, see here - Improved user API when grouping by index levels in
`.groupby()`

, see here - Improved support for
`UInt64`

dtypes, see here - A new orient for JSON serialization,
`orient='table'`

, that uses the Table Schema spec and that gives the possibility for a more interactive repr in the Jupyter Notebook, see here - Experimental support for exporting styled DataFrames (
`DataFrame.style`

) to Excel, see here - Window binary corr/cov operations now return a MultiIndexed
`DataFrame`

rather than a`Panel`

, as`Panel`

is now deprecated, see here - Support for S3 handling now uses
`s3fs`

, see here - Google BigQuery support now uses the
`pandas-gbq`

library, see here

Warning

Pandas has changed the internal structure and layout of the code base.
This can affect imports that are not from the top-level `pandas.*`

namespace, please see the changes here.

Check the API Changes and deprecations before updating.

Note

This is a combined release for 0.20.0 and and 0.20.1.
Version 0.20.1 contains one additional change for backwards-compatibility with downstream projects using pandas’ `utils`

routines. (GH16250)

What’s new in v0.20.0

- New features
`agg`

API for DataFrame/Series`dtype`

keyword for data IO`.to_datetime()`

has gained an`origin`

parameter- Groupby Enhancements
- Better support for compressed URLs in
`read_csv`

- Pickle file I/O now supports compression
- UInt64 Support Improved
- GroupBy on Categoricals
- Table Schema Output
- SciPy sparse matrix from/to SparseDataFrame
- Excel output for styled DataFrames
- IntervalIndex
- Other Enhancements

- Backwards incompatible API changes
- Possible incompatibility for HDF5 formats created with pandas < 0.13.0
- Map on Index types now return other Index types
- Accessing datetime fields of Index now return Index
- pd.unique will now be consistent with extension types
- S3 File Handling
- Partial String Indexing Changes
- Concat of different float dtypes will not automatically upcast
- Pandas Google BigQuery support has moved
- Memory Usage for Index is more Accurate
- DataFrame.sort_index changes
- Groupby Describe Formatting
- Window Binary Corr/Cov operations return a MultiIndex DataFrame
- HDFStore where string comparison
- Index.intersection and inner join now preserve the order of the left Index
- Pivot Table always returns a DataFrame
- Other API Changes

- Reorganization of the library: Privacy Changes
- Deprecations
- Removal of prior version deprecations/changes
- Performance Improvements
- Bug Fixes

### New features¶

`agg`

API for DataFrame/Series¶

Series & DataFrame have been enhanced to support the aggregation API. This is a familiar API
from groupby, window operations, and resampling. This allows aggregation operations in a concise way
by using `agg()`

and `transform()`

. The full documentation
is here (GH1623).

Here is a sample

```
In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
...: index=pd.date_range('1/1/2000', periods=10))
...:
In [2]: df.iloc[3:7] = np.nan
In [3]: df
Out[3]:
A B C
2000-01-01 1.682600 0.413582 1.689516
2000-01-02 -2.099110 -1.180182 1.595661
2000-01-03 -0.419048 0.522165 -1.208946
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.955435 -0.133009 2.011466
2000-01-09 0.578780 0.897126 -0.980013
2000-01-10 -0.045748 0.361601 -0.208039
```

One can operate using string function names, callables, lists, or dictionaries of these.

Using a single function is equivalent to `.apply`

.

```
In [4]: df.agg('sum')
Out[4]:
A 0.652908
B 0.881282
C 2.899645
dtype: float64
```

Multiple aggregations with a list of functions.

```
In [5]: df.agg(['sum', 'min'])
Out[5]:
A B C
sum 0.652908 0.881282 2.899645
min -2.099110 -1.180182 -1.208946
```

Using a dict provides the ability to apply specific aggregations per column.
You will get a matrix-like output of all of the aggregators. The output has one column
per unique function. Those functions applied to a particular column will be `NaN`

:

```
In [6]: df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
Out[6]:
A B
max NaN 0.897126
min -2.099110 -1.180182
sum 0.652908 NaN
```

The API also supports a `.transform()`

function for broadcasting results.

```
In [7]: df.transform(['abs', lambda x: x - x.min()])
Out[7]:
A B C
abs <lambda> abs <lambda> abs <lambda>
2000-01-01 1.682600 3.781710 0.413582 1.593764 1.689516 2.898461
2000-01-02 2.099110 0.000000 1.180182 0.000000 1.595661 2.804606
2000-01-03 0.419048 1.680062 0.522165 1.702346 1.208946 0.000000
2000-01-04 NaN NaN NaN NaN NaN NaN
2000-01-05 NaN NaN NaN NaN NaN NaN
2000-01-06 NaN NaN NaN NaN NaN NaN
2000-01-07 NaN NaN NaN NaN NaN NaN
2000-01-08 0.955435 3.054545 0.133009 1.047173 2.011466 3.220412
2000-01-09 0.578780 2.677890 0.897126 2.077307 0.980013 0.228932
2000-01-10 0.045748 2.053362 0.361601 1.541782 0.208039 1.000907
```

When presented with mixed dtypes that cannot be aggregated, `.agg()`

will only take the valid
aggregations. This is similar to how groupby `.agg()`

works. (GH15015)

```
In [8]: df = pd.DataFrame({'A': [1, 2, 3],
...: 'B': [1., 2., 3.],
...: 'C': ['foo', 'bar', 'baz'],
...: 'D': pd.date_range('20130101', periods=3)})
...:
In [9]: df.dtypes
Out[9]:
A int64
B float64
C object
D datetime64[ns]
dtype: object
```

```
In [10]: df.agg(['min', 'sum'])
Out[10]:
A B C D
min 1 1.0 bar 2013-01-01
sum 6 6.0 foobarbaz NaT
```

`dtype`

keyword for data IO¶

The `'python'`

engine for `read_csv()`

, as well as the `read_fwf()`

function for parsing
fixed-width text files and `read_excel()`

for parsing Excel files, now accept the `dtype`

keyword argument for specifying the types of specific columns (GH14295). See the io docs for more information.

```
In [11]: data = "a b\n1 2\n3 4"
In [12]: pd.read_fwf(StringIO(data)).dtypes
Out[12]:
a int64
b int64
dtype: object
In [13]: pd.read_fwf(StringIO(data), dtype={'a':'float64', 'b':'object'}).dtypes
Out[13]:
a float64
b object
dtype: object
```

`.to_datetime()`

has gained an `origin`

parameter¶

`to_datetime()`

has gained a new parameter, `origin`

, to define a reference date
from where to compute the resulting timestamps when parsing numerical values with a specific `unit`

specified. (GH11276, GH11745)

For example, with 1960-01-01 as the starting date:

```
In [14]: pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01'))
Out[14]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)
```

The default is set at `origin='unix'`

, which defaults to `1970-01-01 00:00:00`

, which is
commonly called ‘unix epoch’ or POSIX time. This was the previous default, so this is a backward compatible change.

```
In [15]: pd.to_datetime([1, 2, 3], unit='D')
Out[15]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)
```

#### Groupby Enhancements¶

Strings passed to `DataFrame.groupby()`

as the `by`

parameter may now reference either column names or index level names. Previously, only column names could be referenced. This allows to easily group by a column and index level at the same time. (GH5677)

```
In [16]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
....:
In [17]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
In [18]: df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
....: 'B': np.arange(8)},
....: index=index)
....:
In [19]: df
Out[19]:
A B
first second
bar one 1 0
two 1 1
baz one 1 2
two 1 3
foo one 2 4
two 2 5
qux one 3 6
two 3 7
In [20]: df.groupby(['second', 'A']).sum()
Out[20]:
B
second A
one 1 2
2 4
3 6
two 1 4
2 5
3 7
```

#### Better support for compressed URLs in `read_csv`

¶

The compression code was refactored (GH12688). As a result, reading
dataframes from URLs in `read_csv()`

or `read_table()`

now supports
additional compression methods: `xz`

, `bz2`

, and `zip`

(GH14570).
Previously, only `gzip`

compression was supported. By default, compression of
URLs and paths are now inferred using their file extensions. Additionally,
support for bz2 compression in the python 2 C-engine improved (GH14874).

```
url = 'https://github.com/{repo}/raw/{branch}/{path}'.format(
repo = 'pandas-dev/pandas',
branch = 'master',
path = 'pandas/tests/io/parser/data/salaries.csv.bz2',
)
df = pd.read_table(url, compression='infer') # default, infer compression
df = pd.read_table(url, compression='bz2') # explicitly specify compression
df.head(2)
```

#### Pickle file I/O now supports compression¶

`read_pickle()`

, `DataFrame.to_pickle()`

and `Series.to_pickle()`

can now read from and write to compressed pickle files. Compression methods
can be an explicit parameter or be inferred from the file extension.
See the docs here.

```
In [21]: df = pd.DataFrame({
....: 'A': np.random.randn(1000),
....: 'B': 'foo',
....: 'C': pd.date_range('20130101', periods=1000, freq='s')})
....:
```

Using an explicit compression type

```
In [22]: df.to_pickle("data.pkl.compress", compression="gzip")
In [23]: rt = pd.read_pickle("data.pkl.compress", compression="gzip")
In [24]: rt.head()
Out[24]:
A B C
0 1.578227 foo 2013-01-01 00:00:00
1 -0.230575 foo 2013-01-01 00:00:01
2 0.695530 foo 2013-01-01 00:00:02
3 -0.466001 foo 2013-01-01 00:00:03
4 -0.154972 foo 2013-01-01 00:00:04
```

The default is to infer the compression type from the extension (`compression='infer'`

):

```
In [25]: df.to_pickle("data.pkl.gz")
In [26]: rt = pd.read_pickle("data.pkl.gz")
In [27]: rt.head()
Out[27]:
A B C
0 1.578227 foo 2013-01-01 00:00:00
1 -0.230575 foo 2013-01-01 00:00:01
2 0.695530 foo 2013-01-01 00:00:02
3 -0.466001 foo 2013-01-01 00:00:03
4 -0.154972 foo 2013-01-01 00:00:04
In [28]: df["A"].to_pickle("s1.pkl.bz2")
In [29]: rt = pd.read_pickle("s1.pkl.bz2")
In [30]: rt.head()
Out[30]:
0 1.578227
1 -0.230575
2 0.695530
3 -0.466001
4 -0.154972
Name: A, dtype: float64
```

#### UInt64 Support Improved¶

Pandas has significantly improved support for operations involving unsigned,
or purely non-negative, integers. Previously, handling these integers would
result in improper rounding or data-type casting, leading to incorrect results.
Notably, a new numerical index, `UInt64Index`

, has been created (GH14937)

`</`