What’s New in 0.25.0 (April XX, 2019)


Starting with the 0.25.x series of releases, pandas only supports Python 3.5 and higher. See Plan for dropping Python 2.7 for more details.


Panel has been fully removed. For N-D labeled data structures, please use xarray

These are the changes in pandas 0.25.0. See Release Notes for a full changelog including other versions of pandas.


Groupby Aggregation with Relabeling

Pandas has added special groupby behavior, known as “named aggregation”, for naming the output columns when applying multiple aggregation functions to specific columns (GH18366, GH26512).

In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ...:                         'height': [9.1, 6.0, 9.5, 34.0],
   ...:                         'weight': [7.9, 7.5, 9.9, 198.0]})

In [2]: animals
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

[4 rows x 3 columns]

In [3]: animals.groupby("kind").agg(
   ...:     min_height=pd.NamedAgg(column='height', aggfunc='min'),
   ...:     max_height=pd.NamedAgg(column='height', aggfunc='max'),
   ...:     average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
   ...: )
      min_height  max_height  average_weight
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

[2 rows x 3 columns]

Pass the desired columns names as the **kwargs to .agg. The values of **kwargs should be tuples where the first element is the column selection, and the second element is the aggregation function to apply. Pandas provides the pandas.NamedAgg namedtuple to make it clearer what the arguments to the function are, but plain tuples are accepted as well.

In [4]: animals.groupby("kind").agg(
   ...:     min_height=('height', 'min'),
   ...:     max_height=('height', 'max'),
   ...:     average_weight=('weight', np.mean),
   ...: )
      min_height  max_height  average_weight
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

[2 rows x 3 columns]

Named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).

A similar approach is now available for Series groupby objects as well. Because there’s no need for column selection, the values can just be the functions to apply

In [5]: animals.groupby("kind").height.agg(
   ...:     min_height="min",
   ...:     max_height="max",
   ...: )
      min_height  max_height
cat          9.1         9.5
dog          6.0        34.0

[2 rows x 2 columns]

This type of aggregation is the recommended alternative to the deprecated behavior when passing a dict to a Series groupby aggregation (Deprecate groupby.agg() with a dictionary when renaming).

See Named Aggregation for more.

Other Enhancements

Backwards incompatible API changes

Indexing with date strings with UTC offsets

Indexing a DataFrame or Series with a DatetimeIndex with a date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset is respected in indexing. (GH24076, GH16785)

In [6]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))

In [7]: df
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

Previous Behavior:

In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
2019-01-01 00:00:00-08:00  0

New Behavior:

In [8]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

MultiIndex constructed from levels and codes

Constructing a MultiIndex with NaN levels or codes value < -1 was allowed previously. Now, construction with codes value < -1 is not allowed and NaN levels’ corresponding codes would be reassigned as -1. (GH19387)

Previous Behavior:

In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ...:               codes=[[0, -1, 1, 2, 3, 4]])
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
                   codes=[[0, -1, 1, 2, 3, 4]])

In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
                   codes=[[0, -2]])

New Behavior:

In [9]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ...:               codes=[[0, -1, 1, 2, 3, 4]])
MultiIndex(levels=[[nan, None, NaT, 128, 2]],
           codes=[[-1, -1, -1, -1, 3, 4]])

In [10]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
ValueError                                Traceback (most recent call last)
<ipython-input-10-225a01af3975> in <module>
----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])

~/build/pandas-dev/pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    186                 else:
    187                     kwargs[new_arg_name] = new_arg_value
--> 188             return func(*args, **kwargs)
    189         return wrapper
    190     return _deprecate_kwarg

~/build/pandas-dev/pandas/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity, _set_identity)
    245         if verify_integrity:
--> 246             new_codes = result._verify_integrity()
    247             result._codes = new_codes

~/build/pandas-dev/pandas/pandas/core/indexes/multi.py in _verify_integrity(self, codes, levels)
    317                 raise ValueError("On level {level}, code value ({code})"
    318                                  " < -1".format(
--> 319                                      level=i, code=level_codes.min()))
    320             if not level.is_unique:
    321                 raise ValueError("Level values must be unique: {values} on "

ValueError: On level 0, code value (-2) < -1

GroupBy.apply on DataFrame evaluates first group only once

The implementation of DataFrameGroupBy.apply() previously evaluated the supplied function consistently twice on the first group to infer if it is safe to use a fast code path. Particularly for functions with side effects, this was an undesired behavior and may have led to surprises. (GH2936, GH2656, GH7739, GH10519, GH12155, GH20084, GH21417)

Now every group is evaluated only a single time.

In [11]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [12]: df
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

In [13]: def func(group):
   ....:     print(group.name)
   ....:     return group

Previous Behavior:

In [3]: df.groupby('a').apply(func)
   a  b
0  x  1
1  y  2

New Behavior:

In [14]: df.groupby("a").apply(func)
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

Concatenating Sparse Values

When passed DataFrames whose values are sparse, concat() will now return a Series or DataFrame with sparse values, rather than a SparseDataFrame (GH25702).

In [15]: df = pd.DataFrame({"A": pd.SparseArray([0, 1])})

Previous Behavior:

In [2]: type(pd.concat([df, df]))

New Behavior:

In [16]: type(pd.concat([df, df]))
Out[16]: pandas.core.frame.DataFrame

This now matches the existing behavior of concat on Series with sparse values. concat() will continue to return a SparseDataFrame when all the values are instances of SparseDataFrame.

This change also affects routines using concat() internally, like get_dummies(), which now returns a DataFrame in all cases (previously a SparseDataFrame was returned if all the columns were dummy encoded, and a DataFrame otherwise).

Providing any SparseSeries or SparseDataFrame to concat() will cause a SparseSeries or SparseDataFrame to be returned, as before.

The .str-accessor performs stricter type checks

Due to the lack of more fine-grained dtypes, Series.str so far only checked whether the data was of object dtype. Series.str will now infer the dtype data within the Series; in particular, 'bytes'-only data will raise an exception (except for Series.str.decode(), Series.str.get(), Series.str.len(), Series.str.slice()), see GH23163, GH23011, GH23551.

Previous Behavior:

In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [2]: s
0      b'a'
1     b'ba'
2    b'cba'
dtype: object

In [3]: s.str.startswith(b'a')
0     True
1    False
2    False
dtype: bool

New Behavior:

In [17]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [18]: s
0      b'a'
1     b'ba'
2    b'cba'
Length: 3, dtype: object

In [19]: s.str.startswith(b'a')
TypeError                                 Traceback (most recent call last)
<ipython-input-19-ac784692b361> in <module>
----> 1 s.str.startswith(b'a')

~/build/pandas-dev/pandas/pandas/core/strings.py in wrapper(self, *args, **kwargs)
   1814                        '{inf_type!r}.'.format(name=func_name,
   1815                                               inf_type=self._inferred_dtype))
-> 1816                 raise TypeError(msg)
   1817             return func(self, *args, **kwargs)
   1818         wrapper.__name__ = func_name

TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.

Incompatible Index Type Unions

When performing Index.union() operations between objects of incompatible dtypes, the result will be a base Index of dtype object. This behavior holds true for unions between Index objects that previously would have been prohibited. The dtype of empty Index objects will now be evaluated before performing union operations rather than simply returning the other Index object. Index.union() can now be considered commutative, such that A.union(B) == B.union(A) (GH23525).

Previous Behavior:

In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
ValueError: can only call with other PeriodIndex-ed objects

In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')

New Behavior:

In [20]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
Out[20]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')

In [21]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[21]: Index([1, 2, 3], dtype='object')

DataFrame groupby ffill/bfill no longer return group labels

The methods ffill, bfill, pad and backfill of DataFrameGroupBy previously included the group labels in the return value, which was inconsistent with other groupby transforms. Now only the filled values are returned. (GH21521)

In [22]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [23]: df
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

Previous Behavior:

In [3]: df.groupby("a").ffill()
   a  b
0  x  1
1  y  2

New Behavior:

In [24]: df.groupby("a").ffill()
0  1
1  2

[2 rows x 1 columns]

DataFrame describe on an empty categorical / object column will return top and freq

When calling DataFrame.describe() with an empty categorical / object column, the ‘top’ and ‘freq’ columns were previously omitted, which was inconsistent with the output for non-empty columns. Now the ‘top’ and ‘freq’ columns will always be included, with numpy.nan in the case of an empty DataFrame (GH26397)

In [25]: df = pd.DataFrame({"empty_col": pd.Categorical([])})

In [26]: df
Empty DataFrame
Columns: [empty_col]
Index: []

[0 rows x 1 columns]

Previous Behavior:

In [3]: df.describe()
count           0
unique          0

New Behavior:

In [27]: df.describe()
count         0.0
unique        0.0
top           NaN
freq          NaN

[4 rows x 1 columns]

__str__ methods now call __repr__ rather than vice versa

Pandas has until now mostly defined string representations in a Pandas objects’s __str__/__unicode__/__bytes__ methods, and called __str__ from the __repr__ method, if a specific __repr__ method is not found. This is not needed for Python3. In Pandas 0.25, the string representations of Pandas objects are now generally defined in __repr__, and calls to __str__ in general now pass the call on to the __repr__, if a specific __str__ method doesn’t exist, as is standard for Python. This change is backward compatible for direct usage of Pandas, but if you subclass Pandas objects and give your subclasses specific __str__/__repr__ methods, you may have to adjust your __str__/__repr__ methods (GH26495).

Increased minimum versions for dependencies

Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (GH25725, GH24942, GH25752). Independently, some minimum supported versions of dependencies were updated (GH23519, GH25554). If installed, we now require:

Package Minimum Version Required
numpy 1.13.3 X
pytz 2015.4 X
python-dateutil 2.6.1 X
bottleneck 1.2.1  
numexpr 2.6.2  
pytest (dev) 4.0.2  

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version
beautifulsoup4 4.6.0
fastparquet 0.2.1
matplotlib 2.2.2
openpyxl 2.4.8
pyarrow 0.9.0
pytables 3.4.2
scipy 0.19.0
sqlalchemy 1.1.4
xarray 0.8.2
xlrd 1.1.0
xlsxwriter 0.9.8
xlwt 1.2.0

See Dependencies and Optional Dependencies for more.

Other API Changes


Sparse Subclasses

The SparseSeries and SparseDataFrame subclasses are deprecated. Their functionality is better-provided by a Series or DataFrame with sparse values.

Previous Way

In [28]: df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})

In [29]: df.dtypes
A    Sparse[int64, nan]
Length: 1, dtype: object

New Way

In [30]: df = pd.DataFrame({"A": pd.SparseArray([0, 0, 1, 2])})

In [31]: df.dtypes
A    Sparse[int64, 0]
Length: 1, dtype: object

The memory usage of the two approaches is identical. See Migrating for more (GH19239).

Other Deprecations

Removal of prior version deprecations/changes

  • Removed Panel (GH25047, GH25191, GH25231)
  • Removed the previously deprecated sheetname keyword in read_excel() (GH16442, GH20938)
  • Removed the previously deprecated TimeGrouper (GH16942)
  • Removed the previously deprecated parse_cols keyword in read_excel() (GH16488)
  • Removed the previously deprecated pd.options.html.border (GH16970)
  • Removed the previously deprecated convert_objects (GH11221)
  • Removed the previously deprecated select method of DataFrame and Series (GH17633)

Performance Improvements

  • Significant speedup in SparseArray initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985)
  • DataFrame.to_stata() is now faster when outputting data with any string or non-native endian columns (GH25045)
  • Improved performance of Series.searchsorted(). The speedup is especially large when the dtype is int8/int16/int32 and the searched key is within the integer bounds for the dtype (GH22034)
  • Improved performance of pandas.core.groupby.GroupBy.quantile() (GH20405)
  • Improved performance of slicing and other selected operation on a RangeIndex (GH26565, GH26617, GH26722)
  • Improved performance of read_csv() by faster tokenizing and faster parsing of small float numbers (GH25784)
  • Improved performance of read_csv() by faster parsing of N/A and boolean values (GH25804)
  • Improved performance of IntervalIndex.is_monotonic, IntervalIndex.is_monotonic_increasing and IntervalIndex.is_monotonic_decreasing by removing conversion to MultiIndex (GH24813)
  • Improved performance of DataFrame.to_csv() when writing datetime dtypes (GH25708)
  • Improved performance of read_csv() by much faster parsing of MM/YYYY and DD/MM/YYYY datetime formats (GH25922)
  • Improved performance of nanops for dtypes that cannot store NaNs. Speedup is particularly prominent for Series.all() and Series.any() (GH25070)
  • Improved performance of Series.map() for dictionary mappers on categorical series by mapping the categories instead of mapping all values (GH23785)
  • Improved performance of IntervalIndex.intersection() (GH24813)
  • Improved performance of read_csv() by faster concatenating date columns without extra conversion to string for integer/float zero and float NaN; by faster checking the string for the possibility of being a date (GH25754)
  • Improved performance of IntervalIndex.is_unique by removing conversion to MultiIndex (GH24813)
  • Restored performance of DatetimeIndex.__iter__() by re-enabling specialized code path (GH26702)
  • Improved performance when building MultiIndex with at least one CategoricalIndex level (GH22044)

Bug Fixes



  • Bug in to_datetime() which would raise an (incorrect) ValueError when called with a date far into the future and the format argument specified instead of raising OutOfBoundsDatetime (GH23830)
  • Bug in to_datetime() which would raise InvalidIndexError: Reindexing only valid with uniquely valued Index objects when called with cache=True, with arg including at least two different elements from the set {None, numpy.nan, pandas.NaT} (GH22305)
  • Bug in DataFrame and Series where timezone aware data with dtype='datetime64[ns] was not cast to naive (GH25843)
  • Improved Timestamp type checking in various datetime functions to prevent exceptions when using a subclassed datetime (GH25851)
  • Bug in Series and DataFrame repr where np.datetime64('NaT') and np.timedelta64('NaT') with dtype=object would be represented as NaN (GH25445)
  • Bug in to_datetime() which does not replace the invalid argument with NaT when error is set to coerce (GH26122)
  • Bug in adding DateOffset with nonzero month to DatetimeIndex would raise ValueError (GH26258)
  • Bug in to_datetime() which raises unhandled OverflowError when called with mix of invalid dates and NaN values with format='%Y%m%d' and error='coerce' (GH25512)
  • Bug in isin() for datetimelike indexes; DatetimeIndex, TimedeltaIndex and PeriodIndex where the levels parameter was ignored. (GH26675)
  • Bug in to_datetime() which raises TypeError for format='%Y%m%d' when called for invalid integer dates with length >= 6 digits with errors='ignore'
  • Bug when comparing a PeriodIndex against a zero-dimensional numpy array (GH26689)


  • Bug in TimedeltaIndex.intersection() where for non-monotonic indices in some cases an empty Index was returned when in fact an intersection existed (GH25913)
  • Bug with comparisons between Timedelta and NaT raising TypeError (GH26039)
  • Bug when adding or subtracting a BusinessHour to a Timestamp with the resulting time landing in a following or prior day respectively (GH26381)
  • Bug when comparing a TimedeltaIndex against a zero-dimensional numpy array (GH26689)



  • Bug in to_numeric() in which large negative numbers were being improperly handled (GH24910)
  • Bug in to_numeric() in which numbers were being coerced to float, even though errors was not coerce (GH24910)
  • Bug in to_numeric() in which invalid values for errors were being allowed (GH26466)
  • Bug in format in which floating point complex numbers were not being formatted to proper display precision and trimming (GH25514)
  • Bug in error messages in DataFrame.corr() and Series.corr(). Added the possibility of using a callable. (GH25729)
  • Bug in Series.divmod() and Series.rdivmod() which would raise an (incorrect) ValueError rather than return a pair of Series objects as result (GH25557)
  • Raises a helpful exception when a non-numeric index is sent to interpolate() with methods which require numeric index. (GH21662)
  • Bug in eval() when comparing floats with scalar operators, for example: x < -0.1 (GH25928)
  • Fixed bug where casting all-boolean array to integer extension array failed (GH25211)








  • Bug in DataFrame.to_html() where values were truncated using display options instead of outputting the full content (GH17004)
  • Fixed bug in missing text when using to_clipboard() if copying utf-16 characters in Python 3 on Windows (GH25040)
  • Bug in read_json() for orient='table' when it tries to infer dtypes by default, which is not applicable as dtypes are already defined in the JSON schema (GH21345)
  • Bug in read_json() for orient='table' and float index, as it infers index dtype by default, which is not applicable because index dtype is already defined in the JSON schema (GH25433)
  • Bug in read_json() for orient='table' and string of float column names, as it makes a column name type conversion to Timestamp, which is not applicable because column names are already defined in the JSON schema (GH25435)
  • Bug in json_normalize() for errors='ignore' where missing values in the input data, were filled in resulting DataFrame with the string "nan" instead of numpy.nan (GH25468)
  • DataFrame.to_html() now raises TypeError when using an invalid type for the classes parameter instead of AsseertionError (GH25608)
  • Bug in DataFrame.to_string() and DataFrame.to_latex() that would lead to incorrect output when the header keyword is used (GH16718)
  • Bug in read_csv() not properly interpreting the UTF8 encoded filenames on Windows on Python 3.6+ (GH15086)
  • Improved performance in pandas.read_stata() and pandas.io.stata.StataReader when converting columns that have missing values (GH25772)
  • Bug in DataFrame.to_html() where header numbers would ignore display options when rounding (GH17280)
  • Bug in read_hdf() not properly closing store after a KeyError is raised (GH25766)
  • Bug in read_csv which would not raise ValueError if a column index in usecols was out of bounds (GH25623)
  • Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (GH25772)
  • Improved pandas.read_stata() and pandas.io.stata.StataReader to read incorrectly formatted 118 format files saved by Stata (GH25960)
  • Improved the col_space parameter in DataFrame.to_html() to accept a string so CSS length values can be set correctly (GH25941)
  • Fixed bug in loading objects from S3 that contain # characters in the URL (GH25945)
  • Adds use_bqstorage_api parameter to read_gbq() to speed up downloads of large data frames. This feature requires version 0.10.0 of the pandas-gbq library as well as the google-cloud-bigquery-storage and fastavro libraries. (GH26104)
  • Fixed memory leak in DataFrame.to_json() when dealing with numeric data (GH24889)
  • Bug in read_json() where date strings with Z were not converted to a UTC timezone (GH26168)
  • Added cache_dates=True parameter to read_csv(), which allows to cache unique dates when they are parsed (GH25990)
  • DataFrame.to_excel() now raises a ValueError when the caller’s dimensions exceed the limitations of Excel (GH26051)
  • Fixed bug in pandas.read_csv() where a BOM would result in incorrect parsing using engine=’python’ (GH26545)
  • read_excel() now raises a ValueError when input is of type pandas.io.excel.ExcelFile and engine param is passed since pandas.io.excel.ExcelFile has an engine defined (GH26566)
  • Bug while selecting from HDFStore with where='' specified (GH26610).




  • Bug in pandas.merge() adds a string of None, if None is assigned in suffixes instead of remain the column name as-is (GH24782).
  • Bug in merge() when merging by index name would sometimes result in an incorrectly numbered index (missing index values are now assigned NA) (GH24212, GH25009)
  • to_records() now accepts dtypes to its column_dtypes parameter (GH24895)
  • Bug in concat() where order of OrderedDict (and dict in Python 3.6+) is not respected, when passed in as objs argument (GH21510)
  • Bug in pivot_table() where columns with NaN values are dropped even if dropna argument is False, when the aggfunc argument contains a list (GH22159)
  • Bug in concat() where the resulting freq of two DatetimeIndex with the same freq would be dropped (GH3232).
  • Bug in merge() where merging with equivalent Categorical dtypes was raising an error (GH22501)
  • bug in DataFrame instantiating with a dict of iterators or generators (e.g. pd.DataFrame({'A': reversed(range(3))})) raised an error (GH26349).
  • Bug in DataFrame instantiating with a range (e.g. pd.DataFrame(range(3))) raised an error (GH26342).
  • Bug in DataFrame constructor when passing non-empty tuples would cause a segmentation fault (GH25691)
  • Bug in Series.apply() failed when the series is a timezone aware DatetimeIndex (GH25959)
  • Bug in pandas.cut() where large bins could incorrectly raise an error due to an integer overflow (GH26045)
  • Bug in DataFrame.sort_index() where an error is thrown when a multi-indexed DataFrame is sorted on all levels with the initial level sorted last (GH26053)
  • Bug in Series.nlargest() treats True as smaller than False (GH26154)
  • Bug in DataFrame.pivot_table() with a IntervalIndex as pivot index would raise TypeError (GH25814)


  • Significant speedup in SparseArray initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985)
  • Bug in SparseFrame constructor where passing None as the data would cause default_fill_value to be ignored (GH16807)
  • Bug in SparseDataFrame when adding a column in which the length of values does not match length of index, AssertionError is raised instead of raising ValueError (GH25484)
  • Introduce a better error message in Series.sparse.from_coo() so it returns a TypeError for inputs that are not coo matrices (GH26554)


  • Removed unused C functions from vendored UltraJSON implementation (GH26198)
  • Bug in factorize() when passing an ExtensionArray with a custom na_sentinel (GH25696).
  • Allow Index and RangeIndex to be passed to numpy min and max functions (GH26125)


Scroll To Top