What’s New in 0.25.0 (April XX, 2019)

Warning

Starting with the 0.25.x series of releases, pandas only supports Python 3.5 and higher. See Plan for dropping Python 2.7 for more details.

Warning

Panel has been fully removed. For N-D labeled data structures, please use xarray

These are the changes in pandas 0.25.0. See Release Notes for a full changelog including other versions of pandas.

Other Enhancements

Backwards incompatible API changes

Indexing with date strings with UTC offsets

Indexing a DataFrame or Series with a DatetimeIndex with a date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset is respected in indexing. (GH24076, GH16785)

In [1]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))

In [2]: df
Out[2]: 
                           0
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

Previous Behavior:

In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
                           0
2019-01-01 00:00:00-08:00  0

New Behavior:

In [3]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[3]: 
                           0
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

GroupBy.apply on DataFrame evaluates first group only once

The implementation of DataFrameGroupBy.apply() previously evaluated the supplied function consistently twice on the first group to infer if it is safe to use a fast code path. Particularly for functions with side effects, this was an undesired behavior and may have led to surprises. (GH2936, GH2656, GH7739, GH10519, GH12155, GH20084, GH21417)

Now every group is evaluated only a single time.

In [4]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [5]: df
Out[5]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

In [6]: def func(group):
   ...:     print(group.name)
   ...:     return group
   ...: 

Previous Behaviour:

In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
   a  b
0  x  1
1  y  2

New Behaviour:

In [7]: df.groupby("a").apply(func)
x
y
Out[7]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

Concatenating Sparse Values

When passed DataFrames whose values are sparse, concat() will now return a Series or DataFrame with sparse values, rather than a SparseDataFrame (GH25702).

In [8]: df = pd.DataFrame({"A": pd.SparseArray([0, 1])})

Previous Behavior:

In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame

New Behavior:

In [9]: type(pd.concat([df, df]))
Out[9]: pandas.core.frame.DataFrame

This now matches the existing behavior of concat on Series with sparse values. concat() will continue to return a SparseDataFrame when all the values are instances of SparseDataFrame.

This change also affects routines using concat() internally, like get_dummies(), which now returns a DataFrame in all cases (previously a SparseDataFrame was returned if all the columns were dummy encoded, and a DataFrame otherwise).

Providing any SparseSeries or SparseDataFrame to concat() will cause a SparseSeries or SparseDataFrame to be returned, as before.

Increased minimum versions for dependencies

Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (GH25725, GH24942, GH25752). Independently, some minimum supported versions of dependencies were updated (GH23519, GH25554). If installed, we now require:

Package

Minimum Version

Required

numpy

1.13.3

X

pytz

2015.4

X

bottleneck

1.2.1

numexpr

2.6.2

pytest (dev)

4.0.2

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package

Minimum Version

fastparquet

0.2.1

matplotlib

2.2.2

openpyxl

2.4.0

pyarrow

0.9.0

pytables

3.4.2

scipy

0.19.0

sqlalchemy

1.1.4

xarray

0.8.2

xlrd

1.0.0

xlsxwriter

0.7.7

xlwt

1.0.0

Other API Changes

Deprecations

Removal of prior version deprecations/changes

Performance Improvements

  • Significant speedup in SparseArray initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985)

  • DataFrame.to_stata() is now faster when outputting data with any string or non-native endian columns (GH25045)

  • Improved performance of Series.searchsorted(). The speedup is especially large when the dtype is int8/int16/int32 and the searched key is within the integer bounds for the dtype (GH22034)

  • Improved performance of pandas.core.groupby.GroupBy.quantile() (GH20405)

  • Improved performance of read_csv() by faster tokenizing and faster parsing of small float numbers (GH25784)

  • Improved performance of read_csv() by faster parsing of N/A and boolean values (GH25804)

  • Imporved performance of IntervalIndex.is_monotonic(), IntervalIndex.is_monotonic_increasing() and IntervalIndex.is_monotonic_decreasing() by removing conversion to MultiIndex (GH24813)

  • Improved performance of DataFrame.to_csv() when writing datetime dtypes (GH25708)

  • Improved performance of read_csv() by much faster parsing of MM/YYYY and DD/MM/YYYY datetime formats (GH25922)

Bug Fixes

Categorical

Datetimelike

  • Bug in to_datetime() which would raise an (incorrect) ValueError when called with a date far into the future and the format argument specified instead of raising OutOfBoundsDatetime (GH23830)

  • Bug in to_datetime() which would raise InvalidIndexError: Reindexing only valid with uniquely valued Index objects when called with cache=True, with arg including at least two different elements from the set {None, numpy.nan, pandas.NaT} (GH22305)

  • Bug in DataFrame and Series where timezone aware data with dtype='datetime64[ns] was not cast to naive (GH25843)

  • Improved Timestamp type checking in various datetime functions to prevent exceptions when using a subclassed datetime (GH25851)

  • Bug in Series and DataFrame repr where np.datetime64('NaT') and np.timedelta64('NaT') with dtype=object would be represented as NaN (GH25445)

Timedelta

Timezones

Numeric

  • Bug in to_numeric() in which large negative numbers were being improperly handled (GH24910)

  • Bug in to_numeric() in which numbers were being coerced to float, even though errors was not coerce (GH24910)

  • Bug in format in which floating point complex numbers were not being formatted to proper display precision and trimming (GH25514)

  • Bug in error messages in DataFrame.corr() and Series.corr(). Added the possibility of using a callable. (GH25729)

  • Bug in Series.divmod() and Series.rdivmod() which would raise an (incorrect) ValueError rather than return a pair of Series objects as result (GH25557)

  • Raises a helpful exception when a non-numeric index is sent to interpolate() with methods which require numeric index. (GH21662)

  • Bug in eval() when comparing floats with scalar operators, for example: x < -0.1 (GH25928)

Conversion

Strings

Interval

Indexing

  • Improved exception message when calling DataFrame.iloc() with a list of non-numeric objects (GH25753).

  • Bug in which DataFrame.append() produced an erroneous warning indicating that a KeyError will be thrown in the future when the data to be appended contains new columns (GH22252).

Missing

MultiIndex

I/O

  • Bug in DataFrame.to_html() where values were truncated using display options instead of outputting the full content (GH17004)

  • Fixed bug in missing text when using to_clipboard() if copying utf-16 characters in Python 3 on Windows (GH25040)

  • Bug in read_json() for orient='table' when it tries to infer dtypes by default, which is not applicable as dtypes are already defined in the JSON schema (GH21345)

  • Bug in read_json() for orient='table' and float index, as it infers index dtype by default, which is not applicable because index dtype is already defined in the JSON schema (GH25433)

  • Bug in read_json() for orient='table' and string of float column names, as it makes a column name type conversion to Timestamp, which is not applicable because column names are already defined in the JSON schema (GH25435)

  • Bug in json_normalize() for errors='ignore' where missing values in the input data, were filled in resulting DataFrame with the string “nan” instead of numpy.nan (GH25468)

  • DataFrame.to_html() now raises TypeError when using an invalid type for the classes parameter instead of AsseertionError (GH25608)

  • Bug in DataFrame.to_string() and DataFrame.to_latex() that would lead to incorrect output when the header keyword is used (GH16718)

  • Bug in read_csv() not properly interpreting the UTF8 encoded filenames on Windows on Python 3.6+ (GH15086)

  • Improved performance in pandas.read_stata() and pandas.io.stata.StataReader when converting columns that have missing values (GH25772)

  • Bug in DataFrame.to_html() where header numbers would ignore display options when rounding (GH17280)

  • Bug in read_hdf() not properly closing store after a KeyError is raised (GH25766)

  • Bug in read_csv which would not raise ValueError if a column index in usecols was out of bounds (GH25623)

  • Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (GH25772)

  • Improved pandas.read_stata() and pandas.io.stata.StataReader to read incorrectly formatted 118 format files saved by Stata (GH25960)

  • Fixed bug in loading objects from S3 that contain # characters in the URL (GH25945)

  • Adds use_bqstorage_api parameter to read_gbq() to speed up downloads of large data frames. This feature requires version 0.10.0 of the pandas-gbq library as well as the google-cloud-bigquery-storage and fastavro libraries. (GH26104)

Plotting

Groupby/Resample/Rolling

Reshaping

  • Bug in pandas.merge() adds a string of None, if None is assigned in suffixes instead of remain the column name as-is (GH24782).

  • Bug in merge() when merging by index name would sometimes result in an incorrectly numbered index (GH24212)

  • to_records() now accepts dtypes to its column_dtypes parameter (GH24895)

  • Bug in concat() where order of OrderedDict (and dict in Python 3.6+) is not respected, when passed in as objs argument (GH21510)

  • Bug in pivot_table() where columns with NaN values are dropped even if dropna argument is False, when the aggfunc argument contains a list (GH22159)

  • Bug in concat() where the resulting freq of two DatetimeIndex with the same freq would be dropped (GH3232).

  • Bug in merge() where merging with equivalent Categorical dtypes was raising an error (GH22501)

  • Bug in DataFrame constructor when passing non-empty tuples would cause a segmentation fault (GH25691)

  • Bug in pandas.cut() where large bins could incorrectly raise an error due to an integer overflow (GH26045)

  • Bug in Series.nlargest() treats True as smaller than False (GH26154)

Sparse

  • Significant speedup in SparseArray initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985)

  • Bug in SparseFrame constructor where passing None as the data would cause default_fill_value to be ignored (GH16807)

  • Bug in SparseDataFrame when adding a column in which the length of values does not match length of index, AssertionError is raised instead of raising ValueError (GH25484)

Other

Contributors

Scroll To Top