v0.17.1 (November 21, 2015)

Note

We are proud to announce that pandas has become a sponsored project of the (NumFOCUS organization). This will help ensure the success of development of pandas as a world-class open-source project.

This is a minor bug-fix release from 0.17.0 and includes a large number of bug fixes along several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.

Highlights include:

  • Support for Conditional HTML Formatting, see here
  • Releasing the GIL on the csv reader & other ops, see here
  • Fixed regression in DataFrame.drop_duplicates from 0.16.2, causing incorrect results on integer values (GH11376)

New features

Conditional HTML Formatting

Warning

This is a new feature and is under active development. We’ll be adding features an possibly making breaking changes in future releases. Feedback is welcome.

We’ve added experimental support for conditional HTML formatting: the visual styling of a DataFrame based on the data. The styling is accomplished with HTML and CSS. Accesses the styler class with the pandas.DataFrame.style, attribute, an instance of Styler with your data attached.

Here’s a quick example:

In [1]: np.random.seed(123)

In [2]: df = DataFrame(np.random.randn(10, 5), columns=list('abcde'))

In [3]: html = df.style.background_gradient(cmap='viridis', low=.5)

We can render the HTML to get the following table.

a b c d e
0 -1.085631 0.997345 0.282978 -1.506295 -0.5786
1 1.651437 -2.426679 -0.428913 1.265936 -0.86674
2 -0.678886 -0.094709 1.49139 -0.638902 -0.443982
3 -0.434351 2.20593 2.186786 1.004054 0.386186
4 0.737369 1.490732 -0.935834 1.175829 -1.253881
5 -0.637752 0.907105 -1.428681 -0.140069 -0.861755
6 -0.255619 -2.798589 -1.771533 -0.699877 0.927462
7 -0.173636 0.002846 0.688223 -0.879536 0.283627
8 -0.805367 -1.727669 -0.3909 0.573806 0.338589
9 -0.01183 2.392365 0.412912 0.978736 2.238143

Styler interacts nicely with the Jupyter Notebook. See the documentation for more.

Enhancements

  • DatetimeIndex now supports conversion to strings with astype(str) (GH10442)

  • Support for compression (gzip/bz2) in pandas.DataFrame.to_csv() (GH7615)

  • pd.read_* functions can now also accept pathlib.Path, or py._path.local.LocalPath objects for the filepath_or_buffer argument. (GH11033) - The DataFrame and Series functions .to_csv(), .to_html() and .to_latex() can now handle paths beginning with tildes (e.g. ~/Documents/) (GH11438)

  • DataFrame now uses the fields of a namedtuple as columns, if columns are not supplied (GH11181)

  • DataFrame.itertuples() now returns namedtuple objects, when possible. (GH11269, GH11625)

  • Added axvlines_kwds to parallel coordinates plot (GH10709)

  • Option to .info() and .memory_usage() to provide for deep introspection of memory consumption. Note that this can be expensive to compute and therefore is an optional parameter. (GH11595)

    In [4]: df = DataFrame({'A' : ['foo']*1000})
    
    In [5]: df['B'] = df['A'].astype('category')
    
    # shows the '+' as we have object dtypes
    In [6]: df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1000 entries, 0 to 999
    Data columns (total 2 columns):
    A    1000 non-null object
    B    1000 non-null category
    dtypes: category(1), object(1)
    memory usage: 9.0+ KB
    
    # we have an accurate memory assessment (but can be expensive to compute this)
    In [7]: df.info(memory_usage='deep')
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1000 entries, 0 to 999
    Data columns (total 2 columns):
    A    1000 non-null object
    B    1000 non-null category
    dtypes: category(1), object(1)
    memory usage: 75.4 KB
    
  • Index now has a fillna method (GH10089)

    In [8]: pd.Index([1, np.nan, 3]).fillna(2)
    Out[8]: Float64Index([1.0, 2.0, 3.0], dtype='float64')
    
  • Series of type category now make .str.<...> and .dt.<...> accessor methods / properties available, if the categories are of that type. (GH10661)

    In [9]: s = pd.Series(list('aabb')).astype('category')
    
    In [10]: s
    Out[10]: 
    0    a
    1    a
    2    b
    3    b
    Length: 4, dtype: category
    Categories (2, object): [a, b]
    
    In [11]: s.str.contains("a")
    Out[11]: 
    0     True
    1     True
    2    False
    3    False
    Length: 4, dtype: bool
    
    In [12]: date = pd.Series(pd.date_range('1/1/2015', periods=5)).astype('category')
    
    In [13]: date
    Out[13]: 
    0   2015-01-01
    1   2015-01-02
    2   2015-01-03
    3   2015-01-04
    4   2015-01-05
    Length: 5, dtype: category
    Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]
    
    In [14]: date.dt.day
    Out[14]: 
    0    1
    1    2
    2    3
    3    4
    4    5
    Length: 5, dtype: int64
    
  • pivot_table now has a margins_name argument so you can use something other than the default of ‘All’ (GH3335)

  • Implement export of datetime64[ns, tz] dtypes with a fixed HDF5 store (GH11411)

  • Pretty printing sets (e.g. in DataFrame cells) now uses set literal syntax ({x, y}) instead of Legacy Python syntax (set([x, y])) (GH11215)

  • Improve the error message in pandas.io.gbq.to_gbq() when a streaming insert fails (GH11285) and when the DataFrame does not match the schema of the destination table (GH11359)

API changes

  • raise NotImplementedError in Index.shift for non-supported index types (GH8038)
  • min and max reductions on datetime64 and timedelta64 dtyped series now result in NaT and not nan (GH11245).
  • Indexing with a null key will raise a TypeError, instead of a ValueError (GH11356)
  • Series.ptp will now ignore missing values by default (GH11163)

Deprecations

  • The pandas.io.ga module which implements google-analytics support is deprecated and will be removed in a future version (GH11308)
  • Deprecate the engine keyword in .to_csv(), which will be removed in a future version (GH11274)

Performance Improvements

  • Checking monotonic-ness before sorting on an index (GH11080)
  • Series.dropna performance improvement when its dtype can’t contain NaN (GH11159)
  • Release the GIL on most datetime field operations (e.g. DatetimeIndex.year, Series.dt.year), normalization, and conversion to and from Period, DatetimeIndex.to_period and PeriodIndex.to_timestamp (GH11263)
  • Release the GIL on some rolling algos: rolling_median, rolling_mean, rolling_max, rolling_min, rolling_var, rolling_kurt, rolling_skew (GH11450)
  • Release the GIL when reading and parsing text files in read_csv, read_table (GH11272)
  • Improved performance of rolling_median (GH11450)
  • Improved performance of to_excel (GH11352)
  • Performance bug in repr of Categorical categories, which was rendering the strings before chopping them for display (GH11305)
  • Performance improvement in Categorical.remove_unused_categories, (GH11643).
  • Improved performance of Series constructor with no data and DatetimeIndex (GH11433)
  • Improved performance of shift, cumprod, and cumsum with groupby (GH4095)

Bug Fixes

  • SparseArray.__iter__() now does not cause PendingDeprecationWarning in Python 3.5 (GH11622)
  • Regression from 0.16.2 for output formatting of long floats/nan, restored in (GH11302)
  • Series.sort_index() now correctly handles the inplace option (GH11402)
  • Incorrectly distributed .c file in the build on PyPi when reading a csv of floats and passing na_values=<a scalar> would show an exception (GH11374)
  • Bug in .to_latex() output broken when the index has a name (GH10660)
  • Bug in HDFStore.append with strings whose encoded length exceeded the max unencoded length (GH11234)
  • Bug in merging datetime64[ns, tz] dtypes (GH11405)
  • Bug in HDFStore.select when comparing with a numpy scalar in a where clause (GH11283)
  • Bug in using DataFrame.ix with a MultiIndex indexer (GH11372)
  • Bug in date_range with ambiguous endpoints (GH11626)
  • Prevent adding new attributes to the accessors .str, .dt and .cat. Retrieving such a value was not possible, so error out on setting it. (GH10673)
  • Bug in tz-conversions with an ambiguous time and .dt accessors (GH11295)
  • Bug in output formatting when using an index of ambiguous times (GH11619)
  • Bug in comparisons of Series vs list-likes (GH11339)
  • Bug in DataFrame.replace with a datetime64[ns, tz] and a non-compat to_replace (GH11326, GH11153)
  • Bug in isnull where numpy.datetime64('NaT') in a numpy.array was not determined to be null(GH11206)
  • Bug in list-like indexing with a mixed-integer Index (GH11320)
  • Bug in pivot_table with margins=True when indexes are of Categorical dtype (GH10993)
  • Bug in DataFrame.plot cannot use hex strings colors (GH10299)
  • Regression in DataFrame.drop_duplicates from 0.16.2, causing incorrect results on integer values (GH11376)
  • Bug in pd.eval where unary ops in a list error (GH11235)
  • Bug in squeeze() with zero length arrays (GH11230, GH8999)
  • Bug in describe() dropping column names for hierarchical indexes (GH11517)
  • Bug in DataFrame.pct_change() not propagating axis keyword on .fillna method (GH11150)
  • Bug in .to_csv() when a mix of integer and string column names are passed as the columns parameter (GH11637)
  • Bug in indexing with a range, (GH11652)
  • Bug in inference of numpy scalars and preserving dtype when setting columns (GH11638)
  • Bug in to_sql using unicode column names giving UnicodeEncodeError with (GH11431).
  • Fix regression in setting of xticks in plot (GH11529).
  • Bug in holiday.dates where observance rules could not be applied to holiday and doc enhancement (GH11477, GH11533)
  • Fix plotting issues when having plain Axes instances instead of SubplotAxes (GH11520, GH11556).
  • Bug in DataFrame.to_latex() produces an extra rule when header=False (GH7124)
  • Bug in df.groupby(...).apply(func) when a func returns a Series containing a new datetimelike column (GH11324)
  • Bug in pandas.json when file to load is big (GH11344)
  • Bugs in to_excel with duplicate columns (GH11007, GH10982, GH10970)
  • Fixed a bug that prevented the construction of an empty series of dtype datetime64[ns, tz] (GH11245).
  • Bug in read_excel with MultiIndex containing integers (GH11317)
  • Bug in to_excel with openpyxl 2.2+ and merging (GH11408)
  • Bug in DataFrame.to_dict() produces a np.datetime64 object instead of Timestamp when only datetime is present in data (GH11327)
  • Bug in DataFrame.corr() raises exception when computes Kendall correlation for DataFrames with boolean and not boolean columns (GH11560)
  • Bug in the link-time error caused by C inline functions on FreeBSD 10+ (with clang) (GH10510)
  • Bug in DataFrame.to_csv in passing through arguments for formatting MultiIndexes, including date_format (GH7791)
  • Bug in DataFrame.join() with how='right' producing a TypeError (GH11519)
  • Bug in Series.quantile with empty list results has Index with object dtype (GH11588)
  • Bug in pd.merge results in empty Int64Index rather than Index(dtype=object) when the merge result is empty (GH11588)
  • Bug in Categorical.remove_unused_categories when having NaN values (GH11599)
  • Bug in DataFrame.to_sparse() loses column names for MultiIndexes (GH11600)
  • Bug in DataFrame.round() with non-unique column index producing a Fatal Python error (GH11611)
  • Bug in DataFrame.round() with decimals being a non-unique indexed Series producing extra columns (GH11618)

Contributors

Scroll To Top