v0.15.2 (December 12, 2014)

This is a minor release from 0.15.1 and includes a large number of bug fixes along with several new features, enhancements, and performance improvements. A small number of API changes were necessary to fix existing bugs. We recommend that all users upgrade to this version.

API changes

  • Indexing in MultiIndex beyond lex-sort depth is now supported, though a lexically sorted index will have a better performance. (GH2646)

    In [1]: df = pd.DataFrame({'jim':[0, 0, 1, 1],
       ...:                    'joe':['x', 'x', 'z', 'y'],
       ...:                    'jolie':np.random.rand(4)}).set_index(['jim', 'joe'])
       ...: 
    
    In [2]: df
    Out[2]: 
                jolie
    jim joe          
    0   x    0.126970
        x    0.966718
    1   z    0.260476
        y    0.897237
    
    [4 rows x 1 columns]
    
    In [3]: df.index.lexsort_depth
    Out[3]: 1
    
    # in prior versions this would raise a KeyError
    # will now show a PerformanceWarning
    In [4]: df.loc[(1, 'z')]
    Out[4]: 
                jolie
    jim joe          
    1   z    0.260476
    
    [1 rows x 1 columns]
    
    # lexically sorting
    In [5]: df2 = df.sort_index()
    
    In [6]: df2
    Out[6]: 
                jolie
    jim joe          
    0   x    0.126970
        x    0.966718
    1   y    0.897237
        z    0.260476
    
    [4 rows x 1 columns]
    
    In [7]: df2.index.lexsort_depth
    Out[7]: 2
    
    In [8]: df2.loc[(1,'z')]
    Out[8]: 
                jolie
    jim joe          
    1   z    0.260476
    
    [1 rows x 1 columns]
    
  • Bug in unique of Series with category dtype, which returned all categories regardless whether they were “used” or not (see GH8559 for the discussion). Previous behaviour was to return all categories:

    In [3]: cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c'])
    
    In [4]: cat
    Out[4]:
    [a, b, a]
    Categories (3, object): [a < b < c]
    
    In [5]: cat.unique()
    Out[5]: array(['a', 'b', 'c'], dtype=object)
    

    Now, only the categories that do effectively occur in the array are returned:

    In [9]: cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c'])
    
    In [10]: cat.unique()
    Out[10]: 
    [a, b]
    Categories (2, object): [a, b]
    
  • Series.all and Series.any now support the level and skipna parameters. Series.all, Series.any, Index.all, and Index.any no longer support the out and keepdims parameters, which existed for compatibility with ndarray. Various index types no longer support the all and any aggregation functions and will now raise TypeError. (GH8302).

  • Allow equality comparisons of Series with a categorical dtype and object dtype; previously these would raise TypeError (GH8938)

  • Bug in NDFrame: conflicting attribute/column names now behave consistently between getting and setting. Previously, when both a column and attribute named y existed, data.y would return the attribute, while data.y = z would update the column (GH8994)

    In [11]: data = pd.DataFrame({'x':[1, 2, 3]})
    
    In [12]: data.y = 2
    
    In [13]: data['y'] = [2, 4, 6]
    
    In [14]: data
    Out[14]: 
       x  y
    0  1  2
    1  2  4
    2  3  6
    
    [3 rows x 2 columns]
    
    # this assignment was inconsistent
    In [15]: data.y = 5
    

    Old behavior:

    In [6]: data.y
    Out[6]: 2
    
    In [7]: data['y'].values
    Out[7]: array([5, 5, 5])
    

    New behavior:

    In [16]: data.y
    Out[16]: 5
    
    In [17]: data['y'].values
    Out[17]: array([2, 4, 6])
    
  • Timestamp('now') is now equivalent to Timestamp.now() in that it returns the local time rather than UTC. Also, Timestamp('today') is now equivalent to Timestamp.today() and both have tz as a possible argument. (GH9000)

  • Fix negative step support for label-based slices (GH8753)

    Old behavior:

    In [1]: s = pd.Series(np.arange(3), ['a', 'b', 'c'])
    Out[1]:
    a    0
    b    1
    c    2
    dtype: int64
    
    In [2]: s.loc['c':'a':-1]
    Out[2]:
    c    2
    dtype: int64
    

    New behavior:

    In [18]: s = pd.Series(np.arange(3), ['a', 'b', 'c'])
    
    In [19]: s.loc['c':'a':-1]
    Out[19]: 
    c    2
    b    1
    a    0
    Length: 3, dtype: int64
    

Enhancements

Categorical enhancements:

  • Added ability to export Categorical data to Stata (GH8633). See here for limitations of categorical variables exported to Stata data files.
  • Added flag order_categoricals to StataReader and read_stata to select whether to order imported categorical data (GH8836). See here for more information on importing categorical variables from Stata data files.
  • Added ability to export Categorical data to to/from HDF5 (GH7621). Queries work the same as if it was an object array. However, the category dtyped data is stored in a more efficient manner. See here for an example and caveats w.r.t. prior versions of pandas.
  • Added support for searchsorted() on Categorical class (GH8420).

Other enhancements:

  • Added the ability to specify the SQL type of columns when writing a DataFrame to a database (GH8778). For example, specifying to use the sqlalchemy String type instead of the default Text type for string columns:

    from sqlalchemy.types import String
    data.to_sql('data_dtype', engine, dtype={'Col_1': String})
    
  • Series.all and Series.any now support the level and skipna parameters (GH8302):

    In [20]: s = pd.Series([False, True, False], index=[0, 0, 1])
    
    In [21]: s.any(level=0)
    Out[21]: 
    0     True
    1    False
    Length: 2, dtype: bool
    
  • Panel now supports the all and any aggregation functions. (GH8302):

    In [22]: p = pd.Panel(np.random.rand(2, 5, 4) > 0.1)
    
    In [23]: p.all()
    Out[23]: 
          0      1     2      3
    0  True  False  True   True
    1  True   True  True  False
    2  True   True  True   True
    3  True   True  True   True
    4  True   True  True   True
    
    [5 rows x 4 columns]
    
  • Added support for utcfromtimestamp(), fromtimestamp(), and combine() on Timestamp class (GH5351).

  • Added Google Analytics (pandas.io.ga) basic documentation (GH8835). See here.

  • Timedelta arithmetic returns NotImplemented in unknown cases, allowing extensions by custom classes (GH8813).

  • Timedelta now supports arithmetic with numpy.ndarray objects of the appropriate dtype (numpy 1.8 or newer only) (GH8884).

  • Added Timedelta.to_timedelta64() method to the public API (GH8884).

  • Added gbq.generate_bq_schema() function to the gbq module (GH8325).

  • Series now works with map objects the same way as generators (GH8909).

  • Added context manager to HDFStore for automatic closing (GH8791).

  • to_datetime gains an exact keyword to allow for a format to not require an exact match for a provided format string (if its False). exact defaults to True (meaning that exact matching is still the default) (GH8904)

  • Added axvlines boolean option to parallel_coordinates plot function, determines whether vertical lines will be printed, default is True

  • Added ability to read table footers to read_html (GH8552)

  • to_sql now infers data types of non-NA values for columns that contain NA values and have dtype object (GH8778).

Performance

  • Reduce memory usage when skiprows is an integer in read_csv (GH8681)
  • Performance boost for to_datetime conversions with a passed format=, and the exact=False (GH8904)

Bug Fixes

  • Bug in concat of Series with category dtype which were coercing to object. (GH8641)
  • Bug in Timestamp-Timestamp not returning a Timedelta type and datelike-datelike ops with timezones (GH8865)
  • Made consistent a timezone mismatch exception (either tz operated with None or incompatible timezone), will now return TypeError rather than ValueError (a couple of edge cases only), (GH8865)
  • Bug in using a pd.Grouper(key=...) with no level/axis or level only (GH8795, GH8866)
  • Report a TypeError when invalid/no parameters are passed in a groupby (GH8015)
  • Bug in packaging pandas with py2app/cx_Freeze (GH8602, GH8831)
  • Bug in groupby signatures that didn’t include *args or **kwargs (GH8733).
  • io.data.Options now raises RemoteDataError when no expiry dates are available from Yahoo and when it receives no data from Yahoo (GH8761), (GH8783).
  • Unclear error message in csv parsing when passing dtype and names and the parsed data is a different data type (GH8833)
  • Bug in slicing a MultiIndex with an empty list and at least one boolean indexer (GH8781)
  • io.data.Options now raises RemoteDataError when no expiry dates are available from Yahoo (GH8761).
  • Timedelta kwargs may now be numpy ints and floats (GH8757).
  • Fixed several outstanding bugs for Timedelta arithmetic and comparisons (GH8813, GH5963, GH5436).
  • sql_schema now generates dialect appropriate CREATE TABLE statements (GH8697)
  • slice string method now takes step into account (GH8754)
  • Bug in BlockManager where setting values with different type would break block integrity (GH8850)
  • Bug in DatetimeIndex when using time object as key (GH8667)
  • Bug in merge where how='left' and sort=False would not preserve left frame order (GH7331)
  • Bug in MultiIndex.reindex where reindexing at level would not reorder labels (GH4088)
  • Bug in certain operations with dateutil timezones, manifesting with dateutil 2.3 (GH8639)
  • Regression in DatetimeIndex iteration with a Fixed/Local offset timezone (GH8890)
  • Bug in to_datetime when parsing a nanoseconds using the %f format (GH8989)
  • io.data.Options now raises RemoteDataError when no expiry dates are available from Yahoo and when it receives no data from Yahoo (GH8761), (GH8783).
  • Fix: The font size was only set on x axis if vertical or the y axis if horizontal. (GH8765)
  • Fixed division by 0 when reading big csv files in python 3 (GH8621)
  • Bug in outputting a MultiIndex with to_html,index=False which would add an extra column (GH8452)
  • Imported categorical variables from Stata files retain the ordinal information in the underlying data (GH8836).
  • Defined .size attribute across NDFrame objects to provide compat with numpy >= 1.9.1; buggy with np.array_split (GH8846)
  • Skip testing of histogram plots for matplotlib <= 1.2 (GH8648).
  • Bug where get_data_google returned object dtypes (GH3995)
  • Bug in DataFrame.stack(..., dropna=False) when the DataFrame’s columns is a MultiIndex whose labels do not reference all its levels. (GH8844)
  • Bug in that Option context applied on __enter__ (GH8514)
  • Bug in resample that causes a ValueError when resampling across multiple days and the last offset is not calculated from the start of the range (GH8683)
  • Bug where DataFrame.plot(kind='scatter') fails when checking if an np.array is in the DataFrame (GH8852)
  • Bug in pd.infer_freq/DataFrame.inferred_freq that prevented proper sub-daily frequency inference when the index contained DST days (GH8772).
  • Bug where index name was still used when plotting a series with use_index=False (GH8558).
  • Bugs when trying to stack multiple columns, when some (or all) of the level names are numbers (GH8584).
  • Bug in MultiIndex where __contains__ returns wrong result if index is not lexically sorted or unique (GH7724)
  • BUG CSV: fix problem with trailing white space in skipped rows, (GH8679), (GH8661), (GH8983)
  • Regression in Timestamp does not parse ‘Z’ zone designator for UTC (GH8771)
  • Bug in StataWriter the produces writes strings with 244 characters irrespective of actual size (GH8969)
  • Fixed ValueError raised by cummin/cummax when datetime64 Series contains NaT. (GH8965)
  • Bug in DataReader returns object dtype if there are missing values (GH8980)
  • Bug in plotting if sharex was enabled and index was a timeseries, would show labels on multiple axes (GH3964).
  • Bug where passing a unit to the TimedeltaIndex constructor applied the to nano-second conversion twice. (GH9011).
  • Bug in plotting of a period-like array (GH9012)

Contributors

Scroll To Top