pandas_streaming.df#

Streaming#

The main class is an interface which mimic pandas.DataFrame interface to offer a short list of methods which apply on an iterator of dataframes. This provides somehow a streaming version of it. As a result, the creation of an instance is fast as long as the data is not processed. Iterators can be chained as many map reduce framework does.

The module implements additional and useful functions not necessarily for the streaming version of the dataframes. Many methods have been rewritten to support streaming. Among them, IO methods: read_csv, read_df, read_json.

Data Manipulation#

pandas_streaming.df.dataframe_helpers.dataframe_hash_columns(df, cols=None, hash_length=10, inplace=False)[source][source]#

Hashes a set of columns in a dataframe. Keeps the same type. Skips missing values.

@param df dataframe @param cols columns to hash or None for alls. @param hash_length for strings only, length of the hash @param inplace modifies inplace @return new dataframe

This might be useful to anonimized data before making it public.

Hashes a set of columns in a dataframe

<<<

import pandas
from pandas_streaming.df import dataframe_hash_columns

df = pandas.DataFrame(
    [
        dict(a=1, b="e", c=5.6, ind="a1", ai=1),
        dict(b="f", c=5.7, ind="a2", ai=2),
        dict(a=4, b="g", ind="a3", ai=3),
        dict(a=8, b="h", c=5.9, ai=4),
        dict(a=16, b="i", c=6.2, ind="a5", ai=5),
    ]
)
print(df)
print("--------------")
df2 = dataframe_hash_columns(df)
print(df2)

>>>

          a  b    c  ind  ai
    0   1.0  e  5.6   a1   1
    1   NaN  f  5.7   a2   2
    2   4.0  g  NaN   a3   3
    3   8.0  h  5.9  NaN   4
    4  16.0  i  6.2   a5   5
    --------------
                  a           b             c         ind        ai
    0  4.648669e+11  3f79bb7b43  3.355454e+11  f55ff16f66  65048080
    1           NaN  252f10c836  5.803745e+11  2c3a4249d7   1214325
    2  2.750847e+11  cd0aa98561           NaN  f46dd28a54  80131111
    3  1.940968e+11  aaa9402664  9.635096e+10         NaN  19167269
    4  1.083806e+12  de7d1b721a  3.183198e+11  66220e7159   8788782
pandas_streaming.df.connex_split.dataframe_shuffle(df, random_state=None)[source][source]#

Shuffles a dataframe.

Parameters:
Returns:

new pandas.DataFrame

Shuffles the rows of a dataframe

<<<

import pandas
from pandas_streaming.df import dataframe_shuffle

df = pandas.DataFrame(
    [
        dict(a=1, b="e", c=5.6, ind="a1"),
        dict(a=2, b="f", c=5.7, ind="a2"),
        dict(a=4, b="g", c=5.8, ind="a3"),
        dict(a=8, b="h", c=5.9, ind="a4"),
        dict(a=16, b="i", c=6.2, ind="a5"),
    ]
)
print(df)
print("----------")

shuffled = dataframe_shuffle(df, random_state=0)
print(shuffled)

>>>

        a  b    c ind
    0   1  e  5.6  a1
    1   2  f  5.7  a2
    2   4  g  5.8  a3
    3   8  h  5.9  a4
    4  16  i  6.2  a5
    ----------
        a  b    c ind
    2   4  g  5.8  a3
    0   1  e  5.6  a1
    1   2  f  5.7  a2
    3   8  h  5.9  a4
    4  16  i  6.2  a5
pandas_streaming.df.dataframe_helpers.dataframe_unfold(df, col, new_col=None, sep=',')[source][source]#

One column may contain concatenated values. This function splits these values and multiplies the rows for each split value.

@param df dataframe @param col column with the concatenated values (strings) @param new_col new column name, if None, use default value. @param sep separator @return a new dataframe

Unfolds a column of a dataframe.

<<<

import pandas
import numpy
from pandas_streaming.df import dataframe_unfold

df = pandas.DataFrame([dict(a=1, b="e,f"), dict(a=2, b="g"), dict(a=3)])
print(df)
df2 = dataframe_unfold(df, "b")
print("----------")
print(df2)

# To fold:
folded = df2.groupby("a").apply(
    lambda row: ",".join(row["b_unfold"].dropna())
    if len(row["b_unfold"].dropna()) > 0
    else numpy.nan
)
print("----------")
print(folded)

>>>

       a    b
    0  1  e,f
    1  2    g
    2  3  NaN
    ----------
       a    b b_unfold
    0  1  e,f        e
    1  1  e,f        f
    2  2    g        g
    3  3  NaN      NaN
    ----------
    a
    1    e,f
    2      g
    3    NaN
    dtype: object
pandas_streaming.df.dataframe_helpers.pandas_groupby_nan(df, by, axis=0, as_index=False, suffix=None, nanback=True, **kwargs)[source][source]#

Does a groupby including keeping missing values (nan).

Parameters:
  • df – dataframe

  • by – column or list of columns

  • axis – only 0 is allowed

  • as_index – should be False

  • suffix – None or a string

  • nanback – put nan back in the index, otherwise it leaves a replacement for nan. (does not work when grouping by multiple columns)

  • kwargs – other parameters sent to groupby

Returns:

groupby results

See groupby and missing values. If no nan is detected, the function falls back in regular pandas.DataFrame.groupby which has the following behavior.

Group a dataframe by one column including nan values

The regular pandas.dataframe.GroupBy of a pandas.DataFrame removes every nan values from the index.

<<<

from pandas import DataFrame

data = [dict(a=2, ind="a", n=1), dict(a=2, ind="a"), dict(a=3, ind="b"), dict(a=30)]
df = DataFrame(data)
print(df)
gr = df.groupby(["ind"]).sum()
print(gr)

>>>

        a  ind    n
    0   2    a  1.0
    1   2    a  NaN
    2   3    b  NaN
    3  30  NaN  NaN
         a    n
    ind        
    a    4  1.0
    b    3  0.0

Function @see fn pandas_groupby_nan modifies keeps them.

<<<

from pandas import DataFrame
from pandas_streaming.df import pandas_groupby_nan

data = [dict(a=2, ind="a", n=1), dict(a=2, ind="a"), dict(a=3, ind="b"), dict(a=30)]
df = DataFrame(data)
gr2 = pandas_groupby_nan(df, ["ind"]).sum()
print(gr2)

>>>

    /home/xadupre/github/pandas-streaming/pandas_streaming/df/dataframe_helpers.py:398: FutureWarning: The 'axis' keyword in DataFrame.groupby is deprecated and will be removed in a future version.
      res = df.groupby(by, axis=axis, as_index=as_index, dropna=False, **kwargs)
       ind   a    n
    0    a   4  1.0
    1    b   3  0.0
    2  NaN  30  0.0

Complex splits#

Splitting a database into train and test is usually simple except if rows are not independant and share some ids. In that case, the following functions will try to build two partitions keeping ids separate or separate as much as possible: train_test_apart_stratify, train_test_connex_split, train_test_split_weights.

Extensions#