pandas_streaming.df¶
Streaming¶
The main class is an interface which mimic
pandas.DataFrame
interface to offer
a short list of methods which apply on an
iterator of dataframes. This provides somehow
a streaming version of it. As a result, the creation
of an instance is fast as long as the data is not
processed. Iterators can be chained as many map reduce
framework does.
The module implements additional and useful functions
not necessarily for the streaming version of the dataframes.
Many methods have been rewritten to support
streaming. Among them, IO methods:
read_csv
,
read_df
,
read_json
.
Data Manipulation¶
- pandas_streaming.df.dataframe_helpers.dataframe_hash_columns(df, cols=None, hash_length=10, inplace=False)[source][source]¶
Hashes a set of columns in a dataframe. Keeps the same type. Skips missing values.
@param df dataframe @param cols columns to hash or None for alls. @param hash_length for strings only, length of the hash @param inplace modifies inplace @return new dataframe
This might be useful to anonimized data before making it public.
Hashes a set of columns in a dataframe
<<<
import pandas from pandas_streaming.df import dataframe_hash_columns df = pandas.DataFrame( [ dict(a=1, b="e", c=5.6, ind="a1", ai=1), dict(b="f", c=5.7, ind="a2", ai=2), dict(a=4, b="g", ind="a3", ai=3), dict(a=8, b="h", c=5.9, ai=4), dict(a=16, b="i", c=6.2, ind="a5", ai=5), ] ) print(df) print("--------------") df2 = dataframe_hash_columns(df) print(df2)
>>>
a b c ind ai 0 1.0 e 5.6 a1 1 1 NaN f 5.7 a2 2 2 4.0 g NaN a3 3 3 8.0 h 5.9 NaN 4 4 16.0 i 6.2 a5 5 -------------- a b c ind ai 0 4.648669e+11 3f79bb7b43 3.355454e+11 f55ff16f66 65048080 1 NaN 252f10c836 5.803745e+11 2c3a4249d7 1214325 2 2.750847e+11 cd0aa98561 NaN f46dd28a54 80131111 3 1.940968e+11 aaa9402664 9.635096e+10 NaN 19167269 4 1.083806e+12 de7d1b721a 3.183198e+11 66220e7159 8788782
- pandas_streaming.df.connex_split.dataframe_shuffle(df, random_state=None)[source][source]¶
Shuffles a dataframe.
- Parameters:
df – pandas.DataFrame
random_state – seed
- Returns:
new pandas.DataFrame
Shuffles the rows of a dataframe
<<<
import pandas from pandas_streaming.df import dataframe_shuffle df = pandas.DataFrame( [ dict(a=1, b="e", c=5.6, ind="a1"), dict(a=2, b="f", c=5.7, ind="a2"), dict(a=4, b="g", c=5.8, ind="a3"), dict(a=8, b="h", c=5.9, ind="a4"), dict(a=16, b="i", c=6.2, ind="a5"), ] ) print(df) print("----------") shuffled = dataframe_shuffle(df, random_state=0) print(shuffled)
>>>
a b c ind 0 1 e 5.6 a1 1 2 f 5.7 a2 2 4 g 5.8 a3 3 8 h 5.9 a4 4 16 i 6.2 a5 ---------- a b c ind 2 4 g 5.8 a3 0 1 e 5.6 a1 1 2 f 5.7 a2 3 8 h 5.9 a4 4 16 i 6.2 a5
- pandas_streaming.df.dataframe_helpers.dataframe_unfold(df, col, new_col=None, sep=',')[source][source]¶
One column may contain concatenated values. This function splits these values and multiplies the rows for each split value.
@param df dataframe @param col column with the concatenated values (strings) @param new_col new column name, if None, use default value. @param sep separator @return a new dataframe
Unfolds a column of a dataframe.
<<<
import pandas import numpy from pandas_streaming.df import dataframe_unfold df = pandas.DataFrame([dict(a=1, b="e,f"), dict(a=2, b="g"), dict(a=3)]) print(df) df2 = dataframe_unfold(df, "b") print("----------") print(df2) # To fold: folded = df2.groupby("a").apply( lambda row: ( ",".join(row["b_unfold"].dropna()) if len(row["b_unfold"].dropna()) > 0 else numpy.nan ) ) print("----------") print(folded)
>>>
a b 0 1 e,f 1 2 g 2 3 NaN ---------- a b b_unfold 0 1 e,f e 1 1 e,f f 2 2 g g 3 3 NaN NaN :18: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. ---------- a 1 e,f 2 g 3 NaN dtype: object
- pandas_streaming.df.dataframe_helpers.pandas_groupby_nan(df, by, axis=0, as_index=False, suffix=None, nanback=True, **kwargs)[source][source]¶
Does a groupby including keeping missing values (nan).
- Parameters:
- Returns:
groupby results
See groupby and missing values. If no nan is detected, the function falls back in regular pandas.DataFrame.groupby which has the following behavior.
Group a dataframe by one column including nan values
The regular pandas.dataframe.GroupBy of a pandas.DataFrame removes every nan values from the index.
<<<
from pandas import DataFrame data = [dict(a=2, ind="a", n=1), dict(a=2, ind="a"), dict(a=3, ind="b"), dict(a=30)] df = DataFrame(data) print(df) gr = df.groupby(["ind"]).sum() print(gr)
>>>
a ind n 0 2 a 1.0 1 2 a NaN 2 3 b NaN 3 30 NaN NaN a n ind a 4 1.0 b 3 0.0
Function @see fn pandas_groupby_nan modifies keeps them.
<<<
from pandas import DataFrame from pandas_streaming.df import pandas_groupby_nan data = [dict(a=2, ind="a", n=1), dict(a=2, ind="a"), dict(a=3, ind="b"), dict(a=30)] df = DataFrame(data) gr2 = pandas_groupby_nan(df, ["ind"]).sum() print(gr2)
>>>
/home/xadupre/github/pandas-streaming/pandas_streaming/df/dataframe_helpers.py:394: FutureWarning: The 'axis' keyword in DataFrame.groupby is deprecated and will be removed in a future version. res = df.groupby(by, axis=axis, as_index=as_index, dropna=False, **kwargs) ind a n 0 a 4 1.0 1 b 3 0.0 2 NaN 30 0.0
Complex splits¶
Splitting a database into train and test is usually simple except
if rows are not independant and share some ids. In that case,
the following functions will try to build two partitions keeping
ids separate or separate as much as possible:
train_test_apart_stratify
,
train_test_connex_split
,
train_test_split_weights
.