pandas_streaming.df.dataframe_split¶

pandas_streaming.df.dataframe_split.sklearn_train_test_split(self, path_or_buf=None, export_method='to_csv', names=None, **kwargs)[source][source]¶

Randomly splits a dataframe into smaller pieces. The function returns streams of file names. The function relies on sklearn.model_selection.train_test_split(). It does not handle stratified version of it.

Parameters:

self – see StreamingDataFrame
path_or_buf – a string, a list of strings or buffers, if it is a string, it must contain {} like partition{}.txt
export_method – method used to store the partitions, by default pandas.DataFrame.to_csv()
names – partitions names, by default ('train', 'test')
kwargs – parameters for the export function and sklearn.model_selection.train_test_split().

Returns:

outputs of the exports functions

The function cannot return two iterators or two see StreamingDataFrame because running through one means running through the other. We can assume both splits do not hold in memory and we cannot run through the same iterator again as random draws would be different. We need to store the results into files or buffers.

Warning

The method export_method must write the data in mode append and allows stream.

pandas_streaming.df.dataframe_split.sklearn_train_test_split_streaming(self, test_size=0.25, train_size=None, stratify=None, hash_size=9, unique_rows=False)[source][source]¶

Randomly splits a dataframe into smaller pieces. The function returns streams of file names. The function relies on sklearn.model_selection.train_test_split(). It handles the stratified version of it.

Parameters:

self – see StreamingDataFrame
test_size – ratio for the test partition (if train_size is not specified)
train_size – ratio for the train partition
stratify – column holding the stratification
hash_size – size of the hash to cache information about partition
unique_rows – ensures that rows are unique

Returns:

Two see StreamingDataFrame, one for train, one for test.

The function returns two iterators or two see StreamingDataFrame. It tries to do everything without writing anything on disk but it requires to store the repartition somehow. This function hashes every row and maps the hash with a part (train or test). This cache must hold in memory otherwise the function fails. The two returned iterators must not be used for the first time in the same time. The first time is used to build the cache. The function changes the order of rows if the parameter stratify is not null. The cache has a side effect: every exact same row will be put in the same partition. If that is not what you want, you should add an index column or a random one.