pandas_streaming.df.dataframe_split¶
- pandas_streaming.df.dataframe_split.sklearn_train_test_split(self, path_or_buf=None, export_method='to_csv', names=None, **kwargs)[source][source]¶
Randomly splits a dataframe into smaller pieces. The function returns streams of file names. The function relies on
sklearn.model_selection.train_test_split()
. It does not handle stratified version of it.- Parameters:
self – see
StreamingDataFrame
path_or_buf – a string, a list of strings or buffers, if it is a string, it must contain
{}
likepartition{}.txt
export_method – method used to store the partitions, by default
pandas.DataFrame.to_csv()
names – partitions names, by default
('train', 'test')
kwargs – parameters for the export function and
sklearn.model_selection.train_test_split()
.
- Returns:
outputs of the exports functions
The function cannot return two iterators or two see
StreamingDataFrame
because running through one means running through the other. We can assume both splits do not hold in memory and we cannot run through the same iterator again as random draws would be different. We need to store the results into files or buffers.Warning
The method export_method must write the data in mode append and allows stream.
- pandas_streaming.df.dataframe_split.sklearn_train_test_split_streaming(self, test_size=0.25, train_size=None, stratify=None, hash_size=9, unique_rows=False)[source][source]¶
Randomly splits a dataframe into smaller pieces. The function returns streams of file names. The function relies on
sklearn.model_selection.train_test_split()
. It handles the stratified version of it.- Parameters:
self – see
StreamingDataFrame
test_size – ratio for the test partition (if train_size is not specified)
train_size – ratio for the train partition
stratify – column holding the stratification
hash_size – size of the hash to cache information about partition
unique_rows – ensures that rows are unique
- Returns:
Two see
StreamingDataFrame
, one for train, one for test.
The function returns two iterators or two see
StreamingDataFrame
. It tries to do everything without writing anything on disk but it requires to store the repartition somehow. This function hashes every row and maps the hash with a part (train or test). This cache must hold in memory otherwise the function fails. The two returned iterators must not be used for the first time in the same time. The first time is used to build the cache. The function changes the order of rows if the parameter stratify is not null. The cache has a side effect: every exact same row will be put in the same partition. If that is not what you want, you should add an index column or a random one.