module df.dataframe_split

Short summary

module pandas_streaming.df.dataframe_split

Implements different methods to split a dataframe.

source on GitHub

Functions

function

truncated documentation

sklearn_train_test_split

Randomly splits a dataframe into smaller pieces. The function returns streams of file names. The function relies …

sklearn_train_test_split_streaming

Randomly splits a dataframe into smaller pieces. The function returns streams of file names. The function relies …

Documentation

Implements different methods to split a dataframe.

source on GitHub

pandas_streaming.df.dataframe_split.sklearn_train_test_split(self, path_or_buf=None, export_method='to_csv', names=None, **kwargs)

Randomly splits a dataframe into smaller pieces. The function returns streams of file names. The function relies on sklearn.model_selection.train_test_split. It does not handle stratified version of it.

Parameters:
Returns:

outputs of the exports functions

The function cannot return two iterators or two StreamingDataFrame because running through one means running through the other. We can assume both splits do not hold in memory and we cannot run through the same iterator again as random draws would be different. We need to store the results into files or buffers.

Warning

The method export_method must write the data in mode append and allows stream.

source on GitHub

pandas_streaming.df.dataframe_split.sklearn_train_test_split_streaming(self, test_size=0.25, train_size=None, stratify=None, hash_size=9, unique_rows=False)

Randomly splits a dataframe into smaller pieces. The function returns streams of file names. The function relies on sklearn.model_selection.train_test_split. It handles the stratified version of it.

Parameters:
  • selfStreamingDataFrame

  • test_size – ratio for the test partition (if train_size is not specified)

  • train_size – ratio for the train partition

  • stratify – column holding the stratification

  • hash_size – size of the hash to cache information about partition

  • unique_rows – ensures that rows are unique

Returns:

Two StreamingDataFrame, one for train, one for test.

The function returns two iterators or two StreamingDataFrame. It tries to do everything without writing anything on disk but it requires to store the repartition somehow. This function hashes every row and maps the hash with a part (train or test). This cache must hold in memory otherwise the function fails. The two returned iterators must not be used for the first time in the same time. The first time is used to build the cache. The function changes the order of rows if the parameter stratify is not null. The cache has a side effect: every exact same row will be put in the same partition. If that is not what you want, you should add an index column or a random one.

source on GitHub