pandas_streaming.df

Streaming

The main class is an interface which mimic pandas.DataFrame interface to offer a short list of methods which apply on an iterator of dataframes. This provides somehow a streaming version of it. As a result, the creation of an instance is fast as long as the data is not processed. Iterators can be chained as many map reduce framework does.

pandas_streaming.df.StreamingDataFrame (self, iter_creation, check_schema = True, stable = True)

Defines a streaming dataframe. The goal is to reduce the memory footprint. The class takes a function which creates an iterator on dataframe. We assume this function can be called multiple time. As a matter of fact, the function is called every time the class needs to walk through the stream with the following loop…

The module implements additional and useful functions not necessarily for the streaming version of the dataframes. Many methods have been rewritten to support streaming. Among them, IO methods:

pandas_streaming.df.StreamingDataFrame.read_csv (args, kwargs)

Reads a csv file or buffer as an iterator on DataFrame. The signature is the same as pandas.read_csv. The important parameter is chunksize which defines the number of rows to parse in a single bloc. If not specified, it will be equal to 100000.

pandas_streaming.df.StreamingDataFrame.read_df (df, chunksize = None, check_schema = True)

Splits a DataFrame into small chunks mostly for unit testing purposes.

pandas_streaming.df.StreamingDataFrame.read_json (args, chunksize = 100000, flatten = False, kwargs)

Reads a json file or buffer as an iterator on DataFrame. The signature is the same as pandas.read_json. The important parameter is chunksize which defines the number of rows to parse in a single bloc and it must be defined to return an iterator. If lines is True, the function falls back into pandas.read_json, otherwise it used enumerate_json_items. If lines is 'stream', enumerate_json_items is called with parameter lines=True. Parameter flatten uses the trick described at Flattening JSON objects in Python. Examples…

Data Manipulation

pandas_streaming.df.dataframe_hash_columns (df, cols = None, hash_length = 10, inplace = False)

Hashes a set of columns in a dataframe. Keeps the same type. Skips missing values.

pandas_streaming.df.dataframe_shuffle (df, random_state = None)

Shuffles a dataframe.

pandas_streaming.df.dataframe_unfold (df, col, new_col = None, sep = ‘,’)

One column may contain concatenated values. This function splits these values and multiplies the rows for each split value.

pandas_streaming.df.pandas_groupby_nan (df, by, axis = 0, as_index = False, suffix = None, nanback = True, kwargs)

Does a groupby including keeping missing values (nan).

Complex splits

Splitting a database into train and test is usually simple except if rows are not independant and share some ids. In that case, the following functions will try to build two partitions keeping ids separate or separate as much as possible.

pandas_streaming.df.train_test_apart_stratify (df, group, test_size = 0.25, train_size = None, stratify = None, force = False, random_state = None, fLOG = None)

This split is for a specific case where data is linked in one way. Let’s assume we have two ids as we have for online sales: (product id, category id). A product can have multiple categories. We need to have distinct products on train and test but common categories on both sides.

pandas_streaming.df.train_test_connex_split (df, groups, test_size = 0.25, train_size = None, stratify = None, hash_size = 9, unique_rows = False, shuffle = True, fail_imbalanced = 0.05, keep_balance = None, stop_if_bigger = None, return_cnx = False, must_groups = None, random_state = None, fLOG = None)

This split is for a specific case where data is linked in many ways. Let’s assume we have three ids as we have for online sales: (product id, user id, card id). As we may need to compute aggregated features, we need every id not to be present in both train and test set. The function computes the connected components and breaks each of them in two parts for train and test.

pandas_streaming.df.train_test_split_weights (df, weights = None, test_size = 0.25, train_size = None, shuffle = True, fail_imbalanced = 0.05, random_state = None)

Splits a database in train/test given, every row can have a different weight.