module df.dataframe_helpers

Short summary

module pandas_streaming.df.dataframe_helpers

Helpers for dataframes.

source on GitHub

Functions

function

truncated documentation

dataframe_hash_columns

Hashes a set of columns in a dataframe. Keeps the same type. Skips missing values.

dataframe_shuffle

Shuffles a dataframe.

dataframe_unfold

One column may contain concatenated values. This function splits these values and multiplies the rows for each split …

hash_float

Hashes a float into a float.

hash_int

Hashes an integer into an integer.

hash_str

Hashes a string.

numpy_types

Returns the list of numpy available types.

pandas_fillna

Replaces the nan values for something not nan. Mostly used by pandas_groupby_nan().

pandas_groupby_nan

Does a groupby including keeping missing values (nan).

Documentation

Helpers for dataframes.

source on GitHub

pandas_streaming.df.dataframe_helpers.dataframe_hash_columns(df, cols=None, hash_length=10, inplace=False)

Hashes a set of columns in a dataframe. Keeps the same type. Skips missing values.

Parameters:
  • df – dataframe

  • cols – columns to hash or None for alls.

  • hash_length – for strings only, length of the hash

  • inplace – modifies inplace

Returns:

new dataframe

This might be useful to anonimized data before making it public.

Hashes a set of columns in a dataframe

<<<

import pandas
from pandas_streaming.df import dataframe_hash_columns
df = pandas.DataFrame([dict(a=1, b="e", c=5.6, ind="a1", ai=1),
                       dict(b="f", c=5.7, ind="a2", ai=2),
                       dict(a=4, b="g", ind="a3", ai=3),
                       dict(a=8, b="h", c=5.9, ai=4),
                       dict(a=16, b="i", c=6.2, ind="a5", ai=5)])
print(df)
print('--------------')
df2 = dataframe_hash_columns(df)
print(df2)

>>>

          a  b    c  ind  ai
    0   1.0  e  5.6   a1   1
    1   NaN  f  5.7   a2   2
    2   4.0  g  NaN   a3   3
    3   8.0  h  5.9  NaN   4
    4  16.0  i  6.2   a5   5
    --------------
                  a           b             c         ind        ai
    0  4.648669e+11  3f79bb7b43  3.355454e+11  f55ff16f66  65048080
    1           NaN  252f10c836  5.803745e+11  2c3a4249d7   1214325
    2  2.750847e+11  cd0aa98561           NaN  f46dd28a54  80131111
    3  1.940968e+11  aaa9402664  9.635096e+10         NaN  19167269
    4  1.083806e+12  de7d1b721a  3.183198e+11  66220e7159   8788782

source on GitHub

pandas_streaming.df.dataframe_helpers.dataframe_shuffle(df, random_state=None)

Shuffles a dataframe.

Parameters:
Returns:

new pandas.DataFrame

Shuffles the rows of a dataframe

<<<

import pandas
from pandas_streaming.df import dataframe_shuffle

df = pandas.DataFrame([dict(a=1, b="e", c=5.6, ind="a1"),
                       dict(a=2, b="f", c=5.7, ind="a2"),
                       dict(a=4, b="g", c=5.8, ind="a3"),
                       dict(a=8, b="h", c=5.9, ind="a4"),
                       dict(a=16, b="i", c=6.2, ind="a5")])
print(df)
print('----------')

shuffled = dataframe_shuffle(df, random_state=0)
print(shuffled)

>>>

        a  b    c ind
    0   1  e  5.6  a1
    1   2  f  5.7  a2
    2   4  g  5.8  a3
    3   8  h  5.9  a4
    4  16  i  6.2  a5
    ----------
        a  b    c ind
    2   4  g  5.8  a3
    0   1  e  5.6  a1
    1   2  f  5.7  a2
    3   8  h  5.9  a4
    4  16  i  6.2  a5

source on GitHub

pandas_streaming.df.dataframe_helpers.dataframe_unfold(df, col, new_col=None, sep=',')

One column may contain concatenated values. This function splits these values and multiplies the rows for each split value.

Parameters:
  • df – dataframe

  • col – column with the concatenated values (strings)

  • new_col – new column name, if None, use default value.

  • sep – separator

Returns:

a new dataframe

Unfolds a column of a dataframe.

<<<

import pandas
import numpy
from pandas_streaming.df import dataframe_unfold

df = pandas.DataFrame([dict(a=1, b="e,f"),
                       dict(a=2, b="g"),
                       dict(a=3)])
print(df)
df2 = dataframe_unfold(df, "b")
print('----------')
print(df2)

# To fold:
folded = df2.groupby('a').apply(lambda row: ','.join(row['b_unfold'].dropna())
                                if len(row['b_unfold'].dropna()) > 0 else numpy.nan)
print('----------')
print(folded)

>>>

       a    b
    0  1  e,f
    1  2    g
    2  3  NaN
    ----------
       a    b b_unfold
    0  1  e,f        e
    1  1  e,f        f
    2  2    g        g
    3  3  NaN      NaN
    ----------
    a
    1    e,f
    2      g
    3    NaN
    dtype: object

source on GitHub

pandas_streaming.df.dataframe_helpers.hash_float(c, hash_length)

Hashes a float into a float.

Parameters:
  • c – value to hash

  • hash_length – hash_length

Returns:

int

source on GitHub

pandas_streaming.df.dataframe_helpers.hash_int(c, hash_length)

Hashes an integer into an integer.

Parameters:
  • c – value to hash

  • hash_length – hash_length

Returns:

int

source on GitHub

pandas_streaming.df.dataframe_helpers.hash_str(c, hash_length)

Hashes a string.

Parameters:
  • c – value to hash

  • hash_length – hash_length

Returns:

string

source on GitHub

pandas_streaming.df.dataframe_helpers.numpy_types()

Returns the list of numpy available types.

Returns:

list of types

source on GitHub

pandas_streaming.df.dataframe_helpers.pandas_fillna(df, by, hasna=None, suffix=None)

Replaces the nan values for something not nan. Mostly used by pandas_groupby_nan.

Parameters:
  • df – dataframe

  • by – list of columns for which we need to replace nan

  • hasna – None or list of columns for which we need to replace NaN

  • suffix – use a prefix for the NaN value

Returns:

list of values chosen for each column, new dataframe (new copy)

source on GitHub

pandas_streaming.df.dataframe_helpers.pandas_groupby_nan(df, by, axis=0, as_index=False, suffix=None, nanback=True, **kwargs)

Does a groupby including keeping missing values (nan).

Parameters:
  • df – dataframe

  • by – column or list of columns

  • axis – only 0 is allowed

  • as_index – should be False

  • suffix – None or a string

  • nanback – put nan back in the index, otherwise it leaves a replacement for nan. (does not work when grouping by multiple columns)

  • kwargs – other parameters sent to groupby

Returns:

groupby results

See groupby and missing values. If no nan is detected, the function falls back in regular pandas.DataFrame.groupby which has the following behavior.

Group a dataframe by one column including nan values

The regular pandas.dataframe.GroupBy of a pandas.DataFrame removes every nan values from the index.

<<<

from pandas import DataFrame

data = [dict(a=2, ind="a", n=1),
        dict(a=2, ind="a"),
        dict(a=3, ind="b"),
        dict(a=30)]
df = DataFrame(data)
print(df)
gr = df.groupby(["ind"]).sum()
print(gr)

>>>

        a  ind    n
    0   2    a  1.0
    1   2    a  NaN
    2   3    b  NaN
    3  30  NaN  NaN
         a    n
    ind        
    a    4  1.0
    b    3  0.0

Function pandas_groupby_nan modifies keeps them.

<<<

from pandas import DataFrame
from pandas_streaming.df import pandas_groupby_nan

data = [dict(a=2, ind="a", n=1),
        dict(a=2, ind="a"),
        dict(a=3, ind="b"),
        dict(a=30)]
df = DataFrame(data)
gr2 = pandas_groupby_nan(df, ["ind"]).sum()
print(gr2)

>>>

       ind   a    n
    0    a   4  1.0
    1    b   3  0.0
    2  NaN  30  0.0

source on GitHub