module df.dataframe_helpers
¶
Short summary¶
module pandas_streaming.df.dataframe_helpers
Helpers for dataframes.
Functions¶
function |
truncated documentation |
---|---|
Hashes a set of columns in a dataframe. Keeps the same type. Skips missing values. |
|
Shuffles a dataframe. |
|
One column may contain concatenated values. This function splits these values and multiplies the rows for each split … |
|
Hashes a float into a float. |
|
Hashes an integer into an integer. |
|
Hashes a string. |
|
Returns the list of numpy available types. |
|
Replaces the nan values for something not nan. Mostly used by |
|
Does a groupby including keeping missing values (nan). |
Documentation¶
Helpers for dataframes.
- pandas_streaming.df.dataframe_helpers.dataframe_hash_columns(df, cols=None, hash_length=10, inplace=False)¶
Hashes a set of columns in a dataframe. Keeps the same type. Skips missing values.
- Parameters:
df – dataframe
cols – columns to hash or None for alls.
hash_length – for strings only, length of the hash
inplace – modifies inplace
- Returns:
new dataframe
This might be useful to anonimized data before making it public.
Hashes a set of columns in a dataframe
<<<
import pandas from pandas_streaming.df import dataframe_hash_columns df = pandas.DataFrame([dict(a=1, b="e", c=5.6, ind="a1", ai=1), dict(b="f", c=5.7, ind="a2", ai=2), dict(a=4, b="g", ind="a3", ai=3), dict(a=8, b="h", c=5.9, ai=4), dict(a=16, b="i", c=6.2, ind="a5", ai=5)]) print(df) print('--------------') df2 = dataframe_hash_columns(df) print(df2)
>>>
a b c ind ai 0 1.0 e 5.6 a1 1 1 NaN f 5.7 a2 2 2 4.0 g NaN a3 3 3 8.0 h 5.9 NaN 4 4 16.0 i 6.2 a5 5 -------------- a b c ind ai 0 4.648669e+11 3f79bb7b43 3.355454e+11 f55ff16f66 65048080 1 NaN 252f10c836 5.803745e+11 2c3a4249d7 1214325 2 2.750847e+11 cd0aa98561 NaN f46dd28a54 80131111 3 1.940968e+11 aaa9402664 9.635096e+10 NaN 19167269 4 1.083806e+12 de7d1b721a 3.183198e+11 66220e7159 8788782
- pandas_streaming.df.dataframe_helpers.dataframe_shuffle(df, random_state=None)¶
Shuffles a dataframe.
- Parameters:
df – pandas.DataFrame
random_state – seed
- Returns:
new pandas.DataFrame
Shuffles the rows of a dataframe
<<<
import pandas from pandas_streaming.df import dataframe_shuffle df = pandas.DataFrame([dict(a=1, b="e", c=5.6, ind="a1"), dict(a=2, b="f", c=5.7, ind="a2"), dict(a=4, b="g", c=5.8, ind="a3"), dict(a=8, b="h", c=5.9, ind="a4"), dict(a=16, b="i", c=6.2, ind="a5")]) print(df) print('----------') shuffled = dataframe_shuffle(df, random_state=0) print(shuffled)
>>>
a b c ind 0 1 e 5.6 a1 1 2 f 5.7 a2 2 4 g 5.8 a3 3 8 h 5.9 a4 4 16 i 6.2 a5 ---------- a b c ind 2 4 g 5.8 a3 0 1 e 5.6 a1 1 2 f 5.7 a2 3 8 h 5.9 a4 4 16 i 6.2 a5
- pandas_streaming.df.dataframe_helpers.dataframe_unfold(df, col, new_col=None, sep=',')¶
One column may contain concatenated values. This function splits these values and multiplies the rows for each split value.
- Parameters:
df – dataframe
col – column with the concatenated values (strings)
new_col – new column name, if None, use default value.
sep – separator
- Returns:
a new dataframe
Unfolds a column of a dataframe.
<<<
import pandas import numpy from pandas_streaming.df import dataframe_unfold df = pandas.DataFrame([dict(a=1, b="e,f"), dict(a=2, b="g"), dict(a=3)]) print(df) df2 = dataframe_unfold(df, "b") print('----------') print(df2) # To fold: folded = df2.groupby('a').apply(lambda row: ','.join(row['b_unfold'].dropna()) if len(row['b_unfold'].dropna()) > 0 else numpy.nan) print('----------') print(folded)
>>>
a b 0 1 e,f 1 2 g 2 3 NaN ---------- a b b_unfold 0 1 e,f e 1 1 e,f f 2 2 g g 3 3 NaN NaN ---------- a 1 e,f 2 g 3 NaN dtype: object
- pandas_streaming.df.dataframe_helpers.hash_float(c, hash_length)¶
Hashes a float into a float.
- Parameters:
c – value to hash
hash_length – hash_length
- Returns:
int
- pandas_streaming.df.dataframe_helpers.hash_int(c, hash_length)¶
Hashes an integer into an integer.
- Parameters:
c – value to hash
hash_length – hash_length
- Returns:
int
- pandas_streaming.df.dataframe_helpers.hash_str(c, hash_length)¶
Hashes a string.
- Parameters:
c – value to hash
hash_length – hash_length
- Returns:
string
- pandas_streaming.df.dataframe_helpers.numpy_types()¶
Returns the list of numpy available types.
- Returns:
list of types
- pandas_streaming.df.dataframe_helpers.pandas_fillna(df, by, hasna=None, suffix=None)¶
Replaces the nan values for something not nan. Mostly used by
pandas_groupby_nan
.- Parameters:
df – dataframe
by – list of columns for which we need to replace nan
hasna – None or list of columns for which we need to replace NaN
suffix – use a prefix for the NaN value
- Returns:
list of values chosen for each column, new dataframe (new copy)
- pandas_streaming.df.dataframe_helpers.pandas_groupby_nan(df, by, axis=0, as_index=False, suffix=None, nanback=True, **kwargs)¶
Does a groupby including keeping missing values (nan).
- Parameters:
- Returns:
groupby results
See groupby and missing values. If no nan is detected, the function falls back in regular pandas.DataFrame.groupby which has the following behavior.
Group a dataframe by one column including nan values
The regular pandas.dataframe.GroupBy of a pandas.DataFrame removes every nan values from the index.
<<<
from pandas import DataFrame data = [dict(a=2, ind="a", n=1), dict(a=2, ind="a"), dict(a=3, ind="b"), dict(a=30)] df = DataFrame(data) print(df) gr = df.groupby(["ind"]).sum() print(gr)
>>>
a ind n 0 2 a 1.0 1 2 a NaN 2 3 b NaN 3 30 NaN NaN a n ind a 4 1.0 b 3 0.0
Function
pandas_groupby_nan
modifies keeps them.<<<
from pandas import DataFrame from pandas_streaming.df import pandas_groupby_nan data = [dict(a=2, ind="a", n=1), dict(a=2, ind="a"), dict(a=3, ind="b"), dict(a=30)] df = DataFrame(data) gr2 = pandas_groupby_nan(df, ["ind"]).sum() print(gr2)
>>>
ind a n 0 a 4 1.0 1 b 3 0.0 2 NaN 30 0.0