# First steps with pandas_streaming

A few difference between [pandas](http://pandas.pydata.org/) and *pandas_streaming*.

In [1]:
from jyquickhelper import add_notebook_menu
add_notebook_menu()

## pandas to pandas_streaming

In [2]:
from pandas import DataFrame
df = DataFrame(data=dict(X=[4.5, 6, 7], Y=["a", "b", "c"]))
df

Unnamed: 0,X,Y
0,4.5,a
1,6.0,b
2,7.0,c


We create a streaming dataframe:

In [3]:
from pandas_streaming.df import StreamingDataFrame
sdf = StreamingDataFrame.read_df(df)
sdf

<pandas_streaming.df.dataframe.StreamingDataFrame at 0x15c2c606160>

In [4]:
sdf.to_dataframe()

Unnamed: 0,X,Y
0,4.5,a
1,6.0,b
2,7.0,c


Internally, StreamingDataFrame implements an iterator on dataframes and then tries to replicate the same interface as [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) possibly wherever it is possible to manipulate data without loading everything into memory.

In [5]:
sdf2 = sdf.concat(sdf)
sdf2.to_dataframe()

Unnamed: 0,X,Y
0,4.5,a
1,6.0,b
2,7.0,c
0,4.5,a
1,6.0,b
2,7.0,c


In [6]:
m = DataFrame(dict(Y=["a", "b"], Z=[10, 20]))
m

Unnamed: 0,Y,Z
0,a,10
1,b,20


In [7]:
sdf3 = sdf2.merge(m, left_on="Y", right_on="Y", how="outer")
sdf3.to_dataframe()

Unnamed: 0,X,Y,Z
0,4.5,a,10.0
1,6.0,b,20.0
2,7.0,c,
0,4.5,a,10.0
1,6.0,b,20.0
2,7.0,c,


In [8]:
sdf2.to_dataframe().merge(m, left_on="Y", right_on="Y", how="outer")

Unnamed: 0,X,Y,Z
0,4.5,a,10.0
1,4.5,a,10.0
2,6.0,b,20.0
3,6.0,b,20.0
4,7.0,c,
5,7.0,c,


The order might be different.

In [9]:
sdftr, sdfte = sdf2.train_test_split(test_size=0.5)
sdfte.head()

Unnamed: 0,X,Y
0,4.5,a
1,4.5,a


In [10]:
sdftr.head()

Unnamed: 0,X,Y
0,6.0,b
1,7.0,c
2,6.0,b
0,7.0,c


## split a big file

In [11]:
sdf2.to_csv("example.txt")

'example.txt'

In [12]:
new_sdf = StreamingDataFrame.read_csv("example.txt")
new_sdf.train_test_split("example.{}.txt", streaming=False)

['example.train.txt', 'example.test.txt']

In [13]:
import glob
glob.glob("ex*.txt")

['example.test.txt', 'example.train.txt', 'example.txt']