module `df.dataframe_io_helpers`¶

Short summary¶

module pandas_streaming.df.dataframe_io_helpers

Saves and reads a dataframe into a zip file.

Classes¶

class	truncated documentation
`JsonIterator2Stream`	Transforms an iterator on JSON items into a stream which returns an items as a string every time method …
`JsonPerRowsStream`	Reads a json streams and adds `,`, `[`, `]` to convert a stream containing one json object …

Functions¶

function	truncated documentation
`enumerate_json_items`	Enumerates items from a JSON file or string.
`flatten_dictionary`	Flattens a dictionary with nested structure to a dictionary with no hierarchy.

Methods¶

method	truncated documentation
`__init__`
`__init__`
`__iter__`	Iterates on each row. The behaviour is a bit tricky. It is implemented to be swalled by `pandas.read_json()` …
`getvalue`	Returns the whole stream content.
`read`	Reads the next item and returns it as a string.
`read`	Reads characters, adds `,`, `[`, `]` if needed. So the number of read characters is not recessarily …
`readline`	Reads a line, adds `,`, `[`, `]` if needed. So the number of read characters is not recessarily the …
`seek`	Change the stream position to the given byte offset.
`seek`	Change the stream position to the given byte offset.
`write`	The class does not write.

Documentation¶

Saves and reads a dataframe into a zip file.

source on GitHub

class pandas_streaming.df.dataframe_io_helpers.JsonIterator2Stream(it, **kwargs)¶

Bases: object

Transforms an iterator on JSON items into a stream which returns an items as a string every time method read is called. The iterator could be one returned by enumerate_json_items.

Parameters:

it – iterator
kwargs – arguments to json.dumps

Reshape a json file

The function enumerate_json_items reads any json even if every record is split over multiple lines. Class JsonIterator2Stream mocks this iterator as a stream. Each row is a single item.

<<<

from pandas_streaming.df.dataframe_io_helpers import enumerate_json_items, JsonIterator2Stream

text_json = b'''
    [
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": [{
                    "GlossEntry": {
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }
                }]
            }
        }
    },
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": {
                    "GlossEntry": [{
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }]
                }
            }
        }
    }
    ]
'''

for item in JsonIterator2Stream(lambda: enumerate_json_items(text_json)):
    print(item)

>>>

    {"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":[{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}]}}}
    {"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":[{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}]}}}}

Changed in version 0.3: The class takes a function which outputs an iterator and not an iterator. JsonIterator2Stream(enumerate_json_items(text_json)) needs to be rewritten into JsonIterator2Stream(lambda: enumerate_json_items(text_json)).

source on GitHub

__init__(it, **kwargs)¶

__iter__()¶

Iterates on each row. The behaviour is a bit tricky. It is implemented to be swalled by pandas.read_json() which uses itertools.islice() to go through the items. It calls multiple times __iter__ but does expect the iterator to continue from where it stopped last time.

source on GitHub

read()¶

Reads the next item and returns it as a string.

source on GitHub

seek(offset)¶

Change the stream position to the given byte offset.

Parameters:: offset – offset, only 0 is implemented

source on GitHub

write()¶

The class does not write.

source on GitHub

class pandas_streaming.df.dataframe_io_helpers.JsonPerRowsStream(st)¶

Bases: object

Reads a json streams and adds ,, [, ] to convert a stream containing one json object per row into one single json object. It only implements method readline.

Parameters:: st – stream

source on GitHub

__init__(st)¶

getvalue()¶

Returns the whole stream content.

source on GitHub

read(size=-1)¶

Reads characters, adds ,, [, ] if needed. So the number of read characters is not recessarily the requested one but could be greater.

source on GitHub

readline(size=-1)¶

Reads a line, adds ,, [, ] if needed. So the number of read characters is not recessarily the requested one but could be greater.

source on GitHub

seek(offset)¶

Change the stream position to the given byte offset.

Parameters:: offset – offset, only 0 is implemented

source on GitHub

pandas_streaming.df.dataframe_io_helpers.enumerate_json_items(filename, encoding=None, lines=False, flatten=False, fLOG=None)¶

Enumerates items from a JSON file or string.

Parameters:

filename – filename or string or stream to parse
encoding – encoding
lines – one record per row
flatten – call flatten_dictionary
fLOG – logging function

Returns:

iterator on records at first level.

It assumes the syntax follows the format: [ {"id":1, ...}, {"id": 2, ...}, ...]. However, if option lines if true, the function considers that the stream or file does have one record per row as follows:

{“id”:1, …} {“id”: 2, …}

Processes a json file by streaming.

The module ijson can read a JSON file by streaming. This module is needed because a record can be written on multiple lines. This function leverages it produces the following results.

<<<

from pandas_streaming.df.dataframe_io_helpers import enumerate_json_items

text_json = b'''
    [
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": [{
                    "GlossEntry": {
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }
                }]
            }
        }
    },
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": {
                    "GlossEntry": [{
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }]
                }
            }
        }
    }
    ]
'''

for item in enumerate_json_items(text_json):
    print(item)

>>>

    {'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': [{'GlossEntry': {'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}}]}}}
    {'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': {'GlossEntry': [{'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}]}}}}

The parsed json must have an empty line at the end otherwise the following exception is raised: ijson.common.IncompleteJSONError: ` `parse error: unallowed token at this point in JSON text.

source on GitHub

pandas_streaming.df.dataframe_io_helpers.flatten_dictionary(dico, sep='_')¶

Flattens a dictionary with nested structure to a dictionary with no hierarchy.

Parameters:

dico – dictionary to flatten
sep – string to separate dictionary keys by

Returns:

flattened dictionary

Inspired from flatten_json.

source on GitHub

module `df.dataframe_io_helpers`¶

Short summary¶

Classes¶

Functions¶

Methods¶

Documentation¶

pandas_streaming

Navigation

Related Topics

module df.dataframe_io_helpers¶

Short summary¶

Classes¶

Functions¶

Methods¶

Documentation¶

module `df.dataframe_io_helpers`¶