module df.dataframe_io_helpers

Inheritance diagram of pandas_streaming.df.dataframe_io_helpers

Short summary

module pandas_streaming.df.dataframe_io_helpers

Saves and reads a dataframe into a zip file.

source on GitHub

Classes

class

truncated documentation

JsonIterator2Stream

Transforms an iterator on JSON items into a stream which returns an items as a string every time method …

JsonPerRowsStream

Reads a json streams and adds ,, [, ] to convert a stream containing one json object …

Functions

function

truncated documentation

enumerate_json_items

Enumerates items from a JSON file or string.

flatten_dictionary

Flattens a dictionary with nested structure to a dictionary with no hierarchy.

Methods

method

truncated documentation

__init__

__init__

__iter__

Iterates on each row. The behaviour is a bit tricky. It is implemented to be swalled by pandas.read_json()

getvalue

Returns the whole stream content.

read

Reads the next item and returns it as a string.

read

Reads characters, adds ,, [, ] if needed. So the number of read characters is not recessarily …

readline

Reads a line, adds ,, [, ] if needed. So the number of read characters is not recessarily the …

seek

Change the stream position to the given byte offset.

seek

Change the stream position to the given byte offset.

write

The class does not write.

Documentation

Saves and reads a dataframe into a zip file.

source on GitHub

class pandas_streaming.df.dataframe_io_helpers.JsonIterator2Stream(it, **kwargs)

Bases: object

Transforms an iterator on JSON items into a stream which returns an items as a string every time method read is called. The iterator could be one returned by enumerate_json_items.

Parameters:
  • it – iterator

  • kwargs – arguments to json.dumps

Reshape a json file

The function enumerate_json_items reads any json even if every record is split over multiple lines. Class JsonIterator2Stream mocks this iterator as a stream. Each row is a single item.

<<<

from pandas_streaming.df.dataframe_io_helpers import enumerate_json_items, JsonIterator2Stream

text_json = b'''
    [
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": [{
                    "GlossEntry": {
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }
                }]
            }
        }
    },
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": {
                    "GlossEntry": [{
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }]
                }
            }
        }
    }
    ]
'''

for item in JsonIterator2Stream(lambda: enumerate_json_items(text_json)):
    print(item)

>>>

    {"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":[{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}]}}}
    {"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":[{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}]}}}}

Changed in version 0.3: The class takes a function which outputs an iterator and not an iterator. JsonIterator2Stream(enumerate_json_items(text_json)) needs to be rewritten into JsonIterator2Stream(lambda: enumerate_json_items(text_json)).

source on GitHub

__init__(it, **kwargs)
__iter__()

Iterates on each row. The behaviour is a bit tricky. It is implemented to be swalled by pandas.read_json() which uses itertools.islice() to go through the items. It calls multiple times __iter__ but does expect the iterator to continue from where it stopped last time.

source on GitHub

read()

Reads the next item and returns it as a string.

source on GitHub

seek(offset)

Change the stream position to the given byte offset.

Parameters:

offset – offset, only 0 is implemented

source on GitHub

write()

The class does not write.

source on GitHub

class pandas_streaming.df.dataframe_io_helpers.JsonPerRowsStream(st)

Bases: object

Reads a json streams and adds ,, [, ] to convert a stream containing one json object per row into one single json object. It only implements method readline.

Parameters:

st – stream

source on GitHub

__init__(st)
getvalue()

Returns the whole stream content.

source on GitHub

read(size=-1)

Reads characters, adds ,, [, ] if needed. So the number of read characters is not recessarily the requested one but could be greater.

source on GitHub

readline(size=-1)

Reads a line, adds ,, [, ] if needed. So the number of read characters is not recessarily the requested one but could be greater.

source on GitHub

seek(offset)

Change the stream position to the given byte offset.

Parameters:

offset – offset, only 0 is implemented

source on GitHub

pandas_streaming.df.dataframe_io_helpers.enumerate_json_items(filename, encoding=None, lines=False, flatten=False, fLOG=None)

Enumerates items from a JSON file or string.

Parameters:
  • filename – filename or string or stream to parse

  • encoding – encoding

  • lines – one record per row

  • flatten – call flatten_dictionary

  • fLOG – logging function

Returns:

iterator on records at first level.

It assumes the syntax follows the format: [ {"id":1, ...}, {"id": 2, ...}, ...]. However, if option lines if true, the function considers that the stream or file does have one record per row as follows:

{“id”:1, …} {“id”: 2, …}

Processes a json file by streaming.

The module ijson can read a JSON file by streaming. This module is needed because a record can be written on multiple lines. This function leverages it produces the following results.

<<<

from pandas_streaming.df.dataframe_io_helpers import enumerate_json_items

text_json = b'''
    [
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": [{
                    "GlossEntry": {
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }
                }]
            }
        }
    },
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": {
                    "GlossEntry": [{
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }]
                }
            }
        }
    }
    ]
'''

for item in enumerate_json_items(text_json):
    print(item)

>>>

    {'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': [{'GlossEntry': {'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}}]}}}
    {'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': {'GlossEntry': [{'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}]}}}}

The parsed json must have an empty line at the end otherwise the following exception is raised: ijson.common.IncompleteJSONError: ` `parse error: unallowed token at this point in JSON text.

source on GitHub

pandas_streaming.df.dataframe_io_helpers.flatten_dictionary(dico, sep='_')

Flattens a dictionary with nested structure to a dictionary with no hierarchy.

Parameters:
  • dico – dictionary to flatten

  • sep – string to separate dictionary keys by

Returns:

flattened dictionary

Inspired from flatten_json.

source on GitHub