module df.dataframe_io_helpers
¶
Short summary¶
module pandas_streaming.df.dataframe_io_helpers
Classes¶
class |
truncated documentation |
---|---|
Transforms an iterator on JSON items into a stream which returns an items as a string every time method … |
|
Reads a json streams and adds |
Functions¶
function |
truncated documentation |
---|---|
Enumerates items from a JSON file or string. |
|
Flattens a dictionary with nested structure to a dictionary with no hierarchy. |
Methods¶
method |
truncated documentation |
---|---|
Iterates on each row. The behaviour is a bit tricky. It is implemented to be swalled by |
|
Returns the whole stream content. |
|
Reads the next item and returns it as a string. |
|
Reads characters, adds |
|
Reads a line, adds |
|
Change the stream position to the given byte offset. |
|
Change the stream position to the given byte offset. |
|
The class does not write. |
Documentation¶
Saves and reads a dataframe into a zip file.
- class pandas_streaming.df.dataframe_io_helpers.JsonIterator2Stream(it, **kwargs)¶
Bases:
object
Transforms an iterator on JSON items into a stream which returns an items as a string every time method read is called. The iterator could be one returned by
enumerate_json_items
.- Parameters:
it – iterator
kwargs – arguments to json.dumps
Reshape a json file
The function
enumerate_json_items
reads any json even if every record is split over multiple lines. ClassJsonIterator2Stream
mocks this iterator as a stream. Each row is a single item.<<<
from pandas_streaming.df.dataframe_io_helpers import enumerate_json_items, JsonIterator2Stream text_json = b''' [ { "glossary": { "title": "example glossary", "GlossDiv": { "title": "S", "GlossList": [{ "GlossEntry": { "ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": { "para": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML"] }, "GlossSee": "markup" } }] } } }, { "glossary": { "title": "example glossary", "GlossDiv": { "title": "S", "GlossList": { "GlossEntry": [{ "ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": { "para": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML"] }, "GlossSee": "markup" }] } } } } ] ''' for item in JsonIterator2Stream(lambda: enumerate_json_items(text_json)): print(item)
>>>
{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":[{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}]}}} {"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":[{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}]}}}}
Changed in version 0.3: The class takes a function which outputs an iterator and not an iterator. JsonIterator2Stream(enumerate_json_items(text_json)) needs to be rewritten into JsonIterator2Stream(lambda: enumerate_json_items(text_json)).
- __init__(it, **kwargs)¶
- __iter__()¶
Iterates on each row. The behaviour is a bit tricky. It is implemented to be swalled by
pandas.read_json()
which usesitertools.islice()
to go through the items. It calls multiple times __iter__ but does expect the iterator to continue from where it stopped last time.
- read()¶
Reads the next item and returns it as a string.
- seek(offset)¶
Change the stream position to the given byte offset.
- Parameters:
offset – offset, only 0 is implemented
- write()¶
The class does not write.
- class pandas_streaming.df.dataframe_io_helpers.JsonPerRowsStream(st)¶
Bases:
object
Reads a json streams and adds
,
,[
,]
to convert a stream containing one json object per row into one single json object. It only implements method readline.- Parameters:
st – stream
- __init__(st)¶
- getvalue()¶
Returns the whole stream content.
- read(size=-1)¶
Reads characters, adds
,
,[
,]
if needed. So the number of read characters is not recessarily the requested one but could be greater.
- readline(size=-1)¶
Reads a line, adds
,
,[
,]
if needed. So the number of read characters is not recessarily the requested one but could be greater.
- seek(offset)¶
Change the stream position to the given byte offset.
- Parameters:
offset – offset, only 0 is implemented
- pandas_streaming.df.dataframe_io_helpers.enumerate_json_items(filename, encoding=None, lines=False, flatten=False, fLOG=None)¶
Enumerates items from a JSON file or string.
- Parameters:
filename – filename or string or stream to parse
encoding – encoding
lines – one record per row
flatten – call
flatten_dictionary
fLOG – logging function
- Returns:
iterator on records at first level.
It assumes the syntax follows the format:
[ {"id":1, ...}, {"id": 2, ...}, ...]
. However, if option lines if true, the function considers that the stream or file does have one record per row as follows:{“id”:1, …} {“id”: 2, …}
Processes a json file by streaming.
The module ijson can read a JSON file by streaming. This module is needed because a record can be written on multiple lines. This function leverages it produces the following results.
<<<
from pandas_streaming.df.dataframe_io_helpers import enumerate_json_items text_json = b''' [ { "glossary": { "title": "example glossary", "GlossDiv": { "title": "S", "GlossList": [{ "GlossEntry": { "ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": { "para": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML"] }, "GlossSee": "markup" } }] } } }, { "glossary": { "title": "example glossary", "GlossDiv": { "title": "S", "GlossList": { "GlossEntry": [{ "ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": { "para": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML"] }, "GlossSee": "markup" }] } } } } ] ''' for item in enumerate_json_items(text_json): print(item)
>>>
{'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': [{'GlossEntry': {'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}}]}}} {'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': {'GlossEntry': [{'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}]}}}}
The parsed json must have an empty line at the end otherwise the following exception is raised: ijson.common.IncompleteJSONError: ` `parse error: unallowed token at this point in JSON text.
- pandas_streaming.df.dataframe_io_helpers.flatten_dictionary(dico, sep='_')¶
Flattens a dictionary with nested structure to a dictionary with no hierarchy.
- Parameters:
dico – dictionary to flatten
sep – string to separate dictionary keys by
- Returns:
flattened dictionary
Inspired from flatten_json.