module hackathon.json_helper
#
Short summary#
module ensae_projects.hackathon.json_helper
Helpers for the hackathon 2017 (Label Emmaüs).
Functions#
function |
truncated documentation |
---|---|
Enumerates items from a JSON file or string. |
|
Extracts fields from a JSON files such as images. |
Documentation#
Helpers for the hackathon 2017 (Label Emmaüs).
- ensae_projects.hackathon.json_helper.enumerate_json_items(filename, encoding=None, fLOG=<function noLOG>)#
Enumerates items from a JSON file or string.
- Parameters:
filename – filename or string or stream to parse
encoding – encoding
fLOG – logging function
- Returns:
iterator on records at first level.
It assumes the syntax follows the format:
[ {"id":1, ...}, {"id": 2, ...}, ...]
.Processes a json file by streaming.
The module ijson can read a JSON file by streaming. This module is needed because a record can be written on multiple lines. This function leverages it produces the following results.
<<<
from ensae_projects.hackathon import enumerate_json_items text_json = ''' [ { "glossary": { "title": "example glossary", "GlossDiv": { "title": "S", "GlossList": [{ "GlossEntry": { "ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": { "para": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML"] }, "GlossSee": "markup" } }] } } }, { "glossary": { "title": "example glossary", "GlossDiv": { "title": "S", "GlossList": { "GlossEntry": [{ "ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": { "para": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML"] }, "GlossSee": "markup" }] } } } } ] ''' for item in enumerate_json_items(text_json): print('------------') print(item)
>>>
------------ {'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': [{'GlossEntry': {'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}}]}}} ------------ {'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': {'GlossEntry': [{'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}]}}}}
- ensae_projects.hackathon.json_helper.extract_images_from_json_2017(filename, encoding=None, fLOG=<function noLOG>)#
Extracts fields from a JSON files such as images.
- Parameters:
filename – filename
encoding – encoding
fLOG – logging function
- Returns:
iterator on images
..warning:: Copy between two iterations?
If you plan to store the enumerated dictionaries, you should copy them because dictionary are reused.
One example on dummy data implementing a subset of the fields the JSON contains. This can be easily converted into a dataframe.
<<<
from ensae_projects.hackathon import extract_images_from_json_2017 text_json = ''' [ {"assigned_images": [], "best_offer": {"created_on": "2016-11-04T23:20:53+01:00", "images": [], "offer_longitude": null, "availability": "in_stock", "start_selling_date": null, "delay_before_shipping": 0.00, "free_return": null, "free_shipping": null, "assigned_images": [{"image_path": "https://coucou.JPEG"}], "id": 1306501, "eco_tax": 0.000000, "keywords": ["boutique", "test"], "sku": "AAAA27160018", "product": {"pk": 2550, "external_id": null, "id": 2580}, "description": "livre l", "last_modified": "2016-11-04T23:27:01+01:00", "name": "les names", "language": "fr"}, "id": 25540, "description": "livre 2", "slug": "les-l", "application_categories": [280, 283], "product_type": "physical", "name": "les l n", "language": "fr", "popularity": 99, "gender": null } ] ''' items = [] for item in extract_images_from_json_2017(text_json): print(item) items.append(item) from pandas import DataFrame df = DataFrame(items) print(df)
>>>
{'product_pk': 2550, 'product_id': 2580, 'id2': None, 'sku': 'AAAA27160018', 'created_on': None, 'keywords': None, 'availability': 'in_stock', 'eco_tax': Decimal('0.000000'), 'restock_date': None, 'status': None, 'number_of_items': None, 'price_with_vat': None, 'price_without_vat': None, 'previous_price_without_vat': None, 'max_order_quantity': None, 'stock': None, 'start_selling_date': None, 'description': 'livre 2', 'last_modified': '2016-11-04T23:27:01+01:00', 'name': 'les l n', 'product_type': 'physical', 'gender': None, 'popularity': 99, 'application_categories': '280,283', 'language': 'fr', 'image_path': 'https://coucou.JPEG'} product_pk product_id ... language image_path 0 2550 2580 ... fr https://coucou.JPEG [1 rows x 26 columns]