module hackathon.json_helper#

Short summary#

module ensae_projects.hackathon.json_helper

Helpers for the hackathon 2017 (Label Emmaüs).

source on GitHub

Functions#

function

truncated documentation

enumerate_json_items

Enumerates items from a JSON file or string.

extract_images_from_json_2017

Extracts fields from a JSON files such as images.

Documentation#

Helpers for the hackathon 2017 (Label Emmaüs).

source on GitHub

ensae_projects.hackathon.json_helper.enumerate_json_items(filename, encoding=None, fLOG=<function noLOG>)#

Enumerates items from a JSON file or string.

Parameters:
  • filename – filename or string or stream to parse

  • encoding – encoding

  • fLOG – logging function

Returns:

iterator on records at first level.

It assumes the syntax follows the format: [ {"id":1, ...}, {"id": 2, ...}, ...].

Processes a json file by streaming.

The module ijson can read a JSON file by streaming. This module is needed because a record can be written on multiple lines. This function leverages it produces the following results.

<<<

from ensae_projects.hackathon import enumerate_json_items

text_json = '''
    [
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": [{
                    "GlossEntry": {
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }
                }]
            }
        }
    },
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": {
                    "GlossEntry": [{
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }]
                }
            }
        }
    }
    ]
'''

for item in enumerate_json_items(text_json):
    print('------------')
    print(item)

>>>

    ------------
    {'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': [{'GlossEntry': {'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}}]}}}
    ------------
    {'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': {'GlossEntry': [{'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}]}}}}

source on GitHub

ensae_projects.hackathon.json_helper.extract_images_from_json_2017(filename, encoding=None, fLOG=<function noLOG>)#

Extracts fields from a JSON files such as images.

Parameters:
  • filename – filename

  • encoding – encoding

  • fLOG – logging function

Returns:

iterator on images

..warning:: Copy between two iterations?

If you plan to store the enumerated dictionaries, you should copy them because dictionary are reused.

One example on dummy data implementing a subset of the fields the JSON contains. This can be easily converted into a dataframe.

<<<

from ensae_projects.hackathon import extract_images_from_json_2017

text_json = '''
    [
       {"assigned_images": [],
        "best_offer": {"created_on": "2016-11-04T23:20:53+01:00", "images": [], "offer_longitude": null, "availability": "in_stock",
                       "start_selling_date": null, "delay_before_shipping": 0.00, "free_return": null, "free_shipping": null,
                       "assigned_images": [{"image_path": "https://coucou.JPEG"}],
                       "id": 1306501, "eco_tax": 0.000000, "keywords": ["boutique", "test"],
        "sku": "AAAA27160018",
        "product": {"pk": 2550, "external_id": null, "id": 2580},
        "description": "livre l", "last_modified": "2016-11-04T23:27:01+01:00",
        "name": "les names", "language": "fr"}, "id": 25540,
        "description": "livre 2", "slug": "les-l",
        "application_categories": [280, 283], "product_type": "physical",
        "name": "les l n", "language": "fr", "popularity": 99, "gender": null
        }
    ]
    '''

items = []
for item in extract_images_from_json_2017(text_json):
    print(item)
    items.append(item)

from pandas import DataFrame
df = DataFrame(items)
print(df)

>>>

    {'product_pk': 2550, 'product_id': 2580, 'id2': None, 'sku': 'AAAA27160018', 'created_on': None, 'keywords': None, 'availability': 'in_stock', 'eco_tax': Decimal('0.000000'), 'restock_date': None, 'status': None, 'number_of_items': None, 'price_with_vat': None, 'price_without_vat': None, 'previous_price_without_vat': None, 'max_order_quantity': None, 'stock': None, 'start_selling_date': None, 'description': 'livre 2', 'last_modified': '2016-11-04T23:27:01+01:00', 'name': 'les l n', 'product_type': 'physical', 'gender': None, 'popularity': 99, 'application_categories': '280,283', 'language': 'fr', 'image_path': 'https://coucou.JPEG'}
       product_pk  product_id  ... language           image_path
    0        2550        2580  ...       fr  https://coucou.JPEG
    
    [1 rows x 26 columns]

source on GitHub