{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.ml - Classification binaire avec features textuelles\n", "\n", "Ce notebook propose de voir comment incorporer des features pour voir l'am\u00e9lioration des performances sur une classification binaire. "]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["## R\u00e9cup\u00e9rer les donn\u00e9es\n", "\n", "Les donn\u00e9es sont t\u00e9l\u00e9chargeables [Comp\u00e9tition 2017 - additifs alimentaires](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/questions/competition_2A.html#id1) ou encore avec le code :"]}, {"cell_type": "code", "execution_count": 2, "metadata": {"collapsed": true}, "outputs": [], "source": ["from pyensae.datasource import download_data\n", "data_train = download_data(\"off_train_all.zip\", \n", " url=\"https://raw.githubusercontent.com/sdpython/data/master/OpenFoodFacts/\")"]}, {"cell_type": "code", "execution_count": 3, "metadata": {"collapsed": true}, "outputs": [], "source": ["data_test = download_data(\"off_test_all.zip\", \n", " url=\"https://raw.githubusercontent.com/sdpython/data/master/OpenFoodFacts/\")"]}, {"cell_type": "code", "execution_count": 4, "metadata": {"collapsed": true}, "outputs": [], "source": ["import pandas\n", "df = pandas.read_csv(\"off_test_all.txt\", sep=\"\\t\", encoding=\"utf8\", low_memory=False)"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
codeurlcreatorcreated_tcreated_datetimelast_modified_tlast_modified_datetimeproduct_namegeneric_namequantity...collagen-meat-protein-ratio_100gcocoa_100gchlorophyl_100gcarbon-footprint_100gnutrition-score-fr_100gnutrition-score-uk_100gglycemic-index_100gwater-hardness_100ghasEs100
01.008255e+10http://world-fr.openfoodfacts.org/produit/0010...usda-ndb-import14890645832017-03-09T13:03:03Z14890645832017-03-09T13:03:03ZGolden Island, Pork Jerky, Grilled BarbecueNaNNaN...NaNNaNNaNNaN23.023.0NaNNaNFalse17.0
11.182204e+10http://world-fr.openfoodfacts.org/produit/0011...usda-ndb-import14890701972017-03-09T14:36:37Z14890701972017-03-09T14:36:37ZBig Fizz, Soda, OrangeNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNTrue7.0
22.548401e+10http://world-fr.openfoodfacts.org/produit/0025...usda-ndb-import14890520242017-03-09T09:33:44Z14890520242017-03-09T09:33:44ZTofubaked Marinated Baked Tofu, Sesame GingerNaNNaN...NaNNaNNaNNaN-2.0-2.0NaNNaNTrue17.0
31.229250e+10http://world-fr.openfoodfacts.org/produit/0012...usda-ndb-import14891334932017-03-10T08:11:33Z14891334932017-03-10T08:11:33ZMilk Chocolate EggsNaNNaN...NaNNaNNaNNaN23.023.0NaNNaNTrue17.0
41.115054e+10http://world-fr.openfoodfacts.org/produit/0011...usda-ndb-import14890528922017-03-09T09:48:12Z14890528922017-03-09T09:48:12ZFresh Polish SausageNaNNaN...NaNNaNNaNNaN22.022.0NaNNaNTrue17.0
\n", "

5 rows \u00d7 165 columns

\n", "
"], "text/plain": [" code url \\\n", "0 1.008255e+10 http://world-fr.openfoodfacts.org/produit/0010... \n", "1 1.182204e+10 http://world-fr.openfoodfacts.org/produit/0011... \n", "2 2.548401e+10 http://world-fr.openfoodfacts.org/produit/0025... \n", "3 1.229250e+10 http://world-fr.openfoodfacts.org/produit/0012... \n", "4 1.115054e+10 http://world-fr.openfoodfacts.org/produit/0011... \n", "\n", " creator created_t created_datetime last_modified_t \\\n", "0 usda-ndb-import 1489064583 2017-03-09T13:03:03Z 1489064583 \n", "1 usda-ndb-import 1489070197 2017-03-09T14:36:37Z 1489070197 \n", "2 usda-ndb-import 1489052024 2017-03-09T09:33:44Z 1489052024 \n", "3 usda-ndb-import 1489133493 2017-03-10T08:11:33Z 1489133493 \n", "4 usda-ndb-import 1489052892 2017-03-09T09:48:12Z 1489052892 \n", "\n", " last_modified_datetime product_name \\\n", "0 2017-03-09T13:03:03Z Golden Island, Pork Jerky, Grilled Barbecue \n", "1 2017-03-09T14:36:37Z Big Fizz, Soda, Orange \n", "2 2017-03-09T09:33:44Z Tofubaked Marinated Baked Tofu, Sesame Ginger \n", "3 2017-03-10T08:11:33Z Milk Chocolate Eggs \n", "4 2017-03-09T09:48:12Z Fresh Polish Sausage \n", "\n", " generic_name quantity ... collagen-meat-protein-ratio_100g cocoa_100g \\\n", "0 NaN NaN ... NaN NaN \n", "1 NaN NaN ... NaN NaN \n", "2 NaN NaN ... NaN NaN \n", "3 NaN NaN ... NaN NaN \n", "4 NaN NaN ... NaN NaN \n", "\n", " chlorophyl_100g carbon-footprint_100g nutrition-score-fr_100g \\\n", "0 NaN NaN 23.0 \n", "1 NaN NaN NaN \n", "2 NaN NaN -2.0 \n", "3 NaN NaN 23.0 \n", "4 NaN NaN 22.0 \n", "\n", " nutrition-score-uk_100g glycemic-index_100g water-hardness_100g hasE s100 \n", "0 23.0 NaN NaN False 17.0 \n", "1 NaN NaN NaN True 7.0 \n", "2 -2.0 NaN NaN True 17.0 \n", "3 23.0 NaN NaN True 17.0 \n", "4 22.0 NaN NaN True 17.0 \n", "\n", "[5 rows x 165 columns]"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["df.head()"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
code1.00826e+101.1822e+10
urlhttp://world-fr.openfoodfacts.org/produit/0010...http://world-fr.openfoodfacts.org/produit/0011...
creatorusda-ndb-importusda-ndb-import
created_t14890645831489070197
created_datetime2017-03-09T13:03:03Z2017-03-09T14:36:37Z
last_modified_t14890645831489070197
last_modified_datetime2017-03-09T13:03:03Z2017-03-09T14:36:37Z
product_nameGolden Island, Pork Jerky, Grilled BarbecueBig Fizz, Soda, Orange
generic_nameNaNNaN
quantityNaNNaN
packagingNaNNaN
packaging_tagsNaNNaN
brandsGolden Island Jerky Inc.Rite Aid Corporation
brands_tagsgolden-island-jerky-incrite-aid-corporation
categoriesNaNNaN
categories_tagsNaNNaN
categories_frNaNNaN
originsNaNNaN
origins_tagsNaNNaN
manufacturing_placesNaNNaN
manufacturing_places_tagsNaNNaN
labelsNaNNaN
labels_tagsNaNNaN
labels_frNaNNaN
emb_codesNaNNaN
emb_codes_tagsNaNNaN
first_packaging_code_geoNaNNaN
citiesNaNNaN
cities_tagsNaNNaN
purchase_placesNaNNaN
storesNaNNaN
countriesUSUS
countries_tagsen:united-statesen:united-states
countries_fr\u00c9tats-Unis\u00c9tats-Unis
ingredients_textPork, sugar, water, brown sugar, gluten free s...Carbonated water, high fructose corn syrup, ci...
allergensNaNNaN
allergens_frNaNNaN
tracesNaNNaN
traces_tagsNaNNaN
traces_frNaNNaN
serving_size28 g (1 oz)240 ml (8 fl oz)
no_nutrimentsNaNNaN
additives_n06
additivesen:2-or-less;en:brown-sugar;en:contain-rice;en...en:and-brominated-vegetable-oil;en:carbonated-...
additives_tagsNaNen:e110,en:e211,en:e330,en:e414,en:e443,en:e445
additives_frNaNE110 - Jaune orang\u00e9 S,E211 - Benzoate de sodiu...
ingredients_from_palm_oil_n00
ingredients_from_palm_oilNaNNaN
ingredients_from_palm_oil_tagsNaNNaN
ingredients_that_may_be_from_palm_oil_n00
\n", "
"], "text/plain": [" 0 \\\n", "code 1.00826e+10 \n", "url http://world-fr.openfoodfacts.org/produit/0010... \n", "creator usda-ndb-import \n", "created_t 1489064583 \n", "created_datetime 2017-03-09T13:03:03Z \n", "last_modified_t 1489064583 \n", "last_modified_datetime 2017-03-09T13:03:03Z \n", "product_name Golden Island, Pork Jerky, Grilled Barbecue \n", "generic_name NaN \n", "quantity NaN \n", "packaging NaN \n", "packaging_tags NaN \n", "brands Golden Island Jerky Inc. \n", "brands_tags golden-island-jerky-inc \n", "categories NaN \n", "categories_tags NaN \n", "categories_fr NaN \n", "origins NaN \n", "origins_tags NaN \n", "manufacturing_places NaN \n", "manufacturing_places_tags NaN \n", "labels NaN \n", "labels_tags NaN \n", "labels_fr NaN \n", "emb_codes NaN \n", "emb_codes_tags NaN \n", "first_packaging_code_geo NaN \n", "cities NaN \n", "cities_tags NaN \n", "purchase_places NaN \n", "stores NaN \n", "countries US \n", "countries_tags en:united-states \n", "countries_fr \u00c9tats-Unis \n", "ingredients_text Pork, sugar, water, brown sugar, gluten free s... \n", "allergens NaN \n", "allergens_fr NaN \n", "traces NaN \n", "traces_tags NaN \n", "traces_fr NaN \n", "serving_size 28 g (1 oz) \n", "no_nutriments NaN \n", "additives_n 0 \n", "additives en:2-or-less;en:brown-sugar;en:contain-rice;en... \n", "additives_tags NaN \n", "additives_fr NaN \n", "ingredients_from_palm_oil_n 0 \n", "ingredients_from_palm_oil NaN \n", "ingredients_from_palm_oil_tags NaN \n", "ingredients_that_may_be_from_palm_oil_n 0 \n", "\n", " 1 \n", "code 1.1822e+10 \n", "url http://world-fr.openfoodfacts.org/produit/0011... \n", "creator usda-ndb-import \n", "created_t 1489070197 \n", "created_datetime 2017-03-09T14:36:37Z \n", "last_modified_t 1489070197 \n", "last_modified_datetime 2017-03-09T14:36:37Z \n", "product_name Big Fizz, Soda, Orange \n", "generic_name NaN \n", "quantity NaN \n", "packaging NaN \n", "packaging_tags NaN \n", "brands Rite Aid Corporation \n", "brands_tags rite-aid-corporation \n", "categories NaN \n", "categories_tags NaN \n", "categories_fr NaN \n", "origins NaN \n", "origins_tags NaN \n", "manufacturing_places NaN \n", "manufacturing_places_tags NaN \n", "labels NaN \n", "labels_tags NaN \n", "labels_fr NaN \n", "emb_codes NaN \n", "emb_codes_tags NaN \n", "first_packaging_code_geo NaN \n", "cities NaN \n", "cities_tags NaN \n", "purchase_places NaN \n", "stores NaN \n", "countries US \n", "countries_tags en:united-states \n", "countries_fr \u00c9tats-Unis \n", "ingredients_text Carbonated water, high fructose corn syrup, ci... \n", "allergens NaN \n", "allergens_fr NaN \n", "traces NaN \n", "traces_tags NaN \n", "traces_fr NaN \n", "serving_size 240 ml (8 fl oz) \n", "no_nutriments NaN \n", "additives_n 6 \n", "additives en:and-brominated-vegetable-oil;en:carbonated-... \n", "additives_tags en:e110,en:e211,en:e330,en:e414,en:e443,en:e445 \n", "additives_fr E110 - Jaune orang\u00e9 S,E211 - Benzoate de sodiu... \n", "ingredients_from_palm_oil_n 0 \n", "ingredients_from_palm_oil NaN \n", "ingredients_from_palm_oil_tags NaN \n", "ingredients_that_may_be_from_palm_oil_n 0 "]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["df.head(n=2).T[:50]"]}, {"cell_type": "code", "execution_count": 7, "metadata": {"scrolled": false}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
ingredients_that_may_be_from_palm_oilNaNNaN
ingredients_that_may_be_from_palm_oil_tagsNaNNaN
nutrition_grade_ukNaNNaN
nutrition_grade_freNaN
pnns_groups_1NaNNaN
pnns_groups_2NaNNaN
statesen:to-be-completed, en:nutrition-facts-complet...en:to-be-completed, en:nutrition-facts-complet...
states_tagsen:to-be-completed,en:nutrition-facts-complete...en:to-be-completed,en:nutrition-facts-complete...
states_frA compl\u00e9ter,Informations nutritionnelles compl...A compl\u00e9ter,Informations nutritionnelles compl...
main_categoryNaNNaN
main_category_frNaNNaN
image_urlNaNNaN
image_small_urlNaNNaN
energy_100g1494226
energy-from-fat_100gNaNNaN
fat_100g7.140
saturated-fat_100g1.79NaN
butyric-acid_100gNaNNaN
caproic-acid_100gNaNNaN
caprylic-acid_100gNaNNaN
capric-acid_100gNaNNaN
lauric-acid_100gNaNNaN
myristic-acid_100gNaNNaN
palmitic-acid_100gNaNNaN
stearic-acid_100gNaNNaN
arachidic-acid_100gNaNNaN
behenic-acid_100gNaNNaN
lignoceric-acid_100gNaNNaN
cerotic-acid_100gNaNNaN
montanic-acid_100gNaNNaN
melissic-acid_100gNaNNaN
monounsaturated-fat_100gNaNNaN
polyunsaturated-fat_100gNaNNaN
omega-3-fat_100gNaNNaN
alpha-linolenic-acid_100gNaNNaN
eicosapentaenoic-acid_100gNaNNaN
docosahexaenoic-acid_100gNaNNaN
omega-6-fat_100gNaNNaN
linoleic-acid_100gNaNNaN
arachidonic-acid_100gNaNNaN
gamma-linolenic-acid_100gNaNNaN
dihomo-gamma-linolenic-acid_100gNaNNaN
omega-9-fat_100gNaNNaN
oleic-acid_100gNaNNaN
elaidic-acid_100gNaNNaN
gondoic-acid_100gNaNNaN
mead-acid_100gNaNNaN
erucic-acid_100gNaNNaN
nervonic-acid_100gNaNNaN
trans-fat_100g0NaN
\n", "
"], "text/plain": [" 0 \\\n", "ingredients_that_may_be_from_palm_oil NaN \n", "ingredients_that_may_be_from_palm_oil_tags NaN \n", "nutrition_grade_uk NaN \n", "nutrition_grade_fr e \n", "pnns_groups_1 NaN \n", "pnns_groups_2 NaN \n", "states en:to-be-completed, en:nutrition-facts-complet... \n", "states_tags en:to-be-completed,en:nutrition-facts-complete... \n", "states_fr A compl\u00e9ter,Informations nutritionnelles compl... \n", "main_category NaN \n", "main_category_fr NaN \n", "image_url NaN \n", "image_small_url NaN \n", "energy_100g 1494 \n", "energy-from-fat_100g NaN \n", "fat_100g 7.14 \n", "saturated-fat_100g 1.79 \n", "butyric-acid_100g NaN \n", "caproic-acid_100g NaN \n", "caprylic-acid_100g NaN \n", "capric-acid_100g NaN \n", "lauric-acid_100g NaN \n", "myristic-acid_100g NaN \n", "palmitic-acid_100g NaN \n", "stearic-acid_100g NaN \n", "arachidic-acid_100g NaN \n", "behenic-acid_100g NaN \n", "lignoceric-acid_100g NaN \n", "cerotic-acid_100g NaN \n", "montanic-acid_100g NaN \n", "melissic-acid_100g NaN \n", "monounsaturated-fat_100g NaN \n", "polyunsaturated-fat_100g NaN \n", "omega-3-fat_100g NaN \n", "alpha-linolenic-acid_100g NaN \n", "eicosapentaenoic-acid_100g NaN \n", "docosahexaenoic-acid_100g NaN \n", "omega-6-fat_100g NaN \n", "linoleic-acid_100g NaN \n", "arachidonic-acid_100g NaN \n", "gamma-linolenic-acid_100g NaN \n", "dihomo-gamma-linolenic-acid_100g NaN \n", "omega-9-fat_100g NaN \n", "oleic-acid_100g NaN \n", "elaidic-acid_100g NaN \n", "gondoic-acid_100g NaN \n", "mead-acid_100g NaN \n", "erucic-acid_100g NaN \n", "nervonic-acid_100g NaN \n", "trans-fat_100g 0 \n", "\n", " 1 \n", "ingredients_that_may_be_from_palm_oil NaN \n", "ingredients_that_may_be_from_palm_oil_tags NaN \n", "nutrition_grade_uk NaN \n", "nutrition_grade_fr NaN \n", "pnns_groups_1 NaN \n", "pnns_groups_2 NaN \n", "states en:to-be-completed, en:nutrition-facts-complet... \n", "states_tags en:to-be-completed,en:nutrition-facts-complete... \n", "states_fr A compl\u00e9ter,Informations nutritionnelles compl... \n", "main_category NaN \n", "main_category_fr NaN \n", "image_url NaN \n", "image_small_url NaN \n", "energy_100g 226 \n", "energy-from-fat_100g NaN \n", "fat_100g 0 \n", "saturated-fat_100g NaN \n", "butyric-acid_100g NaN \n", "caproic-acid_100g NaN \n", "caprylic-acid_100g NaN \n", "capric-acid_100g NaN \n", "lauric-acid_100g NaN \n", "myristic-acid_100g NaN \n", "palmitic-acid_100g NaN \n", "stearic-acid_100g NaN \n", "arachidic-acid_100g NaN \n", "behenic-acid_100g NaN \n", "lignoceric-acid_100g NaN \n", "cerotic-acid_100g NaN \n", "montanic-acid_100g NaN \n", "melissic-acid_100g NaN \n", "monounsaturated-fat_100g NaN \n", "polyunsaturated-fat_100g NaN \n", "omega-3-fat_100g NaN \n", "alpha-linolenic-acid_100g NaN \n", "eicosapentaenoic-acid_100g NaN \n", "docosahexaenoic-acid_100g NaN \n", "omega-6-fat_100g NaN \n", "linoleic-acid_100g NaN \n", "arachidonic-acid_100g NaN \n", "gamma-linolenic-acid_100g NaN \n", "dihomo-gamma-linolenic-acid_100g NaN \n", "omega-9-fat_100g NaN \n", "oleic-acid_100g NaN \n", "elaidic-acid_100g NaN \n", "gondoic-acid_100g NaN \n", "mead-acid_100g NaN \n", "erucic-acid_100g NaN \n", "nervonic-acid_100g NaN \n", "trans-fat_100g NaN "]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["df.head(n=2).T[50:100]"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
cholesterol_100g0.089NaN
carbohydrates_100g39.2913.75
sugars_100g39.2913.75
sucrose_100gNaNNaN
glucose_100gNaNNaN
fructose_100gNaNNaN
lactose_100gNaNNaN
maltose_100gNaNNaN
maltodextrins_100gNaNNaN
starch_100gNaNNaN
polyols_100gNaNNaN
fiber_100g0NaN
proteins_100g28.570
casein_100gNaNNaN
serum-proteins_100gNaNNaN
nucleotides_100gNaNNaN
salt_100g2.811780.04826
sodium_100g1.1070.019
alcohol_100gNaNNaN
vitamin-a_100g0NaN
beta-carotene_100gNaNNaN
vitamin-d_100gNaNNaN
vitamin-e_100gNaNNaN
vitamin-k_100gNaNNaN
vitamin-c_100g0NaN
vitamin-b1_100gNaNNaN
vitamin-b2_100gNaNNaN
vitamin-pp_100gNaNNaN
vitamin-b6_100gNaNNaN
vitamin-b9_100gNaNNaN
folates_100gNaNNaN
vitamin-b12_100gNaNNaN
biotin_100gNaNNaN
pantothenic-acid_100gNaNNaN
silica_100gNaNNaN
bicarbonate_100gNaNNaN
potassium_100gNaNNaN
chloride_100gNaNNaN
calcium_100g0NaN
phosphorus_100gNaNNaN
iron_100g0.00257NaN
magnesium_100gNaNNaN
zinc_100gNaNNaN
copper_100gNaNNaN
manganese_100gNaNNaN
fluoride_100gNaNNaN
selenium_100gNaNNaN
chromium_100gNaNNaN
molybdenum_100gNaNNaN
iodine_100gNaNNaN
\n", "
"], "text/plain": [" 0 1\n", "cholesterol_100g 0.089 NaN\n", "carbohydrates_100g 39.29 13.75\n", "sugars_100g 39.29 13.75\n", "sucrose_100g NaN NaN\n", "glucose_100g NaN NaN\n", "fructose_100g NaN NaN\n", "lactose_100g NaN NaN\n", "maltose_100g NaN NaN\n", "maltodextrins_100g NaN NaN\n", "starch_100g NaN NaN\n", "polyols_100g NaN NaN\n", "fiber_100g 0 NaN\n", "proteins_100g 28.57 0\n", "casein_100g NaN NaN\n", "serum-proteins_100g NaN NaN\n", "nucleotides_100g NaN NaN\n", "salt_100g 2.81178 0.04826\n", "sodium_100g 1.107 0.019\n", "alcohol_100g NaN NaN\n", "vitamin-a_100g 0 NaN\n", "beta-carotene_100g NaN NaN\n", "vitamin-d_100g NaN NaN\n", "vitamin-e_100g NaN NaN\n", "vitamin-k_100g NaN NaN\n", "vitamin-c_100g 0 NaN\n", "vitamin-b1_100g NaN NaN\n", "vitamin-b2_100g NaN NaN\n", "vitamin-pp_100g NaN NaN\n", "vitamin-b6_100g NaN NaN\n", "vitamin-b9_100g NaN NaN\n", "folates_100g NaN NaN\n", "vitamin-b12_100g NaN NaN\n", "biotin_100g NaN NaN\n", "pantothenic-acid_100g NaN NaN\n", "silica_100g NaN NaN\n", "bicarbonate_100g NaN NaN\n", "potassium_100g NaN NaN\n", "chloride_100g NaN NaN\n", "calcium_100g 0 NaN\n", "phosphorus_100g NaN NaN\n", "iron_100g 0.00257 NaN\n", "magnesium_100g NaN NaN\n", "zinc_100g NaN NaN\n", "copper_100g NaN NaN\n", "manganese_100g NaN NaN\n", "fluoride_100g NaN NaN\n", "selenium_100g NaN NaN\n", "chromium_100g NaN NaN\n", "molybdenum_100g NaN NaN\n", "iodine_100g NaN NaN"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["df.head(n=2).T[100:150]"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
caffeine_100gNaNNaN
taurine_100gNaNNaN
ph_100gNaNNaN
fruits-vegetables-nuts_100gNaNNaN
fruits-vegetables-nuts-estimate_100gNaNNaN
collagen-meat-protein-ratio_100gNaNNaN
cocoa_100gNaNNaN
chlorophyl_100gNaNNaN
carbon-footprint_100gNaNNaN
nutrition-score-fr_100g23NaN
nutrition-score-uk_100g23NaN
glycemic-index_100gNaNNaN
water-hardness_100gNaNNaN
hasEFalseTrue
s100177
\n", "
"], "text/plain": [" 0 1\n", "caffeine_100g NaN NaN\n", "taurine_100g NaN NaN\n", "ph_100g NaN NaN\n", "fruits-vegetables-nuts_100g NaN NaN\n", "fruits-vegetables-nuts-estimate_100g NaN NaN\n", "collagen-meat-protein-ratio_100g NaN NaN\n", "cocoa_100g NaN NaN\n", "chlorophyl_100g NaN NaN\n", "carbon-footprint_100g NaN NaN\n", "nutrition-score-fr_100g 23 NaN\n", "nutrition-score-uk_100g 23 NaN\n", "glycemic-index_100g NaN NaN\n", "water-hardness_100g NaN NaN\n", "hasE False True\n", "s100 17 7"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["df.head(n=2).T[150:]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : extraire un \u00e9chantillon al\u00e9atoire\n", "\n", "Les donn\u00e9es sont volumineuses. Prenons un \u00e9chantillon pour aller plus vite."]}, {"cell_type": "code", "execution_count": 10, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 2 : caler un premier mod\u00e8le de classification binaire\n", "\n", "Avec les variables num\u00e9riques uniquement. La cible est la variable ``hasE``."]}, {"cell_type": "code", "execution_count": 11, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 3 : tracer une courbe ROC"]}, {"cell_type": "code", "execution_count": 12, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 4 : utiliser la variable qui contient le pays\n", "\n", "Est-ce que le mod\u00e8le est plus performant ?"]}, {"cell_type": "code", "execution_count": 13, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 5 : utiliser la variable additives_tags ?\n", "\n", "Les r\u00e9sultats ne vous paraissent-ils pas \u00e9tranges ?"]}, {"cell_type": "code", "execution_count": 14, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 6 : utiliser la variable ingredients_text ?\n", "\n", "Que sugg\u00e9rez vous *embedding* ?"]}, {"cell_type": "code", "execution_count": 15, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 7 : utiliser la variable product_name ?"]}, {"cell_type": "code", "execution_count": 16, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 8 : utiliser GridSearch pour optimiser un hyper-param\u00e8tre du mod\u00e8le ?"]}, {"cell_type": "code", "execution_count": 17, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 9 : faites tourner une cross validation ?"]}, {"cell_type": "code", "execution_count": 18, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5"}}, "nbformat": 4, "nbformat_minor": 2}