{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Jeu de donn\u00e9es avec des cat\u00e9gories\n", "\n", "Le jeu de donn\u00e9es [Adult Data Set](https://archive.ics.uci.edu/ml/datasets/adult) ne contient presque que des cat\u00e9gories. Ce notebook explore diff\u00e9rentes moyens de les traiter."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["from seaborn import clustermap\n", "\n", "clustermap(corr, center=0, cmap=\"vlag\", linewidths=.75, figsize=(15, 15));"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ce n'est pas facile \u00e0 voir. Il faudrait essayer avec [bokeh](https://bokeh.pydata.org/en/latest/docs/gallery/les_mis.html) ou essayer de proc\u00e9der autrement."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## ACM\n", "\n", "Ce qui suit n'est pas tout-\u00e0-fait une [ACM](https://fr.wikipedia.org/wiki/Analyse_des_correspondances_multiples) mais cela s'en inspire. On consid\u00e8re les variables comme des observations et on les projette sur des plans d\u00e9finis par les axes d'une [ACP](https://fr.wikipedia.org/wiki/Analyse_en_composantes_principales). On normalise \u00e9galement car on m\u00e9lange variables continues et variables binaires d'ordre de grandeur diff\u00e9rents. Les calculs sont plus pr\u00e9cis lorsque les matrices ont des coefficients de m\u00eame ordre. Le dernier exercice de cet examen [Programmation ENSAE 2006](http://www.xavierdupre.fr/site2013/enseignements/tdnote/ecrit_2006.pdf) ach\u00e8vera de vous convaincre."]}, {"cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [{"data": {"text/html": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["ax = tr.plot(x='axe1', y='axe2', kind='scatter', figsize=(10, 10))\n", "for t, (x, y, z) in tr.iterrows():\n", " ax.text(x, y, t, fontsize=10, rotation=10)\n", "ax.set_title(\"ACP sur les variables - axe 1, 2\");"]}, {"cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["ax = tr.plot(x='axe1', y='axe3', kind='scatter', figsize=(10, 10))\n", "for t, (x, y, z) in tr.iterrows():\n", " ax.text(x, z, t, fontsize=10, rotation=10)\n", "ax.set_title(\"ACP sur les variables - axe 1, 3\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On voit quelques variables \u00e0 supprimer car tr\u00e8s corr\u00e9l\u00e9es comme la relation ou la situation maritale. On voit aussi que les deux genres homme/femme sont oppos\u00e9s. On voit aussi que certaines cat\u00e9gories sont tr\u00e8s proches comme Prof, Masters ou dipl\u00f4m\u00e9s. Il est probable que le mod\u00e8le de pr\u00e9diction ne p\u00e2tisse pas du regroupement de ces trois cat\u00e9gories. On utilise [bokeh](https://bokeh.pydata.org/en/latest/) pour pouvoir zoomer."]}, {"cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " Loading BokehJS ...\n", "
"]}, "metadata": {}, "output_type": "display_data"}, {"data": {"application/javascript": ["\n", "(function(root) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = true;\n", "\n", " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", " root._bokeh_onload_callbacks = [];\n", " root._bokeh_is_loading = undefined;\n", " }\n", "\n", " var JS_MIME_TYPE = 'application/javascript';\n", " var HTML_MIME_TYPE = 'text/html';\n", " var EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", " var CLASS_NAME = 'output_bokeh rendered_html';\n", "\n", " /**\n", " * Render data to the DOM node\n", " */\n", " function render(props, node) {\n", " var script = document.createElement(\"script\");\n", " node.appendChild(script);\n", " }\n", "\n", " /**\n", " * Handle when an output is cleared or removed\n", " */\n", " function handleClearOutput(event, handle) {\n", " var cell = handle.cell;\n", "\n", " var id = cell.output_area._bokeh_element_id;\n", " var server_id = cell.output_area._bokeh_server_id;\n", " // Clean up Bokeh references\n", " if (id != null && id in Bokeh.index) {\n", " Bokeh.index[id].model.document.clear();\n", " delete Bokeh.index[id];\n", " }\n", "\n", " if (server_id !== undefined) {\n", " // Clean up Bokeh references\n", " var cmd = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", " cell.notebook.kernel.execute(cmd, {\n", " iopub: {\n", " output: function(msg) {\n", " var id = msg.content.text.trim();\n", " if (id in Bokeh.index) {\n", " Bokeh.index[id].model.document.clear();\n", " delete Bokeh.index[id];\n", " }\n", " }\n", " }\n", " });\n", " // Destroy server and session\n", " var cmd = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", " cell.notebook.kernel.execute(cmd);\n", " }\n", " }\n", "\n", " /**\n", " * Handle when a new output is added\n", " */\n", " function handleAddOutput(event, handle) {\n", " var output_area = handle.output_area;\n", " var output = handle.output;\n", "\n", " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", " if ((output.output_type != \"display_data\") || (!output.data.hasOwnProperty(EXEC_MIME_TYPE))) {\n", " return\n", " }\n", "\n", " var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", "\n", " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", " // store reference to embed id on output_area\n", " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", " }\n", " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", " var bk_div = document.createElement(\"div\");\n", " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", " var script_attrs = bk_div.children[0].attributes;\n", " for (var i = 0; i < script_attrs.length; i++) {\n", " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", " }\n", " // store reference to server id on output_area\n", " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", " }\n", " }\n", "\n", " function register_renderer(events, OutputArea) {\n", "\n", " function append_mime(data, metadata, element) {\n", " // create a DOM node to render to\n", " var toinsert = this.create_output_subarea(\n", " metadata,\n", " CLASS_NAME,\n", " EXEC_MIME_TYPE\n", " );\n", " this.keyboard_manager.register_events(toinsert);\n", " // Render to node\n", " var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", " render(props, toinsert[toinsert.length - 1]);\n", " element.append(toinsert);\n", " return toinsert\n", " }\n", "\n", " /* Handle when an output is cleared or removed */\n", " events.on('clear_output.CodeCell', handleClearOutput);\n", " events.on('delete.Cell', handleClearOutput);\n", "\n", " /* Handle when a new output is added */\n", " events.on('output_added.OutputArea', handleAddOutput);\n", "\n", " /**\n", " * Register the mime type and append_mime function with output_area\n", " */\n", " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", " /* Is output safe? */\n", " safe: true,\n", " /* Index of renderer in `output_area.display_order` */\n", " index: 0\n", " });\n", " }\n", "\n", " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", " if (root.Jupyter !== undefined) {\n", " var events = require('base/js/events');\n", " var OutputArea = require('notebook/js/outputarea').OutputArea;\n", "\n", " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", " register_renderer(events, OutputArea);\n", " }\n", " }\n", "\n", " \n", " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", " root._bokeh_timeout = Date.now() + 5000;\n", " root._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"
\\n\"+\n", " \"
\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"
\\n\"+\n", " \"
re-rerun `output_notebook()` to attempt to load from CDN again, or
\"}};\n", "\n", " function display_loaded() {\n", " var el = document.getElementById(\"1001\");\n", " if (el != null) {\n", " el.textContent = \"BokehJS is loading...\";\n", " }\n", " if (root.Bokeh !== undefined) {\n", " if (el != null) {\n", " el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n", " }\n", " } else if (Date.now() < root._bokeh_timeout) {\n", " setTimeout(display_loaded, 100)\n", " }\n", " }\n", "\n", "\n", " function run_callbacks() {\n", " try {\n", " root._bokeh_onload_callbacks.forEach(function(callback) {\n", " if (callback != null)\n", " callback();\n", " });\n", " } finally {\n", " delete root._bokeh_onload_callbacks\n", " }\n", " console.debug(\"Bokeh: all callbacks have finished\");\n", " }\n", "\n", " function load_libs(css_urls, js_urls, callback) {\n", " if (css_urls == null) css_urls = [];\n", " if (js_urls == null) js_urls = [];\n", "\n", " root._bokeh_onload_callbacks.push(callback);\n", " if (root._bokeh_is_loading > 0) {\n", " console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", " return null;\n", " }\n", " if (js_urls == null || js_urls.length === 0) {\n", " run_callbacks();\n", " return null;\n", " }\n", " console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", " root._bokeh_is_loading = css_urls.length + js_urls.length;\n", "\n", " function on_load() {\n", " root._bokeh_is_loading--;\n", " if (root._bokeh_is_loading === 0) {\n", " console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n", " run_callbacks()\n", " }\n", " }\n", "\n", " function on_error() {\n", " console.error(\"failed to load \" + url);\n", " }\n", "\n", " for (var i = 0; i < css_urls.length; i++) {\n", " var url = css_urls[i];\n", " const element = document.createElement(\"link\");\n", " element.onload = on_load;\n", " element.onerror = on_error;\n", " element.rel = \"stylesheet\";\n", " element.type = \"text/css\";\n", " element.href = url;\n", " console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n", " document.body.appendChild(element);\n", " }\n", "\n", " for (var i = 0; i < js_urls.length; i++) {\n", " var url = js_urls[i];\n", " var element = document.createElement('script');\n", " element.onload = on_load;\n", " element.onerror = on_error;\n", " element.async = false;\n", " element.src = url;\n", " console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", " document.head.appendChild(element);\n", " }\n", " };var element = document.getElementById(\"1001\");\n", " if (element == null) {\n", " console.error(\"Bokeh: ERROR: autoload.js configured with elementid '1001' but no matching script tag was found. \")\n", " return false;\n", " }\n", "\n", " function inject_raw_css(css) {\n", " const element = document.createElement(\"style\");\n", " element.appendChild(document.createTextNode(css));\n", " document.body.appendChild(element);\n", " }\n", "\n", " \n", " var js_urls = [\"https://cdn.pydata.org/bokeh/release/bokeh-1.4.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-widgets-1.4.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-tables-1.4.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-gl-1.4.0.min.js\"];\n", " var css_urls = [];\n", " \n", "\n", " var inline_js = [\n", " function(Bokeh) {\n", " Bokeh.set_log_level(\"info\");\n", " },\n", " function(Bokeh) {\n", " \n", " \n", " }\n", " ];\n", "\n", " function run_inline_js() {\n", " \n", " if (root.Bokeh !== undefined || force === true) {\n", " \n", " for (var i = 0; i < inline_js.length; i++) {\n", " inline_js[i].call(root, root.Bokeh);\n", " }\n", " if (force === true) {\n", " display_loaded();\n", " }} else if (Date.now() < root._bokeh_timeout) {\n", " setTimeout(run_inline_js, 100);\n", " } else if (!root._bokeh_failed_load) {\n", " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", " root._bokeh_failed_load = true;\n", " } else if (force !== true) {\n", " var cell = $(document.getElementById(\"1001\")).parents('.cell').data().cell;\n", " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", " }\n", "\n", " }\n", "\n", " if (root._bokeh_is_loading === 0) {\n", " console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", " run_inline_js();\n", " } else {\n", " load_libs(css_urls, js_urls, function() {\n", " console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n", " run_inline_js();\n", " });\n", " }\n", "}(window));"], "application/vnd.bokehjs_load.v0+json": "\n(function(root) {\n function now() {\n return new Date();\n }\n\n var force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n \n\n \n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n var NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"
\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"
\\n\"+\n \"
re-rerun `output_notebook()` to attempt to load from CDN again, or
"], "text/plain": [" TR-24878 TR-18056 TR-920\n", "age 20 21 23\n", "workclass Private ? Private\n", "education 12th Some-college Some-college\n", "education_num 8 10 10\n", "marital_status Never-married Never-married Never-married\n", "occupation Other-service ? Sales\n", "relationship Own-child Own-child Not-in-family\n", "race White White White\n", "sex Male Male Male\n", "capital_gain 0 0 0\n", "capital_loss 0 0 0\n", "hours_per_week 35 16 25\n", "native_country United-States United-States United-States\n", "0 0 0 0"]}, "execution_count": 56, "metadata": {}, "output_type": "execute_result"}], "source": ["train_nn = pandas.concat([X_train, y_train], axis=1).iloc[[24878, 18056, 920], :].T\n", "train_nn.columns = ['TR-' + str(_) for _ in train_nn.columns]\n", "train_nn"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il faut comparer la premi\u00e8re colonne avec la quatri\u00e8me, la seconde avec la ciinqui\u00e8me et la trois\u00e8me avec la sixi\u00e8me. Ces exemples sont voisins. On voit que les exemples sont tr\u00e8s proches. Il n'y a qu'une seule valeur qui change \u00e0 chaque fois et il est difficile d'expliquer les erreurs."]}, {"cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "
\n", " \n", "
\n", "
\n", "
TR-24878
\n", "
TR-18056
\n", "
TR-920
\n", "
10408
\n", "
5953
\n", "
3059
\n", "
\n", " \n", " \n", "
\n", "
age
\n", "
20
\n", "
21
\n", "
23
\n", "
22
\n", "
20
\n", "
22
\n", "
\n", "
\n", "
capital_gain
\n", "
0
\n", "
0
\n", "
0
\n", "
0
\n", "
0
\n", "
0
\n", "
\n", "
\n", "
capital_loss
\n", "
0
\n", "
0
\n", "
0
\n", "
0
\n", "
0
\n", "
0
\n", "
\n", "
\n", "
education
\n", "
12th
\n", "
Some-college
\n", "
Some-college
\n", "
Some-college
\n", "
12th
\n", "
Some-college
\n", "
\n", "
\n", "
education_num
\n", "
8
\n", "
10
\n", "
10
\n", "
10
\n", "
8
\n", "
10
\n", "
\n", "
\n", "
hours_per_week
\n", "
35
\n", "
16
\n", "
25
\n", "
15
\n", "
35
\n", "
25
\n", "
\n", "
\n", "
marital_status
\n", "
Never-married
\n", "
Never-married
\n", "
Never-married
\n", "
Never-married
\n", "
Never-married
\n", "
Never-married
\n", "
\n", "
\n", "
native_country
\n", "
United-States
\n", "
United-States
\n", "
United-States
\n", "
?
\n", "
United-States
\n", "
United-States
\n", "
\n", "
\n", "
occupation
\n", "
Other-service
\n", "
?
\n", "
Sales
\n", "
?
\n", "
Other-service
\n", "
Sales
\n", "
\n", "
\n", "
race
\n", "
White
\n", "
White
\n", "
White
\n", "
White
\n", "
Black
\n", "
White
\n", "
\n", "
\n", "
relationship
\n", "
Own-child
\n", "
Own-child
\n", "
Not-in-family
\n", "
Own-child
\n", "
Own-child
\n", "
Not-in-family
\n", "
\n", "
\n", "
sex
\n", "
Male
\n", "
Male
\n", "
Male
\n", "
Male
\n", "
Male
\n", "
Male
\n", "
\n", "
\n", "
workclass
\n", "
Private
\n", "
?
\n", "
Private
\n", "
?
\n", "
Private
\n", "
Private
\n", "
\n", "
\n", "
0
\n", "
0
\n", "
0
\n", "
0
\n", "
NaN
\n", "
NaN
\n", "
NaN
\n", "
\n", "
\n", "
y_test
\n", "
NaN
\n", "
NaN
\n", "
NaN
\n", "
1
\n", "
1
\n", "
1
\n", "
\n", "
\n", "
pred1
\n", "
NaN
\n", "
NaN
\n", "
NaN
\n", "
0
\n", "
0
\n", "
0
\n", "
\n", "
\n", "
P1(>=50K)
\n", "
NaN
\n", "
NaN
\n", "
NaN
\n", "
0.0146811
\n", "
0.00972505
\n", "
0.0145646
\n", "
\n", "
\n", "
pred2
\n", "
NaN
\n", "
NaN
\n", "
NaN
\n", "
0
\n", "
0
\n", "
0
\n", "
\n", "
\n", "
P2(>=50K)
\n", "
NaN
\n", "
NaN
\n", "
NaN
\n", "
0
\n", "
0
\n", "
0
\n", "
\n", "
\n", "
pred3
\n", "
NaN
\n", "
NaN
\n", "
NaN
\n", "
0
\n", "
0
\n", "
0
\n", "
\n", "
\n", "
P3(>=50K)
\n", "
NaN
\n", "
NaN
\n", "
NaN
\n", "
0.00288558
\n", "
0.00299048
\n", "
0.00561586
\n", "
\n", "
\n", "
pred4
\n", "
NaN
\n", "
NaN
\n", "
NaN
\n", "
0
\n", "
0
\n", "
0
\n", "
\n", "
\n", "
P4(>=50K)
\n", "
NaN
\n", "
NaN
\n", "
NaN
\n", "
0.000149386
\n", "
0.000395875
\n", "
0.000863302
\n", "
\n", " \n", "
\n", "
"], "text/plain": [" TR-24878 TR-18056 TR-920 10408 \\\n", "age 20 21 23 22 \n", "capital_gain 0 0 0 0 \n", "capital_loss 0 0 0 0 \n", "education 12th Some-college Some-college Some-college \n", "education_num 8 10 10 10 \n", "hours_per_week 35 16 25 15 \n", "marital_status Never-married Never-married Never-married Never-married \n", "native_country United-States United-States United-States ? \n", "occupation Other-service ? Sales ? \n", "race White White White White \n", "relationship Own-child Own-child Not-in-family Own-child \n", "sex Male Male Male Male \n", "workclass Private ? Private ? \n", "0 0 0 0 NaN \n", "y_test NaN NaN NaN 1 \n", "pred1 NaN NaN NaN 0 \n", "P1(>=50K) NaN NaN NaN 0.0146811 \n", "pred2 NaN NaN NaN 0 \n", "P2(>=50K) NaN NaN NaN 0 \n", "pred3 NaN NaN NaN 0 \n", "P3(>=50K) NaN NaN NaN 0.00288558 \n", "pred4 NaN NaN NaN 0 \n", "P4(>=50K) NaN NaN NaN 0.000149386 \n", "\n", " 5953 3059 \n", "age 20 22 \n", "capital_gain 0 0 \n", "capital_loss 0 0 \n", "education 12th Some-college \n", "education_num 8 10 \n", "hours_per_week 35 25 \n", "marital_status Never-married Never-married \n", "native_country United-States United-States \n", "occupation Other-service Sales \n", "race Black White \n", "relationship Own-child Not-in-family \n", "sex Male Male \n", "workclass Private Private \n", "0 NaN NaN \n", "y_test 1 1 \n", "pred1 0 0 \n", "P1(>=50K) 0.00972505 0.0145646 \n", "pred2 0 0 \n", "P2(>=50K) 0 0 \n", "pred3 0 0 \n", "P3(>=50K) 0.00299048 0.00561586 \n", "pred4 0 0 \n", "P4(>=50K) 0.000395875 0.000863302 "]}, "execution_count": 57, "metadata": {}, "output_type": "execute_result"}], "source": ["pandas.concat([train_nn, wrong_study], axis=1, sort=True)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Ethique\n", "\n", "Le mod\u00e8le qu'on a appris est-il \u00e9thique ? Il n'y a pas de r\u00e9ponse simples \u00e0 ce probl\u00e8me car il est difficile de transcrire math\u00e9matiquement le caract\u00e8re \u00e9thique d'un mod\u00e8le. Dans le cas pr\u00e9sent, admettons que l'on souhaite v\u00e9rifier que le mod\u00e8le ne retourne pas une r\u00e9ponse biais\u00e9e en fonction du genre. Une premi\u00e8re id\u00e9e consiste \u00e0 enlever la variable pour \u00eatre s\u00fbr sur le mod\u00e8le n'en tienne pas compte mais cela ne garantit que la variable genre n'est une variable corr\u00e9l\u00e9e aux autres. Et la corr\u00e9lation implique ici au sens du mod\u00e8le ce qui a un sens plus fort si le mod\u00e8le n'est pas lin\u00e9aire. On aura tout-\u00e0-faire enlev\u00e9 la variable genre si elle ne peut \u00eatre pr\u00e9dite \u00e0 partir des autres."]}, {"cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": ["X_train_sex = train.drop([label, 'sex'], axis=1)\n", "y_train_sex = train['sex'] == 'Male'\n", "X_test_sex = test.drop([label, 'sex'], axis=1)\n", "y_test_sex = test['sex'] == 'Male'"]}, {"cell_type": "code", "execution_count": 58, "metadata": {"scrolled": false}, "outputs": [{"data": {"text/plain": ["Pipeline(memory=None,\n", " steps=[('onehotencoder',\n", " OneHotEncoder(cols=['workclass', 'education', 'marital_status',\n", " 'occupation', 'relationship', 'race',\n", " 'native_country'],\n", " drop_invariant=False, handle_missing='value',\n", " handle_unknown='value', return_df=True,\n", " use_cat_names=False, verbose=0)),\n", " ('randomforestclassifier',\n", " RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,\n", " class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto',\n", " max_leaf_nodes=None, max_samples=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " n_estimators=100, n_jobs=None,\n", " oob_score=False, random_state=None,\n", " verbose=0, warm_start=False))],\n", " verbose=False)"]}, "execution_count": 59, "metadata": {}, "output_type": "execute_result"}], "source": ["ce_sex = OneHotEncoder(cols=[_ for _ in cat_col if _ != 'sex'], \n", " handle_missing='value', drop_invariant=False,\n", " handle_unknown='value')\n", "model_sex = make_pipeline(ce_sex, RandomForestClassifier(n_estimators=100))\n", "model_sex.fit(X_train_sex, y_train_sex)"]}, {"cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.8368650574289048"]}, "execution_count": 60, "metadata": {}, "output_type": "execute_result"}], "source": ["model_sex.score(X_test_sex, y_test_sex)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il est possible de pr\u00e9dire le genre en fonction des autres variables \u00e0 83% pr\u00e8s. Ce n'est pas en enlevant la variable qu'on peut emp\u00eacher le mod\u00e8le d'\u00eatre biais\u00e9 par rapport \u00e0 cette information. Il n'est pas \u00e9vident de retirer toute influence de ce param\u00e8tre. On peut choisir de la garder en supposant que le mod\u00e8le la choisira plut\u00f4t qu'une autre pour pr\u00e9dire si elle s'av\u00e8re pertinente. Dans ce cas, on pourrait comparer combien de fois le mod\u00e8le change de pr\u00e9diction si on inverse la variable *sex*. On rappelle la performance du mod\u00e8le."]}, {"cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.8448498249493275"]}, "execution_count": 61, "metadata": {}, "output_type": "execute_result"}], "source": ["pipe2.score(X_test, y_test)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On remplace la variable *sex* par la pr\u00e9diction de l'autre mod\u00e8le."]}, {"cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": ["X_test_modified = X_test.copy()\n", "X_test_modified['sex'] = model_sex.predict(X_test_sex)\n", "X_test_modified['sex'] = X_test_modified['sex'].apply(lambda x: 'Male' if x else 'Female')"]}, {"cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.843928505620048"]}, "execution_count": 63, "metadata": {}, "output_type": "execute_result"}], "source": ["pipe2.score(X_test_modified, y_test)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Quasiment aucun changement. Inversons la colonne maintenant."]}, {"cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": ["X_test_inv = X_test.copy()\n", "X_test_inv['sex'] = X_test_inv['sex'].apply(lambda x: 'Female' if x == 'Male' else 'Male')"]}, {"cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.8505005835022419"]}, "execution_count": 65, "metadata": {}, "output_type": "execute_result"}], "source": ["pipe2.score(X_test_inv, y_test)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Encore mieux. Regardons les diff\u00e9rences maintenant."]}, {"cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [{"data": {"text/plain": ["(1024, 16281, 0.06289539954548247)"]}, "execution_count": 66, "metadata": {}, "output_type": "execute_result"}], "source": ["diff1 = X_test['sex'] != X_test_inv['sex']\n", "diff2 = pipe2.predict(X_test) != pipe2.predict(X_test_inv)\n", "diff2.sum(), diff1.sum(), diff2.sum() / diff1.sum()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La pr\u00e9diction change dans 5% des cas. On est s\u00fbr que pour ces observations et ce mod\u00e8le, le genre a un impact, cela ne veut rien dire pour les autres. Regardons les cinq premi\u00e8res."]}, {"cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "
\n", " \n", "
\n", "
\n", "
2
\n", "
14
\n", "
17
\n", "
19
\n", "
28
\n", "
\n", " \n", " \n", "
\n", "
age
\n", "
28
\n", "
48
\n", "
43
\n", "
40
\n", "
54
\n", "
\n", "
\n", "
workclass
\n", "
Local-gov
\n", "
Private
\n", "
Private
\n", "
Private
\n", "
Private
\n", "
\n", "
\n", "
education
\n", "
Assoc-acdm
\n", "
HS-grad
\n", "
HS-grad
\n", "
Doctorate
\n", "
HS-grad
\n", "
\n", "
\n", "
education_num
\n", "
12
\n", "
9
\n", "
9
\n", "
16
\n", "
9
\n", "
\n", "
\n", "
marital_status
\n", "
Married-civ-spouse
\n", "
Married-civ-spouse
\n", "
Married-civ-spouse
\n", "
Married-civ-spouse
\n", "
Married-civ-spouse
\n", "
\n", "
\n", "
occupation
\n", "
Protective-serv
\n", "
Machine-op-inspct
\n", "
Adm-clerical
\n", "
Prof-specialty
\n", "
Craft-repair
\n", "
\n", "
\n", "
relationship
\n", "
Husband
\n", "
Husband
\n", "
Wife
\n", "
Husband
\n", "
Husband
\n", "
\n", "
\n", "
race
\n", "
White
\n", "
White
\n", "
White
\n", "
Asian-Pac-Islander
\n", "
White
\n", "
\n", "
\n", "
sex
\n", "
Male
\n", "
Male
\n", "
Female
\n", "
Male
\n", "
Male
\n", "
\n", "
\n", "
capital_gain
\n", "
0
\n", "
3103
\n", "
0
\n", "
0
\n", "
0
\n", "
\n", "
\n", "
capital_loss
\n", "
0
\n", "
0
\n", "
0
\n", "
0
\n", "
0
\n", "
\n", "
\n", "
hours_per_week
\n", "
40
\n", "
48
\n", "
30
\n", "
45
\n", "
35
\n", "
\n", "
\n", "
native_country
\n", "
United-States
\n", "
United-States
\n", "
United-States
\n", "
?
\n", "
United-States
\n", "
\n", "
\n", "
y
\n", "
1
\n", "
1
\n", "
0
\n", "
1
\n", "
0
\n", "
\n", "
\n", "
prediction_sex
\n", "
True
\n", "
True
\n", "
False
\n", "
True
\n", "
True
\n", "
\n", " \n", "
\n", "
"], "text/plain": [" 2 14 17 \\\n", "age 28 48 43 \n", "workclass Local-gov Private Private \n", "education Assoc-acdm HS-grad HS-grad \n", "education_num 12 9 9 \n", "marital_status Married-civ-spouse Married-civ-spouse Married-civ-spouse \n", "occupation Protective-serv Machine-op-inspct Adm-clerical \n", "relationship Husband Husband Wife \n", "race White White White \n", "sex Male Male Female \n", "capital_gain 0 3103 0 \n", "capital_loss 0 0 0 \n", "hours_per_week 40 48 30 \n", "native_country United-States United-States United-States \n", "y 1 1 0 \n", "prediction_sex True True False \n", "\n", " 19 28 \n", "age 40 54 \n", "workclass Private Private \n", "education Doctorate HS-grad \n", "education_num 16 9 \n", "marital_status Married-civ-spouse Married-civ-spouse \n", "occupation Prof-specialty Craft-repair \n", "relationship Husband Husband \n", "race Asian-Pac-Islander White \n", "sex Male Male \n", "capital_gain 0 0 \n", "capital_loss 0 0 \n", "hours_per_week 45 35 \n", "native_country ? United-States \n", "y 1 0 \n", "prediction_sex True True "]}, "execution_count": 67, "metadata": {}, "output_type": "execute_result"}], "source": ["look = X_test.copy()\n", "look['y'] = y_test\n", "look['prediction_sex'] = model_sex.predict(X_test_sex)\n", "look[diff2].head().T"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il existe visible quelques observations \u00e0 v\u00e9rifier o\u00f9 la relation *relationship* et le genre *sex* semble \u00eatre en contradiction ou plut\u00f4t ne pas prendre en compte tous les types de relations possibles."]}, {"cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "
\n", " \n", "
\n", "
relationship
\n", "
Husband
\n", "
Not-in-family
\n", "
Other-relative
\n", "
Own-child
\n", "
Unmarried
\n", "
Wife
\n", "
\n", "
\n", "
sex
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \n", "
\n", "
Female
\n", "
1
\n", "
3875
\n", "
430
\n", "
2245
\n", "
2654
\n", "
1566
\n", "
\n", "
\n", "
Male
\n", "
13192
\n", "
4430
\n", "
551
\n", "
2823
\n", "
792
\n", "
2
\n", "
\n", " \n", "
\n", "
"], "text/plain": ["relationship Husband Not-in-family Other-relative Own-child Unmarried \\\n", "sex \n", "Female 1 3875 430 2245 2654 \n", "Male 13192 4430 551 2823 792 \n", "\n", "relationship Wife \n", "sex \n", "Female 1566 \n", "Male 2 "]}, "execution_count": 68, "metadata": {}, "output_type": "execute_result"}], "source": ["X_train[['sex', 'relationship', 'age']].groupby(['sex', 'relationship'], as_index=False)\\\n", " .count().pivot('sex', 'relationship', 'age')"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pas de conclusion \u00e0 ce stade. Il faut poursuivre l'exploration [Machine Learning \u00e9thique](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/td_2a_mlplus.html#machine-learning-ethique)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## S\u00e9lection des variables\n", "\n", "Il n'y a pas de m\u00e9thode optimale pour s\u00e9lectionner les variables. Il existe diff\u00e9rentes options comme celles propos\u00e9es par [scikit-learn/feature_selection](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection). Certaines partent des features brutes, d'autres utilisent le mod\u00e8le qui doit \u00eatre appris. Mais ce n'est pas toujours \u00e9vident de faire marcher ces m\u00e9thodes sur n'importe quel mod\u00e8le."]}, {"cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["could not convert string to float: 'State-gov'\n"]}], "source": ["from sklearn.feature_selection import RFE\n", "try:\n", " model = RFE(pipe2)\n", " model.fit(X_train, y_train)\n", "except Exception as e:\n", " print(e)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Dans notre cas, on retire les variables une \u00e0 une en fonction de l'indicateur ``feature_importance`` ce qui n'est pas facile car les variables sont des modalit\u00e9s. Il faut en faire la somme..."]}, {"cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [{"data": {"text/html": ["