{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Visualize a scikit-learn pipeline\n", "\n", "Pipeline can be big with *scikit-learn*, let's dig into a visual way to look a them."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["import warnings\n", "warnings.simplefilter(\"ignore\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Simple model\n", "\n", "Let's vizualize a simple pipeline, a single model not even trained."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["LogisticRegression()"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "from sklearn import datasets\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "iris = datasets.load_iris()\n", "X = iris.data[:, :4]\n", "df = pandas.DataFrame(X)\n", "df.columns = [\"X1\", \"X2\", \"X3\", \"X4\"]\n", "clf = LogisticRegression()\n", "clf"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The trick consists in converting the pipeline in a graph through the [DOT](https://en.wikipedia.org/wiki/DOT_(graph_description_language)) language."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["digraph{\n", " orientation=portrait;\n", " nodesep=0.05;\n", " ranksep=0.25;\n", " sch0[label=\" X1| X2| X3| X4\",shape=record,fontsize=8];\n", "\n", " node1[label=\"union\",shape=box,style=\"filled,rounded\",color=cyan,fontsize=12];\n", " sch0:f0 -> node1;\n", " sch0:f1 -> node1;\n", " sch0:f2 -> node1;\n", " sch0:f3 -> node1;\n", " sch1[label=\" -v-0\",shape=record,fontsize=8];\n", " node1 -> sch1:f0;\n", "\n", " node2[label=\"LogisticRegression\",shape=box,style=\"filled,rounded\",color=yellow,fontsize=12];\n", " sch1:f0 -> node2;\n", " sch2[label=\" PredictedLabel| Probabilities\",shape=record,fontsize=8];\n", " node2 -> sch2:f0;\n", " node2 -> sch2:f1;\n", "}\n"]}], "source": ["from mlinsights.plotting import pipeline2dot\n", "dot = pipeline2dot(clf, df)\n", "print(dot)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["It is lot better with an image."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": ["dot_file = \"graph.dot\"\n", "with open(dot_file, \"w\", encoding=\"utf-8\") as f:\n", " f.write(dot)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": ["# might be needed on windows\n", "import sys\n", "import os\n", "if sys.platform.startswith(\"win\") and \"Graphviz\" not in os.environ[\"PATH\"]:\n", " os.environ['PATH'] = os.environ['PATH'] + r';C:\\Program Files (x86)\\Graphviz2.38\\bin'"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[run_cmd] execute dot -G=300 -Tpng graph.dot -ograph.dot.png\n", "end of execution dot -G=300 -Tpng graph.dot -ograph.dot.png\n"]}], "source": ["from pyquickhelper.loghelper import run_cmd\n", "cmd = \"dot -G=300 -Tpng {0} -o{0}.png\".format(dot_file)\n", "run_cmd(cmd, wait=True, fLOG=print);"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": [""]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["from PIL import Image\n", "img = Image.open(\"graph.dot.png\")\n", "img"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Complex pipeline\n", "\n", "*scikit-learn* instroduced a couple of transform to play with features in a single pipeline. The following example is taken from [Column Transformer with Mixed Types](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py)."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": ["from sklearn import datasets\n", "from sklearn.linear_model import LogisticRegression, LinearRegression\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", "from sklearn.linear_model import LogisticRegression"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["Pipeline(steps=[('preprocessor',\n", " ColumnTransformer(transformers=[('num',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(strategy='median')),\n", " ('scaler',\n", " StandardScaler())]),\n", " ['age', 'fare']),\n", " ('cat',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('onehot',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " ['embarked', 'sex',\n", " 'pclass'])])),\n", " ('classifier', LogisticRegression())])"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["columns = ['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare',\n", " 'cabin', 'embarked', 'boat', 'body', 'home.dest']\n", "\n", "numeric_features = ['age', 'fare']\n", "numeric_transformer = Pipeline(steps=[\n", " ('imputer', SimpleImputer(strategy='median')),\n", " ('scaler', StandardScaler())])\n", "\n", "categorical_features = ['embarked', 'sex', 'pclass']\n", "categorical_transformer = Pipeline(steps=[\n", " ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),\n", " ('onehot', OneHotEncoder(handle_unknown='ignore'))])\n", "\n", "preprocessor = ColumnTransformer(\n", " transformers=[\n", " ('num', numeric_transformer, numeric_features),\n", " ('cat', categorical_transformer, categorical_features),\n", " ])\n", "\n", "clf = Pipeline(steps=[('preprocessor', preprocessor),\n", " ('classifier', LogisticRegression(solver='lbfgs'))])\n", "clf"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's see it first as a simplified text."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Pipeline\n", " ColumnTransformer\n", " Pipeline(age,fare)\n", " SimpleImputer\n", " StandardScaler\n", " Pipeline(embarked,sex,pclass)\n", " SimpleImputer\n", " OneHotEncoder\n", " LogisticRegression\n"]}], "source": ["from mlinsights.plotting import pipeline2str\n", "print(pipeline2str(clf))"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": ["dot = pipeline2dot(clf, columns)"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": ["dot_file = \"graph2.dot\"\n", "with open(dot_file, \"w\", encoding=\"utf-8\") as f:\n", " f.write(dot)"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[run_cmd] execute dot -G=300 -Tpng graph2.dot -ograph2.dot.png\n", "end of execution dot -G=300 -Tpng graph2.dot -ograph2.dot.png\n"]}], "source": ["cmd = \"dot -G=300 -Tpng {0} -o{0}.png\".format(dot_file)\n", "run_cmd(cmd, wait=True, fLOG=print);"]}, {"cell_type": "code", "execution_count": 16, "metadata": {"scrolled": false}, "outputs": [{"data": {"image/png": "\n", "text/plain": [""]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["img = Image.open(\"graph2.dot.png\")\n", "img"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## With javascript"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", ""], "text/plain": [""]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import RenderJsDot\n", "RenderJsDot(dot)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Example with FeatureUnion"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", ""], "text/plain": [""]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.pipeline import FeatureUnion\n", "from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures\n", "\n", "model = Pipeline([('poly', PolynomialFeatures()),\n", " ('union', FeatureUnion([\n", " ('scaler2', MinMaxScaler()),\n", " ('scaler3', StandardScaler())]))])\n", "dot = pipeline2dot(model, columns)\n", "RenderJsDot(dot)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Compute intermediate outputs\n", "\n", "It is difficult to access intermediate outputs with *scikit-learn* but it may be interesting to do so. The method [alter_pipeline_for_debugging](find://alter_pipeline_for_debugging) modifies the pipeline to intercept intermediate outputs."]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["Pipeline(steps=[('scaler1', StandardScaler()),\n", " ('union',\n", " FeatureUnion(transformer_list=[('scaler2', StandardScaler()),\n", " ('scaler3', MinMaxScaler())])),\n", " ('lr', LinearRegression())])"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["from numpy.random import randn\n", "\n", "model = Pipeline([('scaler1', StandardScaler()),\n", " ('union', FeatureUnion([\n", " ('scaler2', StandardScaler()),\n", " ('scaler3', MinMaxScaler())])),\n", " ('lr', LinearRegression())])\n", "\n", "X = randn(4, 5)\n", "y = randn(4)\n", "model.fit(X, y)"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Pipeline\n", " StandardScaler\n", " FeatureUnion\n", " StandardScaler\n", " MinMaxScaler\n", " LinearRegression\n"]}], "source": ["print(pipeline2str(model))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's now modify the pipeline to get the intermediate outputs."]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": ["from mlinsights.helpers.pipeline import alter_pipeline_for_debugging\n", "alter_pipeline_for_debugging(model)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The function adds a member ``_debug`` which stores inputs and outputs in every piece of the pipeline."]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"data": {"text/plain": ["BaseEstimatorDebugInformation(StandardScaler)"]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["model.steps[0][1]._debug"]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([ 0.73619378, 0.87936142, -0.56528874, -0.2675163 ])"]}, "execution_count": 24, "metadata": {}, "output_type": "execute_result"}], "source": ["model.predict(X)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The member was populated with inputs and outputs."]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"data": {"text/plain": ["BaseEstimatorDebugInformation(StandardScaler)\n", " transform(\n", " shape=(4, 5) type=\n", " [[ 1.22836841 2.35164607 -0.37367786 0.61490475 -0.45377634]\n", " [-0.77187962 0.43540786 0.20465106 0.8910651 -0.23104796]\n", " [-0.36750208 0.35154324 1.78609517 -1.59325463 1.51595267]\n", " [ 1.37547609 1.59470748 -0.5932628 0.57822003 0.56034736]]\n", " ) -> (\n", " shape=(4, 5) type=\n", " [[ 0.90946066 1.4000516 -0.67682808 0.49311806 -1.03765861]\n", " [-1.20030006 -0.89626498 -0.05514595 0.76980985 -0.7493565 ]\n", " [-0.77378303 -0.99676381 1.64484777 -1.71929067 1.51198065]\n", " [ 1.06462242 0.49297719 -0.91287374 0.45636275 0.27503446]]\n", " )"]}, "execution_count": 25, "metadata": {}, "output_type": "execute_result"}], "source": ["model.steps[0][1]._debug"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Every piece behaves the same way."]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["(0,)\n", "BaseEstimatorDebugInformation(Pipeline)\n", " predict(\n", " shape=(4, 5) type=\n", " [[ 1.22836841 2.35164607 -0.37367786 0.61490475 -0.45377634]\n", " [-0.77187962 0.43540786 0.20465106 0.8910651 -0.23104796]\n", " [-0.36750208 0.35154324 1.78609517 -1.59325463 1.51595267]\n", " [ 1.37547609 1.59470748 -0.5932628 0.57822003 0.56034736]]\n", " ) -> (\n", " shape=(4,) type=\n", " [ 0.73619378 0.87936142 -0.56528874 -0.2675163 ]\n", " )\n", "(0, 0)\n", "BaseEstimatorDebugInformation(StandardScaler)\n", " transform(\n", " shape=(4, 5) type=\n", " [[ 1.22836841 2.35164607 -0.37367786 0.61490475 -0.45377634]\n", " [-0.77187962 0.43540786 0.20465106 0.8910651 -0.23104796]\n", " [-0.36750208 0.35154324 1.78609517 -1.59325463 1.51595267]\n", " [ 1.37547609 1.59470748 -0.5932628 0.57822003 0.56034736]]\n", " ) -> (\n", " shape=(4, 5) type=\n", " [[ 0.90946066 1.4000516 -0.67682808 0.49311806 -1.03765861]\n", " [-1.20030006 -0.89626498 -0.05514595 0.76980985 -0.7493565 ]\n", " [-0.77378303 -0.99676381 1.64484777 -1.71929067 1.51198065]\n", " [ 1.06462242 0.49297719 -0.91287374 0.45636275 0.27503446]]\n", " )\n", "(0, 1)\n", "BaseEstimatorDebugInformation(FeatureUnion)\n", " transform(\n", " shape=(4, 5) type=\n", " [[ 0.90946066 1.4000516 -0.67682808 0.49311806 -1.03765861]\n", " [-1.20030006 -0.89626498 -0.05514595 0.76980985 -0.7493565 ]\n", " [-0.77378303 -0.99676381 1.64484777 -1.71929067 1.51198065]\n", " [ 1.06462242 0.49297719 -0.91287374 0.45636275 0.27503446]]\n", " ) -> (\n", " shape=(4, 10) type=\n", " [[ 0.90946066 1.4000516 -0.67682808 0.49311806 -1.03765861 0.93149357\n", " 1. 0.09228748 0.88883864 0. ]\n", " [-1.20030006 -0.89626498 -0.05514595 0.76980985 -0.7493565 0.\n", " 0.04193015 0.33534839 1. 0.11307564]\n", " [-0.77378303 -0.99676381 1.64484777 -1.71929067 1.51198065 0.18831419\n", " ...\n", " )\n", "(0, 1, 0)\n", "BaseEstimatorDebugInformation(StandardScaler)\n", " transform(\n", " shape=(4, 5) type=\n", " [[ 0.90946066 1.4000516 -0.67682808 0.49311806 -1.03765861]\n", " [-1.20030006 -0.89626498 -0.05514595 0.76980985 -0.7493565 ]\n", " [-0.77378303 -0.99676381 1.64484777 -1.71929067 1.51198065]\n", " [ 1.06462242 0.49297719 -0.91287374 0.45636275 0.27503446]]\n", " ) -> (\n", " shape=(4, 5) type=\n", " [[ 0.90946066 1.4000516 -0.67682808 0.49311806 -1.03765861]\n", " [-1.20030006 -0.89626498 -0.05514595 0.76980985 -0.7493565 ]\n", " [-0.77378303 -0.99676381 1.64484777 -1.71929067 1.51198065]\n", " [ 1.06462242 0.49297719 -0.91287374 0.45636275 0.27503446]]\n", " )\n", "(0, 1, 1)\n", "BaseEstimatorDebugInformation(MinMaxScaler)\n", " transform(\n", " shape=(4, 5) type=\n", " [[ 0.90946066 1.4000516 -0.67682808 0.49311806 -1.03765861]\n", " [-1.20030006 -0.89626498 -0.05514595 0.76980985 -0.7493565 ]\n", " [-0.77378303 -0.99676381 1.64484777 -1.71929067 1.51198065]\n", " [ 1.06462242 0.49297719 -0.91287374 0.45636275 0.27503446]]\n", " ) -> (\n", " shape=(4, 5) type=\n", " [[0.93149357 1. 0.09228748 0.88883864 0. ]\n", " [0. 0.04193015 0.33534839 1. 0.11307564]\n", " [0.18831419 0. 1. 0. 1. ]\n", " [1. 0.62155016 0. 0.87407214 0.51485443]]\n", " )\n", "(0, 2)\n", "BaseEstimatorDebugInformation(LinearRegression)\n", " predict(\n", " shape=(4, 10) type=\n", " [[ 0.90946066 1.4000516 -0.67682808 0.49311806 -1.03765861 0.93149357\n", " 1. 0.09228748 0.88883864 0. ]\n", " [-1.20030006 -0.89626498 -0.05514595 0.76980985 -0.7493565 0.\n", " 0.04193015 0.33534839 1. 0.11307564]\n", " [-0.77378303 -0.99676381 1.64484777 -1.71929067 1.51198065 0.18831419\n", " ...\n", " ) -> (\n", " shape=(4,) type=\n", " [ 0.73619378 0.87936142 -0.56528874 -0.2675163 ]\n", " )\n"]}], "source": ["from mlinsights.helpers.pipeline import enumerate_pipeline_models\n", "for coor, model, vars in enumerate_pipeline_models(model):\n", " print(coor)\n", " print(model._debug)"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5"}}, "nbformat": 4, "nbformat_minor": 2}