.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "gyexamples/plot_gbegin_dataframe.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_gyexamples_plot_gbegin_dataframe.py: Dataframe as an input ===================== .. index:: dataframe A pipeline usually ingests data as a matrix. It may be converted in a matrix if all the data share the same type. But data held in a dataframe have usually multiple types, float, integer or string for categories. ONNX also supports that case. .. contents:: :local: A dataset with categories +++++++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 19-58 .. code-block:: default from mlinsights.plotting import pipeline2dot import numpy import pprint from mlprodict.onnx_conv import guess_schema_from_data from onnxruntime import InferenceSession from pyquickhelper.helpgen.graphviz_helper import plot_graphviz from mlprodict.onnxrt import OnnxInference from mlprodict.onnx_conv import to_onnx as to_onnx_ext from skl2onnx import to_onnx from pandas import DataFrame from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder from sklearn.ensemble import RandomForestClassifier data = DataFrame([ dict(CAT1='a', CAT2='c', num1=0.5, num2=0.6, y=0), dict(CAT1='b', CAT2='d', num1=0.4, num2=0.8, y=1), dict(CAT1='a', CAT2='d', num1=0.5, num2=0.56, y=0), dict(CAT1='a', CAT2='d', num1=0.55, num2=0.56, y=1), dict(CAT1='a', CAT2='c', num1=0.35, num2=0.86, y=0), dict(CAT1='a', CAT2='c', num1=0.5, num2=0.68, y=1), ]) cat_cols = ['CAT1', 'CAT2'] train_data = data.drop('y', axis=1) categorical_transformer = Pipeline([ ('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))]) preprocessor = ColumnTransformer( transformers=[ ('cat', categorical_transformer, cat_cols)], remainder='passthrough') pipe = Pipeline([('preprocess', preprocessor), ('rf', RandomForestClassifier())]) pipe.fit(train_data, data['y']) .. raw:: html
Pipeline(steps=[('preprocess',
                     ColumnTransformer(remainder='passthrough',
                                       transformers=[('cat',
                                                      Pipeline(steps=[('onehot',
                                                                       OneHotEncoder(handle_unknown='ignore',
                                                                                     sparse=False))]),
                                                      ['CAT1', 'CAT2'])])),
                    ('rf', RandomForestClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 59-60 Display. .. GENERATED FROM PYTHON SOURCE LINES 60-66 .. code-block:: default dot = pipeline2dot(pipe, train_data) ax = plot_graphviz(dot) ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) .. image-sg:: /gyexamples/images/sphx_glr_plot_gbegin_dataframe_001.png :alt: plot gbegin dataframe :srcset: /gyexamples/images/sphx_glr_plot_gbegin_dataframe_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 67-71 Conversion to ONNX ++++++++++++++++++ Function *to_onnx* does not handle dataframes. .. GENERATED FROM PYTHON SOURCE LINES 71-78 .. code-block:: default try: onx = to_onnx(pipe, train_data[:1]) except NotImplementedError as e: print(e) .. GENERATED FROM PYTHON SOURCE LINES 79-80 But it possible to use an extended one. .. GENERATED FROM PYTHON SOURCE LINES 80-86 .. code-block:: default onx = to_onnx_ext( pipe, train_data[:1], options={RandomForestClassifier: {'zipmap': False}}) .. GENERATED FROM PYTHON SOURCE LINES 87-89 Graph +++++ .. GENERATED FROM PYTHON SOURCE LINES 89-97 .. code-block:: default oinf = OnnxInference(onx) ax = plot_graphviz(oinf.to_dot()) ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) .. image-sg:: /gyexamples/images/sphx_glr_plot_gbegin_dataframe_002.png :alt: plot gbegin dataframe :srcset: /gyexamples/images/sphx_glr_plot_gbegin_dataframe_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 98-102 Prediction with ONNX ++++++++++++++++++++ *onnxruntime* does not support dataframes. .. GENERATED FROM PYTHON SOURCE LINES 102-111 .. code-block:: default sess = InferenceSession(onx.SerializeToString(), providers=['CPUExecutionProvider']) try: sess.run(None, train_data) except Exception as e: print(e) .. rst-class:: sphx-glr-script-out .. code-block:: none run(): incompatible function arguments. The following argument types are supported: 1. (self: onnxruntime.capi.onnxruntime_pybind11_state.InferenceSession, arg0: List[str], arg1: Dict[str, object], arg2: onnxruntime.capi.onnxruntime_pybind11_state.RunOptions) -> List[object] Invoked with: , ['label', 'probabilities'], CAT1 CAT2 num1 num2 0 a c 0.50 0.60 1 b d 0.40 0.80 2 a d 0.50 0.56 3 a d 0.55 0.56 4 a c 0.35 0.86 5 a c 0.50 0.68, None .. GENERATED FROM PYTHON SOURCE LINES 112-113 Let's use a shortcut .. GENERATED FROM PYTHON SOURCE LINES 113-119 .. code-block:: default oinf = OnnxInference(onx) got = oinf.run(train_data) print(pipe.predict(train_data)) print(got['label']) .. rst-class:: sphx-glr-script-out .. code-block:: none [0 1 0 1 0 1] [0 1 0 1 0 1] .. GENERATED FROM PYTHON SOURCE LINES 120-121 And probilities. .. GENERATED FROM PYTHON SOURCE LINES 121-125 .. code-block:: default print(pipe.predict_proba(train_data)) print(got['probabilities']) .. rst-class:: sphx-glr-script-out .. code-block:: none [[0.78 0.22] [0.26 0.74] [0.69 0.31] [0.28 0.72] [0.73 0.27] [0.27 0.73]] [[0.78 0.22 ] [0.2600001 0.7399999 ] [0.69000006 0.30999997] [0.28000015 0.71999985] [0.73 0.26999995] [0.2700001 0.7299999 ]] .. GENERATED FROM PYTHON SOURCE LINES 126-136 It looks ok. Let's dig into the details to directly use *onnxruntime*. Unhide conversion logic with a dataframe ++++++++++++++++++++++++++++++++++++++++ A dataframe can be seen as a set of columns with different types. That's what ONNX should see: a list of inputs, the input name is the column name, the input type is the column type. .. GENERATED FROM PYTHON SOURCE LINES 136-142 .. code-block:: default init = guess_schema_from_data(train_data) pprint.pprint(init) .. rst-class:: sphx-glr-script-out .. code-block:: none [('CAT1', StringTensorType(shape=[None, 1])), ('CAT2', StringTensorType(shape=[None, 1])), ('num1', DoubleTensorType(shape=[None, 1])), ('num2', DoubleTensorType(shape=[None, 1]))] .. GENERATED FROM PYTHON SOURCE LINES 143-144 Let's use float instead. .. GENERATED FROM PYTHON SOURCE LINES 144-153 .. code-block:: default for c in train_data.columns: if c not in cat_cols: train_data[c] = train_data[c].astype(numpy.float32) init = guess_schema_from_data(train_data) pprint.pprint(init) .. rst-class:: sphx-glr-script-out .. code-block:: none [('CAT1', StringTensorType(shape=[None, 1])), ('CAT2', StringTensorType(shape=[None, 1])), ('num1', FloatTensorType(shape=[None, 1])), ('num2', FloatTensorType(shape=[None, 1]))] .. GENERATED FROM PYTHON SOURCE LINES 154-155 Let's convert with *skl2onnx* only. .. GENERATED FROM PYTHON SOURCE LINES 155-160 .. code-block:: default onx2 = to_onnx( pipe, initial_types=init, options={RandomForestClassifier: {'zipmap': False}}) .. GENERATED FROM PYTHON SOURCE LINES 161-165 Let's run it with onnxruntime. We need to convert the dataframe into a dictionary where column names become keys, and column values become values. .. GENERATED FROM PYTHON SOURCE LINES 165-170 .. code-block:: default inputs = {c: train_data[c].values.reshape((-1, 1)) for c in train_data.columns} pprint.pprint(inputs) .. rst-class:: sphx-glr-script-out .. code-block:: none {'CAT1': array([['a'], ['b'], ['a'], ['a'], ['a'], ['a']], dtype=object), 'CAT2': array([['c'], ['d'], ['d'], ['d'], ['c'], ['c']], dtype=object), 'num1': array([[0.5 ], [0.4 ], [0.5 ], [0.55], [0.35], [0.5 ]], dtype=float32), 'num2': array([[0.6 ], [0.8 ], [0.56], [0.56], [0.86], [0.68]], dtype=float32)} .. GENERATED FROM PYTHON SOURCE LINES 171-172 Inference. .. GENERATED FROM PYTHON SOURCE LINES 172-181 .. code-block:: default sess2 = InferenceSession(onx2.SerializeToString(), providers=['CPUExecutionProvider']) got2 = sess2.run(None, inputs) print(pipe.predict(train_data)) print(got2[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none [0 1 0 1 0 1] [0 1 0 1 0 1] .. GENERATED FROM PYTHON SOURCE LINES 182-183 And probilities. .. GENERATED FROM PYTHON SOURCE LINES 183-186 .. code-block:: default print(pipe.predict_proba(train_data)) print(got2[1]) .. rst-class:: sphx-glr-script-out .. code-block:: none [[0.78 0.22] [0.26 0.74] [0.69 0.31] [0.28 0.72] [0.73 0.27] [0.27 0.73]] [[0.78 0.22000003] [0.2600004 0.7399996 ] [0.69000006 0.30999997] [0.2800004 0.7199996 ] [0.73 0.27 ] [0.2700004 0.7299996 ]] .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 3.322 seconds) .. _sphx_glr_download_gyexamples_plot_gbegin_dataframe.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_gbegin_dataframe.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_gbegin_dataframe.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_