====================================
Inference with onnxruntime in Python
====================================

.. contents::
    :local:

Simple case
===========

The main class is :epkg:`InferenceSession`. It loads
an ONNX graph executes all the nodes in it.

.. runpython::
    :showcode:

    import numpy
    from onnxruntime import InferenceSession
    from sklearn.datasets import load_diabetes
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split
    from skl2onnx import to_onnx

    # creation of an ONNX graph
    data = load_diabetes()
    X, y = data.data, data.target
    X_train, X_test, y_train, __ = train_test_split(X, y, random_state=11)
    clr = LinearRegression()
    clr.fit(X_train, y_train)
    model_def = to_onnx(clr, X_train)

    # InferenceSession only accepts a file name or the serialized
    # ONNX graph.
    sess = InferenceSession(model_def.SerializeToString())

    # Method run takes two inputs, first one is
    # the list of desired outputs or None for all,
    # second is the input tensors in a dictionary
    result = sess.run(None, {'X': X_test[:5]})
    print(result)

    with open("linreg_model.onnx", "wb") as f:
        f.write(model_def.SerializeToString())

And visually:

.. gdot::
    :script: DOT-SECTION

    import onnx
    from mlprodict.onnxrt import OnnxInference

    with open("linreg_model.onnx", "rb") as f:
        onnx_model = onnx.load(f)
    print("DOT-SECTION", OnnxInference(onnx_model).to_dot(recursive=True))

Some informations about the graph can be retrieved
through the class :epkg:`InferenceSession` such as
inputs and outputs.

.. runpython::
    :showcode:

    from onnxruntime import InferenceSession

    sess = InferenceSession("linreg_model.onnx")

    for t in sess.get_inputs():
        print("input:", t.name, t.type, t.shape)

    for t in sess.get_outputs():
        print("output:", t.name, t.type, t.shape)

The class :epkg:`InferenceSession` is not pickable.
It must be restored from the ONNX file.
C API is slightly different. The C object is
stored in attribute `_sess`.

.. runpython::
    :showcode:

    import numpy
    from onnxruntime import InferenceSession, RunOptions

    X = numpy.random.randn(5, 10).astype(numpy.float64)
    sess = InferenceSession("linreg_model.onnx")
    names = [o.name for o in sess._sess.outputs_meta]
    ro = RunOptions()
    result = sess._sess.run(names, {'X': X}, ro)
    print(result)

Session Options
===============

Many options can change the behaviour of the class during predictions.
First class is :epkg:`SessionOptions`.
Next sections describe some of the members.
This class can also be used to profile the execution or
adjust graph optimization. This will be seen in further sections.
Next sections just give an overview, you should go to classes
:epkg:`SessionOptions` and :epkg:`RunOptions` to get the full list.

::

    from onnxruntime import InferenceSession, SessionOptions
    so = SessionOptions()
    # so.... =
    sess = InferenceSession(...., so)

logging
~~~~~~~

Parameters *log_severity_level* and *log_verbosity_level* may change
the verbosity level when the model is loaded.

The logging during execution can be modified with the same
attributes but in class :epkg:`RunOptions`. This class is given
to method `run`.

memory
~~~~~~

:epkg:`onnxruntime` focuses on efficiency first and memory peaks.
Following what should be the priority, following members
may be changed to trade efficiency against memory usage.

* *enable_cpu_mem_arena*: Enables the memory arena on CPU.
  Arena may pre-allocate memory for future usage.
  Set this option to false if you don't want it.
  Default is True.

* *enable_mem_pattern*: Enable the memory pattern optimization.
  Default is true.

* *enable_mem_reuse*: Enable the memory reuse optimization.
  Default is true.

multithreading
~~~~~~~~~~~~~~

By default, :epkg:`onnxruntime` parallelizes the execution
within every node but does not run multiple node at the same time.
But that can be changed.

* *inter_op_num_threads*: Sets the number of threads used to
  parallelize the execution of the graph (across nodes).
  Default is 0 to let onnxruntime choose.

* *intra_op_num_threads*:  Sets the number of threads used to
  parallelize the execution within nodes.
  Default is 0 to let onnxruntime choose.

extensions
~~~~~~~~~~

Attribute `register_custom_ops_library` to register an
assembly implementing the runtime for custom nodes.
:epkg:`onnxruntime-extensions` is one of these extensions
mostly focusing on text processing (tokenizers) or simple
text manipulations. An exemple can be seen in section
:ref:`l-custom-runtime-extensions`.

Providers
=========

A provider is usually a list of implementation of ONNX operator
for a specific environment. `CPUExecutionProvider` provides implementations
for all operator on CPU. `CUDAExecutionProvider` does the same for GPU and
the CUDA drivers. The list of all providers depends on the compilation
settings. The list of available providers is a subset which depends on the machine
:epkg:`onnxruntime` is running on.

.. runpython::
    :showcode:

    import pprint
    import onnxruntime
    print("all providers")
    pprint.pprint(onnxruntime.get_all_providers())
    print("available providers")
    pprint.pprint(onnxruntime.get_available_providers())

:epkg:`onnxruntime` selects `CPUExecutionProvider` if it is the only one available.
It raises an exception if there are more.
It is possible to select which provider must be used for the execution
by filling argument `providers`:

::

    sess = InferenceSession(
        ...
        providers=['CUDAExecutionProvider',  # first one takes precedence
                   'CPUExecutionProvider']
        ...)

All operators are not available in all providers, using multiple may improve
the processing time. Switching from one provider to another may mean
moving data from one memory manager to another, like the transition from CPU
to CUDA or the other way.

Inference on a device different from CPU
========================================

By default, everything happens on CPU.
Next lines show how to do computation on GPU
with :epkg:`onnxruntime`. Method `run` was using numpy arrays,
another method is needed to use another device.
The choice is not unique.
Example :ref:`benchmark-ort-api` shows which API is the fastest.

C_OrtValue
~~~~~~~~~~

Method `run_with_ort_values` works the same way as `run`.
Next example shows how to call the API with any OrtValue
whatever the device it is stored on.

.. runpython::
    :showcode:

    import numpy
    from onnxruntime import InferenceSession
    from onnxruntime.capi._pybind_state import (  # pylint: disable=E0611
        OrtDevice as C_OrtDevice,
        OrtValue as C_OrtValue,
        OrtMemType)

    sess = InferenceSession("linreg_model.onnx")

    X = numpy.random.randn(5, 10).astype(numpy.float64)

    device = C_OrtDevice(C_OrtDevice.cpu(), OrtMemType.DEFAULT, 0)
    ort_X = C_OrtValue.ortvalue_from_numpy(X, device)

    names = [o.name for o in sess._sess.outputs_meta]
    result = sess._sess.run_with_ort_values( {'X': ort_X}, names, None)
    print(result[0].numpy())

IOBinding
~~~~~~~~~

This API is slower than the previous one but is convenient when
not all inputs change between two calls to the API.
It relies on an intermediate structure
:epkg:`SessionIOBinding`. The structure is used to bind an array
knowing its shape, its type, its address, to an input name.

.. runpython::
    :showcode:

    import numpy
    from onnxruntime import InferenceSession
    from onnxruntime.capi._pybind_state import (  # pylint: disable=E0611
        OrtDevice as C_OrtDevice,
        OrtValue as C_OrtValue,
        OrtMemType, SessionIOBinding)

    sess = InferenceSession("linreg_model.onnx")
    X = numpy.random.randn(5, 10).astype(numpy.float64)

    bind = SessionIOBinding(sess._sess)
    device = C_OrtDevice(C_OrtDevice.cpu(), OrtMemType.DEFAULT, 0)

    # Next line binds the array to the input name.
    bind.bind_input('X', device, X.dtype, X.shape,
                    X.__array_interface__['data'][0])

    # This line tells on which device the result should be stored.
    bind.bind_output('variable', device)

    # Inference.
    sess._sess.run_with_iobinding(bind, None)

    # Next line retrieves the outputs as a list of OrtValue.
    result = bind.get_outputs()

    # Conversion to numpy to see the result.
    print(result[0].numpy())

When the input is an OrtValue, another method is available.

.. runpython::
    :showcode:

    import numpy
    from onnxruntime import InferenceSession
    from onnxruntime.capi._pybind_state import (  # pylint: disable=E0611
        OrtDevice as C_OrtDevice,
        OrtValue as C_OrtValue,
        OrtMemType, SessionIOBinding)

    sess = InferenceSession("linreg_model.onnx")
    X = numpy.random.randn(5, 10).astype(numpy.float64)

    bind = SessionIOBinding(sess._sess)
    device = C_OrtDevice(C_OrtDevice.cpu(), OrtMemType.DEFAULT, 0)

    # Next line was changed.
    ort_X = C_OrtValue.ortvalue_from_numpy(X, device)
    bind.bind_ortvalue_input('X', ort_X)

    bind.bind_output('variable', device)
    sess._sess.run_with_iobinding(bind, None)
    result = bind.get_outputs()
    print(result[0].numpy())

The last example binds the output to avoid a copy of the results.
It gives an existing and allocated OrtValue which receives
this output, as if it was inplace.

.. runpython::
    :showcode:

    import numpy
    from onnxruntime import InferenceSession
    from onnxruntime.capi._pybind_state import (  # pylint: disable=E0611
        OrtDevice as C_OrtDevice,
        OrtValue as C_OrtValue,
        OrtMemType, SessionIOBinding)

    sess = InferenceSession("linreg_model.onnx")
    X = numpy.random.randn(5, 10).astype(numpy.float64)
    prediction = numpy.random.randn(5, 1).astype(numpy.float64)

    bind = SessionIOBinding(sess._sess)
    device = C_OrtDevice(C_OrtDevice.cpu(), OrtMemType.DEFAULT, 0)
    ort_X = C_OrtValue.ortvalue_from_numpy(X, device)
    bind.bind_ortvalue_input('X', ort_X)

    # This line tells on which device the result should be stored.
    ort_prediction = C_OrtValue.ortvalue_from_numpy(prediction, device)
    bind.bind_ortvalue_output('variable', ort_prediction)

    # Inference.
    sess._sess.run_with_iobinding(bind, None)

    # Result.
    print(prediction)

Profiling
=========

:epkg:`onnxruntime` offers the possibility to profile
the execution of a graph. It measures the time spent
in each operator. The user starts the profiling when
creating an instance of :epkg:`InferenceSession` and stops
it with method `end_profiling`. It stores the results
as a json file whose name is returned by the method.
The end of the example uses a tool to convert the json
into a table.

.. runpython::
    :showcode:
    :warningout: DeprecationWarning

    import json
    import numpy
    from pandas import DataFrame
    from onnxruntime import InferenceSession, RunOptions, SessionOptions
    from sklearn.datasets import make_classification
    from sklearn.cluster import KMeans
    from skl2onnx import to_onnx
    from mlprodict.onnxrt.ops_whole.session import OnnxWholeSession

    # creation of an ONNX graph.
    X, y = make_classification(100000)
    km = KMeans(max_iter=10)
    km.fit(X)
    onx = to_onnx(km, X[:1].astype(numpy.float32))

    # creation of a session that enables the profiling
    so = SessionOptions()
    so.enable_profiling = True
    sess = InferenceSession(onx.SerializeToString(), so)

    # execution
    for i in range(0, 111):
        sess.run(None, {'X': X.astype(numpy.float32)}, )

    # profiling ends
    prof = sess.end_profiling()
    # and is collected in that file:
    print(prof)

    # what does it look like?
    with open(prof, "r") as f:
        js = json.load(f)
    print(js[:3])

    # a tool to convert it into a table
    df = DataFrame(OnnxWholeSession.process_profiling(js))

    # it has the following columns
    print(df.columns)

    # and looks this way
    print(df.head(n=10))
    df.to_csv("inference_profiling.csv", index=False)

.. plot::
    :include-source:

    import os
    import pandas
    import matplotlib.pyplot as plt

    full_name = os.path.normpath(os.path.abspath(
        os.path.join("..", "..", "inference_profiling.csv")))
    df = pandas.read_csv(full_name)

    # but a graph is usually better...
    gr_dur = df[['dur', "args_op_name"]].groupby("args_op_name").sum().sort_values('dur')
    gr_n = df[['dur', "args_op_name"]].groupby("args_op_name").count().sort_values('dur')
    gr_n = gr_n.loc[gr_dur.index, :]

    fig, ax = plt.subplots(1, 3, figsize=(8, 4))
    gr_dur.plot.barh(ax=ax[0])
    gr_dur /= gr_dur['dur'].sum()
    gr_dur.plot.barh(ax=ax[1])
    gr_n.plot.barh(ax=ax[2])
    ax[0].set_title("duration")
    ax[1].set_title("proportion")
    ax[2].set_title("n occurences");
    for a in ax:
        a.legend().set_visible(False)

    plt.show()

Another example can be found in the tutorial:
:ref:`l-profile-ort-api`.

Graph Optimisations
===================

By default, :epkg:`onnxruntime` optimizes an ONNX graph as much
as it can. It removes every node it can, merges duplicated initializers,
fuses nodes into more complex node but more efficient such
as *FusedMatMul* which deals with transposition as well.
There are four level of optimization and the final can be saved
on a disk to look at it.

::

    so = SessionOptions()
    so.graph_optimization_level = GraphOptimizationLevel.ORT_DISABLE_ALL
    # or GraphOptimizationLevel.ORT_ENABLE_BASIC
    # or GraphOptimizationLevel.ORT_ENABLE_EXTENDED
    # or GraphOptimizationLevel.ORT_ENABLE_ALL
    so.optimized_model_filepath = "to_save_the_optimized_onnx_file.onnx"

The bigger the graph is, the more efficient optimizations are.
One example shows how to enable or disable optimizations on a simple
graph: :ref:`benchmark-ort-onnx-graph-opt`.

Class :epkg:`InferenceSession` as any other class from
:epkg:`onnxruntime` cannot be pickled. Everything can
be created again from the ONNX file it loads. It also means
graph optimization are computed again. To speed up
the process, the optimized graph can be saved
and loaded with disabled optimization next time.
It can save the optimization time.