Convert a pipeline with a LightGBM regressor
============================================

.. index:: LightGBM

The discrepancies observed when using float and TreeEnsemble operator (see :ref:`l-example-discrepencies-float-double`) explains why the converter for *LGBMRegressor* may introduce significant discrepancies even when it is used with float tensors. Library *lightgbm* is implemented with double. A random forest regressor with multiple trees computes its prediction by adding the prediction of every tree. After being converting into ONNX, this summation becomes :math:`\left[\sum\right]_{i=1}^F float(T_i(x))`, where *F* is the number of trees in the forest, :math:`T_i(x)` the output of tree *i* and :math:`\left[\sum\right]` a float addition. The discrepancy can be expressed as :math:`D(x) = |\left[\sum\right]_{i=1}^F float(T_i(x)) - \sum_{i=1}^F T_i(x)|`. This grows with the number of trees in the forest. To reduce the impact, an option was added to split the node *TreeEnsembleRegressor* into multiple ones and to do a summation with double this time. If we assume the node if split into *a* nodes, the discrepancies then become :math:`D'(x) = |\sum_{k=1}^a \left[\sum\right]_{i=1}^{F/a} float(T_{ak + i}(x)) - \sum_{i=1}^F T_i(x)|`. In 2022, :epkg:`onnx` and :epkg:`onnxruntime` updated the specifications of TreeEnsemble operators and they can now support double thresholds (see `TreeEnsembleRegressor v3 `_). That would be the recommended option to reduce the discrepancies.

.. contents::
    :local:

Train a LGBMRegressor
+++++++++++++++++++++

.. GENERATED FROM PYTHON SOURCE LINES 44-69

.. code-block:: default

    import warnings
    import time
    import timeit
    from packaging.version import Version
    import numpy
    from pandas import DataFrame
    import matplotlib.pyplot as plt
    from tqdm import tqdm
    from lightgbm import LGBMRegressor
    from onnxruntime import InferenceSession
    from skl2onnx import update_registered_converter
    from skl2onnx.common.shape_calculator import calculate_linear_regressor_output_shapes  # noqa
    from mlprodict.onnx_conv import to_onnx
    from onnxmltools import __version__ as oml_version
    from onnxmltools.convert.lightgbm.operator_converters.LightGbm import convert_lightgbm  # noqa

    N = 1000
    X = numpy.random.randn(N, 20)
    y = (numpy.random.randn(N) +
         numpy.random.randn(N) * 100 *
         numpy.random.randint(0, 1, 1000))

    reg = LGBMRegressor(n_estimators=1000)
    reg.fit(X, y)
.. GENERATED FROM PYTHON SOURCE LINES 70-81 Register the converter for LGBMRegressor ++++++++++++++++++++++++++++++++++++++++ The converter is implemented in :epkg:`onnxmltools`: `onnxmltools...LightGbm.py `_. and the shape calculator: `onnxmltools...Regressor.py `_. .. GENERATED FROM PYTHON SOURCE LINES 81-102 .. code-block:: default def skl2onnx_convert_lightgbm(scope, operator, container): options = scope.get_options(operator.raw_operator) if 'split' in options: if Version(oml_version) < Version('1.9.2'): warnings.warn( "Option split was released in version 1.9.2 but %s is " "installed. It will be ignored." % oml_version) operator.split = options['split'] else: operator.split = None convert_lightgbm(scope, operator, container) update_registered_converter( LGBMRegressor, 'LightGbmLGBMRegressor', calculate_linear_regressor_output_shapes, skl2onnx_convert_lightgbm, options={'split': None}) .. GENERATED FROM PYTHON SOURCE LINES 103-109 Convert +++++++ We convert the same model following the two scenarios, one single TreeEnsembleRegressor node, or more. *split* parameter is the number of trees per node TreeEnsembleRegressor. .. GENERATED FROM PYTHON SOURCE LINES 109-117 .. code-block:: default model_onnx = to_onnx(reg, X[:1].astype(numpy.float32), target_opset={'': 17, 'ai.onnx.ml': 3}) model_onnx_split = to_onnx(reg, X[:1].astype(numpy.float32), target_opset={'': 17, 'ai.onnx.ml': 3}, options={'split': 100}) .. GENERATED FROM PYTHON SOURCE LINES 118-120 We create another model using the `ai.onnx.ml == 3`. Node thresholds are stored in doubles and not in floats anymore. .. GENERATED FROM PYTHON SOURCE LINES 120-125 .. code-block:: default model_onnx_64 = to_onnx(reg, X[:1].astype(numpy.float64), target_opset={'': 17, 'ai.onnx.ml': 3}, rewrite_ops=True) .. GENERATED FROM PYTHON SOURCE LINES 126-128 Discrepancies +++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 128-146 .. code-block:: default sess = InferenceSession(model_onnx.SerializeToString(), providers=['CPUExecutionProvider']) sess_split = InferenceSession(model_onnx_split.SerializeToString(), providers=['CPUExecutionProvider']) X32 = X.astype(numpy.float32)[:500] expected = reg.predict(X32) got = sess.run(None, {'X': X32})[0].ravel() got_split = sess_split.run(None, {'X': X32})[0].ravel() disp = numpy.abs(got - expected).sum() disc_split = numpy.abs(got_split - expected).sum() print(f"sum of discrepancies 1 node: {disp}") print(f"sum of discrepancies split node: {disc_split}, " f"ratio: {disp / disc_split}") .. rst-class:: sphx-glr-script-out .. code-block:: none sum of discrepancies 1 node: 6.853524745783303e-05 sum of discrepancies split node: 2.0830457483485063e-05, ratio: 3.2901460523452055 .. GENERATED FROM PYTHON SOURCE LINES 147-149 The sum of the discrepancies were reduced 4, 5 times. The maximum is much better too. .. GENERATED FROM PYTHON SOURCE LINES 149-156 .. code-block:: default disc = numpy.abs(got - expected).max() disc_split = numpy.abs(got_split - expected).max() print("max discrepancies 1 node", disc) print("max discrepancies split node", disc_split, "ratio:", disc / disc_split) .. rst-class:: sphx-glr-script-out .. code-block:: none max discrepancies 1 node 1.3420434359368016e-06 max discrepancies split node 3.822674381481761e-07 ratio: 3.5107448398903207 .. GENERATED FROM PYTHON SOURCE LINES 157-160 Let's compare with the double thresholds. We compare the inputs into float first and then in double to make sure they are the same. .. GENERATED FROM PYTHON SOURCE LINES 160-171 .. code-block:: default sess_64 = InferenceSession(model_onnx_64.SerializeToString(), providers=['CPUExecutionProvider']) X64 = X32.astype(numpy.float64) expected_64 = reg.predict(X64) got_64 = sess_64.run(None, {'X': X64})[0].ravel() disc_64 = numpy.abs(got_64 - expected_64).sum() disc_max64 = numpy.abs(got_64 - expected_64).max() print(f"sum of discrepancies with doubles: sum={disc_64}, max={disc_max64}") .. rst-class:: sphx-glr-script-out .. code-block:: none sum of discrepancies with doubles: sum=9.069053242884979e-06, max=1.1771164043494764e-07 .. GENERATED FROM PYTHON SOURCE LINES 172-176 Processing time +++++++++++++++ The processing time is slower but not much. .. GENERATED FROM PYTHON SOURCE LINES 176-187 .. code-block:: default print("processing time no split", timeit.timeit( lambda: sess.run(None, {'X': X32})[0], number=150)) print("processing time no split with double", timeit.timeit( lambda: sess_64.run(None, {'X': X64})[0], number=150)) print("processing time split", timeit.timeit( lambda: sess_split.run(None, {'X': X32})[0], number=150)) .. rst-class:: sphx-glr-script-out .. code-block:: none processing time no split 0.773850008030422 processing time no split with double 0.7656510029919446 processing time split 0.8690498230280355 .. GENERATED FROM PYTHON SOURCE LINES 188-193 Split influence +++++++++++++++ Let's see how the sum of the discrepancies moves against the parameter *split*. .. GENERATED FROM PYTHON SOURCE LINES 193-215 .. code-block:: default res = [] for i in tqdm(list(range(20, 170, 20)) + [200, 300, 400, 500]): model_onnx_split = to_onnx(reg, X[:1].astype(numpy.float32), target_opset={'': 17, 'ai.onnx.ml': 3}, options={'split': i}) times = [] for _ in range(0, 4): begin = time.perf_counter() sess_split = InferenceSession(model_onnx_split.SerializeToString(), providers=['CPUExecutionProvider']) times.append(time.perf_counter() - begin) times.sort() got_split = sess_split.run(None, {'X': X32})[0].ravel() disc_split = numpy.abs(got_split - expected).max() res.append(dict(split=i, max_diff=disc_split, time=sum(times[1:3]) / 2)) df = DataFrame(res).set_index('split') df["baseline"] = disc df["baseline_64"] = disc_max64 print(df) .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/12 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 228-237 Conclusion ++++++++++ The time curve is too noisy to conclude. More measures should be made. The double sum reduces the discrepancies but increases the processing time. It is a tradeoff. The best option is using double for threshold and summation but it requires the latest definition of TreeEnsemble `ai.onnx.ml=3`.

.. GENERATED FROM PYTHON SOURCE LINES 237-240

.. code-block:: default

    # plt.show()

**Total running time of the script:** ( 2 minutes 46.095 seconds)