.. _l-bench-plot-onnxruntime-datasets-nul-reg: Benchmark (ONNX) for common datasets (regression) ================================================= .. contents:: :local: .. index:: onnxruntime, datasets, diabetes Overview ++++++++ The following graph plots the ratio between :epkg:`onnxruntime` and :epkg:`scikit-learn`. It looks into multiple models for a couple of datasets: `diabetes `_. It computes the prediction time for the following models: * *ADA*: ``AdaBoostRegressor()`` * *DT*: ``DecisionTreeRegressor(max_depth=6)`` * *GBT*: ``GradientBoostingRegressor(max_depth=6, n_estimators=100)`` * *KNN*: ``KNeighborsRegressor()`` * *KNN-cdist*: ``KNeighborsRegressor()``, the conversion to ONNX is run with option ``{'optim': 'cdist'}`` to use a specific operator to compute pairwise distances * *LGB*: ``LGBMRegressor(max_depth=6, n_estimators=100)`` * *LR*: ``LinearRegression(solver="liblinear", penalty="l2")`` * *MLP*: ``MLPRegressor()`` * *NuSVR*: ``NuSVC(probability=True)`` * *RF*: ``RandomForestRegressor(max_depth=6, n_estimators=100)`` * *SVR*: ``SVC(probability=True)`` * *XGB*: ``XGBRegressor(max_depth=6, n_estimators=100)`` The predictor follows a `StandardScaler `_ (or a `MinMaxScaler `_ if the model is a Naive Bayes one) in a pipeline if ``norm=True`` or is the only object is ``norm=False``. The pipeline looks like ``make_pipeline(StandardScaler(), estimator())``. Three runtimes are tested: * `skl`: :epkg:`scikit-learn`, * `ort`: :epkg:`onnxruntime`, * `pyrt`: :epkg:`mlprodict`, it relies on :epkg:`numpy` for most of the operators except trees and svm which use a modified version of the C++ code embedded in :epkg:`onnxruntime`, * :epkg:`pyrtc`: same runtime as *pyrt* but the graph logic is replaced by a function dynamically compiled when the ONNX file is loaded. .. plot:: import matplotlib.pyplot as plt import pandas from pymlbenchmark.plotting import plot_bench_xtime name = "../../onnx/results/bench_plot_datasets_num_reg.perf.csv" df = pandas.read_csv(name) fig, ax = plt.subplots(1, 3, figsize=(12, 5)) plot_bench_xtime(df[~df.norm], col_cols='dataset', hue_cols='model', fontsize=12, title="Numerical datasets - norm=False\nBenchmark scikit-learn / onnxruntime", ax=ax) fig.show() Graph X = number of observations to predict +++++++++++++++++++++++++++++++++++++++++++ .. plot:: import matplotlib.pyplot as plt import pandas from pymlbenchmark.plotting import plot_bench_results name = "../../onnx/results/bench_plot_datasets_num_reg.perf.csv" df = pandas.read_csv(name) plot_bench_results(df, row_cols='model', col_cols=('dataset', 'norm'), x_value='N', fontsize=24, title="Numerical datasets\nBenchmark scikit-learn / onnxruntime") plt.show() Graph computing time per observations +++++++++++++++++++++++++++++++++++++ The following graph shows the computing cost per observations depending on the batch size. :epkg:`scikit-learn` is clearly optimized for batch predictions (= training). .. plot:: import matplotlib.pyplot as plt import pandas from pymlbenchmark.plotting import plot_bench_results name = "../../onnx/results/bench_plot_datasets_num_reg.perf.csv" df = pandas.read_csv(name) for c in "min,max,mean,lower,upper,median".split(','): df[c] /= df['N'] plot_bench_results(df, row_cols='model', col_cols=('dataset', 'norm'), x_value='N', fontsize=24, title="Numerical datasets\nBenchmark scikit-learn / onnxruntime") plt.show() Graph of differences between scikit-learn and onnxruntime +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ .. plot:: import matplotlib.pyplot as plt import pandas from pymlbenchmark.plotting import plot_bench_results name = "../../onnx/results/bench_plot_datasets_num_reg.perf.csv" df = pandas.read_csv(name) plot_bench_results(df, row_cols='model', col_cols=('dataset', 'norm'), x_value='N', y_value='diff_ort', fontsize=24, err_value=('lower_diff_ort', 'upper_diff_ort'), title="Numerical datasets\Absolute difference scikit-learn / onnxruntime") plt.show() Graph of differences between scikit-learn and python runtime ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ .. plot:: import matplotlib.pyplot as plt import pandas from pymlbenchmark.plotting import plot_bench_results name = "../../onnx/results/bench_plot_datasets_num_reg.perf.csv" df = pandas.read_csv(name) plot_bench_results(df, row_cols='model', col_cols=('dataset', 'norm'), x_value='N', y_value='diff_pyrt', fontsize=24, err_value=('lower_diff_pyrt', 'upper_diff_pyrt'), title="Numerical datasets\Absolute difference scikit-learn / python runtime") plt.show() Configuration +++++++++++++ .. runpython:: :rst: :warningout: RuntimeWarning :showcode: from pyquickhelper.pandashelper import df2rst import pandas name = os.path.join(__WD__, "../../onnx/results/bench_plot_datasets_num_reg.time.csv") df = pandas.read_csv(name) print(df2rst(df, number_format=4)) Raw results +++++++++++ :download:`bench_plot_datasets_num.csv <../../onnx/results/bench_plot_datasets_num.perf.csv>` .. runpython:: :rst: :warningout: RuntimeWarning :showcode: :toggle: out from pyquickhelper.pandashelper import df2rst from pymlbenchmark.benchmark.bench_helper import bench_pivot import pandas name = os.path.join(__WD__, "../../onnx/results/bench_plot_datasets_num_reg.perf.csv") df = pandas.read_csv(name) print(df2rst(df, number_format=4)) Benchmark code ++++++++++++++ `bench_plot_datasets_num.py `_ .. literalinclude:: ../../onnx/bench_plot_datasets_num_reg.py :language: python