Train a linear regression with onnxruntime-training#

This example explores how onnxruntime-training can be used to train a simple linear regression using a gradient descent. It compares the results with those obtained by sklearn.linear_model.SGDRegressor

A simple linear regression with scikit-learn#

from pprint import pprint
import numpy
import onnx
from pandas import DataFrame
from onnxruntime import (
    InferenceSession, get_device)
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.neural_network import MLPRegressor
from mlprodict.onnx_conv import to_onnx
from onnxcustom.plotting.plotting_onnx import plot_onnxs
from onnxcustom.utils.orttraining_helper import (
    add_loss_output, get_train_initializer)
from import OrtGradientOptimizer

X, y = make_regression(n_features=2, bias=2)
X = X.astype(numpy.float32)
y = y.astype(numpy.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y)

lr = SGDRegressor(l1_ratio=0, max_iter=200, eta0=5e-2), y)


[  72.08101464   22.88780946  -69.31312963   28.00639101 -213.67235927]

The trained coefficients are:

print("trained coefficients:", lr.coef_, lr.intercept_)


trained coefficients: [57.97554439 97.36608314] [1.99956221]

However this model does not show the training curve. We switch to a sklearn.neural_network.MLPRegressor.

lr = MLPRegressor(hidden_layer_sizes=tuple(),
                  activation='identity', max_iter=200,
                  batch_size=10, solver='sgd',
                  alpha=0, learning_rate_init=1e-2,
                  momentum=0, nesterovs_momentum=False), y)


somewhere/workspace/onnxcustom/onnxcustom_UT_39_std/_venv/lib/python3.9/site-packages/sklearn/neural_network/ ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
[  72.08537    22.891481  -69.32091    28.008839 -213.6913  ]

The trained coefficients are:

print("trained coefficients:", lr.coefs_, lr.intercepts_)


trained coefficients: [array([[57.979527],
       [97.37574 ]], dtype=float32)] [array([1.9999899], dtype=float32)]

ONNX graph#

Training with onnxruntime-training starts with an ONNX graph which defines the model to learn. It is obtained by simply converting the previous linear regression into ONNX.

onx = to_onnx(lr, X_train[:1].astype(numpy.float32), target_opset=15,

Choosing a loss#

The training requires a loss function. By default, it is the square function but it could be the absolute error or include regularization. Function add_loss_output appends the loss function to the ONNX graph.

onx_train = add_loss_output(onx)

plot_onnxs(onx, onx_train,
           title=['Linear Regression',
                  'Linear Regression + Loss with ONNX'])
Linear Regression, Linear Regression + Loss with ONNX


array([<AxesSubplot:title={'center':'Linear Regression'}>,
       <AxesSubplot:title={'center':'Linear Regression + Loss with ONNX'}>],

Let’s check inference is working.

sess = InferenceSession(onx_train.SerializeToString(),
res =, {'X': X_test, 'label': y_test.reshape((-1, 1))})
print("onnx loss=%r" % (res[0][0, 0] / X_test.shape[0]))


onnx loss=1.663961484155152e-08


Every initializer is a set of weights which can be trained and a gradient will be computed for it. However an initializer used to modify a shape or to extract a subpart of a tensor does not need training. Let’s remove them from the list of initializer to train.

inits = get_train_initializer(onx)
weights = {k: v for k, v in inits.items() if k != "shape_tensor"}
pprint(list((k, v[0].shape) for k, v in weights.items()))


[('coefficient', (2, 1)), ('intercepts', (1, 1))]

Train on CPU or GPU if available#

device = "cuda" if get_device().upper() == 'GPU' else 'cpu'
print("device=%r get_device()=%r" % (device, get_device()))


device='cpu' get_device()='CPU'

Stochastic Gradient Descent#

The training logic is hidden in class OrtGradientOptimizer. It follows scikit-learn API (see SGDRegressor. The gradient graph is not available at this stage.

train_session = OrtGradientOptimizer(
    onx_train, list(weights), device=device, verbose=1, learning_rate=1e-2,
    warm_start=False, max_iter=200, batch_size=10,
    saved_gradient="saved_gradient.onnx"), y)


  0%|          | 0/200 [00:00<?, ?it/s]
 11%|#1        | 22/200 [00:00<00:00, 211.32it/s]
 22%|##2       | 44/200 [00:00<00:00, 210.16it/s]
 33%|###3      | 66/200 [00:00<00:00, 209.81it/s]
 44%|####3     | 87/200 [00:00<00:00, 209.63it/s]
 55%|#####4    | 109/200 [00:00<00:00, 210.14it/s]
 66%|######5   | 131/200 [00:00<00:00, 210.07it/s]
 76%|#######6  | 153/200 [00:00<00:00, 210.22it/s]
 88%|########7 | 175/200 [00:00<00:00, 210.38it/s]
 98%|#########8| 197/200 [00:00<00:00, 210.23it/s]
100%|##########| 200/200 [00:00<00:00, 209.78it/s]

OrtGradientOptimizer(model_onnx='ir_version...', weights_to_train=['coefficient', 'intercepts'], loss_output_name='loss', max_iter=200, training_optimizer_name='SGDOptimizer', batch_size=10, learning_rate=LearningRateSGD(eta0=0.01, alpha=0.0001, power_t=0.25, learning_rate='invscaling'), value=0.0026591479484724943, device='cpu', warm_start=False, verbose=1, validation_every=20, saved_gradient='saved_gradient.onnx', sample_weight_name='weight')

And the trained coefficient are…

state_tensors = train_session.get_state()
pprint(["trained coefficients:", state_tensors])
print("last_losses:", train_session.train_losses_[-5:])

min_length = min(len(train_session.train_losses_), len(lr.loss_curve_))
df = DataFrame({'ort losses': train_session.train_losses_[:min_length],
                'skl losses': lr.loss_curve_[:min_length]})
df.plot(title="Train loss against iterations")
Train loss against iterations


['trained coefficients:',
 {'coefficient': array([[57.97954],
       [97.37549]], dtype=float32),
  'intercepts': array([[1.9999676]], dtype=float32)}]
last_losses: [1.7468278e-07, 1.9493348e-07, 2.0023383e-07, 1.9793268e-07, 2.1644743e-07]

<AxesSubplot:title={'center':'Train loss against iterations'}>

the training graph looks like the following…

with open("", "rb") as f:
    graph = onnx.load(f)
    for inode, node in enumerate(graph.graph.node):
        if '' in node.output:
            for i in range(len(node.output)):
                if node.output[i] == "":
                    node.output[i] = "n%d-%d" % (inode, i)

plot_onnxs(graph, title='Training graph')
Training graph


<AxesSubplot:title={'center':'Training graph'}>

The convergence speed is not the same but both gradient descents do not update the gradient multiplier the same way. onnxruntime-training does not implement any gradient descent, it just computes the gradient. That’s the purpose of OrtGradientOptimizer. Next example digs into the implementation details.

# import matplotlib.pyplot as plt

Total running time of the script: ( 0 minutes 7.621 seconds)

Gallery generated by Sphinx-Gallery