onnxruntime is already parallelization the computation on multiple cores if the execution runs on CPU only and obvioulsy on GPU. Recent machines have multiple GPUs but onnxruntime usually runs on one single GPUs. These examples tries to take advantage of that configuration. The first parallelize the execution of the same model on each GPU. It assumes a single GPU can host the whole model. The second model explores a way to split the model into pieces when the whole model does not hold in one single GPUs. This is done through function split_onnx.

The tutorial was tested with following version:


import sys
import numpy
import scipy
import onnx
import onnxruntime
import onnxcustom
import sklearn
import torch

print("python {}".format(sys.version_info))
mods = [numpy, scipy, sklearn, onnx,
        onnxruntime, onnxcustom, torch]
mods = [(m.__name__, m.__version__) for m in mods]
mx = max(len(_[0]) for _ in mods) + 1
for name, vers in sorted(mods):
    print("{}{}{}".format(name, " " * (mx - len(name)), vers))


    python sys.version_info(major=3, minor=9, micro=1, releaselevel='final', serial=0)
    numpy       1.24.1
    onnx        1.13.0
    onnxcustom  0.4.344
    onnxruntime 1.14.92+cpu
    scipy       1.10.0
    sklearn     1.2.0
    torch       1.13.1+cu117