BatchNormalization#
BatchNormalization  15#
Version
domain: main
since_version: 15
function: False
support_level: SupportType.COMMON
shape inference: True
This version of the operator has been available since version 15.
Summary
Carries out batch normalization as described in the paper https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, There are five required inputs ‘X’, ‘scale’, ‘B’, ‘input_mean’ and ‘input_var’. Note that ‘input_mean’ and ‘input_var’ are expected to be the estimated statistics in inference mode (training_mode=False, default), and the running statistics in training mode (training_mode=True). There are multiple cases for the number of outputs, which we list below:
Output case #1: Y, running_mean, running_var (training_mode=True) Output case #2: Y (training_mode=False)
When training_mode=False, extra outputs are invalid. The outputs are updated as follows when training_mode=True:
running_mean = input_mean * momentum + current_mean * (1  momentum)
running_var = input_var * momentum + current_var * (1  momentum)
Y = (X  current_mean) / sqrt(current_var + epsilon) * scale + B
where:
current_mean = ReduceMean(X, axis=all_except_channel_index)
current_var = ReduceVar(X, axis=all_except_channel_index)
Notice that ReduceVar refers to the population variance, and it equals to
sum(sqrd(x_i  x_avg)) / N
where N is the population size (this formula does not use sample size N  1).
The computation of ReduceMean and ReduceVar uses float to avoid overflow for float16 inputs.
When training_mode=False:
Y = (X  input_mean) / sqrt(input_var + epsilon) * scale + B
For previous (depreciated) nonspatial cases, implementors are suggested to flatten the input shape to (N x C * D1 * D2 * … * Dn) before a BatchNormalization Op. This operator has optional inputs/outputs. See ONNX for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument’s name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
Attributes
epsilon: The epsilon value to use to avoid division by zero. Default value is
9.999999747378752e06
.momentum: Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1  momentum). Default value is
0.8999999761581421
.training_mode: If set to true, it indicates BatchNormalization is being used for training, and outputs 1, 2, 3, and 4 would be populated. Default value is
0
.
Inputs
X (heterogeneous)  T: Input data tensor from the previous operator; dimensions are in the form of (N x C x D1 x D2 … Dn), where N is the batch size, C is the number of channels. Statistics are computed for every channel of C over N and D1 to Dn dimensions. For image data, input dimensions become (N x C x H x W). The op also accepts single dimension input of size N in which case C is assumed to be 1
scale (heterogeneous)  T1: Scale tensor of shape (C).
B (heterogeneous)  T1: Bias tensor of shape (C).
input_mean (heterogeneous)  T2: running (training) or estimated (testing) mean tensor of shape (C).
input_var (heterogeneous)  T2: running (training) or estimated (testing) variance tensor of shape (C).
Outputs
Between 1 and 3 outputs.
Y (heterogeneous)  T: The output tensor of the same shape as X
running_mean (optional, heterogeneous)  T2: The running mean after the BatchNormalization operator.
running_var (optional, heterogeneous)  T2: The running variance after the BatchNormalization operator. This op uses the population size (N) for calculating variance, and not the sample size N1.
Type Constraints
T in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ): Constrain input and output types to float tensors.
T1 in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ): Constrain scale and bias types to float tensors.
T2 in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ): Constrain mean and variance types to float tensors.
Examples
Differences
0  0  Carries out batch normalization as described in the paper  Carries out batch normalization as described in the paper 
1  1  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, 
2  2  There are five required inputs 'X', 'scale', 'B', 'input_mean' and  There are five required inputs 'X', 'scale', 'B', 'input_mean' and 
3  3  'input_var'.  'input_var'. 
4  4  Note that 'input_mean' and 'input_var' are expected to be the estimated  Note that 'input_mean' and 'input_var' are expected to be the estimated 
5  5  statistics in inference mode (training_mode=False, default),  statistics in inference mode (training_mode=False, default), 
6  6  and the running statistics in training mode (training_mode=True).  and the running statistics in training mode (training_mode=True). 
7  7  There are multiple cases for the number of outputs, which we list below:  There are multiple cases for the number of outputs, which we list below: 
8  8 


9  9  Output case #1: Y, running_mean, running_var (training_mode=True)  Output case #1: Y, running_mean, running_var (training_mode=True) 
10  10  Output case #2: Y (training_mode=False)  Output case #2: Y (training_mode=False) 
11  11 


12  12  When training_mode=False, extra outputs are invalid.  When training_mode=False, extra outputs are invalid. 
13  13  The outputs are updated as follows when training_mode=True:  The outputs are updated as follows when training_mode=True: 
14  14  ::  :: 
15  15 


16  16  running_mean = input_mean * momentum + current_mean * (1  momentum)  running_mean = input_mean * momentum + current_mean * (1  momentum) 
17  17  running_var = input_var * momentum + current_var * (1  momentum)  running_var = input_var * momentum + current_var * (1  momentum) 
18  18 


19  19  Y = (X  current_mean) / sqrt(current_var + epsilon) * scale + B  Y = (X  current_mean) / sqrt(current_var + epsilon) * scale + B 
20  20 


21  21  where:  where: 
22  22 


23  23  current_mean = ReduceMean(X, axis=all_except_channel_index)  current_mean = ReduceMean(X, axis=all_except_channel_index) 
24  24  current_var = ReduceVar(X, axis=all_except_channel_index)  current_var = ReduceVar(X, axis=all_except_channel_index) 
25  25 


26  26  Notice that ReduceVar refers to the population variance, and it equals to  Notice that ReduceVar refers to the population variance, and it equals to 
27  27  sum(sqrd(x_i  x_avg)) / N  sum(sqrd(x_i  x_avg)) / N 
28  28  where N is the population size (this formula does not use sample size N  1).  where N is the population size (this formula does not use sample size N  1). 
29  29 


30  The computation of ReduceMean and ReduceVar uses float to avoid overflow for float16 inputs.  
31 
 
30  32  When training_mode=False:  When training_mode=False: 
31  33  ::  :: 
32  34 


33  35  Y = (X  input_mean) / sqrt(input_var + epsilon) * scale + B  Y = (X  input_mean) / sqrt(input_var + epsilon) * scale + B 
34  36 


35  37  For previous (depreciated) nonspatial cases, implementors are suggested  For previous (depreciated) nonspatial cases, implementors are suggested 
36  38  to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op.  to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op. 
37  39  This operator has **optional** inputs/outputs. See ONNX  This operator has **optional** inputs/outputs. See ONNX 
38  40 


39  41  **Attributes**  **Attributes** 
40  42 


41  43  * **epsilon**:  * **epsilon**: 
42  44  The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e06.  The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e06. 
43  45  * **momentum**:  * **momentum**: 
44  46  Factor used in computing the running mean and variance.e.g.,  Factor used in computing the running mean and variance.e.g., 
45  47  running_mean = running_mean * momentum + mean * (1  momentum). Default value is 0.8999999761581421.  running_mean = running_mean * momentum + mean * (1  momentum). Default value is 0.8999999761581421. 
46  48  * **training_mode**:  * **training_mode**: 
47  49  If set to true, it indicates BatchNormalization is being used for  If set to true, it indicates BatchNormalization is being used for 
48  50  training, and outputs 1, 2, 3, and 4 would be populated. Default value is 0.  training, and outputs 1, 2, 3, and 4 would be populated. Default value is 0. 
49  51 


50  52  **Inputs**  **Inputs** 
51  53 


52  54  * **X** (heterogeneous)  **T**:  * **X** (heterogeneous)  **T**: 
53  55  Input data tensor from the previous operator; dimensions are in the  Input data tensor from the previous operator; dimensions are in the 
54  56  form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is  form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is 
55  57  the number of channels. Statistics are computed for every channel of  the number of channels. Statistics are computed for every channel of 
56  58  C over N and D1 to Dn dimensions. For image data, input dimensions  C over N and D1 to Dn dimensions. For image data, input dimensions 
57  59  become (N x C x H x W). The op also accepts single dimension input  become (N x C x H x W). The op also accepts single dimension input 
58  60  of size N in which case C is assumed to be 1  of size N in which case C is assumed to be 1 
59  61  * **scale** (heterogeneous)  **T**: 

60  62  Scale tensor of shape (C).  Scale tensor of shape (C). 
61  63  * **B** (heterogeneous)  **T**: 

62  64  Bias tensor of shape (C).  Bias tensor of shape (C). 
63  65  * **input_mean** (heterogeneous)  **U**: 

64  66  running (training) or estimated (testing) mean tensor of shape (C).  running (training) or estimated (testing) mean tensor of shape (C). 
65  67  * **input_var** (heterogeneous)  **U**: 

66  68  running (training) or estimated (testing) variance tensor of shape  running (training) or estimated (testing) variance tensor of shape 
67  69  (C).  (C). 
68  70 


69  71  **Outputs**  **Outputs** 
70  72 


71  73  Between 1 and 3 outputs.  Between 1 and 3 outputs. 
72  74 


73  75  * **Y** (heterogeneous)  **T**:  * **Y** (heterogeneous)  **T**: 
74  76  The output tensor of the same shape as X  The output tensor of the same shape as X 
75  77  * **running_mean** (optional, heterogeneous)  **U**: 

76  78  The running mean after the BatchNormalization operator.  The running mean after the BatchNormalization operator. 
77  79  * **running_var** (optional, heterogeneous)  **U**: 

78  80  The running variance after the BatchNormalization operator. This op  The running variance after the BatchNormalization operator. This op 
79  81  uses the population size (N) for calculating variance, and not the  uses the population size (N) for calculating variance, and not the 
80  82  sample size N1.  sample size N1. 
81  83 


82  84  **Type Constraints**  **Type Constraints** 
83  85 


84  86  * **T** in (  * **T** in ( 
85  87  tensor(bfloat16),  tensor(bfloat16), 
86  88  tensor(double),  tensor(double), 
87  89  tensor(float),  tensor(float), 
88  90  tensor(float16)  tensor(float16) 
89  91  ):  ): 
90  92  Constrain input and output types to float tensors.  Constrain input and output types to float tensors. 
93  * **T1** in (  
94  tensor(bfloat16),  
95  tensor(double),  
96  tensor(float),  
97  tensor(float16)  
98  ):  
91  99  * **U** in ( 

100  * **T2** in (  
92  101  tensor(bfloat16),  tensor(bfloat16), 
93  102  tensor(double),  tensor(double), 
94  103  tensor(float),  tensor(float), 
95  104  tensor(float16)  tensor(float16) 
96  105  ):  ): 
97  106  Constrain mean and variance types to float tensors. It allows all 

98  float type for U. 
BatchNormalization  14#
Version
domain: main
since_version: 14
function: False
support_level: SupportType.COMMON
shape inference: True
This version of the operator has been available since version 14.
Summary
Carries out batch normalization as described in the paper https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, There are five required inputs ‘X’, ‘scale’, ‘B’, ‘input_mean’ and ‘input_var’. Note that ‘input_mean’ and ‘input_var’ are expected to be the estimated statistics in inference mode (training_mode=False, default), and the running statistics in training mode (training_mode=True). There are multiple cases for the number of outputs, which we list below:
Output case #1: Y, running_mean, running_var (training_mode=True) Output case #2: Y (training_mode=False)
When training_mode=False, extra outputs are invalid. The outputs are updated as follows when training_mode=True:
running_mean = input_mean * momentum + current_mean * (1  momentum)
running_var = input_var * momentum + current_var * (1  momentum)
Y = (X  current_mean) / sqrt(current_var + epsilon) * scale + B
where:
current_mean = ReduceMean(X, axis=all_except_channel_index)
current_var = ReduceVar(X, axis=all_except_channel_index)
Notice that ReduceVar refers to the population variance, and it equals to
sum(sqrd(x_i  x_avg)) / N
where N is the population size (this formula does not use sample size N  1).
When training_mode=False:
Y = (X  input_mean) / sqrt(input_var + epsilon) * scale + B
For previous (depreciated) nonspatial cases, implementors are suggested to flatten the input shape to (N x C * D1 * D2 * … * Dn) before a BatchNormalization Op. This operator has optional inputs/outputs. See ONNX for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument’s name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
Attributes
epsilon: The epsilon value to use to avoid division by zero. Default value is
9.999999747378752e06
.momentum: Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1  momentum). Default value is
0.8999999761581421
.training_mode: If set to true, it indicates BatchNormalization is being used for training, and outputs 1, 2, 3, and 4 would be populated. Default value is
0
.
Inputs
X (heterogeneous)  T: Input data tensor from the previous operator; dimensions are in the form of (N x C x D1 x D2 … Dn), where N is the batch size, C is the number of channels. Statistics are computed for every channel of C over N and D1 to Dn dimensions. For image data, input dimensions become (N x C x H x W). The op also accepts single dimension input of size N in which case C is assumed to be 1
scale (heterogeneous)  T: Scale tensor of shape (C).
B (heterogeneous)  T: Bias tensor of shape (C).
input_mean (heterogeneous)  U: running (training) or estimated (testing) mean tensor of shape (C).
input_var (heterogeneous)  U: running (training) or estimated (testing) variance tensor of shape (C).
Outputs
Between 1 and 3 outputs.
Y (heterogeneous)  T: The output tensor of the same shape as X
running_mean (optional, heterogeneous)  U: The running mean after the BatchNormalization operator.
running_var (optional, heterogeneous)  U: The running variance after the BatchNormalization operator. This op uses the population size (N) for calculating variance, and not the sample size N1.
Type Constraints
T in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ): Constrain input and output types to float tensors.
U in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ): Constrain mean and variance types to float tensors. It allows all float type for U.
Differences
0  0  Carries out batch normalization as described in the paper  Carries out batch normalization as described in the paper 
1  1  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, 
2  There are five required inputs 'X', 'scale', 'B', 'input_mean' and  
3  'input_var'.  
4  Note that 'input_mean' and 'input_var' are expected to be the estimated  
5  statistics in inference mode (training_mode=False, default),  
6  and the running statistics in training mode (training_mode=True).  
2  7  there are multiple cases for the number of outputs, which we list below: 

3  8 


9  Output case #1: Y, running_mean, running_var (training_mode=True)  
10  Output case #2: Y (training_mode=False)  
11 
 
12  When training_mode=False, extra outputs are invalid.  
13  The outputs are updated as follows when training_mode=True:  
14  ::  
15 
 
16  running_mean = input_mean * momentum + current_mean * (1  momentum)  
17  running_var = input_var * momentum + current_var * (1  momentum)  
18 
 
19  Y = (X  current_mean) / sqrt(current_var + epsilon) * scale + B  
20 
 
21  where:  
22 
 
23  current_mean = ReduceMean(X, axis=all_except_channel_index)  
24  current_var = ReduceVar(X, axis=all_except_channel_index)  
25 
 
26  Notice that ReduceVar refers to the population variance, and it equals to  
4  27  Output case #1: Y, mean, var, saved_mean, saved_var (training mode) 

5  28  Output case #2: Y (test mode) 

6  29 


30  When training_mode=False:  
31  ::  
32 
 
33  Y = (X  input_mean) / sqrt(input_var + epsilon) * scale + B  
34 
 
7  35  For previous (depreciated) nonspatial cases, implementors are suggested  For previous (depreciated) nonspatial cases, implementors are suggested 
8  36  to flatten the input shape to (N x C*D1*D2 ..*Dn) before a BatchNormalization Op. 

9  37  This operator has **optional** inputs/outputs. See ONNX  This operator has **optional** inputs/outputs. See ONNX 
10  38 


11  39  **Attributes**  **Attributes** 
12  40 


13  41  * **epsilon**:  * **epsilon**: 
14  42  The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e06.  The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e06. 
15  43  * **momentum**:  * **momentum**: 
16  44  Factor used in computing the running mean and variance.e.g.,  Factor used in computing the running mean and variance.e.g., 
17  45  running_mean = running_mean * momentum + mean * (1  momentum). Default value is 0.8999999761581421.  running_mean = running_mean * momentum + mean * (1  momentum). Default value is 0.8999999761581421. 
46  * **training_mode**:  
47  If set to true, it indicates BatchNormalization is being used for  
48  training, and outputs 1, 2, 3, and 4 would be populated. Default value is 0.  
18  49 


19  50  **Inputs**  **Inputs** 
20  51 


21  52  * **X** (heterogeneous)  **T**:  * **X** (heterogeneous)  **T**: 
22  53  Input data tensor from the previous operator; dimensions are in the  Input data tensor from the previous operator; dimensions are in the 
23  54  form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is  form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is 
24  55  the number of channels. Statistics are computed for every channel of  the number of channels. Statistics are computed for every channel of 
25  56  C over N and D1 to Dn dimensions. For image data, input dimensions  C over N and D1 to Dn dimensions. For image data, input dimensions 
26  57  become (N x C x H x W). The op also accepts single dimension input  become (N x C x H x W). The op also accepts single dimension input 
27  58  of size N in which case C is assumed to be 1  of size N in which case C is assumed to be 1 
28  59  * **scale** (heterogeneous)  **T**:  * **scale** (heterogeneous)  **T**: 
29  60  Scale tensor of shape (C).  Scale tensor of shape (C). 
30  61  * **B** (heterogeneous)  **T**:  * **B** (heterogeneous)  **T**: 
31  62  Bias tensor of shape (C).  Bias tensor of shape (C). 
32  63  * **mean** (heterogeneous)  **T**: 

33  64  running (training) or estimated (testing) mean tensor of shape (C).  running (training) or estimated (testing) mean tensor of shape (C). 
34  65  * **var** (heterogeneous)  **T**: 

35  66  running (training) or estimated (testing) variance tensor of shape  running (training) or estimated (testing) variance tensor of shape 
36  67  (C).  (C). 
37  68 


38  69  **Outputs**  **Outputs** 
39  70 


40  71  Between 1 and 5 outputs. 

41  72 


42  73  * **Y** (heterogeneous)  **T**:  * **Y** (heterogeneous)  **T**: 
43  74  The output tensor of the same shape as X  The output tensor of the same shape as X 
44  75  * **mean** (optional, heterogeneous)  **T**: 

45  76  The running mean after the BatchNormalization operator.  The running mean after the BatchNormalization operator. 
46  77  * **var** (optional, heterogeneous)  **T**: 

47  78  The running variance after the BatchNormalization operator. 

79  uses the population size (N) for calculating variance, and not the  
48  80  * **saved_mean** (optional, heterogeneous)  **T**: 

81 
 
49  82  Saved mean used during training to speed up gradient computation. 

83 
 
84  * **T** in (  
50  85  * **saved_var** (optional, heterogeneous)  **T**: 

51  86  Saved variance used during training to speed up gradient 

52  computation.  
53 
 
87  tensor(float),  
88  tensor(float16)  
89  ):  
54  90  **Type Constraints** 

55 
 
56  91  * **T** in ( 

92  tensor(bfloat16),  
57  93  tensor(double),  tensor(double), 
58  94  tensor(float),  tensor(float), 
59  95  tensor(float16)  tensor(float16) 
60  96  ):  ): 
61  97  Constrain input and output types to float tensors. 

98  float type for U. 
BatchNormalization  9#
Version
domain: main
since_version: 9
function: False
support_level: SupportType.COMMON
shape inference: True
This version of the operator has been available since version 9.
Summary
Carries out batch normalization as described in the paper https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, there are multiple cases for the number of outputs, which we list below:
Output case #1: Y, mean, var, saved_mean, saved_var (training mode) Output case #2: Y (test mode)
For previous (depreciated) nonspatial cases, implementors are suggested to flatten the input shape to (N x C*D1*D2 ..*Dn) before a BatchNormalization Op. This operator has optional inputs/outputs. See ONNX for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument’s name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
Attributes
epsilon: The epsilon value to use to avoid division by zero. Default value is
9.999999747378752e06
.momentum: Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1  momentum). Default value is
0.8999999761581421
.
Inputs
X (heterogeneous)  T: Input data tensor from the previous operator; dimensions are in the form of (N x C x D1 x D2 … Dn), where N is the batch size, C is the number of channels. Statistics are computed for every channel of C over N and D1 to Dn dimensions. For image data, input dimensions become (N x C x H x W). The op also accepts single dimension input of size N in which case C is assumed to be 1
scale (heterogeneous)  T: Scale tensor of shape (C).
B (heterogeneous)  T: Bias tensor of shape (C).
mean (heterogeneous)  T: running (training) or estimated (testing) mean tensor of shape (C).
var (heterogeneous)  T: running (training) or estimated (testing) variance tensor of shape (C).
Outputs
Between 1 and 5 outputs.
Y (heterogeneous)  T: The output tensor of the same shape as X
mean (optional, heterogeneous)  T: The running mean after the BatchNormalization operator.
var (optional, heterogeneous)  T: The running variance after the BatchNormalization operator.
saved_mean (optional, heterogeneous)  T: Saved mean used during training to speed up gradient computation.
saved_var (optional, heterogeneous)  T: Saved variance used during training to speed up gradient computation.
Type Constraints
T in ( tensor(double), tensor(float), tensor(float16) ): Constrain input and output types to float tensors.
Differences
0  0  Carries out batch normalization as described in the paper  Carries out batch normalization as described in the paper 
1  1  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, 
2  2  there are multiple cases for the number of outputs, which we list below:  there are multiple cases for the number of outputs, which we list below: 
3  3 


4  4  Output case #1: Y, mean, var, saved_mean, saved_var (training mode)  Output case #1: Y, mean, var, saved_mean, saved_var (training mode) 
5  5  Output case #2: Y (test mode)  Output case #2: Y (test mode) 
6 
 
7  For previous (depreciated) nonspatial cases, implementors are suggested  
8  to flatten the input shape to (N x C*D1*D2 ..*Dn) before a BatchNormalization Op.  
6  9  This operator has **optional** inputs/outputs. See ONNX 

7  10 


8  11  **Attributes**  **Attributes** 
9  12 


10  13  * **epsilon**:  * **epsilon**: 
11  14  The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e06.  The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e06. 
12  15  * **momentum**:  * **momentum**: 
13  16  Factor used in computing the running mean and variance.e.g.,  Factor used in computing the running mean and variance.e.g., 
14  17  running_mean = running_mean * momentum + mean * (1  momentum). Default value is 0.8999999761581421.  running_mean = running_mean * momentum + mean * (1  momentum). Default value is 0.8999999761581421. 
15  * **spatial**:  
16  If true, compute the mean and variance across per activation. If  
17  false, compute the mean and variance across per feature over each  
18  minibatch. Default value is 1.  
19  18 


20  19  **Inputs**  **Inputs** 
21  20 


22  21  * **X** (heterogeneous)  **T**:  * **X** (heterogeneous)  **T**: 
23  22  Input data tensor from the previous operator; dimensions for image 

24  23  case are (N x C x H x W), where N is the batch size, C is the number 

25  of channels, and H and W are the height and the width of the data.  
26  For non image case, the dimensions are in the form of (N x C x D1 x  
27  D2 ... Dn), where N is the batch size.  
24  the number of channels. Statistics are computed for every channel of  
25  C over N and D1 to Dn dimensions. For image data, input dimensions  
26  become (N x C x H x W). The op also accepts single dimension input  
27  of size N in which case C is assumed to be 1  
28  28  * **scale** (heterogeneous)  **T**:  * **scale** (heterogeneous)  **T**: 
29  29  If spatial is true, the dimension of scale is (C). If spatial is 

30  false, the dimensions of scale are (C x D1 x ... x Dn)  
31  30  * **B** (heterogeneous)  **T**:  * **B** (heterogeneous)  **T**: 
32  31  If spatial is true, the dimension of bias is (C). If spatial is 

33  false, the dimensions of bias are (C x D1 x ... x Dn)  
34  32  * **mean** (heterogeneous)  **T**:  * **mean** (heterogeneous)  **T**: 
35  If spatial is true, the dimension of the running mean (training) or  
36  the estimated mean (testing) is (C). If spatial is false, the  
37  33  dimensions of the running mean (training) or the estimated mean 

38  (testing) are (C x D1 x ... x Dn).  
39  34  * **var** (heterogeneous)  **T**:  * **var** (heterogeneous)  **T**: 
40  If spatial is true, the dimension of the running variance(training)  
41  35  or the estimated variance (testing) is (C). If spatial is false, the 

42  dimensions of the running variance(training) or the estimated  
43  variance (testing) are (C x D1 x ... x Dn).  
36  (C).  
44  37 


45  38  **Outputs**  **Outputs** 
46  39 


47  40  Between 1 and 5 outputs.  Between 1 and 5 outputs. 
48  41 


49  42  * **Y** (heterogeneous)  **T**:  * **Y** (heterogeneous)  **T**: 
50  43  The output tensor of the same shape as X  The output tensor of the same shape as X 
51  44  * **mean** (optional, heterogeneous)  **T**:  * **mean** (optional, heterogeneous)  **T**: 
52  45  The running mean after the BatchNormalization operator.  The running mean after the BatchNormalization operator. 
53  46  * **var** (optional, heterogeneous)  **T**:  * **var** (optional, heterogeneous)  **T**: 
54  47  The running variance after the BatchNormalization operator.  The running variance after the BatchNormalization operator. 
55  48  * **saved_mean** (optional, heterogeneous)  **T**:  * **saved_mean** (optional, heterogeneous)  **T**: 
56  49  Saved mean used during training to speed up gradient computation.  Saved mean used during training to speed up gradient computation. 
57  50  * **saved_var** (optional, heterogeneous)  **T**:  * **saved_var** (optional, heterogeneous)  **T**: 
58  51  Saved variance used during training to speed up gradient  Saved variance used during training to speed up gradient 
59  52  computation.  computation. 
60  53 


61  54  **Type Constraints**  **Type Constraints** 
62  55 


63  56  * **T** in (  * **T** in ( 
64  57  tensor(double),  tensor(double), 
65  58  tensor(float),  tensor(float), 
66  59  tensor(float16)  tensor(float16) 
67  60  ):  ): 
68  61  Constrain input and output types to float tensors.  Constrain input and output types to float tensors. 
BatchNormalization  7#
Version
domain: main
since_version: 7
function: False
support_level: SupportType.COMMON
shape inference: True
This version of the operator has been available since version 7.
Summary
Carries out batch normalization as described in the paper https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, there are multiple cases for the number of outputs, which we list below:
Output case #1: Y, mean, var, saved_mean, saved_var (training mode) Output case #2: Y (test mode)
This operator has optional inputs/outputs. See ONNX for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument’s name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
Attributes
epsilon: The epsilon value to use to avoid division by zero. Default value is
9.999999747378752e06
.momentum: Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1  momentum). Default value is
0.8999999761581421
.spatial: If true, compute the mean and variance across per activation. If false, compute the mean and variance across per feature over each minibatch. Default value is
1
.
Inputs
X (heterogeneous)  T: Input data tensor from the previous operator; dimensions for image case are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data. For non image case, the dimensions are in the form of (N x C x D1 x D2 … Dn), where N is the batch size.
scale (heterogeneous)  T: If spatial is true, the dimension of scale is (C). If spatial is false, the dimensions of scale are (C x D1 x … x Dn)
B (heterogeneous)  T: If spatial is true, the dimension of bias is (C). If spatial is false, the dimensions of bias are (C x D1 x … x Dn)
mean (heterogeneous)  T: If spatial is true, the dimension of the running mean (training) or the estimated mean (testing) is (C). If spatial is false, the dimensions of the running mean (training) or the estimated mean (testing) are (C x D1 x … x Dn).
var (heterogeneous)  T: If spatial is true, the dimension of the running variance(training) or the estimated variance (testing) is (C). If spatial is false, the dimensions of the running variance(training) or the estimated variance (testing) are (C x D1 x … x Dn).
Outputs
Between 1 and 5 outputs.
Y (heterogeneous)  T: The output tensor of the same shape as X
mean (optional, heterogeneous)  T: The running mean after the BatchNormalization operator.
var (optional, heterogeneous)  T: The running variance after the BatchNormalization operator.
saved_mean (optional, heterogeneous)  T: Saved mean used during training to speed up gradient computation.
saved_var (optional, heterogeneous)  T: Saved variance used during training to speed up gradient computation.
Type Constraints
T in ( tensor(double), tensor(float), tensor(float16) ): Constrain input and output types to float tensors.
Differences
0  0  Carries out batch normalization as described in the paper  Carries out batch normalization as described in the paper 
1  1  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, 
2  2  there are multiple cases for the number of outputs, which we list below:  there are multiple cases for the number of outputs, which we list below: 
3  3 


4  4  Output case #1: Y, mean, var, saved_mean, saved_var (training mode)  Output case #1: Y, mean, var, saved_mean, saved_var (training mode) 
5  5  Output case #2: Y (test mode)  Output case #2: Y (test mode) 
6 
 
7  **Attributes**  
8 
 
9  * **epsilon**:  
10  6  The epsilon value to use to avoid division by zero, default is 

7 
 
11  8  1e5f. Default value is 9.999999747378752e06. 

12  * **is_test**:  
9 
 
13  10  If set to nonzero, run spatial batch normalization in test mode, 

14  11  default is 0. Default value is 0. 

15  12  * **momentum**:  * **momentum**: 
16  13  Factor used in computing the running mean and variance.e.g.,  Factor used in computing the running mean and variance.e.g., 
17  14  running_mean = running_mean * momentum + mean * (1  momentum), 

18  default is 0.9f. Default value is 0.8999999761581421.  
19  15  * **spatial**:  * **spatial**: 
20  16  If true, compute the mean and variance across all spatial elements 

21  17  If false, compute the mean and variance across per feature.Default 

22  18  is 1. Default value is 1. 

23  19 


24  20  **Inputs**  **Inputs** 
25  21 


26  22  * **X** (heterogeneous)  **T**:  * **X** (heterogeneous)  **T**: 
27  23  Input data tensor from the previous operator; dimensions for image  Input data tensor from the previous operator; dimensions for image 
28  24  case are (N x C x H x W), where N is the batch size, C is the number  case are (N x C x H x W), where N is the batch size, C is the number 
29  25  of channels, and H and W are the height and the width of the data.  of channels, and H and W are the height and the width of the data. 
30  26  For non image case, the dimensions are in the form of (N x C x D1 x  For non image case, the dimensions are in the form of (N x C x D1 x 
31  27  D2 ... Dn), where N is the batch size.  D2 ... Dn), where N is the batch size. 
32  28  * **scale** (heterogeneous)  **T**:  * **scale** (heterogeneous)  **T**: 
33  The scale as a 1dimensional tensor of size C to be applied to the  
34  output.  
29  If spatial is true, the dimension of scale is (C). If spatial is  
30  false, the dimensions of scale are (C x D1 x ... x Dn)  
35  31  * **B** (heterogeneous)  **T**:  * **B** (heterogeneous)  **T**: 
36  The bias as a 1dimensional tensor of size C to be applied to the  
37  output.  
32  If spatial is true, the dimension of bias is (C). If spatial is  
33  false, the dimensions of bias are (C x D1 x ... x Dn)  
38  34  * **mean** (heterogeneous)  **T**:  * **mean** (heterogeneous)  **T**: 
35  If spatial is true, the dimension of the running mean (training) or  
36  the estimated mean (testing) is (C). If spatial is false, the  
39  37  The running mean (training) or the estimated mean (testing) as a 

40  1dimensional tensor of size C.  
38  (testing) are (C x D1 x ... x Dn).  
41  39  * **var** (heterogeneous)  **T**:  * **var** (heterogeneous)  **T**: 
40  If spatial is true, the dimension of the running variance(training)  
41  or the estimated variance (testing) is (C). If spatial is false, the  
42  42  The running variance (training) or the estimated variance (testing) 

43  as a 1dimensional tensor of size C.  
43  variance (testing) are (C x D1 x ... x Dn).  
44  44 


45  45  **Outputs**  **Outputs** 
46  46 


47  47  Between 1 and 5 outputs.  Between 1 and 5 outputs. 
48  48 


49  49  * **Y** (heterogeneous)  **T**:  * **Y** (heterogeneous)  **T**: 
50  50  The output tensor of the same shape as X. 

51  51  * **mean** (optional, heterogeneous)  **T**:  * **mean** (optional, heterogeneous)  **T**: 
52  52  The running mean after the BatchNormalization operator. Must be in 

53  place with the input mean. Should not be used for testing.  
54  53  * **var** (optional, heterogeneous)  **T**:  * **var** (optional, heterogeneous)  **T**: 
55  54  The running variance after the BatchNormalization operator. Must be 

56  inplace with the input var. Should not be used for testing.  
57  55  * **saved_mean** (optional, heterogeneous)  **T**:  * **saved_mean** (optional, heterogeneous)  **T**: 
58  56  Saved mean used during training to speed up gradient computation.  Saved mean used during training to speed up gradient computation. 
59  Should not be used for testing.  
60  57  * **saved_var** (optional, heterogeneous)  **T**:  * **saved_var** (optional, heterogeneous)  **T**: 
61  58  Saved variance used during training to speed up gradient  Saved variance used during training to speed up gradient 
62  59  computation. Should not be used for testing. 

63  60 


64  61  **Type Constraints**  **Type Constraints** 
65  62 


66  63  * **T** in (  * **T** in ( 
67  64  tensor(double),  tensor(double), 
68  65  tensor(float),  tensor(float), 
69  66  tensor(float16)  tensor(float16) 
70  67  ):  ): 
71  68  Constrain input and output types to float tensors.  Constrain input and output types to float tensors. 
BatchNormalization  6#
Version
domain: main
since_version: 6
function: False
support_level: SupportType.COMMON
shape inference: True
This version of the operator has been available since version 6.
Summary
Carries out batch normalization as described in the paper https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, there are multiple cases for the number of outputs, which we list below:
Output case #1: Y, mean, var, saved_mean, saved_var (training mode) Output case #2: Y (test mode)
Attributes
epsilon: The epsilon value to use to avoid division by zero, default is 1e5f. Default value is
9.999999747378752e06
.is_test: If set to nonzero, run spatial batch normalization in test mode, default is 0. Default value is
0
.momentum: Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1  momentum), default is 0.9f. Default value is
0.8999999761581421
.spatial: If true, compute the mean and variance across all spatial elements If false, compute the mean and variance across per feature.Default is 1. Default value is
1
.
Inputs
X (heterogeneous)  T: Input data tensor from the previous operator; dimensions for image case are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data. For non image case, the dimensions are in the form of (N x C x D1 x D2 … Dn), where N is the batch size.
scale (heterogeneous)  T: The scale as a 1dimensional tensor of size C to be applied to the output.
B (heterogeneous)  T: The bias as a 1dimensional tensor of size C to be applied to the output.
mean (heterogeneous)  T: The running mean (training) or the estimated mean (testing) as a 1dimensional tensor of size C.
var (heterogeneous)  T: The running variance (training) or the estimated variance (testing) as a 1dimensional tensor of size C.
Outputs
Between 1 and 5 outputs.
Y (heterogeneous)  T: The output tensor of the same shape as X.
mean (optional, heterogeneous)  T: The running mean after the BatchNormalization operator. Must be in place with the input mean. Should not be used for testing.
var (optional, heterogeneous)  T: The running variance after the BatchNormalization operator. Must be inplace with the input var. Should not be used for testing.
saved_mean (optional, heterogeneous)  T: Saved mean used during training to speed up gradient computation. Should not be used for testing.
saved_var (optional, heterogeneous)  T: Saved variance used during training to speed up gradient computation. Should not be used for testing.
Type Constraints
T in ( tensor(double), tensor(float), tensor(float16) ): Constrain input and output types to float tensors.
Differences
0  0  Carries out batch normalization as described in the paper  Carries out batch normalization as described in the paper 
1  1  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, 
2  2  there are multiple cases for the number of outputs, which we list below:  there are multiple cases for the number of outputs, which we list below: 
3  3 


4  4  Output case #1: Y, mean, var, saved_mean, saved_var (training mode)  Output case #1: Y, mean, var, saved_mean, saved_var (training mode) 
5  5  Output case #2: Y (test mode)  Output case #2: Y (test mode) 
6  6 


7  7  **Attributes**  **Attributes** 
8  8 


9  * **consumed_inputs** (required):  
10  legacy optimization attribute.  
11  9  * **epsilon**:  * **epsilon**: 
12  10  The epsilon value to use to avoid division by zero, default is  The epsilon value to use to avoid division by zero, default is 
13  11  1e5f. Default value is 9.999999747378752e06.  1e5f. Default value is 9.999999747378752e06. 
14  12  * **is_test**:  * **is_test**: 
15  13  If set to nonzero, run spatial batch normalization in test mode,  If set to nonzero, run spatial batch normalization in test mode, 
16  14  default is 0. Default value is 0.  default is 0. Default value is 0. 
17  15  * **momentum**:  * **momentum**: 
18  16  Factor used in computing the running mean and variance.e.g.,  Factor used in computing the running mean and variance.e.g., 
19  17  running_mean = running_mean * momentum + mean * (1  momentum),  running_mean = running_mean * momentum + mean * (1  momentum), 
20  18  default is 0.9f. Default value is 0.8999999761581421.  default is 0.9f. Default value is 0.8999999761581421. 
21  19  * **spatial**:  * **spatial**: 
22  20  If true, compute the mean and variance across all spatial elements  If true, compute the mean and variance across all spatial elements 
23  21  If false, compute the mean and variance across per feature.Default  If false, compute the mean and variance across per feature.Default 
24  22  is 1. Default value is 1.  is 1. Default value is 1. 
25  23 


26  24  **Inputs**  **Inputs** 
27  25 


28  26  * **X** (heterogeneous)  **T**:  * **X** (heterogeneous)  **T**: 
29  The input 4dimensional tensor of shape NCHW.  
27  Input data tensor from the previous operator; dimensions for image  
28  case are (N x C x H x W), where N is the batch size, C is the number  
29  of channels, and H and W are the height and the width of the data.  
30  For non image case, the dimensions are in the form of (N x C x D1 x  
31  D2 ... Dn), where N is the batch size.  
30  32  * **scale** (heterogeneous)  **T**:  * **scale** (heterogeneous)  **T**: 
31  33  The scale as a 1dimensional tensor of size C to be applied to the  The scale as a 1dimensional tensor of size C to be applied to the 
32  34  output.  output. 
33  35  * **B** (heterogeneous)  **T**:  * **B** (heterogeneous)  **T**: 
34  36  The bias as a 1dimensional tensor of size C to be applied to the  The bias as a 1dimensional tensor of size C to be applied to the 
35  37  output.  output. 
36  38  * **mean** (heterogeneous)  **T**:  * **mean** (heterogeneous)  **T**: 
37  39  The running mean (training) or the estimated mean (testing) as a  The running mean (training) or the estimated mean (testing) as a 
38  40  1dimensional tensor of size C.  1dimensional tensor of size C. 
39  41  * **var** (heterogeneous)  **T**:  * **var** (heterogeneous)  **T**: 
40  42  The running variance (training) or the estimated variance (testing)  The running variance (training) or the estimated variance (testing) 
41  43  as a 1dimensional tensor of size C.  as a 1dimensional tensor of size C. 
42  44 


43  45  **Outputs**  **Outputs** 
44  46 


45  47  Between 1 and 5 outputs.  Between 1 and 5 outputs. 
46  48 


47  49  * **Y** (heterogeneous)  **T**:  * **Y** (heterogeneous)  **T**: 
48  50  The output 4dimensional tensor of the same shape as X. 

49  51  * **mean** (optional, heterogeneous)  **T**:  * **mean** (optional, heterogeneous)  **T**: 
50  52  The running mean after the BatchNormalization operator. Must be in  The running mean after the BatchNormalization operator. Must be in 
51  53  place with the input mean. Should not be used for testing.  place with the input mean. Should not be used for testing. 
52  54  * **var** (optional, heterogeneous)  **T**:  * **var** (optional, heterogeneous)  **T**: 
53  55  The running variance after the BatchNormalization operator. Must be  The running variance after the BatchNormalization operator. Must be 
54  56  inplace with the input var. Should not be used for testing.  inplace with the input var. Should not be used for testing. 
55  57  * **saved_mean** (optional, heterogeneous)  **T**:  * **saved_mean** (optional, heterogeneous)  **T**: 
56  58  Saved mean used during training to speed up gradient computation.  Saved mean used during training to speed up gradient computation. 
57  59  Should not be used for testing.  Should not be used for testing. 
58  60  * **saved_var** (optional, heterogeneous)  **T**:  * **saved_var** (optional, heterogeneous)  **T**: 
59  61  Saved variance used during training to speed up gradient  Saved variance used during training to speed up gradient 
60  62  computation. Should not be used for testing.  computation. Should not be used for testing. 
61  63 


62  64  **Type Constraints**  **Type Constraints** 
63  65 


64  66  * **T** in (  * **T** in ( 
65  67  tensor(double),  tensor(double), 
66  68  tensor(float),  tensor(float), 
67  69  tensor(float16)  tensor(float16) 
68  70  ):  ): 
69  71  Constrain input and output types to float tensors.  Constrain input and output types to float tensors. 
BatchNormalization  1#
Version
domain: main
since_version: 1
function: False
support_level: SupportType.COMMON
shape inference: False
This version of the operator has been available since version 1.
Summary
Carries out batch normalization as described in the paper https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, there are multiple cases for the number of outputs, which we list below:
Output case #1: Y, mean, var, saved_mean, saved_var (training mode) Output case #2: Y (test mode)
Attributes
consumed_inputs (required): legacy optimization attribute.
epsilon: The epsilon value to use to avoid division by zero, default is 1e5f. Default value is
9.999999747378752e06
.is_test: If set to nonzero, run spatial batch normalization in test mode, default is 0. Default value is
0
.momentum: Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1  momentum), default is 0.9f. Default value is
0.8999999761581421
.spatial: If true, compute the mean and variance across all spatial elements If false, compute the mean and variance across per feature.Default is 1. Default value is
1
.
Inputs
X (heterogeneous)  T: The input 4dimensional tensor of shape NCHW.
scale (heterogeneous)  T: The scale as a 1dimensional tensor of size C to be applied to the output.
B (heterogeneous)  T: The bias as a 1dimensional tensor of size C to be applied to the output.
mean (heterogeneous)  T: The running mean (training) or the estimated mean (testing) as a 1dimensional tensor of size C.
var (heterogeneous)  T: The running variance (training) or the estimated variance (testing) as a 1dimensional tensor of size C.
Outputs
Between 1 and 5 outputs.
Y (heterogeneous)  T: The output 4dimensional tensor of the same shape as X.
mean (optional, heterogeneous)  T: The running mean after the BatchNormalization operator. Must be in place with the input mean. Should not be used for testing.
var (optional, heterogeneous)  T: The running variance after the BatchNormalization operator. Must be inplace with the input var. Should not be used for testing.
saved_mean (optional, heterogeneous)  T: Saved mean used during training to speed up gradient computation. Should not be used for testing.
saved_var (optional, heterogeneous)  T: Saved variance used during training to speed up gradient computation. Should not be used for testing.
Type Constraints
T in ( tensor(double), tensor(float), tensor(float16) ): Constrain input and output types to float tensors.