RoiAlign#
RoiAlign  16#
Version
name: RoiAlign (GitHub)
domain: main
since_version: 16
function: False
support_level: SupportType.COMMON
shape inference: True
This version of the operator has been available since version 16.
Summary
Region of Interest (RoI) align operation described in the [Mask RCNN paper](https://arxiv.org/abs/1703.06870). RoiAlign consumes an input tensor X and region of interests (rois) to apply pooling across each RoI; it produces a 4D tensor of shape (num_rois, C, output_height, output_width).
RoiAlign is proposed to avoid the misalignment by removing quantizations while converting from original image into feature map and from feature map into RoI feature; in each ROI bin, the value of the sampled locations are computed directly through bilinear interpolation.
Attributes
coordinate_transformation_mode: Allowed values are ‘half_pixel’ and ‘output_half_pixel’. Use the value ‘half_pixel’ to pixel shift the input coordinates by 0.5 (the recommended behavior). Use the value ‘output_half_pixel’ to omit the pixel shift for the input (use this for a backwardcompatible behavior). Default value is
'half_pixel'
.mode: The pooling method. Two modes are supported: ‘avg’ and ‘max’. Default is ‘avg’. Default value is
'avg'
.output_height: default 1; Pooled output Y’s height. Default value is
1
.output_width: default 1; Pooled output Y’s width. Default value is
1
.sampling_ratio: Number of sampling points in the interpolation grid used to compute the output value of each pooled output bin. If > 0, then exactly sampling_ratio x sampling_ratio grid points are used. If == 0, then an adaptive number of grid points are used (computed as ceil(roi_width / output_width), and likewise for height). Default is 0. Default value is
0
.spatial_scale: Multiplicative spatial scale factor to translate ROI coordinates from their input spatial scale to the scale used when pooling, i.e., spatial scale of the input feature map X relative to the input image. E.g.; default is 1.0f. Default value is
1.0
.
Inputs
X (heterogeneous)  T1: Input data tensor from the previous operator; 4D feature map of shape (N, C, H, W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data.
rois (heterogeneous)  T1: RoIs (Regions of Interest) to pool over; rois is 2D input of shape (num_rois, 4) given as [[x1, y1, x2, y2], …]. The RoIs’ coordinates are in the coordinate system of the input image. Each coordinate set has a 1:1 correspondence with the ‘batch_indices’ input.
batch_indices (heterogeneous)  T2: 1D tensor of shape (num_rois,) with each element denoting the index of the corresponding image in the batch.
Outputs
Y (heterogeneous)  T1: RoI pooled output, 4D tensor of shape (num_rois, C, output_height, output_width). The rth batch element Y[r1] is a pooled feature map corresponding to the rth RoI X[r1].
Type Constraints
T1 in ( tensor(double), tensor(float), tensor(float16) ): Constrain types to float tensors.
T2 in ( tensor(int64) ): Constrain types to int tensors.
Examples
_roialign_aligned_false
node = onnx.helper.make_node(
"RoiAlign",
inputs=["X", "rois", "batch_indices"],
outputs=["Y"],
spatial_scale=1.0,
output_height=5,
output_width=5,
sampling_ratio=2,
coordinate_transformation_mode="output_half_pixel",
)
X, batch_indices, rois = get_roi_align_input_values()
# (num_rois, C, output_height, output_width)
Y = np.array(
[
[
[
[0.4664, 0.4466, 0.3405, 0.5688, 0.6068],
[0.3714, 0.4296, 0.3835, 0.5562, 0.3510],
[0.2768, 0.4883, 0.5222, 0.5528, 0.4171],
[0.4713, 0.4844, 0.6904, 0.4920, 0.8774],
[0.6239, 0.7125, 0.6289, 0.3355, 0.3495],
]
],
[
[
[0.3022, 0.4305, 0.4696, 0.3978, 0.5423],
[0.3656, 0.7050, 0.5165, 0.3172, 0.7015],
[0.2912, 0.5059, 0.6476, 0.6235, 0.8299],
[0.5916, 0.7389, 0.7048, 0.8372, 0.8893],
[0.6227, 0.6153, 0.7097, 0.6154, 0.4585],
]
],
[
[
[0.2384, 0.3379, 0.3717, 0.6100, 0.7601],
[0.3767, 0.3785, 0.7147, 0.9243, 0.9727],
[0.5749, 0.5826, 0.5709, 0.7619, 0.8770],
[0.5355, 0.2566, 0.2141, 0.2796, 0.3600],
[0.4365, 0.3504, 0.2887, 0.3661, 0.2349],
]
],
],
dtype=np.float32,
)
expect(
node,
inputs=[X, rois, batch_indices],
outputs=[Y],
name="test_roialign_aligned_false",
)
_roialign_aligned_true
node = onnx.helper.make_node(
"RoiAlign",
inputs=["X", "rois", "batch_indices"],
outputs=["Y"],
spatial_scale=1.0,
output_height=5,
output_width=5,
sampling_ratio=2,
coordinate_transformation_mode="half_pixel",
)
X, batch_indices, rois = get_roi_align_input_values()
# (num_rois, C, output_height, output_width)
Y = np.array(
[
[
[
[0.5178, 0.3434, 0.3229, 0.4474, 0.6344],
[0.4031, 0.5366, 0.4428, 0.4861, 0.4023],
[0.2512, 0.4002, 0.5155, 0.6954, 0.3465],
[0.3350, 0.4601, 0.5881, 0.3439, 0.6849],
[0.4932, 0.7141, 0.8217, 0.4719, 0.4039],
]
],
[
[
[0.3070, 0.2187, 0.3337, 0.4880, 0.4870],
[0.1871, 0.4914, 0.5561, 0.4192, 0.3686],
[0.1433, 0.4608, 0.5971, 0.5310, 0.4982],
[0.2788, 0.4386, 0.6022, 0.7000, 0.7524],
[0.5774, 0.7024, 0.7251, 0.7338, 0.8163],
]
],
[
[
[0.2393, 0.4075, 0.3379, 0.2525, 0.4743],
[0.3671, 0.2702, 0.4105, 0.6419, 0.8308],
[0.5556, 0.4543, 0.5564, 0.7502, 0.9300],
[0.6626, 0.5617, 0.4813, 0.4954, 0.6663],
[0.6636, 0.3721, 0.2056, 0.1928, 0.2478],
]
],
],
dtype=np.float32,
)
expect(
node,
inputs=[X, rois, batch_indices],
outputs=[Y],
name="test_roialign_aligned_true",
)
Differences
0  0  Region of Interest (RoI) align operation described in the  Region of Interest (RoI) align operation described in the 
1  1  [Mask RCNN paper](https://arxiv.org/abs/1703.06870).  [Mask RCNN paper](https://arxiv.org/abs/1703.06870). 
2  2  RoiAlign consumes an input tensor X and region of interests (rois)  RoiAlign consumes an input tensor X and region of interests (rois) 
3  3  to apply pooling across each RoI; it produces a 4D tensor of shape  to apply pooling across each RoI; it produces a 4D tensor of shape 
4  4  (num_rois, C, output_height, output_width).  (num_rois, C, output_height, output_width). 
5  5 


6  6  RoiAlign is proposed to avoid the misalignment by removing  RoiAlign is proposed to avoid the misalignment by removing 
7  7  quantizations while converting from original image into feature  quantizations while converting from original image into feature 
8  8  map and from feature map into RoI feature; in each ROI bin,  map and from feature map into RoI feature; in each ROI bin, 
9  9  the value of the sampled locations are computed directly  the value of the sampled locations are computed directly 
10  10  through bilinear interpolation.  through bilinear interpolation. 
11  11 


12  12  **Attributes**  **Attributes** 
13  13 


14  * **coordinate_transformation_mode**:  
15  Allowed values are 'half_pixel' and 'output_half_pixel'. Use the  
16  value 'half_pixel' to pixel shift the input coordinates by 0.5 (the  
17  recommended behavior). Use the value 'output_half_pixel' to omit the  
18  pixel shift for the input (use this for a backwardcompatible  
19  behavior). Default value is 'half_pixel'.  
14  20  * **mode**:  * **mode**: 
15  21  The pooling method. Two modes are supported: 'avg' and 'max'.  The pooling method. Two modes are supported: 'avg' and 'max'. 
16  22  Default is 'avg'. Default value is 'avg'.  Default is 'avg'. Default value is 'avg'. 
17  23  * **output_height**:  * **output_height**: 
18  24  default 1; Pooled output Y's height. Default value is 1.  default 1; Pooled output Y's height. Default value is 1. 
19  25  * **output_width**:  * **output_width**: 
20  26  default 1; Pooled output Y's width. Default value is 1.  default 1; Pooled output Y's width. Default value is 1. 
21  27  * **sampling_ratio**:  * **sampling_ratio**: 
22  28  Number of sampling points in the interpolation grid used to compute  Number of sampling points in the interpolation grid used to compute 
23  29  the output value of each pooled output bin. If > 0, then exactly  the output value of each pooled output bin. If > 0, then exactly 
24  30  sampling_ratio x sampling_ratio grid points are used. If == 0, then  sampling_ratio x sampling_ratio grid points are used. If == 0, then 
25  31  an adaptive number of grid points are used (computed as  an adaptive number of grid points are used (computed as 
26  32  ceil(roi_width / output_width), and likewise for height). Default is  ceil(roi_width / output_width), and likewise for height). Default is 
27  33  0. Default value is 0.  0. Default value is 0. 
28  34  * **spatial_scale**:  * **spatial_scale**: 
29  35  Multiplicative spatial scale factor to translate ROI coordinates  Multiplicative spatial scale factor to translate ROI coordinates 
30  36  from their input spatial scale to the scale used when pooling, i.e.,  from their input spatial scale to the scale used when pooling, i.e., 
31  37  spatial scale of the input feature map X relative to the input  spatial scale of the input feature map X relative to the input 
32  38  image. E.g.; default is 1.0f. Default value is 1.0.  image. E.g.; default is 1.0f. Default value is 1.0. 
33  39 


34  40  **Inputs**  **Inputs** 
35  41 


36  42  * **X** (heterogeneous)  **T1**:  * **X** (heterogeneous)  **T1**: 
37  43  Input data tensor from the previous operator; 4D feature map of  Input data tensor from the previous operator; 4D feature map of 
38  44  shape (N, C, H, W), where N is the batch size, C is the number of  shape (N, C, H, W), where N is the batch size, C is the number of 
39  45  channels, and H and W are the height and the width of the data.  channels, and H and W are the height and the width of the data. 
40  46  * **rois** (heterogeneous)  **T1**:  * **rois** (heterogeneous)  **T1**: 
41  47  RoIs (Regions of Interest) to pool over; rois is 2D input of shape  RoIs (Regions of Interest) to pool over; rois is 2D input of shape 
42  48  (num_rois, 4) given as [[x1, y1, x2, y2], ...]. The RoIs'  (num_rois, 4) given as [[x1, y1, x2, y2], ...]. The RoIs' 
43  49  coordinates are in the coordinate system of the input image. Each  coordinates are in the coordinate system of the input image. Each 
44  50  coordinate set has a 1:1 correspondence with the 'batch_indices'  coordinate set has a 1:1 correspondence with the 'batch_indices' 
45  51  input.  input. 
46  52  * **batch_indices** (heterogeneous)  **T2**:  * **batch_indices** (heterogeneous)  **T2**: 
47  53  1D tensor of shape (num_rois,) with each element denoting the index  1D tensor of shape (num_rois,) with each element denoting the index 
48  54  of the corresponding image in the batch.  of the corresponding image in the batch. 
49  55 


50  56  **Outputs**  **Outputs** 
51  57 


52  58  * **Y** (heterogeneous)  **T1**:  * **Y** (heterogeneous)  **T1**: 
53  59  RoI pooled output, 4D tensor of shape (num_rois, C, output_height,  RoI pooled output, 4D tensor of shape (num_rois, C, output_height, 
54  60  output_width). The rth batch element Y[r1] is a pooled feature map  output_width). The rth batch element Y[r1] is a pooled feature map 
55  61  corresponding to the rth RoI X[r1].  corresponding to the rth RoI X[r1]. 
56  62 


57  63  **Type Constraints**  **Type Constraints** 
58  64 


59  65  * **T1** in (  * **T1** in ( 
60  66  tensor(double),  tensor(double), 
61  67  tensor(float),  tensor(float), 
62  68  tensor(float16)  tensor(float16) 
63  69  ):  ): 
64  70  Constrain types to float tensors.  Constrain types to float tensors. 
65  71  * **T2** in (  * **T2** in ( 
66  72  tensor(int64)  tensor(int64) 
67  73  ):  ): 
68  74  Constrain types to int tensors.  Constrain types to int tensors. 
RoiAlign  10#
Version
name: RoiAlign (GitHub)
domain: main
since_version: 10
function: False
support_level: SupportType.COMMON
shape inference: True
This version of the operator has been available since version 10.
Summary
Region of Interest (RoI) align operation described in the [Mask RCNN paper](https://arxiv.org/abs/1703.06870). RoiAlign consumes an input tensor X and region of interests (rois) to apply pooling across each RoI; it produces a 4D tensor of shape (num_rois, C, output_height, output_width).
RoiAlign is proposed to avoid the misalignment by removing quantizations while converting from original image into feature map and from feature map into RoI feature; in each ROI bin, the value of the sampled locations are computed directly through bilinear interpolation.
Attributes
mode: The pooling method. Two modes are supported: ‘avg’ and ‘max’. Default is ‘avg’. Default value is
'avg'
.output_height: default 1; Pooled output Y’s height. Default value is
1
.output_width: default 1; Pooled output Y’s width. Default value is
1
.sampling_ratio: Number of sampling points in the interpolation grid used to compute the output value of each pooled output bin. If > 0, then exactly sampling_ratio x sampling_ratio grid points are used. If == 0, then an adaptive number of grid points are used (computed as ceil(roi_width / output_width), and likewise for height). Default is 0. Default value is
0
.spatial_scale: Multiplicative spatial scale factor to translate ROI coordinates from their input spatial scale to the scale used when pooling, i.e., spatial scale of the input feature map X relative to the input image. E.g.; default is 1.0f. Default value is
1.0
.
Inputs
X (heterogeneous)  T1: Input data tensor from the previous operator; 4D feature map of shape (N, C, H, W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data.
rois (heterogeneous)  T1: RoIs (Regions of Interest) to pool over; rois is 2D input of shape (num_rois, 4) given as [[x1, y1, x2, y2], …]. The RoIs’ coordinates are in the coordinate system of the input image. Each coordinate set has a 1:1 correspondence with the ‘batch_indices’ input.
batch_indices (heterogeneous)  T2: 1D tensor of shape (num_rois,) with each element denoting the index of the corresponding image in the batch.
Outputs
Y (heterogeneous)  T1: RoI pooled output, 4D tensor of shape (num_rois, C, output_height, output_width). The rth batch element Y[r1] is a pooled feature map corresponding to the rth RoI X[r1].
Type Constraints
T1 in ( tensor(double), tensor(float), tensor(float16) ): Constrain types to float tensors.
T2 in ( tensor(int64) ): Constrain types to int tensors.