.. _l-onnx-doccom.microsoft-QOrderedAttention: ================================= com.microsoft - QOrderedAttention ================================= .. contents:: :local: .. _l-onnx-opcom-microsoft-qorderedattention-1: QOrderedAttention - 1 (com.microsoft) ===================================== **Version** * **name**: `QOrderedAttention (GitHub) `_ * **domain**: **com.microsoft** * **since_version**: **1** * **function**: * **support_level**: * **shape inference**: This version of the operator has been available **since version 1 of domain com.microsoft**. **Summary** Quantized version of simplified Multi-Head Self Attention(using int8 with specific matrix Layout). Multi-Head Self Attention that can be either unidirectional (like GPT-2) or bidirectional (like BERT). The mask_index input is optional. Besides raw attention mask with shape (batch_size, past_sequence_length + sequence_length) or (batch_size, sequence_length, past_sequence_length + sequence_length) with value 0 for masked and 1 otherwise, we also support other two formats: When input has right-side padding, mask_index is one dimension with shape (batch_size), where value of each element is the end position, or valid length of actual sequence excluding padding. When input has left-side padding, mask_index has shape (2 * batch_size), where the values are the exclusive end positions followed by the inclusive start positions. When unidirectional is 1, and each token only attend to previous tokens. For GPT-2, both past and present state are optional. Present state could appear in output even when past state is not in input. Current version does not support past/present, extra_add and qkv_hidden_sizes. TODO: Support them if needed in the future. **Attributes** * **num_heads** (required): Number of attention heads Default value is ``?``. * **order_input** (required): cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition. Default value is ``?``. * **order_output** (required): cublasLt order of global bias Default value is ``?``. * **order_weight** (required): cublasLt order of weight matrix Default value is ``?``. * **qkv_hidden_sizes**: Hidden layer sizes of Q, K, V paths in Attention Default value is ``?``. * **unidirectional**: Whether every token can only attend to previous tokens. Default value is 0. Default value is ``?``. **Inputs** Between 17 and 20 inputs. * **input** (heterogeneous) - **Q**: 3D input tensor with shape (batch_size, sequence_length, input_hidden_size) * **scale_input** (heterogeneous) - **S**: scale of the input, scalar value (per tensor) currently. * **scale_Q_gemm** (heterogeneous) - **S**: scale of the gemm - scalar (per-tensor quantization) * **scale_K_gemm** (heterogeneous) - **S**: scale of the gemm - scalar (per-tensor quantization) * **scale_V_gemm** (heterogeneous) - **S**: scale of the gemm - scalar (per-tensor quantization) * **Q_weight** (heterogeneous) - **Q**: 2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size * **K_weight** (heterogeneous) - **Q**: 2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size * **V_weight** (heterogeneous) - **Q**: 2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size * **scale_Q_weight** (heterogeneous) - **S**: scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization) * **scale_K_weight** (heterogeneous) - **S**: scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization) * **scale_V_weight** (heterogeneous) - **S**: scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization) * **Q_bias** (heterogeneous) - **S**: 1D input tensor with shape (hidden_size) * **K_bias** (heterogeneous) - **S**: 1D input tensor with shape (hidden_size) * **V_bias** (heterogeneous) - **S**: 1D input tensor with shape (hidden_size) * **scale_QKT_gemm** (optional, heterogeneous) - **S**: scale of the gemm - scalar (per-tensor quantization) * **scale_QKT_softmax** (optional, heterogeneous) - **S**: scale of the softmax result - scalar (per-tensor quantization) * **scale_values_gemm** (heterogeneous) - **S**: scale of the gemm - scalar (per-tensor quantization). Also this is the output scale for the operator. * **mask_index** (optional, heterogeneous) - **G**: Attention mask with shape (batch_size, 1, max_sequence_length, max_sequence_length), (batch_size, past_sequence_length + sequence_length)or (batch_size, sequence_length, past_sequence_length + sequence_length), or index with shape (batch_size) or (2 * batch_size). * **past** (optional, heterogeneous) - **Q**: past state for key and value with shape (2, batch_size, num_heads, past_sequence_length, head_size). * **extra_add** (optional, heterogeneous) - **S**: additional add to QxK' with shape (batch_size, num_heads, sequence_length, sequence_length). **Outputs** * **output** (heterogeneous) - **Q**: 3D output tensor with shape (batch_size, sequence_length, hidden_size) **Examples**