.. _l-onnx-doccom.microsoft-QOrderedAttention:

=================================
com.microsoft - QOrderedAttention
=================================

.. contents::
    :local:


.. _l-onnx-opcom-microsoft-qorderedattention-1:

QOrderedAttention - 1 (com.microsoft)
=====================================

**Version**

* **name**: `QOrderedAttention (GitHub) <https://github.com/onnx/onnx/blob/main/docs/Operators.md#com.microsoft.QOrderedAttention>`_
* **domain**: **com.microsoft**
* **since_version**: **1**
* **function**:
* **support_level**:
* **shape inference**:

This version of the operator has been available
**since version 1 of domain com.microsoft**.

**Summary**

Quantized version of simplified Multi-Head Self Attention(using int8 with specific matrix Layout).
Multi-Head Self Attention that can be either unidirectional (like GPT-2) or bidirectional (like BERT).
The mask_index input is optional. Besides raw attention mask with shape (batch_size, past_sequence_length + sequence_length)
or (batch_size, sequence_length, past_sequence_length + sequence_length) with value 0 for masked and 1 otherwise,
we also support other two formats: When input has right-side padding, mask_index is one dimension with shape (batch_size),
where value of each element is the end position, or valid length of actual sequence excluding padding. When input has
left-side padding, mask_index has shape (2 * batch_size), where the values are the exclusive end positions followed by
the inclusive start positions. When unidirectional is 1, and each token only attend to previous tokens. For GPT-2, both past
and present state are optional. Present state could appear in output even when past state is not in input.
Current version does not support past/present, extra_add and qkv_hidden_sizes.
TODO: Support them if needed in the future.

**Attributes**

* **num_heads** (required):
  Number of attention heads Default value is ``?``.
* **order_input** (required):
  cublasLt order of input matrix. See the schema of QuantizeWithOrder
  for order definition. Default value is ``?``.
* **order_output** (required):
  cublasLt order of global bias Default value is ``?``.
* **order_weight** (required):
  cublasLt order of weight matrix Default value is ``?``.
* **qkv_hidden_sizes**:
  Hidden layer sizes of Q, K, V paths in Attention Default value is ``?``.
* **unidirectional**:
  Whether every token can only attend to previous tokens. Default
  value is 0. Default value is ``?``.

**Inputs**

Between 17 and 20 inputs.

* **input** (heterogeneous) - **Q**:
  3D input tensor with shape (batch_size, sequence_length,
  input_hidden_size)
* **scale_input** (heterogeneous) - **S**:
  scale of the input, scalar value (per tensor) currently.
* **scale_Q_gemm** (heterogeneous) - **S**:
  scale of the gemm - scalar (per-tensor quantization)
* **scale_K_gemm** (heterogeneous) - **S**:
  scale of the gemm - scalar (per-tensor quantization)
* **scale_V_gemm** (heterogeneous) - **S**:
  scale of the gemm - scalar (per-tensor quantization)
* **Q_weight** (heterogeneous) - **Q**:
  2D input tensor with shape (input_hidden_size, hidden_size), where
  hidden_size = num_heads * head_size
* **K_weight** (heterogeneous) - **Q**:
  2D input tensor with shape (input_hidden_size, hidden_size), where
  hidden_size = num_heads * head_size
* **V_weight** (heterogeneous) - **Q**:
  2D input tensor with shape (input_hidden_size, hidden_size), where
  hidden_size = num_heads * head_size
* **scale_Q_weight** (heterogeneous) - **S**:
  scale of the weight (scalar for per-tensor quantization or 1-D of
  dims [hidden_size] for per-channel quantization)
* **scale_K_weight** (heterogeneous) - **S**:
  scale of the weight (scalar for per-tensor quantization or 1-D of
  dims [hidden_size] for per-channel quantization)
* **scale_V_weight** (heterogeneous) - **S**:
  scale of the weight (scalar for per-tensor quantization or 1-D of
  dims [hidden_size] for per-channel quantization)
* **Q_bias** (heterogeneous) - **S**:
  1D input tensor with shape (hidden_size)
* **K_bias** (heterogeneous) - **S**:
  1D input tensor with shape (hidden_size)
* **V_bias** (heterogeneous) - **S**:
  1D input tensor with shape (hidden_size)
* **scale_QKT_gemm** (optional, heterogeneous) - **S**:
  scale of the gemm - scalar (per-tensor quantization)
* **scale_QKT_softmax** (optional, heterogeneous) - **S**:
  scale of the softmax result - scalar (per-tensor quantization)
* **scale_values_gemm** (heterogeneous) - **S**:
  scale of the gemm - scalar (per-tensor quantization). Also this is
  the output scale for the operator.
* **mask_index** (optional, heterogeneous) - **G**:
  Attention mask with shape (batch_size, 1, max_sequence_length,
  max_sequence_length), (batch_size, past_sequence_length +
  sequence_length)or (batch_size, sequence_length,
  past_sequence_length + sequence_length), or index with shape
  (batch_size) or (2 * batch_size).
* **past** (optional, heterogeneous) - **Q**:
  past state for key and value with shape (2, batch_size, num_heads,
  past_sequence_length, head_size).
* **extra_add** (optional, heterogeneous) - **S**:
  additional add to QxK' with shape (batch_size, num_heads,
  sequence_length, sequence_length).

**Outputs**

* **output** (heterogeneous) - **Q**:
  3D output tensor with shape (batch_size, sequence_length,
  hidden_size)

**Examples**