.. _l-onnx-doccom.microsoft-Attention:

=========================
com.microsoft - Attention
=========================

.. contents::
    :local:


.. _l-onnx-opcom-microsoft-attention-1:

Attention - 1 (com.microsoft)
=============================

**Version**

* **name**: `Attention (GitHub) <https://github.com/onnx/onnx/blob/main/docs/Operators.md#com.microsoft.Attention>`_
* **domain**: **com.microsoft**
* **since_version**: **1**
* **function**:
* **support_level**:
* **shape inference**:

This version of the operator has been available
**since version 1 of domain com.microsoft**.

**Summary**

Multi-Head Attention that can be either unidirectional (like GPT-2) or bidirectional (like BERT).

The weights for input projection of Q, K and V are merged. The data is stacked on the second dimension. Its shape
is (input_hidden_size, hidden_size + hidden_size + v_hidden_size). Here hidden_size is the hidden dimension of Q and K,
and v_hidden_size is that of V.

The mask_index is optional. Besides raw attention mask with shape (batch_size, total_sequence_length)
or (batch_size, sequence_length, total_sequence_length) with value 0 for masked and 1 otherwise,
we support other two formats: When input has right-side padding, mask_index is one dimension with shape (batch_size),
where value is actual sequence length excluding padding. When input has left-side padding, mask_index has
shape (2 * batch_size), where the values are the exclusive end positions followed by the inclusive start positions.

When unidirectional is 1, each token only attends to previous tokens.

Both past and present state are optional. They shall be used together, and not allowed to use only one of them.

When weights is not provided, key and value are required. In this situation, MatMul for input projection is excluded,
and input is the query after projection. The bias is included for performance consideration.

The qkv_hidden_sizes is required only when K and V have different hidden sizes.

When there is past state, hidden dimension for Q, K and V shall be the same.

The total_sequence_length is past_sequence_length + kv_sequence_length. Here kv_sequence_length is the length of K or V.
For self attention, kv_sequence_length equals to sequence_length (sequence length of Q).
For cross attention, query and key might have different lengths.

**Attributes**

* **num_heads** (required):
  Number of attention heads Default value is ``?``.
* **qkv_hidden_sizes**:
  Hidden dimension of Q, K, V: hidden_size, hidden_size and
  v_hidden_size Default value is ``?``.
* **unidirectional**:
  Whether every token can only attend to previous tokens. Default
  value is 0. Default value is ``?``.

**Inputs**

Between 3 and 8 inputs.

* **input** (optional, heterogeneous) - **T**:
  Input tensor with shape (batch_size, sequence_length,
  input_hidden_size) when weights is available, or query tensor with
  shape (batch_size, sequence_length, hidden_size) when weights is not
  available.
* **weights** (optional, heterogeneous) - **T**:
  Merged Q/K/V weights with shape (input_hidden_size, hidden_size +
  hidden_size + v_hidden_size)
* **bias** (heterogeneous) - **T**:
  Bias tensor with shape (hidden_size + hidden_size + v_hidden_size)
  for input projection
* **mask_index** (optional, heterogeneous) - **M**:
  Attention mask with shape (batch_size, 1, max_sequence_length,
  max_sequence_length), (batch_size, total_sequence_length) or
  (batch_size, sequence_length, total_sequence_length), or index with
  shape (batch_size) or (2 * batch_size).
* **past** (optional, heterogeneous) - **T**:
  past state for key and value with shape (2, batch_size, num_heads,
  past_sequence_length, head_size)
* **extra_add** (optional, heterogeneous) - **T**:
  additional add to QxK' with shape (batch_size, num_heads,
  sequence_length, total_sequence_length)
* **key** (optional, heterogeneous) - **T**:
  Input for key with shape (batch_size, kv_sequence_length,
  hidden_size). Required when weights is not available.
* **value** (optional, heterogeneous) - **T**:
  Input for key with shape (batch_size, kv_sequence_length,
  v_hidden_size). Required when weights is not available.

**Outputs**

Between 1 and 2 outputs.

* **output** (heterogeneous) - **T**:
  3D output tensor with shape (batch_size, sequence_length,
  v_hidden_size)
* **present** (optional, heterogeneous) - **T**:
  past state for key and value with shape (2, batch_size, num_heads,
  total_sequence_length, head_size)

**Examples**