com.microsoft - QOrderedLongformerAttention#

QOrderedLongformerAttention - 1 (com.microsoft)#

Version

This version of the operator has been available since version 1 of domain com.microsoft.

Summary

Quantized version of Longformer Self Attention (using int8 with specific matrix Layout).

Attributes

  • num_heads (required): Number of attention heads Default value is ?.

  • order_global_weight (required): cublasLt order of weight matrix Default value is ?.

  • order_input (required): cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition. Default value is ?.

  • order_output (required): cublasLt order of global bias Default value is ?.

  • order_weight (required): cublasLt order of weight matrix Default value is ?.

  • window (required): One sided attention windows length W, or half of total window length Default value is ?.

Inputs

  • input (heterogeneous) - Q: 3D input tensor with shape (batch_size, sequence_length, hidden_size), hidden_size = num_heads * head_size

  • scale_input (heterogeneous) - S: scale of the input

  • weight (heterogeneous) - Q: 2D input tensor with shape (hidden_size, 3 * hidden_size)

  • scale_weight (heterogeneous) - S: scale of the weight

  • bias (heterogeneous) - S: 1D input tensor with shape (3 * hidden_size), fp32 only currently.

  • scale_bias (heterogeneous) - S: reserved. (not used as add bias need float value in cublasLt for normal order.)

  • scale_qkv_gemm (heterogeneous) - S: scale of the output for fused kqv gemm

  • mask (heterogeneous) - F: Attention mask with shape (batch_size, sequence_length)

  • global_weight (heterogeneous) - Q: 2D input tensor with shape (hidden_size, 3 * hidden_size)

  • scale_global_weight (heterogeneous) - S: scale of the global_weight

  • global_bias (heterogeneous) - S: 1D input tensor with shape (3 * hidden_size)

  • scale_global_gemm (heterogeneous) - S: scale of the global_qkv_gemm

  • global (heterogeneous) - G: Global attention flags with shape (batch_size, sequence_length)

  • scale_output (heterogeneous) - S: scale of the output

Outputs

  • output (heterogeneous) - Q: 3D output tensor with shape (batch_size, sequence_length, hidden_size)

Examples