com.microsoft - QOrderedMatMul#

QOrderedMatMul - 1 (com.microsoft)#

Version

This version of the operator has been available since version 1 of domain com.microsoft.

Summary

Quantize (Int8) MatMul with order. Implement Y = alpha * A * B + bias + beta * C. Matrix A, B, C, Y are all int8 matrix. Two type of order combination supported:

*) When order_B is ORDER_COL, order_A must be ORDER_ROW.

bias is vector of {#cols of Y} of float32, C should be batch 1/batch_A. B could be of batch 1 or batch_A. Note B is reorder to ORDER_COL, or Transposed. Not Transposed first and then Reordered here.

*) When order_B is specify ORDER_COL4_4R2_8C or ORDER_COL32_2R_4R4, orderA must be ORDER_COL32.

MatMul will be implemented using alpha(A * B) + beta * C => Y. bias is not supported here. B in fact is transposed first then reordered into ORDER_COL4_4R2_8C or ORDER_COL32_2R_4R4 here.

order_Y and order_C will be same as order_A. Support per column quantized weight, ie, scale_B is 1-D vector of size [#cols of matrix B].

Attributes

  • order_A (required): cublasLt order of matrix A. See the schema of QuantizeWithOrder for order definition. Default value is ?.

  • order_B (required): cublasLt order of matrix B Default value is ?.

  • order_Y (required): cublasLt order of matrix Y and optional matrix C Default value is ?.

Inputs

Between 5 and 8 inputs.

  • A (heterogeneous) - Q: 3-dimensional matrix A

  • scale_A (heterogeneous) - S: scale of the input A.

  • B (heterogeneous) - Q: 2-dimensional matrix B. Transposed if order_B is ORDER_COL.

  • scale_B (heterogeneous) - S: scale of the input B. Scalar or 1-D float32.

  • scale_Y (heterogeneous) - S: scale of the output Y.

  • bias (optional, heterogeneous) - S: 1d bias, not scaled with scale_Y.

  • C (optional, heterogeneous) - Q: 3d or 2d matrix C. if 2d expand to 3d first. Shape[0] should be 1 or same as A.shape[0]

  • scale_C (optional, heterogeneous) - S: scale of the input A.

Outputs

  • Y (heterogeneous) - Q: Matrix multiply results from A * B

Examples