AutoGEMM

Installation and user guide

Dec 15, 2022

Description

AutoGEMM is a software package that offers a family of Python code generators for high-performance GEMM (General Matrix-Matrix Multiplication) implementations based on Apache TVM.

This guide complements the submitted paper Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM and includes basic setup and execution steps for the package.

The provided software package includes all the generators described in the paper (listed in Sections 5, 6, and 7 of the manuscript) and used to extract the performance plots in Section 8, together with a simplified high-level driver program to easily use them.

Package structure

The software bundle consists on two main directories:

File name Description
userManual.pdf This documentation file (PDF format)
userManual.html This documentation file (HTML format)
File name Description
driver.py Simplified driver to conduct specific performance experiments using different configuration parameters and code generators
exec.cfg Configuration file to define the general experimental parameters
machine.cfg Configuration file to define the specific features of the target architecture
generators/ Directory containing code generators, described below

Within the src/ directory, the Python scripts for the generators described in the paper can be found in the generators subdirectory:

File name Description Listing
basic_GEMM_B3A2C0.py TVM generator for the basic GEMM 5
blocked_GEMM_B3A2C0.py TVM generator for GEMM mimicking the blocking scheme of the baseline algorithm 6
packed_GEMM_B3A2C0.py TVM generator for GEMM mimicking the (blocking scheme and) packing of the baseline algorithm
packed_GEMM_B3A2C0_ukernel.py TVM generator for GEMM mimicking (the blocking and packing of) the baseline algorithm, and integrating the optimized micro-kernel with C-resident. 8
opt_GEMM_B3A2C0_ukernel.py TVM generator for GEMM mimicking (the blocking and packing of) the baseline algorithm, integrating both the optimized micro-kernel with C-resident and fine-grain optimizations 9
opt_GEMM_B3A2C0_ukernel_parallel.py TVM generator for GEMM mimicking (the blocking and packing of) the baseline algorithm, integrating both the optimized micro-kernel with C-resident and fine-grain optimizations and a loop-level parallelization of loop ic 9
opt_GEMM_A3C2B0_ukernel.py TVM generator for GEMM mimicking the blocking and packing schemes of the A3C2B0 algorithm 11
opt_GEMM_C3A2B0_ukernel.py TVM generator for GEMM mimicking the blocking and packing schemes of the C3A2B0 algorithm 12

Installation steps

Prerequisites and supported platforms

AutoGEMM relies on two software prerequisites: any installation of Python 3, and Apache TVM. The prerequisites for the latter are not considered in this documentation, but will be automatically managed and installed by the following installation steps. To ease the installation process, the following instructions are based on the creation of a virtual environment via Python.

Creation of a virtual environment

# Installation of Python3 virtualenv (considering a Debian-based Linux distribution)
user@machine:~$ apt-get install python3-virtualenv
# Creation of a virtual Python environment with name .tvm
user@machine:~$ python3 -m venv .tvm
# Activation of the virtual environment
user@machine:~$ source .tvm/bin/activate
(.tvm) user@machine:~$

Additional software prerequisites

# Installation of additional software requisites via pip
(.tvm) user@machine:~$ pip install typing_extensions pytest

Installation of Apache TVM

# Installation of Apache TVM and dependences
(.tvm) user@machine:~$ pip install apache-tvm

Driver configuration and usage

Configuration files

Execution configuration file (exec.cfg)

This configuration file includes the specific conditions for the execution of one experiment. The file is structured in sections, each one including a number of mandatory parameters (we consider next the operation with matrices input matrices A and B, and output matrix C, C := A * B:

Example of configuration file exec.cfg:

#########################
[GENERAL]
repeats = 1
assembly = 1

[EXPERMIENT]
variant = B3A2C0
opt_level = opt_ukernel
dtype = float32

[PROBLEM]
M = 4096
N = 4096
K = 4096

[BLOCKSIZES]
mr = 4
nr = 16
kr = 4

MC = 256
NC = 1280
KC = 128

ls = 16
#########################

Machine configuration file (machine.cfg)

This configuration file includes the specific characteristics of the target architecture. In this version, this information includes a label describing the architecture, and the llvm target used to generate code, as follows:

Example of configuration file machine.cfg (in this example, targeting an Intel Icelake architecture):

#########################
[MACHINE]
name = icelake

target = llvm -mcpu=icelake-server
#########################

Some examples for target used in the manuscript include:

Architecture Target
Basic (native wihtout optimization) llvm
ARMv8a (8.2) with NEON llvm -device=arm_cpu -mattr=+v8.2a,+fp-armv8,+neon
ARMv8a (8.2) with NEON and FP16 llvm -device=arm_cpu -mattr=+v8.2a,+fp-armv8,+neon,+fp16fml
AMD Zen2 with AVX2 llvm -mcpu=znver2
Intel Icelake with AVX512 llvm -mcpu=icelake-server

Driver execution

The file driver.py accepts two mandatory arguments that indicate the path to the configuration files. A detailed usage guide can be extrated upon execution with the -h argument:

(.tvm) user@machine:~$ python3 driver.py -h
## Test driver for BLAS generators
usage: Test driver. [-h] -c EXECCONFIG -m MACHINECONFIG

Description

optional arguments:
  -h, --help            show this help message and exit
  -c EXECCONFIG, --execconfig EXECCONFIG
                        Execution configuration file
  -m MACHINECONFIG, --machineconfig MACHINECONFIG
                        Machine configuration file

An example execution of the driver is illustrated next:

(.tvm) user@machine:~$ python3 driver.py -c exec.cfg -m machine.cfg
## Test driver for BLAS generators
icelake,B3A2C0_opt_ukernel_par,4096,4096,4096,256,1280,128,4,16,4,465.45

Note that the CSV output includes the execution information, namely:

If the assembly file has been requested in the execution configuration file, it will be placed in the assembly folder with name variant_MxNxK_mrxnr.s (e.g. B3A2C0_256x1280x128_4x16.s).

Parallel executions

For codes generated with support for multithreading, the environment variable TVM_NUM_THREADS controls the number of threads deployed for execution. As an example, the following execution:

(.tvm) user@machine:~$ TVM_NUM_THREADS=1 python3 driver.py -c exec.cfg -m machine.cfg
## Test driver for BLAS generators
icelake,B3A2C0_opt_ukernel_par,4096,4096,4096,256,1280,128,4,16,4,32.46

would execute a sequential version of the corresponding GEMM code, whereas

(.tvm) user@machine:~$ TVM_NUM_THREADS=64 python3 driver.py -c exec.cfg -m machine.cfg
## Test driver for BLAS generators
icelake,B3A2C0_opt_ukernel_par,4096,4096,4096,256,1280,128,4,16,4,464.63

would execute a parallel version deploying 64 threads (observe the difference in performance reported as the last field of the comma-separated output). By default, TVM deploys as many threads as cores are available in the system.

Prototypes of the generators

      def basic_GEMM_B3A2C0(m, n, k, mc, nc, kc, mr, nr, kr, lanesize, dtype, target)
      def blocked_GEMM_B3A2C0(m, n, k, mc, nc, kc, mr, nr, kr, lanesize, dtype, target)
      def packed_GEMM_B3A2C0(m, n, k, mc, nc, kc, mr, nr, kr, lanesize, dtype, target)
      def opt_GEMM_B3A2C0_ukernel(m, n, k, mc, nc, kc, mr, nr, kr, lanesize, dtype, target)
      def opt_GEMM_B3A2C0_ukernel_parallel(m, n, k, mc, nc, kc, mr, nr, kr, lanesize, dtype, target)
      def opt_GEMM_C3A2C0_ukernel(m, n, k, mc, nc, kc, mr, nr, kr, lanesize, dtype, target)
      def opt_GEMM_A3C2B0_ukernel(m, n, k, mc, nc, kc, mr, nr, kr, lanesize, dtype, target)
Parameter Type Description
m integer Number of rows of C (rows of A)
n integer Number of columns of C (columns of A)
k integer Number of columns of A (rows of B)
mc integer Cache parameter. Number of rows of Ac
nc integer nc cache parameter Number of columns of Bc
kc integer kc cache parameter. Number of columns of Ac (rows of Bc)
mr integer Micro-kernel parameter. Number of rows of Cr (rows of Ar)
nr integer Micro-kernel parameter. Number of columns of Cr (columns of Br)
kr integer Micro-kernel parameter. Number of rows of each micro-panel in Bc (not used in variant B3A2C0)
lanesize integer Number of elements (lanes) in a vector register
dtype Numpy datatype Datatype for matrix elements
target TVM target Target description, following the convention in https://tvm.apache.org/docs/reference/api/python/target.html