My Ramblings

Key Idea of BitNet (ternary edition). Matrix Multiply using Lookup Tables

This example is inspired by BitNet b1.58, which uses ternary weights: 1, 0, and +1. Actual packing structure and details aren't meant to be the same.

Lookup-based matrix multiplication is not a new idea, but it is useful here because it allows compact quantized weights to be used directly during CPU inference.

Key ideas:

In a follow-up post, we will show that the bigger tables shown here are actually feasible with AVX-512 and can get many times the performance of the AVX2 implementations in the official BitNet Repository.

Packing five ternary weights into one byte

It can be packed!

Five ternary weights have 35=243 possible combinations.

One byte has 28=256 possible values. This means one byte is enough to represent 5 trits.

Encoding a group of weights

Consider five weights:

[-1, 0, 1, 1, -1]

Treat them as the digits of a balanced ternary number. The place values are powers of three:

Position 0 1 2 3 4
Place value 81 27 9 3 1
Weight -1 0 1 1 -1

The packed value is

(1)(81)+(0)(27)+(1)(9)+(1)(3)+(1)(1)=70

Each weight is one of 1, 0, or 1. The packed code therefore ranges from 121 to 121, giving exactly 243 possible values.

Using the packed byte as a lookup index

Suppose the current group of five input values is

[1, 2, 3, 4, 5]

We build a table containing the dot product of these inputs with every possible five-weight pattern. A small section of that table looks like this:

Packed code Ternary weights Dot product
121 [-1, -1, -1, -1, -1] 15
120 [-1, -1, -1, -1, 0] 10
119 [-1, -1, -1, -1, 1] 5
...
1 [0, 0, 0, 0, -1] 5
0 [0, 0, 0, 0, 0] 0
1 [0, 0, 0, 0, 1] 5
...
119 [1, 1, 1, 1, -1] 5
120 [1, 1, 1, 1, 0] 10
121 [1, 1, 1, 1, 1] 15

Applying this to matrix multiplication

Let's walk through a small example: y=xW with 10 inputs and several output columns.

Setup

Inputs, in two groups of five:

x = [1 2 3 4 5 | 6 7 8 9 10]
      group 0      group 1

Weights, one column per output:

            col 0   col 1   col 2   col 3   col 4   col 5    ...
group 0:     -1       0       1       1       0      -1      ...
              0       0      -1       1       1       1
              1       0       1       1      -1       0
              1       0      -1       1       0       1
             -1       0       1       1       1       0
group 1:      1       1      -1      -1       1       0      ...
              1       1       0      -1       0      -1
              0       1       0      -1       1       1
             -1       1       0      -1       0       1
              0       1       1      -1      -1       0

Each five-weight block is packed into one byte, and these bytes are the only form of the weights kept in memory:

Packed codes col 0 col 1 col 2 col 3 col 4 col 5
group 0 70 0 61 121 19 51
group 1 105 121 80 121 89 15

Build one table per input group

When the input vector arrives, build the full 243-entry lookup table for each group of five inputs, exactly as in the previous section.

Table 0 contains the dot products with the group-0 inputs [1, 2, 3, 4, 5]:

Packed code Ternary weights Dot product
121 [-1, -1, -1, -1, -1] 15
...
70 [-1, 0, 1, 1, -1] 1
...
51 [-1, 1, 0, 1, 0] 5
...
0 [0, 0, 0, 0, 0] 0
...
19 [0, 1, -1, 0, 1] 4
...
61 [1, -1, 1, -1, 1] 3
...
121 [1, 1, 1, 1, 1] 15

Table 1 contains the dot products with the group-1 inputs [6, 7, 8, 9, 10]:

Packed code Ternary weights Dot product
121 [-1, -1, -1, -1, -1] 40
...
80 [-1, 0, 0, 0, 1] 4
...
15 [0, -1, 1, 1, 0] 10
...
89 [1, 0, 1, 0, -1] 4
...
105 [1, 1, 0, -1, 0] 4
...
121 [1, 1, 1, 1, 1] 40

Look up and accumulate

Now sweep across the packed bytes. Each byte costs one table lookup and one add:

group 0 codes: [-70    0   61  121   19  -51]  →  table 0 gives [ 1   0   3   15   4   5]
group 1 codes: [105  121  -80 -121   89  -15]  →  table 1 gives [ 4  40   4  -40   4  10]
                                                       sum  y = [ 5  40   7  -25   8  15]

Sanity check on column 0, the slow way:

1+0+3+45group 0+6+7+09+0group 1=1+4=5

The weights were never unpacked: each byte went straight from memory into a table index.

A few questions you might have

Isn't building a 243-entry table expensive? It's built once per input group and then reused by every output column. A real weight matrix has thousands of columns, so the build cost is negligible.

What about those negative indices? The table has a symmetry: flipping all five weights negates both the code and the dot product, so T[c]=T[c]. The implementation stores only the non-negative half and applies the sign separately.

How dense is the packing? 8 bits for 5 weights is 1.6 bits per weight, close to the theoretical floor of log231.585 — and denser than the 2.0 bits of a straightforward 2-bits-per-trit encoding.