Key Idea of BitNet (ternary edition). Matrix Multiply using Lookup Tables
This example is inspired by BitNet b1.58, which uses ternary weights: , , and . Actual packing structure and details aren't meant to be the same.
Lookup-based matrix multiplication is not a new idea, but it is useful here because it allows compact quantized weights to be used directly during CPU inference.
Key ideas:
- Multiple quantized weights can be packed into one byte, reducing storage and memory use.
- During matrix multiplication, the packed weights do not need to be unpacked. The byte can be used directly as an index into a lookup table.
In a follow-up post, we will show that the bigger tables shown here are actually feasible with AVX-512 and can get many times the performance of the AVX2 implementations in the official BitNet Repository.
Packing five ternary weights into one byte
It can be packed!
Five ternary weights have possible combinations.
One byte has possible values. This means one byte is enough to represent 5 trits.
Encoding a group of weights
Consider five weights:
[-1, 0, 1, 1, -1]
Treat them as the digits of a balanced ternary number. The place values are powers of three:
| Position | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| Place value | 81 | 27 | 9 | 3 | 1 |
| Weight | -1 | 0 | 1 | 1 | -1 |
The packed value is
Each weight is one of , , or . The packed code therefore ranges from to , giving exactly 243 possible values.
Using the packed byte as a lookup index
Suppose the current group of five input values is
[1, 2, 3, 4, 5]
We build a table containing the dot product of these inputs with every possible five-weight pattern. A small section of that table looks like this:
| Packed code | Ternary weights | Dot product |
|---|---|---|
[-1, -1, -1, -1, -1] |
||
[-1, -1, -1, -1, 0] |
||
[-1, -1, -1, -1, 1] |
||
... |
||
[0, 0, 0, 0, -1] |
||
[0, 0, 0, 0, 0] |
||
[0, 0, 0, 0, 1] |
||
... |
||
[1, 1, 1, 1, -1] |
||
[1, 1, 1, 1, 0] |
||
[1, 1, 1, 1, 1] |
Applying this to matrix multiplication
Let's walk through a small example: with 10 inputs and several output columns.
Setup
Inputs, in two groups of five:
x = [1 2 3 4 5 | 6 7 8 9 10]
group 0 group 1
Weights, one column per output:
col 0 col 1 col 2 col 3 col 4 col 5 ...
group 0: -1 0 1 1 0 -1 ...
0 0 -1 1 1 1
1 0 1 1 -1 0
1 0 -1 1 0 1
-1 0 1 1 1 0
group 1: 1 1 -1 -1 1 0 ...
1 1 0 -1 0 -1
0 1 0 -1 1 1
-1 1 0 -1 0 1
0 1 1 -1 -1 0
Each five-weight block is packed into one byte, and these bytes are the only form of the weights kept in memory:
| Packed codes | col 0 | col 1 | col 2 | col 3 | col 4 | col 5 | |
|---|---|---|---|---|---|---|---|
| group 0 | |||||||
| group 1 |
Build one table per input group
When the input vector arrives, build the full 243-entry lookup table for each group of five inputs, exactly as in the previous section.
Table 0 contains the dot products with the group-0 inputs [1, 2, 3, 4, 5]:
| Packed code | Ternary weights | Dot product |
|---|---|---|
[-1, -1, -1, -1, -1] |
||
... |
||
[-1, 0, 1, 1, -1] |
||
... |
||
[-1, 1, 0, 1, 0] |
||
... |
||
[0, 0, 0, 0, 0] |
||
... |
||
[0, 1, -1, 0, 1] |
||
... |
||
[1, -1, 1, -1, 1] |
||
... |
||
[1, 1, 1, 1, 1] |
Table 1 contains the dot products with the group-1 inputs [6, 7, 8, 9, 10]:
| Packed code | Ternary weights | Dot product |
|---|---|---|
[-1, -1, -1, -1, -1] |
||
... |
||
[-1, 0, 0, 0, 1] |
||
... |
||
[0, -1, 1, 1, 0] |
||
... |
||
[1, 0, 1, 0, -1] |
||
... |
||
[1, 1, 0, -1, 0] |
||
... |
||
[1, 1, 1, 1, 1] |
Look up and accumulate
Now sweep across the packed bytes. Each byte costs one table lookup and one add:
group 0 codes: [-70 0 61 121 19 -51] → table 0 gives [ 1 0 3 15 4 5]
group 1 codes: [105 121 -80 -121 89 -15] → table 1 gives [ 4 40 4 -40 4 10]
sum y = [ 5 40 7 -25 8 15]
Sanity check on column 0, the slow way:
The weights were never unpacked: each byte went straight from memory into a table index.
A few questions you might have
Isn't building a 243-entry table expensive? It's built once per input group and then reused by every output column. A real weight matrix has thousands of columns, so the build cost is negligible.
What about those negative indices? The table has a symmetry: flipping all five weights negates both the code and the dot product, so . The implementation stores only the non-negative half and applies the sign separately.
How dense is the packing? 8 bits for 5 weights is 1.6 bits per weight, close to the theoretical floor of — and denser than the 2.0 bits of a straightforward 2-bits-per-trit encoding.