Compiler Internals

The SHDL compiler transforms Base SHDL into highly optimized C code. This page explains how the code generation works and the optimization techniques used.

Compilation Pipeline

Base SHDL
    │
    ▼
┌─────────────┐
│   Parser    │   Tokenize and parse Base SHDL
└─────────────┘
    │
    ▼
┌─────────────┐
│  Analyzer   │   Semantic analysis, build connection graph
└─────────────┘
    │
    ▼
┌─────────────┐
│  CodeGen    │   Generate optimized C code
└─────────────┘
    │
    ▼
  C Source
    │
    ▼
┌─────────────┐
│ C Compiler  │   clang/gcc compiles to native code
└─────────────┘
    │
    ▼
Shared Library

Bit-Packing Architecture

The key optimization in SHDL's compiler is bit-packed SIMD-style execution:

Gate Packing - All gates of the same type are packed into 64-bit integers
Lane Assignment - Each gate instance occupies one bit position (lane)
Parallel Evaluation - A single CPU operation evaluates up to 64 gates simultaneously
Chunking - If more than 64 gates of one type exist, multiple chunks are used

Why Bit-Packing?

Consider a 16-bit adder with 32 XOR gates. Without bit-packing, you'd need 32 separate operations. With bit-packing, all 32 XOR gates are evaluated with a single ^ operation.

// Without bit-packing (slow)
xor1_out = xor1_a ^ xor1_b;
xor2_out = xor2_a ^ xor2_b;
// ... 30 more lines

// With bit-packing (fast)
XOR_outputs = XOR_inputs_A ^ XOR_inputs_B;  // All 32 at once!

Generated Code Structure

State Structure

The compiler generates a State struct containing outputs of all gate families:

typedef struct {
    uint64_t XOR_O_0;   // Chunk 0 of XOR outputs (up to 64 gates)
    uint64_t XOR_O_1;   // Chunk 1 if > 64 XOR gates
    uint64_t AND_O_0;   // Chunk 0 of AND outputs
    uint64_t OR_O_0;    // Chunk 0 of OR outputs
    uint64_t NOT_O_0;   // Chunk 0 of NOT outputs
} State;

Rules:

Only gate types that appear in the circuit are included
Each chunk holds up to 64 gate outputs
Chunk naming: <GATE_TYPE>_O_<chunk_index>

The `tick()` Function

The core simulation function computes the next state:

static inline State tick(State s, uint64_t A, uint64_t B, uint64_t Cin) {
    State n = s;  // Copy current state
    
    // 1. Build input vectors for each gate family
    // 2. Evaluate all gates in parallel
    // 3. Store results in next state
    
    return n;
}

Input Vector Construction

For each gate family, build packed input vectors by gathering bits:

uint64_t XOR_0_A = 0ull;
// Gather from component input A, bit 0, into lane 0
XOR_0_A |= ((uint64_t)-( ((A >> 0) & 1u) )) & 0x0000000000000001ull;
// Gather from component input A, bit 1, into lane 2
XOR_0_A |= ((uint64_t)-( ((A >> 1) & 1u) )) & 0x0000000000000004ull;
// Gather from XOR output bit 0, into lane 1
XOR_0_A |= ((uint64_t)-( ((s.XOR_O_0 >> 0) & 1u) )) & 0x0000000000000002ull;

The Bit-Spreading Idiom

((uint64_t)-( bit_value ))

This branchless technique:

Extracts a single bit: (value >> bit_pos) & 1u
Converts to all-ones or all-zeros: (uint64_t)-(bit_value)
- If bit is 1: result is 0xFFFFFFFFFFFFFFFF
- If bit is 0: result is 0x0000000000000000
Masks to target lane: & lane_mask

This is highly efficient on modern CPUs—no branches, pure arithmetic.

Gate Evaluation

After building input vectors, evaluate all gates with a single operation:

// XOR gates: O = A ^ B
n.XOR_O_0 = (XOR_0_A ^ XOR_0_B) & active_lanes_mask;

// AND gates: O = A & B
n.AND_O_0 = (AND_0_A & AND_0_B) & active_lanes_mask;

// OR gates: O = A | B
n.OR_O_0 = (OR_0_A | OR_0_B) & active_lanes_mask;

// NOT gates: O = ~A
n.NOT_O_0 = (~NOT_0_A) & active_lanes_mask;

The active_lanes_mask ensures only valid lanes are set when there are fewer than 64 gates in a chunk.

Constant Gates

__VCC__ and __GND__ gates are special:

__VCC__: Output is always 1 in its lane
__GND__: Output is always 0 in its lane

These are handled by pre-computing constant masks at compile time.

Library API

The generated C library exposes these functions:

`void reset(void)`

Resets all internal state to zero:

void reset(void) {
    memset(&dut, 0, sizeof(dut));
}

`void poke(const char *signal_name, uint64_t value)`

Sets an input signal value:

void poke(const char *signal_name, uint64_t value) {
    if (strcmp(signal_name, "A") == 0) {
        dut.input_A = value & 0xFFull;  // Mask to port width
    } else if (strcmp(signal_name, "B") == 0) {
        dut.input_B = value & 0xFFull;
    }
    // ...
}

`uint64_t peek(const char *signal_name)`

Reads current value of any signal:

uint64_t peek(const char *signal_name) {
    if (strcmp(signal_name, "A") == 0) return dut.input_A;
    
    ensure_outputs();  // Compute if needed
    
    if (strcmp(signal_name, "Sum") == 0) return dut.sum;
    // ...
}

`void step(int cycles)`

Advances simulation by N cycles:

void step(int cycles) {
    for (int i = 0; i < cycles; ++i) {
        dut.current = tick(dut.current, dut.input_A, dut.input_B, dut.input_Cin);
    }
    // Update cached outputs
    dut.sum = extract_Sum(&dut.current);
    dut.cout = extract_Cout(&dut.current);
}

Lane Assignment Strategy

The compiler assigns each gate instance to a specific lane:

Group by type - Collect all AND gates, all XOR gates, etc.
Assign lanes sequentially - Gate 0 → lane 0, Gate 1 → lane 1, etc.
Create chunks - Every 64 gates starts a new chunk
Record mapping - Track which lane holds which gate for I/O extraction

Example for a 16-bit adder with 32 XOR gates:

fa1_x1 → XOR_O_0, lane 0
fa1_x2 → XOR_O_0, lane 1
fa2_x1 → XOR_O_0, lane 2
fa2_x2 → XOR_O_0, lane 3
...
fa16_x2 → XOR_O_0, lane 31

Output Extraction

Generated functions extract component outputs from packed state:

static inline uint64_t extract_Sum(const State *s) {
    return (((s->XOR_O_0 >> 1) & 1ull) << 0)   // Sum[1] from lane 1
         | (((s->XOR_O_0 >> 3) & 1ull) << 1)   // Sum[2] from lane 3
         | (((s->XOR_O_0 >> 5) & 1ull) << 2)   // Sum[3] from lane 5
         // ...
         ;
}

Two-Phase Update Model

The simulator uses a two-phase model for correct timing:

Compute Phase - tick() computes next state from current state + inputs
Commit Phase - step() commits the computed state

This ensures:

Combinational loops are handled correctly
Sequential logic with feedback works properly
Multiple poke() calls can be made before evaluation

Complete Example

For a simple buffer component:

Input (Base SHDL):

component Buffer(A) -> (B) { 
    n1: NOT; 
    n2: NOT; 
    
    connect { 
        A -> n1.A; 
        n1.O -> n2.A; 
        n2.O -> B; 
    } 
}

Output (C code):

#include <stdint.h>
#include <string.h>

typedef struct {
    uint64_t NOT_O_0;
} State;

static inline State tick(State s, uint64_t A) {
    State n = s;
    
    // NOT gate inputs
    uint64_t NOT_0_A = 0ull;
    NOT_0_A |= ((uint64_t)-( (A & 1u) )) & 0x1ull;
    NOT_0_A |= ((uint64_t)-( ((s.NOT_O_0 >> 0) & 1u) )) & 0x2ull;
    
    // Evaluate NOT gates
    n.NOT_O_0 = (~NOT_0_A) & 0x3ull;  // 2 active lanes
    
    return n;
}

static inline uint64_t extract_B(const State *s) {
    return (s->NOT_O_0 >> 1) & 1ull;  // B from lane 1
}

// ... rest of API implementation

Using the Compiler

From Python

from SHDL import compile_shdl_file

# Generate C code
c_code = compile_shdl_file("adder16.shdl")
print(c_code)

# Or compile directly to a loadable circuit
from SHDL import Circuit

circuit = Circuit("adder16.shdl")

From Command Line

# Generate C code
shdlc adder16.shdl -o adder16.c

# Compile to shared library
shdlc adder16.shdl -c -o libadder16.dylib

# Debug build with introspection support
shdlc -g adder16.shdl -c -o libadder16.dylib

Debug Builds

The compiler supports debug builds with the -g flag, which add:

Gate name table - Maps gate names to their bit positions
peek_gate() function - Read any internal gate output by name
Cycle counter - Track simulation time
.shdb file - JSON metadata with hierarchy and source mapping

See Debug Build Reference for complete documentation.

Compilation Pipeline​

Bit-Packing Architecture​

Why Bit-Packing?​

Generated Code Structure​

State Structure​

The tick() Function​

Input Vector Construction​

The Bit-Spreading Idiom​

Gate Evaluation​

Constant Gates​

Library API​

void reset(void)​

void poke(const char *signal_name, uint64_t value)​

uint64_t peek(const char *signal_name)​

void step(int cycles)​

Lane Assignment Strategy​

Output Extraction​

Two-Phase Update Model​

Complete Example​

Using the Compiler​

From Python​

From Command Line​

Debug Builds​