Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance

Part 1: The Cache — Your Best Friend

The Numbers That Matter

┌─────────────────────────────────────────────────────────────┐
│   SH-4 MEMORY HIERARCHY                                     │
│                                                             │
│   Registers:     0 cycles (instant)                         │
│   L1 Cache:      1-2 cycles (~10 ns)                        │
│   Main RAM:      10-20 cycles (~100 ns)                     │
│   CD-ROM:        millions of cycles (200+ ms)               │
│                                                             │
│   Cache miss = 10-20× SLOWER than cache hit!                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Cache Lines: The Free Lunch

When you read one byte from RAM, the CPU doesn’t fetch just that byte. It fetches a whole cache line — 32 bytes on SH-4.

You ask for array[0]:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ 0  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 7  │  ← All 32 bytes loaded!
└────┴────┴────┴────┴────┴────┴────┴────┘
  ▲
  You wanted this one

Next 7 accesses are FREE! They're already in cache.

Sequential Access: The Fast Path

// FAST: Sequential access — 125 elements
sum := 0
for i := 0; i < 125; i++ {
    sum += array[i]
}

What happens:

Access array[0] → Cache miss, load 32 bytes
Access array[1] → Cache HIT (free!)
Access array[2] → Cache HIT (free!)
...
Access array[7] → Cache HIT (free!)
Access array[8] → Cache miss, load next 32 bytes
...

Total cache misses: 125 / 8 = ~16

Strided Access: The Slow Path

// SLOW: Strided access (every 8th element) — also 125 elements
sum := 0
for i := 0; i < 1000; i += 8 {
    sum += array[i]
}

What happens:

Access array[0]   → Cache miss
Access array[8]   → Cache miss (different cache line!)
Access array[16]  → Cache miss
Access array[24]  → Cache miss
...
Access array[992] → Cache miss

Total cache misses: 125 (EVERY access misses!)

Same number of additions (125), but strided is ~8× slower because every access misses the cache.

The Practical Lesson

┌─────────────────────────────────────────────────────────────┐
│   CACHE-FRIENDLY PATTERNS                                   │
│                                                             │
│   ✓ Process arrays left-to-right                            │
│   ✓ Keep related data together (struct of arrays)           │
│   ✓ Avoid pointer-chasing (linked lists are slow!)          │
│   ✓ Small, tight loops                                      │
│                                                             │
│   ✗ Random access patterns                                  │
│   ✗ Large structs with rarely-used fields                   │
│   ✗ Jumping around memory                                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Part 2: The Float64 Trap

The Shocking Truth

Go defaults to float64 for floating-point numbers:

x := 3.14  // This is float64!

On a modern PC, float64 and float32 are about the same speed. On SH-4?

┌─────────────────────────────────────────────────────────────┐
│   FLOAT PERFORMANCE ON SH-4                                 │
│                                                             │
│   float32:  Hardware accelerated, FAST                      │
│             One instruction, one cycle                      │
│                                                             │
│   float64:  Software emulation, SLOW                        │
│             Multiple instructions, 10-20× slower!           │
│                                                             │
│   A physics simulation using float64 could run              │
│   at 6 FPS instead of 60 FPS. That's the difference.        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Fix

Be explicit about float32:

// SLOW
x := 3.14           // float64 by default!
y := x * 2.0        // float64 math

// FAST
var x float32 = 3.14  // Explicit float32
y := x * 2.0          // float32 math

For game physics, positions, velocities — always use float32.


Part 3: What We Deliberately Left Out

“Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.” — Antoine de Saint-Exupéry

libgodc is not a complete Go implementation. That’s intentional. Here’s what we cut and why:

Omission 1: Full Reflection

Standard Go: Every type carries metadata — field names, method signatures, struct tags. This enables reflect and fancy JSON marshaling.

Cost: Binary size can double.

libgodc: Basic reflection only. Enough for println to work.

What you lose:

reflect.MakeFunc(...)     // NOT SUPPORTED
json.Marshal(myStruct)    // NOT SUPPORTED (would need full reflection)

What you do instead: Write explicit serialization. Use code generators.

Omission 2: Finalizers

Standard Go:

runtime.SetFinalizer(obj, func(o *MyType) {
    o.cleanup()  // Runs when GC collects obj
})

The problem: Finalizers are a nightmare for GC:

  • Objects can be resurrected
  • Run order is undefined
  • Timing is unpredictable
  • Complicate the GC significantly

libgodc: No finalizers.

What you do instead: Use defer for cleanup:

func process() {
    resource := acquire()
    defer resource.Release()  // Always runs!
    // ... use resource ...
}

Omission 3: Preemptive Scheduling

Standard Go: The runtime can interrupt a goroutine at almost any point.

libgodc: Goroutines must yield voluntarily.

// THIS FREEZES THE SYSTEM
for {
    // Infinite loop, never yields
    // No other goroutine will EVER run
}

// THIS IS FINE
for {
    doWork()
    runtime.Gosched()  // "Let others run"
}

Why we did this: Preemption requires safe points, stack inspection, and signal handling. Complex for little benefit on single-CPU.

Omission 4: Concurrent GC

Standard Go:

Your code:    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
GC:                ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
              Both run in parallel!
              Pause: < 1ms

libgodc:

Your code:    ░░░░░░░░░░████████████░░░░░░░░
GC:                     ▓▓▓▓▓▓▓▓▓▓▓▓
              EVERYTHING STOPS during GC
              Pause: 5-20ms

Why we did this: Concurrent GC requires write barriers, atomic operations, and careful synchronization. Stop-the-world is simpler and predictable.

What you do: Keep live data small. Trigger GC between frames or during loading.

The Trade-off Table

FeatureWhat We ChoseWhy
GCSemi-space, stop-the-worldSimple, no fragmentation
SchedulingCooperative, M:1No locks, predictable
Panic/Recoversetjmp/longjmpNo DWARF unwinding
ReflectionMinimalBinary size
PreemptionNoneSimplicity
C interopDirect linkingNo CGo complexity

Our philosophy: Predictability over throughput. Simplicity over features.


Part 4: When to Optimize

The Golden Question

Before optimizing anything, ask:

“Have I measured this?”

If the answer is no, stop. You’re guessing. And programmers are notoriously bad at guessing where time is spent.

The 90/10 Rule

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   90% of execution time is spent in 10% of the code         │
│                                                             │
│   That means:                                               │
│   • 90% of your code DOESN'T MATTER for performance         │
│   • Optimizing the wrong code = wasted effort               │
│   • Always measure first!                                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

DO Optimize

  • Code that runs every frame (game loop, rendering)
  • Hot loops with thousands of iterations
  • Code that measurements show is slow

DON’T Optimize

  • Code that runs once (startup, level load)
  • Code that runs rarely (menu navigation)
  • Code you haven’t measured
  • At the cost of readability

How to Measure

//extern timer_us_gettime64
func timerUsGettime64() uint64

func measureGameLoop() {
    start := timerUsGettime64()
    
    updatePhysics()
    physicsTime := timerUsGettime64() - start
    
    renderStart := timerUsGettime64()
    renderFrame()
    renderTime := timerUsGettime64() - renderStart
    
    println("Physics:", physicsTime, "us")
    println("Render:", renderTime, "us")
}

Now you know where time actually goes!


Part 5: The Debug Build System

Production vs Debug

By default, libgodc is silent. Zero debug output, zero overhead.

# Production build (default)
make && make install

# Debug build - enables debug output and assertions
make DEBUG=3 && make install

The Performance Tax of Debug Output

┌─────────────────────────────────────────────────────────────┐
│   OPERATION          Production     DEBUG=3                 │
│                                                             │
│   Goroutine spawn    50 μs          188,000 μs (188 ms!)    │
│   Channel send       19 μs          ~50,000 μs              │
│   GC pause           21 ms          ~500 ms                 │
│                                                             │
│   Debug output is EXTREMELY EXPENSIVE!                      │
│   Never benchmark with DEBUG enabled.                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Debug Macros

Instead of raw printf, use these macros:

MacroUse ForExample
LIBGODC_TRACE()General tracingScheduler events
LIBGODC_WARNING()Non-fatal issuesLarge allocations
LIBGODC_ERROR()Recoverable errorsFailed operations
LIBGODC_CRITICAL()Fatal errorsLogged to crash dump
GC_TRACE()GC-specificCollection details

In production (DEBUG=0): All macros compile to nothing. Zero cost.

In debug (DEBUG=3): Output includes labels:

[godc:main] Scheduling G 42 (status=1)
[godc:main] WARNING: Large allocation 256 KB
[GC] #3: 1024->512 (50% survived) in 21045 us

Using Debug Macros

In C runtime code:

#include "runtime.h"

void my_function(void) {
    LIBGODC_TRACE("Entering my_function");
    
    if (error_condition) {
        LIBGODC_WARNING("Something unexpected: %d", value);
    }
    
    LIBGODC_TRACE("my_function complete");
}

In Go code, use println:

const DEBUG = false  // Set to true when debugging

func debugPrint(msg string) {
    if DEBUG {
        println(msg)
    }
}

Debug Functions Available

When investigating issues, you can call these:

gc_dump_stats();       // Print GC statistics
gc_verify_heap();      // Check heap integrity
gc_print_object(ptr);  // Print object details
gc_dump_heap(10);      // Dump first 10 heap objects

Real Benchmark Results

We ran these benchmarks on actual Dreamcast hardware. These numbers should guide your optimization decisions.

PVRMark: Go vs Native C

We ran the KOS pvrmark benchmark (flat-shaded triangles, no textures) on real Dreamcast hardware to measure Go runtime overhead:

MetricC NativeGo (default)Go (GODC_FAST)
Peak polys/frame17,53313,83314,333
Peak pps~1,054,097~831,714~860,532
vs C performance100%79%82%
Binary size314 KB614 KB614 KB
┌─────────────────────────────────────────────────────────────┐
│   POLYGON THROUGHPUT (polys/frame @ 60fps)                  │
│                                                             │
│   C Native:      ████████████████████████████████████ 17,533│
│   Go Optimized:  ████████████████████████████        14,333 │
│   Go Default:    ██████████████████████████          13,833 │
│                                                             │
│   GODC_FAST=1 adds +500 polys/frame (+3.6%)                 │
│   Go achieves 82% of C polygon throughput                   │
└─────────────────────────────────────────────────────────────┘

Analysis:

  • The 18% overhead comes from bounds checking, slice header overhead, and gccgo code generation differences (not FFI — //extern compiles to direct jsr calls)
  • GODC_FAST=1 improves performance by ~3.6% via aggressive optimization
  • For real games with textures, lighting, and game logic, this difference is negligible
  • 14,333 flat-shaded triangles at 60fps is plenty for actual gameplay

What the extra 300KB binary size buys you:

  • Garbage collection
  • Goroutines and channels
  • Defer/panic/recover
  • Type safety and bounds checking
  • Full Go standard library support

Compiler Optimization Flags

The godc build command uses these SH-4 specific optimizations:

FlagEffectDefault
-O2Standard optimization
-m4-singleSingle-precision FPU mode
-mfsrraHardware reciprocal sqrt (10× faster)
-mfscaHardware sin/cos (10× faster)
-O3Aggressive optimizationGODC_FAST only
-ffast-mathFast FP (breaks IEEE)GODC_FAST only
-funroll-loopsLoop unrollingGODC_FAST only

To enable aggressive optimizations:

GODC_FAST=1 godc build

Warning: -ffast-math breaks IEEE floating point compliance. NaN and infinity handling may not work correctly. Use only for games where FP precision isn’t critical.