Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

libgodc

libgodc

Welcome to libgodc — a minimal Go runtime implementation for the Sega Dreamcast.

This project brings the Go programming language to a 1998 game console with 16MB of RAM, a 200MHz SH4 processor, and absolutely no operating system to speak of. It’s an exercise in constraints, a love letter to retro hardware, and a deep dive into how programming languages actually work under the hood.

What is libgodc?

libgodc replaces the standard Go runtime (libgo) with one designed for the Dreamcast’s unique constraints:

FeatureDesktop Golibgodc
MemoryGigabytes16 MB total
CPUMulticore GHzSinglecore 200 MHz
SchedulerPreemptiveCooperative
GCConcurrent tricolorStop-the-world semispace
StacksGrowableFixed 64 KB

Despite these differences, you write normal Go code. Goroutines work. Channels work. Maps, slices, interfaces — they all work. The magic is in the runtime.

Who is this for?

  • Systems programmers curious about runtime implementation
  • Go developers who want to understand what happens below go run
  • Retro enthusiasts who think game consoles deserve modern languages
  • Anyone who enjoys the challenge of severe constraints

Prerequisites

Before diving in, you should be comfortable with:

SkillLevelWhy You Need It
GoIntermediateVariables, functions, structs, goroutines, channels
CBasicPointers, memory layout, basic syntax
Command lineComfortableBuilding, running, navigating directories

You don’t need to know:

  • Assembly language (we’ll explain what you need)
  • Dreamcast hardware (KallistiOS handles the hard parts)
  • Garbage collection algorithms (we’ll build one together)
  • Operating system internals (we’ll cover what’s relevant)

If you can write a Go program that uses goroutines and channels, and you know what a pointer is in C, you’re ready.

What’s in this book?

Getting Started

Installation, toolchain setup, and your first Dreamcast Go program.

The Book

A complete walkthrough of building a Go runtime from scratch:

  • Memory allocation and garbage collection
  • Goroutine scheduling without threads
  • Channel implementation
  • Panic, defer, and recover
  • Building real games

Reference

Technical documentation for daily use:

  • API design
  • Best practices
  • Hardware integration
  • Known limitations

Quick Example

package main

import "kos"

func main() {
    kos.PvrInitDefaults()
    
    for {
        kos.PvrWaitReady()
        kos.PvrSceneBegin()
        kos.PvrListBegin(kos.PVR_LIST_OP_POLY)
        // draw stuff here
        kos.PvrListFinish()
        kos.PvrSceneFinish()
    }
}

This runs on a Dreamcast. Real hardware. 1998 technology. Go code.

Getting Started

Ready to begin? Head to the Installation page.

Or if you want to understand the full journey, start with Building From Nothing.

"Console development is the art of saying 'no' to malloc."

Installation

Requirements

  • A Unix-like system (Linux, macOS, WSL2)
  • 4GB disk space for the toolchain
  • An x86_64 or arm64 host
  • Go 1.25.3 or later — required to install the godc CLI tool
  • make — required for building projects
  • git — required for toolchain setup and updates

Quick Start

The godc tool automates everything:

go install github.com/drpaneas/godc@latest
godc setup

This downloads the prebuilt toolchain to ~/dreamcast and configures your environment. Run godc doctor to verify the installation.

godc Commands

CommandDescription
godc setupInstall entire toolchain from scratch
godc configConfigure paths and settings
godc initCreate project files in current directory
godc buildCompile your game
godc runBuild and run in emulator
godc run --ipBuild and run on real Dreamcast via BBA
godc cleanRemove build artifacts
godc doctorCheck if everything is installed
godc updateUpdate libgodc to latest version
godc envShow current paths
godc versionPrint godc version

Configuration

godc stores its config in ~/.config/godc/config.toml:

Path = "/home/user/dreamcast"    # Toolchain location
Emu = "flycast"                  # Default emulator
IP = "192.168.2.203"             # Dreamcast IP for dc-tool

To update settings interactively:

godc config

Manual Installation

If the automated setup doesn’t work for your environment:

Step 1: Get the Toolchain

Download the prebuilt toolchain for your platform:

# Linux x86_64
curl -LO https://github.com/drpaneas/dreamcast-toolchain-builds/releases/download/gcc15.1.0-kos2.2.1/dreamcast-toolchain-gcc15.1.0-kos2.2.1-linux-x86_64.tar.gz

# Linux arm64 (aarch64)
curl -LO https://github.com/drpaneas/dreamcast-toolchain-builds/releases/download/gcc15.1.0-kos2.2.1/dreamcast-toolchain-gcc15.1.0-kos2.2.1-linux-arm64.tar.gz

# macOS arm64 (Apple Silicon)
curl -LO https://github.com/drpaneas/dreamcast-toolchain-builds/releases/download/gcc15.1.0-kos2.2.1/dreamcast-toolchain-gcc15.1.0-kos2.2.1-darwin-arm64.tar.gz

Step 2: Extract

mkdir -p ~/dreamcast
tar -xf dreamcast-toolchain-*.tar.gz -C ~/dreamcast --strip-components=1

The toolchain contains:

~/dreamcast/
├── sh-elf/           # Cross-compiler (sh-elf-gccgo, binutils)
├── kos/              # KallistiOS (OS, drivers, headers)
├── libgodc/          # This library (Go runtime)
└── tools/            # Utilities (elf2bin, makeip, etc.)

Step 3: Environment

Add these to your shell configuration (~/.bashrc, ~/.zshrc, etc.):

export PATH="$HOME/dreamcast/sh-elf/bin:$PATH"
source ~/dreamcast/kos/environ.sh

environ.sh sets KOS_BASE, KOS_ARCH, and other build variables.

Step 4: Verify

sh-elf-gccgo --version
# Should print: sh-elf-gccgo (GCC) 14.x.x ...

ls $KOS_BASE/lib/libgodc.a
# Should exist

Building libgodc from Source

If you need to modify the runtime, or if prebuilt libraries aren’t available:

git clone https://github.com/drpaneas/libgodc ~/dreamcast/libgodc
cd ~/dreamcast/libgodc
source ~/dreamcast/kos/environ.sh
make clean
make
make install

This builds libgodc.a (the runtime) and libgodcbegin.a (startup code), then installs them to $KOS_BASE/lib/.

Debug Build

For development, enable debug output:

make DEBUG=1

This adds -DLIBGODC_DEBUG=1 -g to the compiler flags, enabling trace output and symbols.

Running Code

Emulator

lxdream-nitro or flycast can run Dreamcast binaries.

cd examples/hello
make
flycast hello.elf

Real Hardware

With a Broadband Adapter or serial cable:

# Upload via IP (BBA)
dc-tool-ip -t 192.168.1.100 -x hello.elf

# Upload via serial
dc-tool-ser -t /dev/ttyUSB0 -x hello.elf

The godc run command automates this:

godc run              # Uses configured emulator
godc run --ip         # Uses dc-tool-ip with configured address

Project Structure

A minimal project:

myproject/
├── go.mod            # Module definition
├── main.go           # Your code
├── .Makefile         # Build rules (generated by godc)
└── romdisk/          # Optional: game assets
    ├── texture.png
    └── sound.wav

Example 1: Minimal (hello)

The simplest program — no graphics, just debug output:

main.go:

// Minimal Dreamcast program
package main

func main() {
    println("Hello, Dreamcast!")
}

go.mod (generated by godc init):

module hello

go 1.25.3

replace kos => ~/dreamcast/libgodc/kos

Example 2: Screen Output (hello_screen)

Display text on screen using the BIOS font:

main.go:

// Hello World on Dreamcast screen using BIOS font
package main

import "kos"

func main() {
    // center "Hello World" on 640x480 screen
    x := 640/2 - (11*kos.BFONT_THIN_WIDTH)/2
    y := 480/2 - kos.BFONT_HEIGHT/2
    offset := y*640 + x

    kos.BfontDrawStr(kos.VramSOffset(offset), 640, true, "Hello World")

    for {
        kos.TimerSpinSleep(100)
    }
}

go.mod (generated by godc init):

module hello_screen

go 1.25.3

replace kos => ~/dreamcast/libgodc/kos

require kos v0.0.0-00010101000000-000000000000

Build and Run

godc init             # Generate go.mod and .Makefile
godc build            # Compile to .elf
godc run              # Launch in emulator

Or manually:

sh-elf-gccgo -O2 -ml -m4-single -fno-split-stack -mfsrra -mfsca \
    -I$KOS_BASE/lib -L$KOS_BASE/lib \
    -c main.go -o main.o

kos-cc -o myproject.elf main.o \
    -L$KOS_BASE/lib -Wl,--whole-archive -lgodcbegin \
    -Wl,--no-whole-archive -lkos -lgodc

Romdisks — Packaging Assets

A romdisk is a read-only filesystem compiled into your executable. Put assets in the romdisk/ directory:

myproject/
├── main.go
└── romdisk/
    ├── player.png
    └── music.wav

The build system automatically:

  1. Creates romdisk.img using genromfs
  2. Converts it to romdisk.o using bin2o
  3. Links it into your executable

Access files in Go via /rd/:

texture := kos.PlxTxrLoad("/rd/player.png", true, 0)
sound := kos.SndSfxLoad("/rd/music.wav")

Compiler Flags

Default flags used by godc:

FlagPurpose
-O2Standard optimization
-mlLittle-endian mode
-m4-singleSH-4 with single-precision FPU
-fno-split-stackFixed-size goroutine stacks
-mfsrraHardware reciprocal sqrt
-mfscaHardware sin/cos lookup

For maximum performance:

GODC_FAST=1 godc build

This enables -O3 -ffast-math -funroll-loops. Warning: -ffast-math breaks IEEE floating-point compliance.

Project Overrides

Create godc.mk for project-specific customizations:

# Reduce GC heap to free RAM for assets
CFLAGS += -DGC_SEMISPACE_SIZE_KB=1024

# Add extra libraries
LIBS += -lmy_custom_lib

# Custom romdisk location
ROMDISK_DIR = assets

Troubleshooting

“sh-elf-gccgo: command not found”

The compiler isn’t in your PATH. Check:

echo $PATH | tr ':' '\n' | grep dreamcast
which sh-elf-gccgo

“cannot find -lgodc”

The runtime library isn’t installed. Build and install it:

cd ~/dreamcast/libgodc
make install
ls $KOS_BASE/lib/libgodc.a

“undefined reference to `__go_runtime_init’”

You’re linking with the wrong library order. The correct order is:

-Wl,--whole-archive -lgodcbegin -Wl,--no-whole-archive -lkos -lgodc

-lgodcbegin must be wrapped in --whole-archive to ensure all its symbols are included.

Runtime crashes immediately

Check if your program uses double-precision floats. The SH-4 FPU is single-precision only. Compile with -m4-single and avoid float64 in hot paths.

Out of memory

The Dreamcast has 16MB. Check your allocations using the C API:

#include "gc_semispace.h"

size_t used, total;
uint32_t collections;
gc_stats(&used, &total, &collections);
printf("Heap: %zu / %zu bytes, %u collections\n", used, total, collections);

From Go, you can count goroutines:

println("Goroutines:", runtime.NumGoroutine())

Consider using KOS malloc directly for large buffers:

ptr := kos.PvrMemMalloc(size)  // PVR VRAM
ptr := kos.Malloc(size)        // KOS heap

Next Steps

Quick Start

Let’s create your first Dreamcast Go program.

Create a Project

mkdir myproject && cd myproject
godc init

Example output:

$ godc init
go: found kos in kos v0.0.0-00010101000000-000000000000

This creates go.mod and go.work files that configure your project to use the kos package from your libgodc installation.

Project Structure

A minimal project looks like this:

myproject/
├── go.mod            # Module definition with kos dependency
├── go.work           # Workspace configuration
└── main.go           # Your code

The go.mod file (paths will match your libgodc location):

module myproject

go 1.25.3

replace kos => /path/to/your/libgodc/kos

require kos v0.0.0-00010101000000-000000000000

The go.work file:

go 1.25.3

use (
        /path/to/your/libgodc
        .
)

Note: The paths in go.mod and go.work will automatically point to your libgodc installation location.

Hello, Dreamcast!

Create main.go:

package main

import "kos"

func main() {
    kos.PvrInitDefaults()
    println("Hello, Dreamcast!")
    for {}
}

Build and Run

Using godc:

godc build            # Compile to .elf
godc run              # Launch in emulator

Or manually with sh-elf-gccgo:

sh-elf-gccgo -O2 -ml -m4-single -fno-split-stack -mfsrra -mfsca \
    -I$KOS_BASE/lib -L$KOS_BASE/lib \
    -c main.go -o main.o

kos-cc -o myproject.elf main.o \
    -L$KOS_BASE/lib -Wl,--whole-archive -lgodcbegin \
    -Wl,--no-whole-archive -lkos -lgodc

Your First Graphics

Let’s draw something on screen:

package main

import "kos"

func main() {
    kos.PvrInitDefaults()
    
    for {
        kos.PvrWaitReady()
        kos.PvrSceneBegin()
        
        // Draw opaque geometry
        kos.PvrListBegin(kos.PVR_LIST_OP_POLY)
        drawTriangle()
        kos.PvrListFinish()
        
        kos.PvrSceneFinish()
    }
}

func drawTriangle() {
    // Create and submit polygon header
    var hdr kos.PvrPolyHdr
    var ctx kos.PvrPolyCxt
    kos.PvrPolyCxtCol(&ctx, kos.PVR_LIST_OP_POLY)
    kos.PvrPolyCompile(&hdr, &ctx)
    kos.PvrPrim(&hdr)  // Submit header
    
    // Submit vertices (use PvrPrimVertex for vertices)
    v := kos.PvrVertex{
        Flags: kos.PVR_CMD_VERTEX,
        X: 320, Y: 100, Z: 1,
        ARGB: 0xFFFF0000,  // Red
    }
    kos.PvrPrimVertex(&v)
    
    v.X, v.Y = 200, 400
    v.ARGB = 0xFF00FF00  // Green
    kos.PvrPrimVertex(&v)
    
    v.X, v.Y = 440, 400
    v.Flags = kos.PVR_CMD_VERTEX_EOL  // End of strip
    v.ARGB = 0xFF0000FF  // Blue
    kos.PvrPrimVertex(&v)
}

Using Goroutines

Goroutines work on Dreamcast:

package main

import "kos"

func main() {
    kos.PvrInitDefaults()
    
    // Start a background goroutine
    go func() {
        counter := 0
        for {
            counter++
            println("Background:", counter)
            select {}  // Yield to scheduler
        }
    }()
    
    // Main loop
    for {
        kos.PvrWaitReady()
        kos.PvrSceneBegin()
        render()
        kos.PvrSceneFinish()
    }
}

Using Channels

Channels enable communication between goroutines:

package main

import "kos"

func main() {
    kos.PvrInitDefaults()
    
    // Create a buffered channel
    scores := make(chan int, 10)
    
    // Score counter goroutine
    go func() {
        total := 0
        for score := range scores {
            total += score
            println("Total score:", total)
        }
    }()
    
    // Main game loop
    for {
        // Game logic
        if playerScored() {
            scores <- 100  // Send score
        }
        render()
    }
}

Next Steps

Building From Nothing

The Real Starting Point

Most documentation starts after the hard part. “Here’s the GC” assumes you know you need one. “Here’s how goroutines work” assumes you figured out the symbol names.

Let’s go back to the real beginning:

DAY 0: THE SITUATION

You have:
• sh-elf-gccgo (Go compiler for SH-4)
• KallistiOS (Dreamcast SDK)
• A simple Go program: println("Hello, Dreamcast!")

You try to compile it. What happens?

$ sh-elf-gccgo -c hello.go
$ sh-elf-gcc hello.o -o hello.elf

LINKER ERRORS. Hundreds of them.

undefined reference to `runtime.printstring'
undefined reference to `runtime.printnl'
undefined reference to `__go_runtime_error'
undefined reference to `runtime.newobject'
...

Those undefined references are the holes we discussed in Chapter 2. The compiler generated calls to runtime functions that don’t exist.

Your job: Provide implementations for every one of them.


Part 1: The Discovery Process

How Do You Know What gccgo Expects?

This is the question nobody answers. Where is it documented? What’s the ABI?

Answer: It’s not well-documented. You have to investigate.

Here’s the process we used:

Method 1: Read the Linker Errors

The linker tells you exactly what’s missing:

sh-elf-gccgo -c myprogram.go -o myprogram.o
sh-elf-gcc myprogram.o -o myprogram.elf 2>&1 | grep "undefined reference"

You’ll see output like:

undefined reference to `runtime.printstring'
undefined reference to `runtime.printnl'
undefined reference to `__go_runtime_error'
undefined reference to `runtime.newobject'
undefined reference to `runtime.makeslice'

Start here. Each undefined symbol is a function you need to write.

Method 2: Read the gccgo Source

The gccgo frontend lives in the GCC source tree. The key directories:

gcc/go/gofrontend/      ← The Go parser and type checker
libgo/runtime/          ← The reference runtime (for Linux)
libgo/go/               ← Go standard library

When gccgo compiles make([]int, 10), it emits a call to runtime.makeslice. To find the expected signature:

# In the GCC source tree
grep -r "makeslice" libgo/runtime/

You’ll find the actual implementation. Study its parameters and return type.

Method 3: Use nm on Object Files

Compile your Go code and inspect what symbols it references:

sh-elf-gccgo -c test.go -o test.o
sh-elf-nm test.o | grep " U "   # "U" = undefined (needs linking)

This shows you every external symbol your code needs.

Method 4: Disassemble and Trace

When things don’t work, disassemble:

sh-elf-objdump -d test.o | less

Look at how functions are called. What registers hold arguments? What’s expected in return registers?

The Symbol Naming Convention

gccgo uses a specific naming scheme:

Go ConceptSymbol Name
runtime.Xruntime.X (literal dot)
main.foomain.foo
Method on type TT.MethodName
Interface methodComplex mangling

Since C can’t have dots in identifiers, we use the __asm__ trick:

void runtime_printstring(String s) __asm__("runtime.printstring");

void runtime_printstring(String s) {
    // Implementation
}

Part 2: The Build Order

You can’t build everything at once. There are dependencies:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   DEPENDENCY GRAPH                                          │
│                                                             │
│                       ┌─────────┐                           │
│                       │ println │                           │
│                       └────┬────┘                           │
│                            │ needs                          │
│                       ┌────▼────┐                           │
│                       │ strings │                           │
│                       └────┬────┘                           │
│                            │ needs                          │
│                       ┌────▼────┐                           │
│                       │ memory  │                           │
│                       │ alloc   │                           │
│                       └────┬────┘                           │
│                            │ needs                          │
│                       ┌────▼────┐                           │
│                       │  heap   │                           │
│                       │  init   │                           │
│                       └─────────┘                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Milestone 1: Hello World

Goal: Print a string. No GC, no goroutines, nothing fancy.

What you need:

  1. Memory allocator — Even println allocates internally
  2. Print functionsruntime.printstring, runtime.printnl, runtime.printint
  3. String support — Go strings are {pointer, length} structs
  4. Entry point — Something to call main.main

The minimal files:

runtime/
├── go-main.c           # Entry point, calls main.main
├── malloc_dreamcast.c  # Basic malloc wrapper
├── go-print.c          # Print functions
└── runtime.h           # Common definitions

Test:

package main

func main() {
    println("Hello, Dreamcast!")
}

If this prints, you have a foundation.

Milestone 2: Basic Types

Goal: Slices, arrays, basic type operations.

What you need:

  1. makeslice — Create slices
  2. growslice — Append to slices
  3. Type descriptors — Compiler generates these, you need to understand them
  4. Memory operationsmemcpy, memset, memmove wrappers

New files:

runtime/
├── slice_dreamcast.c   # Slice operations
├── string_dreamcast.c  # String operations
└── type_descriptors.h  # Type metadata structures

Test:

package main

func main() {
    s := make([]int, 5)
    s[0] = 42
    println(s[0])
}

Milestone 3: Panic and Defer

Goal: Error handling works.

Why before GC? Because GC needs defer for cleanup. And panic is simpler than GC.

What you need:

  1. Defer chain — Linked list of deferred calls per goroutine
  2. Panic mechanism — setjmp/longjmp based
  3. Recover — Check if in deferred function

Test:

package main

func main() {
    defer println("world")
    println("hello")
}
// Should print: hello, then world

Milestone 4: Maps

Goal: Hash tables work.

The problem: Go maps have complex semantics:

  • Iteration order is randomized
  • Growing rehashes everything
  • Keys can be any comparable type

What you need:

  1. Hash function — For each key type
  2. Bucket structure — Go uses a specific layout
  3. makemap, mapaccess, mapassign, mapdelete — Core operations
  4. Map iteration — Complex state machine

Lesson learned: Map iteration state is stored in a hiter struct. If you get this wrong, range loops break mysteriously.

Milestone 5: Garbage Collection

Goal: Automatic memory management.

Design decision: We chose semi-space copying GC because:

  • No fragmentation
  • Simple implementation
  • Predictable pause times (though not short)

What you need:

  1. Root scanning — Find all pointers on stack and in globals
  2. Object copying — Move live objects to new space
  3. Pointer updating — Fix all references
  4. Type bitmaps — Know which words are pointers

The hard part: Knowing which stack slots are pointers. gccgo generates __gcdata bitmaps for types, but stack scanning is conservative.

Milestone 6: Goroutines

What you need:

  1. G struct — Goroutine state
  2. Stack allocation — Each goroutine needs its own stack
  3. Context switching — Save/restore CPU registers (assembly!)
  4. Scheduler — Pick which goroutine runs next
  5. Run queue — List of runnable goroutines

The assembly is unavoidable. You must write swapcontext in SH-4 assembly. There’s no way around it. You see, you have to do context switching in the actual registers, but C doesn’t give you access to talk to them. The compiler manages the registers behind your back.

! Save current context
mov.l   r8, @-r4
mov.l   r9, @-r4
! ... save all callee-saved registers ...

! Load new context
mov.l   @r5+, r8
mov.l   @r5+, r9
! ... restore all registers ...

rts

Milestone 7: Channels

Goal: Goroutines can communicate.

Channels require:

  • Wait queues (goroutines blocked on send/receive)
  • Buffered storage (ring buffer)
  • Select statement (waiting on multiple channels)

The “3 days of debugging” commit touched channels. The issue was usually:

  • Waking the wrong goroutine
  • Corrupting state during concurrent access
  • Stack misalignment after context switch

Part 3: Resources You’ll Need

Essential Reading

  1. gccgo source codegcc/go/gofrontend/ and libgo/runtime/
  2. Go runtime source$GOROOT/src/runtime/ (different ABI, but same concepts)
  3. SH-4 programming manual — For assembly and ABI
  4. KallistiOS documentation — For Dreamcast specifics

Tools

ToolPurpose
sh-elf-nmList symbols in object files
sh-elf-objdumpDisassemble code
sh-elf-addr2lineConvert addresses to line numbers
dc-tool-ipUpload and run on Dreamcast
lxdreamDreamcast emulator (for faster iteration)

The Checklist Mentality

Before each phase, write down:

  1. What symbols must I implement?
  2. What’s the expected signature?
  3. How will I test it?

After each phase:

  1. Did all tests pass?
  2. What surprised me?
  3. What would I do differently?

The journey from nothing to a working Go runtime is not easy. But it is achievable. Every problem has a solution. Every bug can be found. Every undefined symbol can be implemented.

You now have the map. Go build it.

Introduction to libgodc

What Is This Book?

This book is about building a Go runtime for the Sega Dreamcast.

Wait, what?

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   THE CRAZY PROJECT                                         │
│                                                             │
│   Go:                                                       │
│   • Designed for servers and cloud computing                │
│   • Expects gigabytes of RAM                                │
│   • Has a sophisticated garbage collector                   │
│   • Written for modern multi-core CPUs                      │
│                                                             │
│   Dreamcast:                                                │
│   • A game console from 1998                                │
│   • Has 16 MB of RAM (megabytes, not giga)                  │
│   • Single CPU core at 200 MHz                              │
│   • Was designed for arcade games                           │
│                                                             │
│   These shouldn't work together. But they do.               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

We call this project libgodc, a library that implements Go’s runtime for the Dreamcast. By the end of this book, you’ll understand how we built the Dreamcast Go runtime from scratch: memory allocation, garbage collection, goroutine scheduling, channels, and more.


Who Is This Book For?

You should read this book if:

  • You’re curious how programming languages work “under the hood”
  • You want to understand what a runtime actually does
  • You enjoy systems programming and low-level details
  • You think retro game consoles are cool

You’ll need to know:

  • Basic Go (variables, functions, structs, goroutines)
  • Some C (pointers, memory, basic syntax)
  • What a compiler does (turns source code into machine code, duh!)

You don’t need to know:

  • Assembly language (we’ll explain what you need)
  • How to program the Dreamcast (KallistiOS handles the hard parts)
  • Anything about garbage collectors (we’ll build one together)

The Machine We’re Programming

Let’s meet our hardware. The Sega Dreamcast (1998) was ahead of its time—the first 128-bit console, they said! (Marketing math, but still impressive.)

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   THE SEGA DREAMCAST                                        │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐   │
│   │                                                     │   │
│   │   CPU:     Hitachi SH-4 @ 200 MHz                   │   │
│   │                                                     │   │
│   │   RAM:     16 MB (yes, that's megabytes, not giga)  │   │
│   │                                                     │   │
│   │   VRAM:    8 MB (for the GPU)                       │   │
│   │                                                     │   │
│   │   GPU:     PowerVR2 CLX2                            │   │
│   │                                                     │   │
│   │   Sound:   Yamaha AICA (has its own ARM7 + 2 MB)    │   │
│   │                                                     │   │
│   │   Storage: GD-ROM (or SD card adapter)              │   │
│   │                                                     │   │
│   └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

For comparison, your phone probably has:

  • 4-8 CPU cores at 2+ GHz
  • 4-8 GB of RAM
  • Virtual memory, memory protection, multiple privilege levels

The Dreamcast has:

  • 1 CPU core at 200 MHz
  • 16 MB of RAM
  • No virtual memory, no memory protection, no privilege levels

Different world.


Why Can’t We Just Use Standard Go?

Go has an official compiler called gc. It generates code for x86, ARM, and other modern architectures.

The Dreamcast uses a SuperH SH-4 processor. Adding SH-4 support to gc would require rewriting significant portions of the compiler backend—months of work, requiring deep expertise in both Go internals and the SH-4 architecture. That’s a project for a team of compiler engineers with sleepless nights, questionable caffeine consumption, and possibly mild insanity.

Instead, we use gccgo, an alternative Go compiler built on GCC. GCC already supports SH-4 (from decades of embedded development). So gccgo can compile Go to SH-4—we just need to provide the runtime.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   TWO PATHS TO GO ON DREAMCAST                              │
│                                                             │
│   Path A: Modify gc                                         │
│   ─────────────────────                                     │
│   - Write a new SH-4 backend                                │
│   - Write a new Dreamcast Operating System                  │
│   - Understand SSA, register allocation, etc.               │
│   - Result: "real" Go on Dreamcast                          │
│                                                             │
│   Path B: Use gccgo + write runtime (this book)             │
│   ────────────────────────────────────────────              │
│   - GCC already knows SH-4                                  │
│   - Write runtime in C                                      │
│   - Result: Go dialect for Dreamcast                        │
│                                                             │
│   We chose Path B. It's faster and teaches more.            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The 16 Megabyte Problem

Sixteen megabytes. That’s it. Everything must fit:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   16 MB = 16,777,216 bytes                                  │
│                                                             │
│   That's shared between:                                    │
│                                                             │
│   ┌─────────────────────────────────────────────────┐       │
│   │  Your program's code           (0.5 - 2 MB)     │       │
│   ├─────────────────────────────────────────────────┤       │
│   │  KallistiOS overhead           (~0.5 MB)        │       │
│   ├─────────────────────────────────────────────────┤       │
│   │  Go runtime heap               (??? MB)         │       │
│   ├─────────────────────────────────────────────────┤       │
│   │  Goroutine stacks              (??? MB)         │       │
│   ├─────────────────────────────────────────────────┤       │
│   │  Game assets (textures, etc.)  (??? MB)         │       │
│   └─────────────────────────────────────────────────┘       │
│                                                             │
│   Everything fights for space.                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

This is why our garbage collector choice matters so much. We use a semi-space copying collector, which needs two equally-sized spaces. libgodc allocates 2 MB per space = 4 MB total = 2 MB usable heap.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Semi-space GC memory usage (libgodc default):             │
│                                                             │
│   ┌─────────────────────┬─────────────────────┐             │
│   │    FROM-SPACE       │     TO-SPACE        │             │
│   │      2 MB           │       2 MB          │             │
│   │                     │                     │             │
│   │  (active heap)      │  (empty, waiting    │             │
│   │                     │   for next GC)      │             │
│   └─────────────────────┴─────────────────────┘             │
│                                                             │
│   Total: 4 MB for a 2 MB usable heap. That's 50% overhead!  │
│                                                             │
│   But: no fragmentation, simple, predictable.               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Design decision: We chose simplicity (semi-space GC) over memory efficiency. On a 16 MB machine, this hurts. But a more memory-efficient collector would be much more complex to implement and debug. The 2 MB usable heap is sufficient for most Dreamcast games—large assets like textures should use external allocation anyway. For games needing more RAM, compile with -DGC_SEMISPACE_SIZE_KB=1024 to shrink the heap to 1 MB usable (2 MB total).


Where Does Everything Live?

The Dreamcast has 16 MB of main RAM at addresses 0x8C000000 to 0x8CFFFFFF. Here’s how it’s organized:

    0x8C000000 ──────────────────────────────────────────────
                 │
                 │   KOS kernel + drivers (~1 MB)
                 │
                 ├──────────────────────────────────────────
                 │   .text (your compiled code)
                 │   .rodata (constants, type descriptors)
                 │   .data (initialized globals)
                 │   .bss (uninitialized globals)
                 ├──────────────────────────────────────────
                 │
                 │   KOS malloc heap (everything below):
                 │
                 │   ┌─────────────────────────────────────┐
                 │   │  GC semi-space 0 (2 MB)             │
                 │   ├─────────────────────────────────────┤
                 │   │  GC semi-space 1 (2 MB)             │
                 │   ├─────────────────────────────────────┤
                 │   │  Goroutine stacks (64 KB each)      │
                 │   ├─────────────────────────────────────┤
                 │   │  Textures, audio, game assets       │
                 │   └─────────────────────────────────────┘
                 │
                 ├──────────────────────────────────────────
                 │   Main thread stack (grows downward)
                 │
    0x8CFFFFFF ──────────────────────────────────────────────

                 Total: 16 MB (0x1000000 bytes)

KOS manages the heap via malloc. When you run out of memory, malloc returns NULL and your program crashes. There’s no virtual memory, no swap file, no second chance. See our implementation friendly messages (lol):

// runtime/gc_heap.c
if (gc_heap.alloc_ptr + total_size > gc_heap.alloc_limit)
    runtime_throw("out of memory");

// runtime/stack.c  
void *base = memalign(8, size);
if (!base)
    runtime_throw("stack_alloc: out of memory");

// runtime/chan.c
c = (hchan *)gc_alloc(totalSize, &__hchan_type);
if (!c)
    runtime_throw("makechan: out of memory");

// runtime/tls_sh4.c
tls = (tls_block_t *)malloc(sizeof(tls_block_t));
if (!tls)
    runtime_throw("tls_alloc: out of memory");

The SH-4 Processor

Let’s get to know the CPU that runs our code.

The Alignment Rule

Here’s something that will bite you if you forget it:

The SH-4 requires natural alignment.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Type          Size     Must be aligned to                 │
│   ────          ────     ──────────────────                 │
│   uint8         1 byte   Any address is fine                │
│   uint16        2 bytes  Address must be divisible by 2     │
│   uint32        4 bytes  Address must be divisible by 4     │
│   uint64        8 bytes  Address must be divisible by 8     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

On x86 (your laptop), unaligned access is just slow. On SH-4, it crashes the CPU.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   x86 (your laptop):                                        │
│   Unaligned access?  → Works, but slower                    │
│                                                             │
│   SH-4 (Dreamcast):                                         │
│   Unaligned access?  → ADDRESS ERROR EXCEPTION              │
│                         System crashes. No recovery.        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Our allocator must always return properly aligned addresses.

The Floating Point Unit

The SH-4 has a powerful FPU with a twist:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Single-precision (float32):  FAST! ✓                      │
│   - Hardware accelerated                                    │
│   - Multiply-add in 1 cycle                                 │
│                                                             │
│   Double-precision (float64):  Slow ✗                       │
│   - Takes many more cycles                                  │
│   - Avoid in performance-critical code                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Go defaults to float64. For games, consider using float32 where precision isn’t critical. Sadly making float32 the new default for our libgodc is not possible. Unless someone, is crazy enough to recompile gccgo and change all the consts and all the standard library to use float32, that is a massive work, especially around the math libraries and ones that depend on it. So, just remember to use float32 and never float64.

A better way to solve this, in the future, would be to create float32 wrappers around common math functions.


The Cache Problem

The SH-4 has a 16 KB data cache with “write-back” behavior. When you write data, it might only go to the cache, not to main memory.

THE PROBLEM:
════════════

  Your code writes to address 0x8C100000
          │
          ▼
  ┌───────────────┐
  │    CACHE      │  ← Data goes HERE
  │  (new value)  │
  └───────────────┘
          
  ┌───────────────┐
  │  MAIN MEMORY  │  ← But not HERE (yet)
  │  (old value)  │
  └───────────────┘
          │
          ▼
  GPU reads from 0x8C100000
  Gets the OLD value!  💥

We have to manually flush the cache before hardware reads from memory:

dcache_flush_range(addr, len);  // Push cache → memory

On your laptop, the OS handles this. On the Dreamcast, it’s our job.


KallistiOS: The Foundation

We’re not programming bare-metal. We build on KallistiOS (KOS), the standard SDK for Dreamcast homebrew.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   ┌───────────────────────────────────────────────────┐     │
│   │              Your Go Program                      │     │
│   └───────────────────────────────────────────────────┘     │
│                          │                                  │
│                          ▼                                  │
│   ┌───────────────────────────────────────────────────┐     │
│   │                  libgodc                          │     │
│   │  (Go runtime: GC, scheduler, channels, etc.)      │     │
│   └───────────────────────────────────────────────────┘     │
│                          │                                  │
│                          ▼                                  │
│   ┌───────────────────────────────────────────────────┐     │
│   │               KallistiOS                          │     │
│   │  (hardware abstraction, malloc, timers)           │     │
│   └───────────────────────────────────────────────────┘     │
│                          │                                  │
│                          ▼                                  │
│   ┌───────────────────────────────────────────────────┐     │
│   │            Dreamcast Hardware                     │     │
│   └───────────────────────────────────────────────────┘     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

KOS is a minimal embedded operating system that gets statically linked into your program. There’s no user/kernel mode separation, no process isolation, and no memory protection. Your code runs with full hardware access, alongside the KOS kernel.


The Constraints That Shape Everything

These hardware limitations drive every decision in libgodc:

Constraint 1: No Memory Protection

On your laptop, accessing invalid memory gives: Segmentation fault (core dumped)

On the Dreamcast: the program corrupts silently or crashes without explanation.

Constraint 2: Real-Time Requirements

Games need consistent frame rates. At 60 FPS, you have 16.67 milliseconds per frame:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   One frame = 16.67 ms                                      │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐   │
│   │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │   │
│   └─────────────────────────────────────────────────────┘   │
│   Game logic  Rendering             GC pause                │
│   ░░░░░░░░░░  ░░░░░░░░░░░░░░░      ░░░░                     │
│                                      ▲                      │
│                                      │                      │
│                        If GC takes 20ms, you miss frames!   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Constraint 3: Single Core

The SH-4 is single-core CPU. Even if we wanted parallel GC, the SH-4 can’t run threads simultaneously. That said, when GC runs, everything stops.

The Toolchain

In this chapter

  • You learn why we use gccgo instead of the standard Go compiler
  • You see how Go code becomes Dreamcast machine code
  • You understand the “holes” in compiled code and how we fill them
  • You discover the dark arts: making C pretend to be Go
  • You learn about calling conventions and type descriptors

Why gccgo?

A compiler is just a program that writes programs. Most Go developers use gc, the standard Go compiler. It’s fast, produces excellent code, and has a fantastic runtime.

But gc only speaks certain architectures:

┌─────────────────────────────────────────┐
│                                         │
│     gc compiler's architecture list     │
│                                         │
│     ✓ x86-64   laptops, desktops        │
│     ✓ ARM64    phones, Raspberry Pi     │
│     ✓ RISC-V   new trend                │
│                                         │
│     ✗ SH-4     "never heard of this"    │
│                                         │
└─────────────────────────────────────────┘

The Dreamcast uses a Hitachi SuperH SH-4 processor. Adding support to gc would require modifying the compiler backend—months of work, lots of caffeine, and at least three existential crises.

But here’s the thing: GCC has supported the SH-4 for over two decades.

┌─────────────────┐         ┌─────────────────┐
│   gc compiler   │         │   GCC compiler  │
│                 │         │                 │
│  Knows Go ✓     │         │  Knows Go ✗     │
│  Knows SH-4 ✗   │         │  Knows SH-4 ✓   │
└─────────────────┘         └─────────────────┘
        │                           │
        └─────── combine? ──────────┘
                    │
                    ▼
          ┌─────────────────┐
          │     gccgo       │
          │                 │
          │  Knows Go ✓     │
          │  Knows SH-4 ✓   │
          └─────────────────┘

gccgo is a Go frontend for GCC. It reads Go source code, performs type checking, then hands everything to GCC’s backend. GCC handles the hard part—generating SH-4 machine code.

We get Go compilation for the Dreamcast “for free.” Our job is to provide the runtime library.

What is a Runtime?

A runtime is a library of functions that a compiled program calls during execution. It handles things the compiler can’t (or shouldn’t) generate inline: memory allocation, garbage collection, goroutine scheduling, panic handling, and more.

Why do languages use this pattern? Portability. The compiler translates your source code into machine instructions, but those instructions need to interact with the operating system or hardware. By separating “language translation” from “platform interaction,” you can:

  1. Reuse the compiler — gccgo already knows Go. We don’t touch it.
  2. Swap the runtime — We write a Dreamcast-specific runtime. The same compiler now works on a new platform.

This is how Go supports Linux, Windows, macOS, and now Dreamcast—same language, same compiler frontend, different runtimes.

Other languages use similar patterns:

  • C has startup code (crt0) and libc for system calls
  • C++ adds exception handling (libgcc) and the standard library (libstdc++)
  • Rust has a minimal runtime embedded in libstd
  • Java has the JVM—a full runtime with GC, JIT, and class loading
  • Python has libpython—the interpreter itself

The difference is scope. C’s runtime is small—just system call wrappers. Go’s runtime is large—it includes a garbage collector, scheduler, and channel implementation. That’s why porting Go is harder than porting C, but the principle is identical.


Code with Holes

Here’s the key insight of this entire book. When you compile Go code, the compiler doesn’t include everything.

func main() {
    s := make([]int, 10)
    m := make(map[string]int)
    go doSomething()
}

What does make([]int, 10) actually do? It needs to allocate memory, initialize the slice header, and return it. Does the compiler generate all that code inline?

No. It generates function calls instead:

Your Go code              What the compiler emits
─────────────             ──────────────────────

make([]int, 10)       →   CALL runtime.makeslice
make(map[string]int)  →   CALL runtime.makemap  
go doSomething()      →   CALL runtime.newproc

The compiled object file is full of these calls. But the implementations aren’t there:

┌─────────────────────────────────────────────────────┐
│                                                     │
│                  main.o (your compiled code)        │
│                                                     │
│    ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐      │
│    │HOLE │  │HOLE │  │HOLE │  │HOLE │  │HOLE │      │
│    └─────┘  └─────┘  └─────┘  └─────┘  └─────┘      │
│    runtime  runtime  runtime  runtime  runtime      │
│    .make    .make    .make    .new     .defer       │
│    slice    map      chan     proc     proc         │
│                                                     │
└─────────────────────────────────────────────────────┘

These are unresolved symbols. The object file knows it needs to call runtime.makeslice, but doesn’t know where that function is.

Who fills in the holes? That’s us. That’s libgodc.


Filling the Holes

Our job is to provide implementations. When the linker combines your code with our library, every hole gets filled:

BEFORE LINKING:
═══════════════

┌──────────────────┐          ┌──────────────────┐
│    main.o        │          │   libgodc.a      │
│                  │          │                  │
│  HOLE: runtime.  │          │  runtime.        │
│        makeslice │          │  makeslice ──────┼──→ actual code!
│                  │          │                  │
│  HOLE: runtime.  │          │  runtime.        │
│        newproc   │          │  newproc ────────┼──→ actual code!
└──────────────────┘          └──────────────────┘


AFTER LINKING:
══════════════

┌─────────────────────────────────────────────────────┐
│                    game.elf                         │
│                                                     │
│    call runtime.makeslice ───→ [makeslice code]     │
│    call runtime.newproc ─────→ [newproc code]       │
│                                                     │
│    No more holes! Ready to run.                     │
└─────────────────────────────────────────────────────┘

The Symbol Problem

There’s a wrinkle. Go uses dots in names: runtime.makeslice.

But dots are illegal in C identifiers:

void runtime.makeslice() { }  // SYNTAX ERROR!

How do we write a C function with a dot in its name?

The __asm__ Trick

GCC lets you specify the symbol name separately:

// C identifier uses underscore, but symbol has a dot
void *runtime_makeslice(void *type, int len, int cap)
    __asm__("runtime.makeslice");

void *runtime_makeslice(void *type, int len, int cap) {
    // implementation
}
┌────────────────────────────────────────────────────────┐
│                                                        │
│   In C code:         →    In object file:              │
│                                                        │
│   runtime_makeslice()     runtime.makeslice            │
│   (underscore)            (dot)                        │
│                                                        │
│   Go calls runtime.makeslice, linker finds it,         │
│   Go never knows it was written in C.                  │
│                                                        │
└────────────────────────────────────────────────────────┘

Every runtime function in libgodc uses this pattern.


Symbols vs. Signatures

Two things must match between caller and callee:

1. The Symbol (the name): Get it wrong, the linker complains loudly.

2. The Signature (the shape): What arguments, what order, what return values.

The compiler has already decided how to call runtime.makeslice:

Register r4:  pointer to type descriptor
Register r5:  length
Register r6:  capacity

Return value in r0

If our implementation expects arguments in different registers:

What compiler sends:        What our code expects:
────────────────────        ──────────────────────

  r4 = type pointer           r4 = length        ← WRONG!
  r5 = length                 r5 = capacity      ← WRONG!

The linker won’t catch this. Symbol names match, so it happily connects them. The mismatch only shows up at runtime as mysterious crashes.

┌─────────────────────────────────────────────────────┐
│                                                     │
│   Symbol mismatch:        Signature mismatch:       │
│   ───────────────         ──────────────────        │
│   Linker error            Linker succeeds           │
│   Clear message           Runtime crash             │
│   Easy to fix             Hard to debug             │
│                                                     │
└─────────────────────────────────────────────────────┘

The Calling Convention

When a function calls another function, they need to agree on how to pass data. This is the calling convention.

SH-4 Register Usage

┌─────────────────────────────────────────────────────────────┐
│   SH-4 Register Usage                                       │
│                                                             │
│   r0      Return value / scratch                            │
│   r1      Return value (64-bit) / scratch                   │
│   r2-r3   Scratch                                           │
│   ─────────────────────────────────────────────             │
│   r4      1st argument                                      │
│   r5      2nd argument                                      │
│   r6      3rd argument                                      │
│   r7      4th argument                                      │
│   ─────────────────────────────────────────────             │
│   r8-r13  Callee-saved (must preserve)                      │
│   r14     Frame pointer                                     │
│   r15     Stack pointer                                     │
└─────────────────────────────────────────────────────────────┘

Why does this matter? Most of the time, it doesn’t—the compiler handles it. But understanding the calling convention helps when:

  • Debugging crashes: Register dumps make sense when you know r4-r7 hold arguments
  • Writing //extern bindings: You need to match what C functions expect
  • Reading the runtime assembly: Context switching must save/restore the right registers (r8-r14 are callee-saved, so the callee must preserve them)

Multiple Return Values

Go functions can return multiple values. C can’t. gccgo handles this by returning a struct:

struct result {
    int quotient;
    int remainder;
};

struct result divmod(int a, int b) {
    return (struct result){ a / b, a % b };
}

Small structs fit in r0-r1. When implementing runtime functions that return multiple values, we must match exactly what gccgo expects.


Reading CPU Registers

Sometimes we need to know register values directly:

// This variable IS register r15
register uintptr_t sp asm("r15");

printf("Stack pointer: 0x%08x\n", sp);

This isn’t a copy—sp is the register. We use this for:

  • Stack bounds checking
  • Context switching (saving/restoring goroutine state)
  • Debugging (dump registers on crash)

Inline Assembly

Sometimes C can’t express what we need. Here are real examples from libgodc:

// Prefetch - hint CPU to load cache line (gc_copy.c)
#define GC_PREFETCH(addr) __asm__ volatile("pref @%0" : : "r"(addr))

// Read the stack pointer (gc_copy.c)
void *sp;
__asm__ volatile("mov r15, %0" : "=r"(sp));

// Read/write status register (scheduler.c)
__asm__ volatile("stc sr, %0" : "=r"(sr));  // read
__asm__ volatile("ldc %0, sr" : : "r"(sr)); // write

// Memory barrier - prevent compiler reordering (runtime.h)
#define CONTEXT_SWITCH_BARRIER() __asm__ volatile("" ::: "memory")

We use assembly for:

  • Prefetching (hint cache to load data we’ll need soon)
  • Context switching (save/restore all registers—see runtime_sh4_minimal.S)
  • Reading special registers (stack pointer, status register)
  • Memory barriers (ensure memory operations complete before continuing)

Don’t use it for anything you can do in C. KOS handles cache flush/invalidate via dcache_flush_range().


Type Descriptors

When you define a Go type, the compiler generates a type descriptor. Here are the key fields (the full struct has 12 fields, 36 bytes):

struct __go_type_descriptor {
    uintptr_t __size;        // Size of an instance
    uintptr_t __ptrdata;     // Bytes containing pointers
    uint32_t  __hash;        // Hash for type comparison
    uint8_t   __code;        // Kind (int, string, struct...)
    const uint8_t *__gcdata; // GC bitmap: which words are pointers
    // ... plus alignment, equality function, reflection string, etc.
};

For this Go type:

type Point struct {
    X, Y int
    Name *string
}

The compiler generates:

┌─────────────────────────────────────────────────────────────┐
│   Type descriptor for Point:                                │
│                                                             │
│   __size:    12 bytes  (int + int + pointer)                │
│   __ptrdata: 12 bytes  (all 3 words may contain pointers)   │
│   __code:    STRUCT                                         │
│   __gcdata:  bit-packed bitmap (1 bit per word)             │
│                                                             │
│   Word 0 (X):    int, not a pointer  → bit 0 = 0            │
│   Word 1 (Y):    int, not a pointer  → bit 1 = 0            │
│   Word 2 (Name): pointer             → bit 2 = 1            │
│                                                             │
│   __gcdata[0] = 0b00000100 = 0x04                           │
│                                                             │
│   GC reads: gcdata[word/8] & (1 << (word%8))                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The garbage collector uses __gcdata to know which fields to scan. The bitmap is bit-packed: one bit per pointer-sized word. Without it, the GC would have to guess which values are pointers.


The Build Process

══════════════════════════════════════════════════════════════
                    THE BUILD PIPELINE
══════════════════════════════════════════════════════════════

ONCE (building libgodc):
────────────────────────

  gc_runtime.c ─┐
  chan.c ───────┼──→ sh-elf-gcc ──→ *.o ──→ ar ──→ libgodc.a
  scheduler.c ──┤
  map.c ────────┘


EVERY TIME (building your game):
────────────────────────────────

  main.go ──→ sh-elf-gccgo ──→ main.o (with holes)
                                   │
                                   ▼
  main.o + libgodc.a + libkallisti.a ──→ sh-elf-ld ──→ game.elf

══════════════════════════════════════════════════════════════

The linker doesn’t care what language produced the code. It just matches symbol names.


Why C, Not Go?

libgodc is written in C (specifically, C11 with GNU extensions).

The Bootstrap Problem: To compile Go, you need a Go runtime. To get a Go runtime, you need to compile Go. Chicken, meet egg.

By writing the runtime in C, we sidestep the problem. The C compiler doesn’t need anything from Go.

Also, KallistiOS is written in C, so we can directly call its functions.


What Runs Before main()?

Your Go main() isn’t the first thing that runs. libgodcbegin.a provides the C main() (in go-main.c) that sets everything up:

Dreamcast powers on
        │
        ▼
KallistiOS boots
        │
        ▼
C main() [go-main.c]
        │
        ├──→ runtime_args()              Save argc/argv
        ├──→ runtime_init()
        │       ├──→ gc_init()           Set up garbage collector
        │       ├──→ map_init()          Initialize map subsystem
        │       ├──→ sudog_pool_init()   Pre-allocate channel waiters
        │       ├──→ stack_pool_preallocate()  Pre-allocate goroutine stacks
        │       ├──→ proc_init()         Set up scheduler (tls_init, g0)
        │       └──→ panic_init()        Set up panic/recover
        │
        ├──→ __go_go(main_wrapper)       Create goroutine for main.main
        │
        └──→ scheduler_run_loop()        Start scheduler
                    │
                    ▼
            YOUR CODE RUNS HERE

Memory Management

The Problem with Memory

In C, you’re the janitor:

char *name = malloc(100);
strcpy(name, "Mario");
free(name);  // Forget this? Memory leak.
             // Do it twice? Crash.

It’s like putting your cup of coffee to your desk every morning and never putting back. Monday is fine, but Friday looks like a pile of empty coffee mugs all over the place.

Go says: “I’ll handle the cleaning the trash and your coffee mugs.”

player := &Player{name: "Mario"} // struct allocation (heap)
enemies := make([]Enemy, 10) // slice allocation (heap)
scores := make(map[string]int) // map allocation (heap)
// That's it. Go cleans up automatically when you're done with them.

Stack vs Heap: Where Does Memory Live?

If you’re coming from Python or JavaScript, you might never have thought about where your variables live. In those languages, everything “just works” in the sense where you create objects, use them, and the runtime cleans up. But programs actually use two different regions of RAM: the stack and the heap. Both are in main memory, but they’re managed very differently.

func calculate() int {
    x := 42           // stack: lives only during this function call
    y := x * 2        // stack: same, gone when function returns
    return y          // value is copied out, then x and y disappear
}

func createPlayer() *Player {
    p := &Player{name: "Mario"}   // heap: we're returning a pointer
    return p                      // p (the pointer) disappears, but the
                                  // Player data survives on the heap
}

The stack is memory that belongs to the current function call. When the function returns, that memory is immediately reclaimed—no cleanup needed, no garbage collector involved. But the data is gone forever.

The heap is memory that persists beyond the function that created it. When you take the address of something (&Player{...}), return a pointer, or use make() for slices/maps, Go allocates on the heap. That memory sticks around until the garbage collector determines nothing references it anymore.

There’s also the data segment where global variables live. These are allocated once when the program starts and exist until the program exits—no cleanup, no GC, they just persist for the program’s entire lifetime.

var highScore int       // data segment - exists from start to end

func main() {
    x := 42             // stack - gone when main() returns
    p := &Player{}      // heap - GC cleans up when unreferenced
    highScore = 9999    // modifying global, not allocating
}

On Dreamcast, there are additional memory regions you’ll encounter:

RegionSizeContains
CodevariesYour compiled program (read-only instructions)
Data/BSSvariesGlobal variables
Stack64 KB per goroutineLocal variables, function calls
Heap~4 MB (2 MB usable)GC-managed allocations
VRAM8 MB totalTextures, framebuffer (via PVR functions)
Sound RAM2 MBAudio samples (via sound functions)

VRAM and Sound RAM are physically separate chips—they can’t corrupt main RAM or each other. If you run out of VRAM, PvrMemMalloc() returns 0. If you don’t check and try to use that zero pointer, your program crashes. Use PvrMemAvailable() to check how much VRAM remains (the framebuffer takes some of the 8 MB, so you won’t have all of it for textures).

When your game ends (power off or reset), all memory is simply gone—the “cleanup” is turning off the console.

func example() {
    // STACK - temporary, fast, automatic cleanup:
    count := 10
    sum := 0.0
    flag := true

    // HEAP - persists, needs GC to clean up:
    player := &Player{}           // pointer escapes? heap
    enemies := make([]Enemy, 5)   // slices go to heap
    scores := make(map[string]int) // maps always heap
}

The compiler decides where each variable lives through escape analysis: if the data could be used after the function returns (passed around, stored somewhere, returned), it goes to the heap. Otherwise, it stays on the stack.

The garbage collector (GC) finds stuff you’re not using anymore and reclaims the memory. But here’s the catch—it takes time to run.


How Allocation Works

When you create something in Go, where does the memory come from?

We use bump allocation. Think of it like a notepad:

┌─────────────────────────────────────────────────────┐
│ Mario │ Luigi │ Peach │                             │
└─────────────────────────────────────────────────────┘
                        ↑
                     You are here
                   (next free spot)

To allocate: just write at the current spot and move the marker.

┌─────────────────────────────────────────────────────┐
│ Mario │ Luigi │ Peach │ Toad │                      │
└─────────────────────────────────────────────────────┘
                               ↑
                            Moved!

That’s it! Just move a pointer. Way faster than malloc.

Verifying Allocations: A Hands-On Example

Embedded developers are used to inspecting memory directly. Here’s how you can see these allocations in action:

package main

import "unsafe"

type Player struct {
    X, Y  float32
    Score int32
}

//go:noinline
func allocOnHeap() *Player {
    return &Player{X: 10, Y: 20, Score: 100}
}

func main() {
    // Stack allocation
    var local Player
    stackAddr := uintptr(unsafe.Pointer(&local))
    println("Stack allocation at:", stackAddr)

    // Heap allocation
    p := allocOnHeap()
    heapAddr := uintptr(unsafe.Pointer(p))
    println("Heap allocation at:", heapAddr)

    // Multiple heap allocations - watch the bump pointer move
    for i := 0; i < 5; i++ {
        obj := allocOnHeap()
        addr := uintptr(unsafe.Pointer(obj))
        println("  Player", i, "at:", addr)
    }
}

Actual output from Dreamcast hardware (from tests/test_alloc_inspect.elf):

Stack allocation:
  Address (hex):     0x8c494cc4

Heap allocation:
  Address (hex):     0x8c084b00

Allocating 5 Player structs consecutively:
  Player 0 at: 0x8c084b50
  Player 1 at: 0x8c084b68  (+ 24 bytes)
  Player 2 at: 0x8c084b80  (+ 24 bytes)
  Player 3 at: 0x8c084b98  (+ 24 bytes)
  Player 4 at: 0x8c084bb0  (+ 24 bytes)

Global variable at:  0x8c05ecc0
  → Data segment (matches .data section start)

Notice the heap addresses increment by 24 bytes each time—that’s the 12-byte Player struct plus the 8-byte GC header, rounded up to 8-byte alignment. The bump pointer just keeps moving forward.

Using GDB to inspect:

# Start dc-tool with GDB server enabled
$ dc-tool-ip -t 192.168.x.x -g -x your_game.elf

# In another terminal, connect GDB
$ sh-elf-gdb your_game.elf
(gdb) target remote :2159

# Set breakpoint and run
(gdb) break main.main
(gdb) continue

# Examine heap memory (address from test output)
(gdb) x/32x 0x8c084b00    # Dump heap region
(gdb) info registers r15  # Stack pointer (SP)

# View GC heap structure
(gdb) p gc_heap           # Print GC heap state
(gdb) p gc_heap.alloc_ptr # Current bump pointer

Memory layout from real hardware (16 MB RAM at 0x8c000000-0x8d000000):

0x8c000000 ┌─────────────────────────────────────┐
           │ KOS kernel and system data          │
0x8c010000 ├─────────────────────────────────────┤
           │ .text (your compiled code)          │ ← Binary starts here
0x8c052aa0 ├─────────────────────────────────────┤
           │ .rodata (read-only data, strings)   │
0x8c05ecc0 ├─────────────────────────────────────┤
           │ .data (global variables)            │ ← Global at 0x8c05ecc0
0x8c0622ac ├─────────────────────────────────────┤
           │ Heap (KOS malloc)                   │
           │   - GC semi-spaces                  │ ← Heap alloc at 0x8c084b00
           │   - KOS thread stacks               │ ← Stack var at 0x8c494cc4
           │   - Other malloc allocations        │
           │                                     │
0x8d000000 └─────────────────────────────────────┘

Note: KOS manages thread stacks via malloc, so both heap allocations and stack memory come from the same pool. The addresses above are from running test_alloc_inspect.elf on real hardware.

But wait…! We never erase anything. Eventually we run out of pages. Yikes!


Why Two Spaces? (Semi-Space Collection)

The bump allocator has a problem: it can only allocate, never free individual objects. When the space fills up, we need a way to reclaim garbage.

Why not free objects in place? Because it creates fragmentation:

┌──────────────────────────────────────────────────────┐
│ Player │ FREE │ Enemy │ FREE │ FREE │ Bullet │ FREE  │
└──────────────────────────────────────────────────────┘
          ↑       can't fit a 3-slot object here

You end up with “free” holes everywhere. A 3-slot object might not fit even though there’s enough total free space.

The solution: copy to a second space. Instead of freeing in place:

  1. Allocate a second space of equal size
  2. When the first space fills, scan for live objects (objects still referenced)
  3. Copy only live objects to the second space
  4. The first space is now 100% garbage—reset the bump pointer to the start
BEFORE (Space A full):           AFTER (Space B active):
┌────────────────────────┐       ┌────────────────────────┐
│ Player │ xxx │ Enemy │ │  →    │ Player │ Enemy │ Bullet│
│ xxx │ Bullet │ xxx │   │       │                        │
└────────────────────────┘       └────────────────────────┘
 (xxx = garbage)                  (compacted, no gaps!)

This copying collection solves two problems at once:

  • Garbage is reclaimed: everything left in Space A is garbage
  • Memory is compacted: no fragmentation in Space B

How Copying Works: Cheney’s Algorithm

The copying process uses an elegant algorithm invented by C.J. Cheney in 1970. It needs only two pointers and no recursion:

TO-SPACE:
┌────────────────────────────────────────────────────────┐
│ Player │ Enemy │ Bullet │                              │
└────────────────────────────────────────────────────────┘
         ↑                 ↑
       SCAN              ALLOC
  1. Start with roots (global variables, stack references, CPU registers)

    Why roots? The GC needs to know which objects are still in use. It can’t ask the running program—the program is paused. The only way to determine if an object is “live” is to check: can any code reach it? Roots are the starting points—references the program definitely has access to. If an object isn’t reachable from any root (directly or through a chain of pointers), no code can ever access it again. It’s garbage.

  2. Copy each root object to to-space at the ALLOC position, then move ALLOC forward by the object’s size (this is the same bump allocation from earlier—just alloc_ptr += size)

  3. Scan copied objects (starting at SCAN pointer) for pointers to other objects

    “Scan” doesn’t mean checking every byte—that would be slow and error-prone. Each object has type information (the __gcdata bitmap from its type descriptor) that tells the GC exactly which fields are pointers. The GC only checks those fields.

  4. If a referenced object hasn’t been copied, copy it to to-space

  5. Update the pointer to point to the new location

  6. Repeat until SCAN catches up with ALLOC—all live objects are now copied

The clever part: when you copy an object, you leave a forwarding pointer in the old location. If another reference points to that same object, you find the forwarding pointer and update the reference without copying again.

// Simplified from runtime/gc_copy.c
void *gc_copy_object(void *old_ptr) {
    gc_header_t *header = gc_get_header(old_ptr);
    
    // Already copied? Return the forwarding address
    if (GC_HEADER_IS_FORWARDED(header))
        return GC_HEADER_GET_FORWARD(header);
    
    size_t obj_size = GC_HEADER_GET_SIZE(header);
    
    // Copy to to-space at current alloc_ptr
    gc_header_t *new_header = (gc_header_t *)gc_heap.alloc_ptr;
    memcpy(new_header, header, obj_size);
    gc_heap.alloc_ptr += obj_size;
    
    void *new_ptr = gc_get_user_ptr(new_header);
    
    // Leave forwarding pointer in old location
    GC_HEADER_SET_FORWARD(header, new_ptr);
    
    return new_ptr;
}

Why this algorithm is elegant:

  • O(live objects) time—dead objects aren’t even touched
  • No recursion—just two pointers chasing each other
  • Single pass—scan and copy happen together
  • Compaction is free—objects naturally pack together

The trade-off: 50% of heap is always reserved for the copy destination.


The 50% Memory Cost

You may have noticed the trade-off mentioned earlier: one space is always reserved for copying. That means half your heap is “unusable” at any given time.

┌─────────────────────────────────────────────────────┐
│        4 MB total GC heap                           │
│  ┌──────────────────┬──────────────────┐            │
│  │   Space A        │   Space B        │            │
│  │   2 MB           │   2 MB           │            │
│  │   (active)       │   (copy target)  │            │
│  └──────────────────┴──────────────────┘            │
│                                                     │
│  Usable at any time: 2 MB                           │
└─────────────────────────────────────────────────────┘

Why accept this 50% cost? Because you get:

  • No fragmentation: Cheney’s algorithm compacts automatically
  • O(1) allocation: just bump alloc_ptr, no free-list search
  • O(live objects) collection: dead objects aren’t even touched
  • Simple implementation: fewer bugs in the runtime
  • Cache-friendly: live objects end up packed together

It’s a deliberate trade-off: memory for speed and simplicity. On a 16 MB system where you’re also using VRAM and Sound RAM for assets, 2 MB of usable GC heap is often sufficient.

Customizing heap size: The default is GC_SEMISPACE_SIZE_KB=2048 (2 MB per space, 4 MB total). To change it, edit runtime/godc_config.h or rebuild libgodc with make CFLAGS="-DGC_SEMISPACE_SIZE_KB=1024" for 1 MB usable, leaving more RAM for game assets.


The Freeze

Here’s the bad news. When the GC runs, your game stops.

Timeline:
────────────────────────────────────────────────────────
Game:   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓████████████▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
                        ↑            ↑
                      GC starts    GC ends
                        
                     "stop-the-world"

All Go code freezes, game logic, physics, input handling. No goroutines run during collection. (Music keeps playing though—the AICA sound processor runs independently of the SH-4 CPU.)

How long does this take? Let’s find out with real numbers.


Real Benchmark Results

Benchmarks from actual Dreamcast hardware (from tests/bench_architecture.elf), verified December 2025:

┌─────────────────────────────────────────────────────┐
│  SCENARIO                   GC PAUSE                │
├─────────────────────────────────────────────────────┤
│  Large objects (≥128 KB)    ~73 μs   (bypass GC)    │
│  64 KB live data            ~2.2 ms                 │
│  32 KB live data            ~6.2 ms                 │
└─────────────────────────────────────────────────────┘

GC pause scales with the number of objects, not just total size. Many small objects (32 KB scenario) require more traversal and copying than fewer large objects.

Key insight: Allocations ≥64 KB bypass the GC heap entirely (go straight to malloc), which is why the “large objects” scenario shows only ~73 μs—that’s just the baseline GC setup cost with nothing to copy.

See the Glossary for a complete reference of all benchmark numbers.


What This Means for Games

Let’s do the math with real data (assuming ~128KB live data = ~6ms pause):

┌─────────────────────────────────────────────────────┐
│  TARGET FPS    FRAME BUDGET    GC PAUSE (~6ms)      │
├─────────────────────────────────────────────────────┤
│  60 FPS        16.7 ms         ~1/3 frame stutter   │
│  30 FPS        33.3 ms         barely noticeable    │
│  20 FPS        50 ms           unnoticeable         │
└─────────────────────────────────────────────────────┘

At 60 FPS, a 6ms GC pause is noticeable but brief. Keep live data small, and pauses stay short.


Big Objects Get Special Treatment

Here’s a surprise: big allocations skip the GC entirely!

small := make([]byte, 1000)      // → GC heap
big := make([]byte, 100*1024)    // → malloc (bypasses GC!)

The threshold is 64 KB:

┌─────────────────────────────────────────────────────┐
│  SIZE           WHERE IT GOES     FREED BY          │
├─────────────────────────────────────────────────────┤
│  < 64 KB        GC heap           GC (automatic)    │
│  ≥ 64 KB        malloc            NEVER! (manual)   │
└─────────────────────────────────────────────────────┘

Wait, never? That’s right. Big objects are never automatically freed.

Why? Copying a 256 KB texture during GC would be too slow. So we skip it entirely. But that means you’re responsible for freeing it.

      ⚠️  WARNING  ⚠️
      
      Large objects (≥64 KB) are NEVER 
      automatically freed by the GC!
      
      This is a memory leak unless you
      call freeExternal() manually (see next section).

When Is This OK?

Fine: Loading a texture at game start. It lives forever anyway.

Problem: Loading new textures every level without freeing old ones.


Freeing Big Objects

Here’s how to clean up big allocations:

import "unsafe"

//extern _runtime.FreeExternal
func freeExternal(ptr unsafe.Pointer)

// Load a big texture
texture := make([]byte, 256*1024)  // 256KB, bypasses GC

// Later, when done with it:
freeExternal(unsafe.Pointer(&texture[0]))
texture = nil  // Don't use it anymore!

The best time to do this? Level transitions.

func LoadLevel(num int) {
    // Free old level's big stuff
    if oldTexture != nil {
        freeExternal(unsafe.Pointer(&oldTexture[0]))
        oldTexture = nil
    }
    
    // Load new level
    oldTexture = loadTexture(num)
    
    // Clean up small stuff too
    runtime.GC()
}

EXERCISE

3.3 You load a 128 KB texture each level. After 10 levels without calling freeExternal(), how much memory have you leaked?


Making GC Hurt Less

Techniques to reduce GC impact, validated by real benchmarks from tests/bench_gc_techniques.elf.

Technique 1: Pre-allocate Slices

Benchmark result: 78% faster!

Real numbers from Dreamcast:

  • Growing slice: 72,027 ns/iteration
  • Pre-allocated: 40,450 ns/iteration
// SLOW: Slice grows, triggers multiple allocations
var items []int
for i := 0; i < 100; i++ {
    items = append(items, i)
}

Why is this slow? A slice in Go is three things: a pointer to data, a length, and a capacity. When you append beyond capacity, Go must:

  1. Allocate a new, larger array (typically 2x the size)
  2. Copy all existing elements to the new array
  3. Abandon the old array (becomes garbage for GC to collect)

Here’s what happens in memory when appending 5 items to an empty slice:

append #1:  Allocate [_], write item         → 1 alloc, 0 copies
append #2:  Full! Allocate [_,_], copy 1     → 2 allocs, 1 copy
append #3:  Full! Allocate [_,_,_,_], copy 2 → 3 allocs, 3 copies total
append #4:  Space available, just write      → 3 allocs, 3 copies total
append #5:  Full! Allocate [_,_,_,_,_,_,_,_], copy 4 → 4 allocs, 7 copies total

For 100 items, this triggers ~7 reallocations and copies ~200 elements total. Each abandoned array is garbage that fills the heap faster.

Memory timeline (growing slice):
┌─────────────────────────────────────────────────────┐
│ [1]  ← alloc #1 (abandoned)                         │
│ [1,2]  ← alloc #2 (abandoned)                       │
│ [1,2,3,_]  ← alloc #3 (abandoned)                   │
│ [1,2,3,4,5,_,_,_]  ← alloc #4 (abandoned)           │
│ [1,2,3,4,5,6,7,8,9,...]  ← alloc #5 (current)       │
│                                                     │
│ GC must eventually clean up allocs #1-#4!           │
└─────────────────────────────────────────────────────┘

The fix: If you know (or can estimate) how many items you’ll need, pre-allocate:

// FAST: Pre-allocate with known capacity
items := make([]int, 0, 100)  // length=0, capacity=100
for i := 0; i < 100; i++ {
    items = append(items, i)
}
Memory timeline (pre-allocated):
┌─────────────────────────────────────────────────────┐
│ [_,_,_,_,_,...100 slots...]  ← single allocation    │
│ [1,_,_,_,_,...] → [1,2,_,_,...] → [1,2,3,_,...]     │
│                                                     │
│ No copying. No garbage. Just fill in the blanks.    │
└─────────────────────────────────────────────────────┘

No growing. No copying. No garbage. 78% faster.

When to use: Loading enemy spawns from a level file? You know the count. Parsing a protocol with a length header? Pre-allocate. Even a rough estimate (round up to next power of 2) beats growing from zero.


Technique 2: Object Pools

Important: Pools are NOT faster for allocation!

Real numbers from Dreamcast:

  • new() allocation: 201 ns/object
  • Pool get/return: 1,450 ns/object (7x slower!)

This is counter-intuitive if you’re coming from desktop Go or other languages. Let’s understand why.

Why is new() so fast? Our bump allocator is essentially one operation:

new(Bullet):
┌───────────────────────────────────────────────────────┐
│ alloc_ptr → [████████ used █████|▓▓▓▓ free ▓▓▓▓▓]     │
│                                 ↑                     │
│                            alloc_ptr += sizeof(Bullet)│
│                                                       │
│ Total: 1 pointer increment. Done.                     │
└───────────────────────────────────────────────────────┘

That’s it. No free lists to search. No size classes. No locking. Just bump the pointer forward. This is why 201 ns is achievable—it’s maybe 40-50 CPU cycles.

Why are pools slower? Pool operations involve slice manipulation:

GetFromPool():
┌─────────────────────────────────────────────────────┐
│ 1. Check if len(pool) > 0      ← bounds check       │
│ 2. Read pool[len-1]            ← memory access      │
│ 3. pool = pool[:len-1]         ← slice header write │
│ 4. Return pointer              ← done               │
│                                                     │
│ ReturnToPool():                                     │
│ 1. Reset object fields         ← memory writes      │
│ 2. pool = append(pool, obj)    ← may grow slice!    │
│                                                     │
│ Total: ~7x more work than bump allocation           │
└─────────────────────────────────────────────────────┘

So why use pools at all? The trade-off isn’t about allocation speed. It’s about when you pay the cost:

WITHOUT POOL (100 bullets/frame):
─────────────────────────────────────────────────────
Frame 1:  new new new new... (100x)  │ 20 μs │ smooth
Frame 2:  new new new new... (100x)  │ 20 μs │ smooth
Frame 3:  new new new new... (100x)  │ 20 μs │ smooth
  ...
Frame 50: GC TRIGGERED!              │ 6 ms  │ ← STUTTER!
─────────────────────────────────────────────────────
                                     └─ 60 FPS target = 16.6 ms
                                        6 ms pause = 1/3 frame drop


WITH POOL (100 bullets/frame):
─────────────────────────────────────────────────────
Frame 1:  get get get... return...   │ 145 μs │ smooth
Frame 2:  get get get... return...   │ 145 μs │ smooth
Frame 3:  get get get... return...   │ 145 μs │ smooth
  ...
Frame 50: (no GC needed)             │ 145 μs │ still smooth!
─────────────────────────────────────────────────────

You’re trading ~125 μs per frame for no GC pauses. For a bullet hell game, that’s worth it.

When to use pools:

  • High-frequency create/destroy (bullets, particles, audio events)
  • Objects with predictable lifetimes (spawned and despawned together)
  • When you need consistent frame times (no surprise stutters)

When NOT to use pools:

  • Objects created once and kept (player, level geometry)
  • Low churn rate (a few allocations per second)
  • Prototype/debugging (just use new(), it’s simpler)

Simple pool implementation:

var pool []*Bullet

func GetBullet() *Bullet {
    if len(pool) > 0 {
        b := pool[len(pool)-1]
        pool = pool[:len(pool)-1]
        return b
    }
    return new(Bullet)  // Pool empty? Allocate fresh
}

func ReturnBullet(b *Bullet) {
    b.X, b.Y, b.Active = 0, 0, false  // Reset state!
    pool = append(pool, b)
}

Pro tip: Pre-populate the pool at game start to avoid any new() calls during gameplay:

func InitBulletPool(size int) {
    pool = make([]*Bullet, size)
    for i := range pool {
        pool[i] = new(Bullet)
    }
}

Now GetBullet() never allocates during gameplay—predictable performance every frame.


Technique 3: Trigger GC at Safe Times

Benchmark: Manual GC takes ~35 μs with minimal live data

The problem with automatic GC is unpredictability. You don’t control when it runs. It just happens when the heap fills up. That might be during a boss fight.

GC pause times from real benchmarks (from bench_gc_pause.elf):

Live DataGC PauseImpact at 60 FPS
Minimal~100 μsUnnoticeable
32 KB~2 msMinor stutter
128 KB~6 ms1/3 frame drop

The key insight: GC pause scales with live data, not garbage. If you trigger GC when live data is minimal (between levels, during menus), the pause is tiny.

Uncontrolled vs Controlled GC:

UNCONTROLLED (GC surprises you):
─────────────────────────────────────────────────────────────
│ Gameplay ││ Gameplay ││ Gameplay ││ GC! ││ Gameplay       │
│  smooth  ││  smooth  ││  smooth  ││6 ms!││  smooth        │
─────────────────────────────────────────────────────────────
                                      ↑
                                 Player notices!
                                 "Why did it stutter
                                  when I jumped?"


CONTROLLED (you choose when):
─────────────────────────────────────────────────────────────
│ Gameplay ││ Menu Opens ││ Gameplay ││ Level End ││ Next   │
│  smooth  ││ GC (35 μs) ││  smooth  ││ GC (35 μs)││Level   │
─────────────────────────────────────────────────────────────
              ↑                         ↑
         Player is reading         Victory animation
         menu anyway               playing anyway

How to trigger GC manually:

//go:linkname forceGC runtime.GC
func forceGC()

Best times to trigger GC (player won’t notice):

func OnDialogueStart() {
    forceGC()  // Text appearing letter-by-letter anyway
}

func OnMenuOpen() {
    forceGC()  // Player is reading options
}

func OnLevelComplete() {
    forceGC()  // Victory fanfare playing, score tallying
}

func OnLoadingScreen() {
    forceGC()  // Already showing "Loading..."
}

func OnRoomTransition() {
    forceGC()  // Screen is fading to black
}

func OnCutsceneStart() {
    forceGC()  // Video/animation taking over
}

Important caveats:

  1. Don’t trigger too often. GC still takes time. Once per scene transition is reasonable. Once per frame defeats the purpose.

  2. This doesn’t reduce garbage. You’re just choosing when to pay the cost. Combine with pre-allocation and pools to reduce how much garbage you create.

  3. Live data still matters. If you have 128 KB of permanent game state, even manual GC takes ~6 ms. Keep live data lean.

Good: Trigger GC → level enemies/items are garbage → fast GC
Bad:  Trigger GC → 10,000 persistent objects → slow GC anyway

Technique 4: Reuse Slices

Benchmark: 5% faster (13,200 ns → 12,500 ns)

Small gain per-call, but the real win is less garbage over time. Reset with [:0] instead of allocating new:

// BAD: New allocation every frame
func ProcessFrame() {
    items := make([]int, 0, 100)  // ← garbage next frame
    // ...
}

// GOOD: Reuse backing array
var items = make([]int, 0, 100)  // Allocate once

func ProcessFrame() {
    items = items[:0]  // Reset length, keep capacity
    // ...
}

The [:0] trick keeps the backing array. Over 1000 frames: 1 allocation instead of 1000.

Bonus pattern—shift without allocating:

// Creates new slice header:
queue = append(queue[1:], newItem)

// Reuses existing array:
copy(queue, queue[1:])
queue[len(queue)-1] = newItem

Technique 5: Compact In-Place

When entities die, don’t allocate a filtered slice. Compact the existing one:

// BAD: Allocates new slice
alive := make([]*Enemy, 0)
for _, e := range enemies {
    if e.Active {
        alive = append(alive, e)  // ← garbage
    }
}
enemies = alive

// GOOD: Compact in place
n := 0
for _, e := range enemies {
    if e.Active {
        enemies[n] = e
        n++
    }
}
enemies = enemies[:n]  // Shrink, no allocation

Visual:

Before: [A, _, B, _, _, C]  (3 active, 3 dead)
         ↓ compact
After:  [A, B, C]           (same backing array, shorter length)

Classic game loop pattern: every frame, compact dead bullets/particles/enemies without touching the allocator.

Goroutines

The Trade-off

Let me set expectations: goroutines on Dreamcast work, but differently than on modern hardware.

You get zero parallelism (single CPU), but you get everything else: clean concurrency primitives, channels, and code that feels like Go.

Here’s the thing. Goroutines shine when you have multiple CPU cores:

Modern PC (8 cores):
────────────────────────────────────────────────────────────
Core 1: [──────goroutine A──────]
Core 2: [──────goroutine B──────]
Core 3: [──────goroutine C──────]
Core 4: [──────goroutine D──────]
...
        ↑
        All running SIMULTANEOUSLY
        4x faster than running them one-by-one!

But Dreamcast?

Dreamcast (1 core):
────────────────────────────────────────────────────────────
CPU:    [───A───][───B───][───A───][───C───][───B───]...
        ↑
        Only ONE runs at a time
        ZERO parallelism benefit

So why did libgodc implements them?


Why Bother?

Because Go without goroutines isn’t Go.

Imagine porting Python to a machine without lists. Or JavaScript without callbacks. You could do it, but would it feel like the same language?

I wanted Go on Dreamcast to feel like Go. You can write:

go processEnemies()
go playBackgroundMusic()
go handleInput()

It works. It’s correct. The code is cleaner. It’s just not faster than calling them directly:

processEnemies()
playBackgroundMusic()
handleInput()

There’s overhead—but less than you might expect. Let’s see the numbers.


What Happens Under the Hood

When you create a goroutine, here’s what actually happens:

┌─────────────────────────────────────────────────────────────┐
│   go doSomething()                                          │
│   ────────────────                                          │
│                                                             │
│   1. Allocate 64 KB stack (from pool or malloc)             │
│   2. Initialize G struct (~150 bytes)                       │
│   3. Save 16 CPU registers to context                       │
│   4. Set up context (sp, pc, pr)                            │
│   5. Add to run queue                                       │
│   6. Later: context switch to run (~6.6 μs)                 │
│   ─────────────────────────────────────────────────────     │
│   Total spawn + first run: ~32 μs                           │
│                                                             │
│   That's ~6,400 CPU cycles per goroutine spawn!             │
└─────────────────────────────────────────────────────────────┘

What do you get for this overhead? On a multi-core system: parallelism. On Dreamcast: proper Go semantics and working concurrency primitives. That’s actually worth something!

The Numbers

I ran benchmarks on real Dreamcast hardware (from bench_architecture.elf):

┌─────────────────────────────────────────────────────────────┐
│   OPERATION               TIME                              │
├─────────────────────────────────────────────────────────────┤
│   runtime.Gosched()       120 ns      ← very cheap!         │
│   Buffered channel op     ~1.5 μs                           │
│   Context switch          ~6.6 μs                           │
│   Channel round-trip      ~13 μs                            │
│   Goroutine spawn+run     ~34 μs                            │
└─────────────────────────────────────────────────────────────┘

At 200 MHz, you get about 200 million cycles per second. At 60 FPS you have 3.3 million cycles per frame. A 34 μs goroutine spawn is ~6,800 cycles—that’s only 0.2% of your frame budget. You can afford a few goroutines per frame, just don’t spawn hundreds!

See the Glossary for a complete reference of all benchmark numbers.


How It Works

The implementation is pretty elegant for a 200 MHz machine. Let’s see how we create the illusion of concurrency.

The G Struct

Every goroutine is a G structure (see runtime/goroutine.h):

┌─────────────────────────────────────────────────────────────┐
│   Goroutine (G)                                             │
│                                                             │
│   _panic:     nil         (current panic - offset 0)        │
│   _defer:     nil         (deferred functions - offset 4)   │
│   atomicstatus: Grunning  (or Gwaiting, Grunnable, etc.)    │
│   schedlink:  next G      (run queue linkage)               │
│   stack_lo:   0x8c100000  (bottom of stack)                 │
│   stack_hi:   0x8c110000  (top of stack, 64 KB above)       │
│   context:    saved CPU registers (64 bytes)                │
│                           ├── r8-r14 (callee-saved GPRs)    │
│                           ├── sp, pc, pr (special)          │
│                           └── fr12-fr15, fpscr, fpul (FPU)  │
│   goid:       42          (unique ID - 8 bytes)             │
│   waiting:    sudog*      (channel wait queue entry)        │
│   checkpoint: ptr         (for panic/recover)               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The key is context, aka the saved CPU registers. This lets us pause mid-function and resume later.

The Run Queue

Runnable goroutines wait in line:

     head                                    tail
       ↓                                       ↓
    ┌────┐   ┌────┐   ┌────┐   ┌────┐
    │ G3 │──▶│ G7 │──▶│ G2 │──▶│ G9 │──▶ NULL
    └────┘   └────┘   └────┘   └────┘
      ↑
   "I'm next!"

The scheduler is simple:

while (true) {
    G *gp = runq_get();       // Get next goroutine
    if (gp) {
        switch_to(gp);        // Run it
    }
    // When it yields, we come back here
}

Context Switching

This is where the magic happens. We’re running goroutine A, and we need to switch to B:

STEP 1: Save A's registers to A's context
────────────────────────────────────────────────────────
        CPU                         A's Context
    ┌─────────┐                   ┌─────────┐
    │ r8 = 42 │ ────────────────▶ │ r8 = 42 │
    │ r9 = 17 │                   │ r9 = 17 │
    │ sp = X  │                   │ sp = X  │
    │ pc = Y  │                   │ pc = Y  │
    └─────────┘                   └─────────┘


STEP 2: Load B's registers from B's context  
────────────────────────────────────────────────────────
    B's Context                       CPU
    ┌─────────┐                   ┌─────────┐
    │ r8 = 99 │ ────────────────▶ │ r8 = 99 │
    │ r9 = 55 │                   │ r9 = 55 │
    │ sp = P  │                   │ sp = P  │
    │ pc = Q  │                   │ pc = Q  │
    └─────────┘                   └─────────┘


STEP 3: Return (now running B!)
────────────────────────────────────────────────────────
CPU continues from B's saved PC with B's saved registers.
To B, it's like it never stopped running!

On SH-4, we save/restore 16 registers (64 bytes). The full context switch with FPU takes ~88 cycles. With lazy FPU optimization (skipping FPU for integer-only goroutines), it drops to ~38 cycles. At 200 MHz, that’s under 0.5 microseconds—the total yield path including scheduler overhead is ~6.6 μs as shown in the benchmarks.


Cooperative Scheduling: The Gotcha

Our scheduler is cooperative, not preemptive. This is different from official Go!

Preemptive (official Go since 1.14): The runtime can forcibly pause a goroutine at any time using timer interrupts or signals. Even an infinite loop gets interrupted so other goroutines can run.

Cooperative (libgodc): Goroutines must volunteer to give up the CPU. The runtime never forces a switch. If a goroutine doesn’t yield, nothing else runs.

Why the difference? Preemptive scheduling requires:

  • Signal handlers or timer interrupts to interrupt running code
  • Complex stack inspection to find safe preemption points
  • More saved state per context switch

On Dreamcast, we keep it simple. The cost is that you must be careful:

// This freezes your Dreamcast (but works fine in official Go!):
func badGoroutine() {
    for {
        x++  // Infinite loop, never yields
    }
}

Where Goroutines Yield

┌─────────────────────────────────────────────────────────────┐
│   YIELDS (lets others run)         DOESN'T YIELD            │
├─────────────────────────────────────────────────────────────┤
│   ✓ Channel send: ch <- x          ✗ Math: x + y * z        │
│   ✓ Channel receive: <-ch          ✗ Memory: array[i]       │
│   ✓ time.Sleep()                   ✗ Loops: for i := ...    │
│   ✓ runtime.Gosched()                                       │
│   ✓ select {}                                               │
└─────────────────────────────────────────────────────────────┘

The Fix for Long Computations

// Bad: No yields for 10 million iterations
for i := 0; i < 10000000; i++ {
    result += compute(i)
}

// Good: Yield periodically
for i := 0; i < 10000000; i++ {
    result += compute(i)
    if i % 10000 == 0 {
        runtime.Gosched()  // Let others run
    }
}

Note: if you have a single long computation with no natural yield points, a direct function call is simpler. Goroutines shine when you have multiple things that can interleave.


When Goroutines Shine

Goroutines work well for several patterns. Here’s real benchmark data from bench_goroutine_usecase.elf:

┌─────────────────────────────────────────────────────────────┐
│   USE CASE                    OVERHEAD    VERDICT           │
├─────────────────────────────────────────────────────────────┤
│   Multiple independent tasks  10-38%      ✓ Acceptable      │
│   Producer-consumer pattern   ~163%       ⚠ Use carefully   │
│   Channel ping-pong           ~13 μs/op   Know the cost     │
└─────────────────────────────────────────────────────────────┘

The key insight: independent tasks (each goroutine does its own work, minimal channel communication) have reasonable overhead (typically ~25%, varies with scheduling). Heavy channel use (producer-consumer with many sends) costs ~163%.

Porting Existing Go Code

If you’re porting Go code that uses goroutines, it works without modification:

// This Go code just works:
func fetch(urls []string) []Result {
    ch := make(chan Result, len(urls))
    for _, url := range urls {
        go func(u string) {
            ch <- download(u)
        }(url)
    }
    // ... collect results
}

Patterns to Avoid

Some patterns don’t make sense on a single-core system:

Don’t: Spawn Per-Item

// Inefficient: 1000 spawns = 32 ms overhead
for i := 0; i < 1000; i++ {
    go process(items[i])
}

// Better: Process directly, or use one goroutine
for i := 0; i < 1000; i++ {
    process(items[i])
}

Don’t: Force Sequential With Channels

// Overcomplicated: These are sequential anyway
go step1()
<-done1
go step2()
<-done2

// Simpler:
step1()
step2()

Be Careful: Heavy Channel Traffic

// Each channel op is ~13 μs
// High-volume producer-consumer shows ~163% overhead
for item := range items {
    workChan <- item
}

For high-throughput paths, batch items or use direct calls.

Panic and Recover

Two Kinds of Errors

Most errors in Go are… boring. And that’s good! You handle them like this:

file, err := openFile("game.sav")
if err != nil {
    // No saved game? No problem.
    // Start a new game instead.
}

The function tells you something went wrong, and you decide what to do. Maybe you retry. Maybe you use a default. Maybe you tell the user. It’s your choice.

But some errors are different. They’re programmer mistakes:

enemies := []Enemy{orc, goblin, troll}
enemy := enemies[99]  // WAIT. There's only 3 enemies!

This isn’t “the file doesn’t exist.” This is “the code is broken.” There’s no sensible way to continue.

This is when Go panics.


What Happens When You Panic

Here’s the sequence, step by step:

                  Normal Execution
                        ↓
        ┌───────────────────────────────┐
        │  enemies := []Enemy{...}      │
        │  enemy := enemies[99]         │ ← PANIC!
        │  moveEnemy(enemy)             │ ← never runs
        └───────────────────────────────┘
                        ↓
              EXECUTION STOPS
                        ↓
        ┌───────────────────────────────┐
        │  Run all deferred functions   │
        │  (in reverse order!)          │
        └───────────────────────────────┘
                        ↓
          Did any defer call recover()?
                  /           \
                YES             NO
                 ↓               ↓
        Program continues   Program dies

The key insight: deferred functions always run, even during a panic. This is Go’s cleanup guarantee. Well… there are some really really bad cases (e.g. panic before runtime init or too many nested panics) where this statement is false.


Defer: The Cleanup Crew

Before we talk more about panic, let’s understand defer. It’s simple but powerful.

func processEnemy(e *Enemy) {
    file := openLog("combat.log")
    defer closeLog(file)  // "Remember to do this when I leave!"
    
    damage := calculateDamage(e)
    applyDamage(e, damage)
    
    // closeLog runs here, automatically
}

The defer keyword says: “Don’t run this now. Run it when the function exits.”

No matter how you exit—return, panic, whatever—the deferred function runs.

Multiple Defers: LIFO

If you have multiple defers, they run in reverse order. Last in, first out. Like a stack of plates:

func setup() {
    defer println("First defer")   // Runs 3rd
    defer println("Second defer")  // Runs 2nd
    defer println("Third defer")   // Runs 1st
    println("Normal code")
}

// Output:
// Normal code
// Third defer
// Second defer
// First defer

Why reverse order? Think about it: if you opened file A, then file B, you want to close B before A. The last thing you set up is the first thing you tear down.

Visualizing the Defer Chain

Each goroutine maintains a linked list of deferred functions:

G.defer → [cleanup3] → [cleanup2] → [cleanup1]
            newest                    oldest
             runs                      runs
             first                     last

When the function returns (or panics):

  1. Pop cleanup3, run it
  2. Pop cleanup2, run it
  3. Pop cleanup1, run it
  4. Done!

Recover: Catching the Fall

Here’s the safety net. recover() catches a panic mid-flight:

func safeGameLoop() {
    if runtime_checkpoint() != 0 {
        // We land here after recovering from a panic
        // libgodc needs this, if you are going to use "recover" mechanisms
        println("Recovered! Returning to main menu...")
        return
    }
    
    defer func() {
        if r := recover(); r != nil {
            println("Caught panic:", r)
        }
    }()
    
    runGame()  // If this panics, we catch it!
}

func main() {
    safeGameLoop()
    println("Program continues!")  // This runs even after panic!
}

Note: libgodc requires runtime_checkpoint() for recover to work properly. Without it, even a successful recover() will terminate the program. Standard Go handles this automatically via DWARF unwinding, but we use setjmp/longjmp instead (explained later in this chapter).

Let’s trace what happens:

1. safeGameLoop() starts
2. runtime_checkpoint() saves recovery point, returns 0
3. defer registers our recovery function
4. runGame() starts
5. ... something bad happens ...
6. PANIC!
7. Deferred function runs
8. recover() catches the panic, marks it recovered
9. longjmp back to checkpoint, runtime_checkpoint() returns 1
10. "Recovered!" prints, function returns normally
11. "Program continues!" prints

The panic was caught. The program lives.


The Golden Rule

Here’s the catch: recover only works inside a deferred function.

// THIS WORKS ✓
defer func() {
    recover()  // Called directly in defer
}()

// THIS DOESN'T WORK ✗
recover()  // Not in a defer—does nothing!

Why? Because recover needs to intercept the panic during the cleanup phase. If you’re not in a defer, you’re not in cleanup mode.

libgodc note: Standard Go is even stricter—recover must be called directly in the defer, not in a helper function. We relaxed this rule because it’s complex to implement and the behavior difference is benign for games. More panics get caught, which is fine.


How We Implement It

Standard Go uses something called DWARF unwinding. It’s sophisticated: the compiler generates detailed metadata about every function’s stack layout, and a runtime library uses this to carefully walk back up the stack.

That’s a lot of complexity. We don’t have DWARF support on Dreamcast, yet (?).

Instead, we use an old C trick: setjmp/longjmp.

The Teleportation Trick

Imagine setjmp as dropping a bookmark:

jmp_buf bookmark;

if (setjmp(bookmark) == 0) {
    // First time through: setjmp returns 0
    printf("Starting...\n");
    doRiskyThing();
    printf("Made it!\n");
} else {
    // After longjmp: setjmp returns 1
    printf("Something went wrong!\n");
}

And longjmp teleports you back to that bookmark:

void doRiskyThing() {
    // ...
    if (disaster) {
        longjmp(bookmark, 1);  // TELEPORT!
    }
    // ...
}

When longjmp is called, execution jumps back to setjmp, which now returns 1 instead of 0. All the function calls in between? Gone. Skipped. Like they never happened.

The Recovery Path

┌─────────────────────────────────────────────────────────────┐
│   PANIC WITH CHECKPOINT                                     │
│                                                             │
│   func risky() {                                            │
│       if runtime_checkpoint() != 0 {                        │
│           return  // Recovered! Continue here.              │
│       }                                                     │
│       defer func() {                                        │
│           recover()                                         │
│       }()                                                   │
│       panic("oops")  // longjmp to checkpoint               │
│   }                                                         │
│                                                             │
│   → Clean, predictable                                      │
│   → Required for recover() to work in libgodc               │
└─────────────────────────────────────────────────────────────┘

Important: Without runtime_checkpoint(), calling recover() will still mark the panic as recovered, but the program will terminate with “FATAL: recover without checkpoint”. The checkpoint is required for proper recovery in libgodc.


When Nobody Catches the Panic

If no recover catches the panic, the program dies. On Dreamcast, you’ll see:

panic: index out of range [99] with length 3

goroutine 1 [running]:
  0x8c010234
  0x8c010456
  0x8c010678

Memory: arena=4194304 used=1258291 free=2936013

The console halts. The user has to manually reset. This is intentional. A crash is better than continuing with corrupted state and zombies.


When Should You Panic?

Here’s the decision tree:

Is this a programmer mistake?
        │
        ├── YES → Maybe panic is okay
        │           ├── nil pointer dereference
        │           ├── index out of bounds
        │           └── calling method on nil
        │
        └── NO → DON'T PANIC. Return an error.
                    ├── File not found
                    ├── Network timeout
                    ├── Invalid user input
                    └── Resource unavailable

When Recover Makes Sense

Use recover at boundaries—places where you want to contain failures. In libgodc, remember to use runtime_checkpoint():

func handleEventSafely(event Event) {
    if runtime_checkpoint() != 0 {
        println("Event handler crashed, continuing...")
        return
    }
    
    defer func() {
        if r := recover(); r != nil {
            println("Caught:", r)
        }
    }()
    
    handleEvent(event)  // If this panics, we catch it
}

One bad event handler shouldn’t kill the entire game.

For general Go error handling best practices (when to panic vs return errors), see Effective Go.


Data Structures

Part 1: Strings

The Million-Dollar Question

How long is this string?

"Hello, Dreamcast!"

In C, you have to count:

char *msg = "Hello, Dreamcast!";
int len = 0;
while (msg[len] != '\0') {  // Keep going until null byte
    len++;
}

// H  e  l  l  o  ,     D  r  e  a  m  c  a  s  t  !  \0
// 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17
// len is now 17... but we checked 18 characters!

C strings end with a special “null byte” (\0). To find the length, you walk through every character until you hit it. For a 10,000-character string, that’s 10,000 checks.

Go strings are smarter. They remember their length:

┌────────────────┐
│  str: ─────────────────▶ h │ e │ l │ l │ o │
│  len: 5        │
└────────────────┘

In libgodc, this is an 8-byte structure (on 32-bit Dreamcast):

// From runtime/runtime.h, see GoString C struct
typedef struct {
    const uint8_t *str;  // 4 bytes: pointer to character data
    intptr_t len;        // 4 bytes: length in bytes
} GoString;

Unlike C strings (null-terminated), Go strings store their length explicitly. This means:

  • O(1) length lookup just read the len field
  • Can contain null bytes no special terminator
  • Bounds checked we know exactly where the string ends

String Allocation

Strings are immutable. Every concatenation allocates new memory:

s := "foo" + "bar"  // Allocates 6 bytes, copies both strings

Repeated concatenation in a loop is O(n²), where each iteration copies all previous data. This is a common Go performance pitfall; see Effective Go for solutions.

The tmpBuf Optimization

Here’s a secret: libgodc cheats for short strings.

When you concatenate strings that total ≤32 bytes, we use a stack buffer instead of allocating from the heap:

"a" + "b" = "ab"

Stack (fast):  ┌────────────────────────────────┐
               │ a │ b │   │   │ ... │   │   │  │  32 bytes
               └────────────────────────────────┘

No GC allocation needed!

This happens automatically. You don’t have to do anything—the compiler passes a stack buffer to the runtime, and we use it when we can.


Part 2: Slices

The Three-Part Header

A slice is not just a pointer. It’s a header (that means struct) with three fields:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Slice: []int with values [10, 20, 30]                     │
│                                                             │
│   ┌────────────────┐        ┌─────┬─────┬─────┬─────┬─────┐ │
│   │  array: ───────────────▶│ 10  │ 20  │ 30  │  ?  │  ?  │ │
│   │  len:   3      │        └─────┴─────┴─────┴─────┴─────┘ │
│   │  cap:   5      │             ▲           ▲              │
│   └────────────────┘          length      capacity          │
│                                                             │
└─────────────────────────────────────────────────────────────┘
  • array :: Pointer to the underlying data
  • len :: How many elements are currently in use
  • cap:: How many elements could fit before reallocation

Think of it like a notebook. You have 100 pages (capacity), but you’ve only written on 30 (length).

The Magic of Slicing

Here’s the trick that makes Go slices amazing. When you “slice” a slice, no data is copied:

a := []int{10, 20, 30, 40, 50}
b := a[1:4]  // b is [20, 30, 40]

What actually happens:

Underlying array:
┌─────┬─────┬─────┬─────┬─────┐
│ 10  │ 20  │ 30  │ 40  │ 50  │
└─────┴─────┴─────┴─────┴─────┘
  ▲     ▲
  │     │
  │     └── b.array points here
  │         b.len = 3
  │         b.cap = 4
  │
  └── a.array points here
      a.len = 5
      a.cap = 5

Both a and b point to the same memory. Slicing is O(1) — just create a new 12-byte header.

The Sharing Trap

But wait. If they share memory…

a := []int{10, 20, 30, 40, 50}
b := a[1:4]

b[0] = 999  // What happens to a?
After b[0] = 999:
┌─────┬─────┬─────┬─────┬─────┐
│ 10  │ 999 │ 30  │ 40  │ 50  │
└─────┴─────┴─────┴─────┴─────┘
  ▲     ▲
  │     │
  a     b

a is now [10, 999, 30, 40, 50]!

Both slices see the change! This is usually a bug waiting to happen.

If you need independent data, use copy:

b := make([]int, 3)
copy(b, a[1:4])  // b has its own data now

How libgodc Implements copy

When you write copy(dst, src), what actually happens?

Step 1: Figure out how many elements to copy
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   dst has room for 3       src has 5 elements               │
│   ┌───┬───┬───┐            ┌───┬───┬───┬───┬───┐            │
│   │   │   │   │            │ A │ B │ C │ D │ E │            │
│   └───┴───┴───┘            └───┴───┴───┴───┴───┘            │
│                                                             │
│   Copy min(3, 5) = 3 elements                               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Step 2: Calculate byte size
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   3 elements × 4 bytes each (int) = 12 bytes                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Step 3: copy the bytes safely (aka memmove in C)
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   src:  ████████████░░░░░░░░  (copy first 12 bytes)         │
│              │                                              │
│              ▼                                              │
│   dst:  ████████████                                        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Step 4: Return 3 (number of elements copied)

Why memmove instead of memcpy? Because slices can overlap:

s := []int{1, 2, 3, 4, 5}
copy(s[1:], s[:4])  // Shift elements right — overlapping!

memmove handles this safely. memcpy would corrupt the data.

Growing Slices: The append Dance

What happens when you append beyond capacity?

s := make([]int, 3, 4)  // len=3, cap=4
s = append(s, 10)       // len=4, cap=4 — fits!
s = append(s, 20)       // len=5, cap=??? — doesn't fit!

libgodc allocates a new, bigger array:

Before:
┌─────┬─────┬─────┬─────┐
│  0  │  0  │  0  │ 10  │  cap=4, FULL
└─────┴─────┴─────┴─────┘

After append(s, 20):
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│  0  │  0  │  0  │ 10  │ 20  │     │     │     │  cap=8, NEW ARRAY
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘

Old array becomes garbage (GC will clean it up).

libgodc’s Growth Strategy

Standard Go doubles capacity for small slices and grows by 25% for large ones. But Dreamcast only has 16MB RAM, so libgodc is more conservative by design:

┌─────────────────────────────────────────────────────────────┐
│   libgodc growth algorithm (runtime_growslice)              │
│                                                             │
│   if capacity < 64:                                         │
│       new_cap = capacity × 2      ← Double (same as std Go) │
│   else:                                                     │
│       new_cap = capacity × 1.125  ← Only 12.5% growth!      │
│                                                             │
└─────────────────────────────────────────────────────────────┘
Slice sizeStandard Golibgodc
Small (< 64 elements)DoubleDouble
Large (≥ 64 elements)+25%+12.5%

Why the difference? On a 16MB system, aggressive doubling wastes precious memory. A 10,000-element slice growing by 25% allocates 2,500 extra slots. At 12.5%, that’s only 1,250, so half the waste.

Pro tip: If you know how big you’ll need, then pre-allocate!

// Bad: many reallocations
enemies := []Enemy{}
for i := 0; i < 100; i++ {
    enemies = append(enemies, loadEnemy(i))
}

// Good: one allocation
enemies := make([]Enemy, 0, 100)
for i := 0; i < 100; i++ {
    enemies = append(enemies, loadEnemy(i))
}

Part 3: Maps

The Problem: Finding Things Fast

Suppose you’re building an item shop for your game. You have a price list:

type Item struct {
    Name  string
    Price int
}

items := []Item{
    {"Potion", 50},
    {"Sword", 300},
    {"Shield", 250},
    {"Bow", 200},
    // ... 100 more items
}

A customer asks: “How much is the Bow?”

You have to search through every item:

for _, item := range items {
    if item.Name == "Bow" {
        return item.Price
    }
}

If the item list has 100 items, you might check up to 100 items. That’s O(n) time.

Now imagine you have a friend named Maggie who has memorized every item and its price. You ask “How much is the Bow?” and she instantly says “200 gold!”

Maggie gives you the answer in O(1) time — constant time. It doesn’t matter if there are 10 items or 10,000. She just knows.

How do you get a “Maggie”?

You use a hash table. In Go, that’s a map.

Building Your Own Maggie

A hash table combines two things:

  1. A hash function that turns keys into numbers
  2. An array to store the values

Let’s build one step by step. Start with an empty array of 5 slots:

┌───────┬───────┬───────┬───────┬───────┐
│   0   │   1   │   2   │   3   │   4   │
├───────┼───────┼───────┼───────┼───────┤
│       │       │       │       │       │
└───────┴───────┴───────┴───────┴───────┘

Now we need a hash function. A hash function takes a string and returns a number. Here’s the important part:

  • It must be consistent: “Potion” always returns the same number.
  • It should spread things out: different strings should (usually) give different numbers.

Let’s add the price of a Potion. We feed “Potion” into the hash function:

hash("Potion") → 7392
7392 % 5 = 2  ← slot 2!

We store the price (50) at index 2:

┌───────┬───────┬───────┬───────┬───────┐
│   0   │   1   │   2   │   3   │   4   │
├───────┼───────┼───────┼───────┼───────┤
│       │       │ 50    │       │       │
│       │       │Potion │       │       │
└───────┴───────┴───────┴───────┴───────┘

Now add the Sword (300 gold):

hash("Sword") → 4281
4281 % 5 = 1  ← slot 1!
┌───────┬───────┬───────┬───────┬───────┐
│   0   │   1   │   2   │   3   │   4   │
├───────┼───────┼───────┼───────┼───────┤
│       │ 300   │ 50    │       │       │
│       │ Sword │Potion │       │       │
└───────┴───────┴───────┴───────┴───────┘

Add the Shield and Bow:

hash("Shield") % 5 = 0
hash("Bow") % 5 = 4
┌───────┬───────┬───────┬───────┬───────┐
│   0   │   1   │   2   │   3   │   4   │
├───────┼───────┼───────┼───────┼───────┤
│ 250   │ 300   │ 50    │       │ 200   │
│Shield │ Sword │Potion │       │ Bow   │
└───────┴───────┴───────┴───────┴───────┘

Now when someone asks “How much is the Bow?”:

  1. hash("Bow") % 5 = 4
  2. Look at slot 4
  3. It’s 200 gold!

No searching! The hash function tells you exactly where to look. This is O(1) — constant time.

You just built a “Maggie”!

Collisions: When Two Keys Want the Same Slot

Here’s a problem. What if two items hash to the same slot?

hash("Potion") % 5 = 2
hash("Scroll") % 5 = 2  ← Same slot!

Oh no! Potions are already in slot 2. If we put Scrolls there, we’ll overwrite Potions!

This is called a collision. There are different ways to handle it. Go uses a simple approach: store both items in the same slot using a small list.

┌───────┬───────┬────────────────────┬───────┬───────┐
│   0   │   1   │         2          │   3   │   4   │
├───────┼───────┼────────────────────┼───────┼───────┤
│ 250   │ 300   │ Potion→50          │       │ 200   │
│Shield │ Sword │ Scroll→75          │       │ Bow   │
└───────┴───────┴────────────────────┴───────┴───────┘

Now when you look up “Scroll”:

  1. hash("Scroll") % 5 = 2
  2. Look at slot 2
  3. Check if “Potion” matches — no
  4. Check if “Scroll” matches — yes! Return 75.

It takes a tiny bit longer, but it works.

The Worst Case: Everyone in One Slot

What if you’re really unlucky and every item hashes to the same slot?

┌───────┬───────┬──────────────────────────┬───────┬───────┐
│   0   │   1   │            2             │   3   │   4   │
├───────┼───────┼──────────────────────────┼───────┼───────┤
│       │       │ Potion→50                │       │       │
│       │       │ Sword→300                │       │       │
│       │       │ Shield→250               │       │       │
│       │       │ Bow→200                  │       │       │
│       │       │ Scroll→75                │       │       │
└───────┴───────┴──────────────────────────┴───────┴───────┘

Now looking up “Scroll” requires checking 5 items. That’s just as slow as a regular list!

This is the worst case: O(n) instead of O(1).

Two things prevent this:

  1. Good hash functions spread keys evenly
  2. Resizing — when the table gets too full, Go makes it bigger

The Tophash Optimization

Each bucket stores a “tophash” — the top 8 bits of the hash — for quick rejection:

Bucket 2:
┌─────────────────────────────────────────────────┐
│ tophash: [a3] [7f] [  ] [  ] [  ] [  ] [  ] [  ]│
│ keys:    [Potion] [Scroll] [  ] [  ] [  ] [  ]  │
│ values:  [  50  ] [  75  ] [  ] [  ] [  ] [  ]  │
└─────────────────────────────────────────────────┘

When looking up “Sword” (tophash = 0xb2):

  1. Check if 0xb2 == 0xa3? No. Skip.
  2. Check if 0xb2 == 0x7f? No. Skip.
  3. Not found!

We didn’t even compare the full strings. The tophash check is super fast.

Performance Comparison

┌─────────────────────────────────────────────────────────────┐
│   Hash Table vs Array: Searching 100 elements               │
│                                                             │
│   Array (linear search):                                    │
│   ┌───────────────────────────────────────────────────────┐ │
│   │ Average: check 50 elements                            │ │
│   │ Worst:   check 100 elements                           │ │
│   │ Time:    O(n)                                         │ │
│   └───────────────────────────────────────────────────────┘ │
│                                                             │
│   Hash Table (map):                                         │
│   ┌───────────────────────────────────────────────────────┐ │
│   │ Average: check 1 element                              │ │
│   │ Worst:   check all elements (very rare!)              │ │
│   │ Time:    O(1) average                                 │ │
│   └───────────────────────────────────────────────────────┘ │
│                                                             │
│   With 1,000,000 elements:                                  │
│   • Array: up to 1,000,000 checks                           │
│   • Map:   still just ~1 check!                             │
└─────────────────────────────────────────────────────────────┘

How libgodc Implements Maps

libgodc’s map implementation is tuned for the Dreamcast’s SH-4 CPU and 16MB memory limit.

The GoMap header (28 bytes):

┌─────────────────────────────────────────────────────────────┐
│   GoMap Structure                                           │
│                                                             │
│   ┌──────────────┬──────────────────────────────────────┐   │
│   │ count        │ Number of entries                    │   │
│   │ flags + B    │ State flags + log2(bucket count)     │   │
│   │ hash0        │ Random seed (different per map!)     │   │
│   │ buckets ─────────▶ Current bucket array             │   │
│   │ oldbuckets ──────▶ Old buckets (during resize)      │   │
│   │ nevacuate    │ Resize progress counter              │   │
│   └──────────────┴──────────────────────────────────────┘   │
│                                                             │
│   Total: 28 bytes (compact for Dreamcast's limited RAM)     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

SH-4 optimized hashing:

The hash function uses wyhash, a fast 32-bit algorithm that takes advantage of SH-4’s dmuls.l instruction (32×32→64 multiply):

┌─────────────────────────────────────────────────────────────┐
│   Hash("Potion", seed=0x12345678)                           │
│                                                             │
│   Step 1: Mix 4 bytes at a time                             │
│           wymix32(h ^ "Poti", 0x9E3779B9)                   │
│                                                             │
│   Step 2: Handle remaining bytes                            │
│           wymix32(h ^ "on\0\0", 0x85EBCA6B)                 │
│                                                             │
│   Step 3: Final mix with length                             │
│           wymix32(h, 6)  →  0x7A3B2C1D                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Dreamcast-specific limits:

SettinglibgodcStandard Go
Max bucket shift15 (32K buckets)~24 (16M buckets)
Hash seed sourceDreamcast timerOS random
Prefetch hintSH-4 pref @RnPlatform-specific

Lazy allocation for small maps:

items := make(map[string]int)  // No buckets yet!
items["key"] = 1               // NOW buckets are allocated

This saves memory when you create maps that might stay empty.

The Nil Map Trap

This is the #1 map bug for Go beginners:

var inventory map[string]int  // nil map

// Reading: works! Returns zero value.
count := inventory["sword"]  // count is 0

// Writing: PANIC!
inventory["sword"] = 1  // "assignment to entry in nil map"

A nil map is like a locked filing cabinet. You can look through the glass (read), but you can’t put anything in (write).

Always initialize:

inventory := make(map[string]int)
// or
inventory := map[string]int{}

Map Iteration is Random

scores := map[string]int{
    "Mario": 100,
    "Luigi": 85,
    "Peach": 95,
}

for name, score := range scores {
    println(name, score)
}

Run this twice. You might get:

Run 1:          Run 2:
Luigi 85        Peach 95
Peach 95        Mario 100
Mario 100       Luigi 85

This is intentional. Go randomizes iteration order to prevent you from depending on it. If you need sorted keys, sort them yourself.


Choosing the Right Tool

┌─────────────────────────────────────────────────────────────┐
│   DECISION TREE: What Data Structure Should I Use?          │
│                                                             │
│   Need to look up by name/key?                              │
│           │                                                 │
│           ├── YES → Use a map (O(1) lookup!)                │
│           │                                                 │
│           └── NO → Is the data ordered/sequential?          │
│                       │                                     │
│                       ├── YES → Use a slice                 │
│                       │                                     │
│                       └── NO → Still probably use a slice   │
│                                (maps have memory overhead)  │
│                                                             │
│   Is it text? → Use a string (immutable)                    │
│   Need to build text? → Use []byte, convert at the end      │
└─────────────────────────────────────────────────────────────┘

Summary Table

OperationStringSliceMap
Get lengthO(1)O(1)O(1)
Access by indexO(1)O(1)
Access by keyO(1) avg
AppendN/AO(1)*O(1) avg
ConcatenateO(n)O(n)

* Amortized — occasional reallocations

Memory Overhead

String header:  8 bytes  (pointer + length)
Slice header:  12 bytes  (pointer + length + capacity)
Map header:    28 bytes  (+ bucket overhead per entry)

Maps have the most overhead. For small, dense integer keys (0 to N), a slice is often better:

// If enemy IDs are 0-999, use a slice!
enemies := make([]*Enemy, 1000)
enemies[42] = &orc  // O(1), less memory than map

Real Benchmark Results

We ran these benchmarks on actual Dreamcast hardware. The numbers don’t lie!

Map vs Slice: The “Maggie” Effect

Looking up an item by ID, searching near the end of the collection:

ElementsSlice (linear search)Map lookupMap is…
10017 μs1.3 μs13× faster
50092 μs0.9 μs97× faster
1,000187 μs0.9 μs203× faster
2,000443 μs1.2 μs376× faster

Notice how slice time grows linearly (O(n)) while map time stays constant (O(1)). With 2,000 enemies, map lookup is 376× faster!

String Concatenation: The Hidden Cost

Building a string character by character:

Characterss += "x" in loopappend to []byteSpeedup
50122 μs23 μs5× faster
200665 μs69 μs9× faster
5002,725 μs161 μs16× faster
1,0008,973 μs314 μs28× faster

The loop method is O(n²) — time explodes as strings get longer. For 1,000 characters, pre-allocation is 28× faster!

Slice Pre-allocation: One Allocation vs Many

Appending items to a slice:

ItemsGrowing []int{}Pre-alloc make(0,n)Time saved
5035 μs24 μs32% faster
10076 μs41 μs46% faster
200178 μs76 μs57% faster

Pre-allocation eliminates the repeated reallocations as the slice grows.


The right data structure is like having the right superpower. A map turns an O(n) search into O(1). That’s not just faster… it’s magic.

Channels

This chapter explains how libgodc implements Go channels for the Dreamcast. The implementation differs significantly from the standard Go runtime due to our M:1 cooperative scheduling model.


The hchan Structure

Every channel is an hchan structure allocated on the GC heap:

typedef struct hchan {
    uint32_t qcount;      // Items currently in buffer
    uint32_t dataqsiz;    // Buffer capacity (0 = unbuffered)
    void *buf;            // Ring buffer (follows hchan in memory)
    uint16_t elemsize;    // Size of each element
    uint8_t closed;       // Channel closed flag
    uint8_t buf_mask_valid; // Power-of-2 optimization flag
    
    struct __go_type_descriptor *elemtype;
    
    uint32_t sendx;       // Send index into ring buffer
    uint32_t recvx;       // Receive index into ring buffer
    
    waitq recvq;          // Goroutines waiting to receive
    waitq sendq;          // Goroutines waiting to send
    
    uint8_t locked;       // Simple lock (no contention in M:1)
} hchan;

When you write make(chan int, 3), libgodc allocates a single block containing both the hchan header and the buffer:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Memory Layout for make(chan int, 3)                       │
│                                                             │
│   ┌─────────────────────┬─────────────────────────────────┐ │
│   │      hchan (48B)    │     buffer (3 × 4B = 12B)       │ │
│   ├─────────────────────┼───────┬───────┬───────┬─────────┤ │
│   │ qcount, dataqsiz,   │ [0]   │ [1]   │ [2]   │         │ │
│   │ sendx, recvx,       │ int   │ int   │ int   │         │ │
│   │ waitqueues, ...     │       │       │       │         │ │
│   └─────────────────────┴───────┴───────┴───────┴─────────┘ │
│                                                             │
│   Total allocation: sizeof(hchan) + (cap × elemsize)        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Ring Buffer Indexing

The buffer is a circular queue. To find where to read/write:

static inline void *chanbuf(hchan *c, uint32_t i) {
    uint32_t index = chan_index(c, i);
    return (void *)((uintptr_t)c->buf + (uintptr_t)index * c->elemsize);
}

For power-of-2 capacities, we use bitwise AND instead of modulo:

static inline uint32_t chan_index(hchan *c, uint32_t i) {
    if (c->buf_mask_valid)
        return i & (c->dataqsiz - 1);  // Fast: i & 3 for cap=4
    return i % c->dataqsiz;            // Slow: division
}

Tip: Use power-of-2 buffer sizes (2, 4, 8, 16…) for faster indexing.


The Send Algorithm

When you write ch <- value, this is chansend():

┌─────────────────────────────────────────────────────────────┐
│   chansend(c, elem, block)                                  │
│                                                             │
│   1. nil channel?                                           │
│      └── block=true: gopark forever (deadlock)              │
│      └── block=false: return false                          │
│                                                             │
│   2. Channel closed?                                        │
│      └── runtime_throw("send on closed channel")            │
│                                                             │
│   3. Receiver waiting in recvq?                             │
│      └── YES: Copy data DIRECTLY to receiver's elem         │
│               Wake receiver with goready()                  │
│               Return true                                   │
│                                                             │
│   4. Buffer has space? (qcount < dataqsiz)                  │
│      └── YES: Copy to buf[sendx], increment sendx           │
│               Return true                                   │
│                                                             │
│   5. Non-blocking? (block=false)                            │
│      └── Return false                                       │
│                                                             │
│   6. Must block:                                            │
│      └── Create sudog, enqueue in sendq                     │
│          gopark() - yield to scheduler                      │
│          When woken: return success flag                    │
└─────────────────────────────────────────────────────────────┘

The key insight: direct transfer. If a receiver is already waiting, we copy data straight to their memory location, bypassing the buffer entirely. This is why unbuffered channels involve no buffer at all.


The Receive Algorithm

When you write value := <-ch, this is chanrecv():

┌─────────────────────────────────────────────────────────────┐
│   chanrecv(c, elem, block)                                  │
│                                                             │
│   1. nil channel?                                           │
│      └── block=true: gopark forever                         │
│      └── block=false: return false                          │
│                                                             │
│   2. Closed AND empty?                                      │
│      └── Zero out elem, return (true, received=false)       │
│                                                             │
│   3. Sender waiting in sendq?                               │
│      └── Unbuffered: Copy directly from sender's elem       │
│      └── Buffered: Take from buffer, move sender's data in  │
│          Wake sender with goready()                         │
│          Return (true, received=true)                       │
│                                                             │
│   4. Buffer has data? (qcount > 0)                          │
│      └── Copy from buf[recvx], zero slot, decrement qcount  │
│          Return (true, received=true)                       │
│                                                             │
│   5. Non-blocking?                                          │
│      └── Return false                                       │
│                                                             │
│   6. Must block:                                            │
│      └── Create sudog, enqueue in recvq                     │
│          gopark()                                           │
│          When woken: return success                         │
└─────────────────────────────────────────────────────────────┘

The Buffered Receive with Waiting Sender

This case is subtle. When the buffer is full and a sender is waiting:

if (c->dataqsiz > 0) {  // Buffered channel
    // 1. Take oldest item from buffer for receiver
    src = chanbuf(c, c->recvx);
    chan_copy(c, elem, src);
    
    // 2. Put sender's NEW item into the freed slot
    chan_copy(c, src, sg->elem);
    
    // 3. Advance indices (sendx follows recvx)
    c->recvx = chan_index(c, c->recvx + 1);
    c->sendx = c->recvx;
}

This maintains FIFO order: the receiver gets the oldest buffered value, not the sender’s new value.


Wait Queues and Sudogs

When a goroutine blocks on a channel, it creates a sudog (sender/receiver descriptor):

typedef struct sudog {
    G *g;                // The blocked goroutine
    struct sudog *next;  // Next in wait queue
    struct sudog *prev;  // Previous in wait queue
    void *elem;          // Pointer to data being sent/received
    uint64_t ticket;     // Used by select for case index
    bool isSelect;       // Part of a select statement?
    bool success;        // Did operation succeed?
    struct sudog *waitlink;   // For select: links all sudogs
    struct sudog *releasetime; // Unused (Go runtime compat)
    struct hchan *c;     // Channel we're waiting on
} sudog;

The Sudog Pool

Creating sudogs during gameplay would trigger malloc(). libgodc pre-allocates a pool at startup:

void sudog_pool_init(void) {
    for (int i = 0; i < 16; i++) {
        sudog *s = (sudog *)malloc(sizeof(sudog));
        s->next = global_pool;
        global_pool = s;
    }
}

acquireSudog() pulls from the pool; releaseSudog() returns to it. If the pool is exhausted, we fall back to malloc().

Wait Queues

Each channel has two wait queues (doubly-linked lists):

typedef struct waitq {
    struct sudog *first;
    struct sudog *last;
} waitq;

Operations:

  • waitq_enqueue() - add blocked goroutine to end
  • waitq_dequeue() - remove and return first goroutine
  • waitq_remove() - remove specific sudog (for select cancellation)

Blocking and Waking: gopark/goready

This is where libgodc’s M:1 model shines.

gopark() - Block Current Goroutine

void gopark(bool (*unlockf)(void *), void *lock, WaitReason reason) {
    G *gp = getg();
    if (!gp || gp == g0)
        runtime_throw("gopark on g0 or nil");

    gp->atomicstatus = Gwaiting;
    gp->waitreason = reason;

    // Call unlock function - if it returns false, abort parking
    if (unlockf && !unlockf(lock)) {
        gp->atomicstatus = Grunnable;
        runq_put(gp);
        return;
    }

    // Context switch to scheduler
    __go_swapcontext(&gp->context, &sched_context);
}

The goroutine saves its context and swaps to the scheduler. The unlockf callback releases the channel lock atomically with parking - if it returns false, we abort and re-enqueue instead.

goready() - Wake a Goroutine

void goready(G *gp) {
    if (!gp) return;

    // Don't wake dead/already-runnable/running goroutines
    Gstatus status = gp->atomicstatus;
    if (status == Gdead || status == Grunnable || status == Grunning)
        return;

    gp->atomicstatus = Grunnable;
    gp->waitreason = waitReasonZero;
    runq_put(gp);
}

The woken goroutine becomes runnable and will be scheduled on the next schedule() call.

Why M:1 Simplifies Things

In standard Go, channels need atomic operations and memory barriers because multiple OS threads access them. libgodc runs all goroutines on one KOS thread:

  • No atomics needed for locked flag (simple bool)
  • No memory barriers
  • No contention on wait queues
  • Context switches are explicit (cooperative)

The chan_lock()/chan_unlock() functions just set a flag:

void chan_lock(hchan *c) {
    if (!c)
        runtime_throw("chan: nil channel");
    if (c->locked)
        runtime_throw("chan: recursive lock");
    c->locked = 1;
}

void chan_unlock(hchan *c) {
    if (c) c->locked = 0;
}

This is safe because we never preempt a goroutine in the middle of a channel operation.


Select Implementation

Select is the most complex part. Here’s how selectgo() works:

Phase 1: Setup

SelectGoResult selectgo(scase *cas0, uint16_t *order0, 
                        int nsends, int nrecvs, bool block) {
    int ncases = nsends + nrecvs;
    
    // order0 provides space for two arrays:
    uint16_t *pollorder = order0;           // Random order to check cases
    uint16_t *lockorder = order0 + ncases;  // Order to lock channels

Phase 2: Randomize Poll Order (Fairness)

// Fisher-Yates shuffle
for (int i = ncases - 1; i > 0; i--) {
    int j = fastrand() % (i + 1);
    uint16_t tmp = pollorder[i];
    pollorder[i] = pollorder[j];
    pollorder[j] = tmp;
}

Why random? If we always checked cases in order, the first case would always win when multiple are ready. Randomization ensures fairness.

Phase 3: Lock Channels (Deadlock Prevention)

// Sort by channel address using heap sort
heapsort_lockorder(cas0, lockorder, ncases);

// Lock in address order
sellock(cas0, lockorder, ncases);

If goroutine A does select { case <-ch1: case <-ch2: } and goroutine B does select { case <-ch2: case <-ch1: }, they could deadlock if they lock in different orders. Sorting by address ensures everyone locks in the same global order.

Phase 4: Check for Ready Cases

for (int i = 0; i < ncases; i++) {
    int casi = pollorder[i];  // Check in random order
    scase *cas = &cas0[casi];
    hchan *c = cas->c;
    
    if (c == NULL)
        continue;
    
    if (casi < nsends) {
        // Send: closed channel will panic - select it
        if (c->closed) {
            selected = casi;
            break;
        }
        // Check for waiting receiver or buffer space
        if (!waitq_empty(&c->recvq) || c->qcount < c->dataqsiz) {
            selected = casi;
            break;
        }
    } else {
        // Receive: check for waiting sender, buffer data, or closed
        if (!waitq_empty(&c->sendq) || c->qcount > 0 || c->closed) {
            selected = casi;
            break;
        }
    }
}

If any case is ready, execute it immediately and return.

Phase 5: Block on All Channels

If nothing is ready and block=true, we enqueue on ALL channels:

sudog *sglist = NULL;

for (int i = 0; i < ncases; i++) {
    int casi = pollorder[i];
    scase *cas = &cas0[casi];
    hchan *c = cas->c;
    
    if (c == NULL)
        continue;
    
    sudog *sg = acquireSudog();
    sg->g = gp;
    sg->c = c;
    sg->elem = cas->elem;
    sg->isSelect = true;
    sg->success = false;
    sg->ticket = casi;  // Remember which case this is
    
    // Link for later cleanup
    sg->waitlink = sglist;
    sglist = sg;
    
    if (casi < nsends)
        waitq_enqueue(&c->sendq, sg);
    else
        waitq_enqueue(&c->recvq, sg);
}

gp->waiting = sglist;
gopark(selparkcommit, &unlock_arg, waitReasonSelect);

Phase 6: Woken - Find Winner

When woken, one sudog has success=true. Find it and dequeue from all other channels:

// Pass 3: Find winner and dequeue losers
for (sudog *sg = sglist; sg != NULL; sg = sgnext) {
    sgnext = sg->waitlink;  // Save before we might release
    int casi = (int)sg->ticket;
    
    if (sg->success) {
        selected = casi;
        if (casi >= nsends)
            recvOK = true;  // Received actual data
    } else {
        // Remove from wait queue (we won't use this case)
        if (casi < nsends)
            waitq_remove(&sg->c->sendq, sg);
        else
            waitq_remove(&sg->c->recvq, sg);
    }
}

// Release all sudogs in separate pass
for (sudog *sg = sglist; sg != NULL; sg = sgnext) {
    sgnext = sg->waitlink;
    releaseSudog(sg);
}

The Default Case

When block=false and nothing is ready, selectgo() returns selected=-1:

if (!block) {
    selunlock(cas0, lockorder, ncases);
    go_yield();  // Give other goroutines a chance
    return (SelectGoResult){-1, false};
}

The go_yield() prevents tight polling loops from starving other goroutines.


Closing Channels

closechan() marks the channel closed and wakes ALL waiting goroutines:

void closechan(hchan *c) {
    G *wake_list = NULL;
    G *wake_tail = NULL;
    
    chan_lock(c);
    
    if (c->closed) {
        chan_unlock(c);
        runtime_throw("close of closed channel");
    }
    
    c->closed = 1;
    
    // Collect all receivers (they'll get zero values)
    while ((sg = waitq_dequeue(&c->recvq)) != NULL) {
        sg->success = false;  // Indicates closed, not real data
        gp = sg->g;
        if (!gp || gp->atomicstatus == Gdead)
            continue;
        if (sg->elem && c->elemsize > 0)
            memset(sg->elem, 0, c->elemsize);
        // Add gp to wake_list via schedlink...
    }
    
    // Collect all senders (they'll panic when they wake)
    while ((sg = waitq_dequeue(&c->sendq)) != NULL) {
        sg->success = false;
        gp = sg->g;
        if (!gp || gp->atomicstatus == Gdead)
            continue;
        // Add gp to wake_list via schedlink...
    }
    
    chan_unlock(c);
    
    // Wake everyone outside the lock
    while (wake_list) {
        gp = wake_list;
        wake_list = gp->schedlink;
        goready(gp);
    }
}

Senders check success when they wake and throw “send on closed channel” if false.


Performance

For benchmark numbers, see the Performance section in Design. You can run the benchmarks yourself with tests/bench_architecture.elf on hardware.

Why Unbuffered is Slower

Unbuffered channels always require a context switch:

Sender                          Receiver
──────                          ────────
ch <- 42                        
  │                             
  └── gopark() ─────────────────► scheduler picks receiver
                                       │
                                  x := <-ch
                                       │
  ◄── goready() ────────────────── wakes sender
  │
continues

Buffered channels avoid this when buffer has space/data.

Optimization Tips

  1. Use buffered channels for producer/consumer patterns
  2. Power-of-2 buffer sizes for faster indexing (uses bitwise AND instead of modulo)
  3. Batch data - send structs with multiple values instead of multiple sends
  4. select with default for non-blocking checks in game loops
  5. Pre-warm channels - send/receive once during init to allocate sudogs

Limitations

libgodc channels have some constraints:

LimitValueReason
Max buffer size65536 elementsSanity check in makechan()
Max element size65536 bytes16-bit elemsize field in hchan
Sudog pool16 pre-allocated, 128 maxDefined in godc_config.h

For game code, these limits are rarely hit. If you need larger queues, consider using slices with your own synchronization.

System Integration

The Layer Cake

Imagine your game as an office building. You’re on the top floor, writing Go code. But when you need something done — read a file, play a sound, draw a sprite. Well, obviously there is no such thing as “the cloud”. Someone else does the actual work.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Floor 4:  Your Go Program                                 │
│             "I want to play a sound!"                       │
│                    ↓                                        │
│   Floor 3:  libgodc (Go runtime)                            │
│             "Let me translate that..."                      │
│                    ↓                                        │
│   Floor 2:  KallistiOS                                      │
│             "I know how to talk to hardware."               │
│                    ↓                                        │
│   Floor 1:  Dreamcast Hardware                              │
│             *beep boop*                                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Each floor speaks a different language. libgodc translates Go into something KallistiOS understands. KallistiOS translates that into hardware register writes.

You don’t need to know all the details, but understanding the stack helps you debug problems.


Part 1: Timers and Sleep

How Does Sleep Work?

When you write:

time.Sleep(100 * time.Millisecond)

What actually happens? Let’s trace it:

┌─────────────────────────────────────────────────────────────┐
│   WHAT HAPPENS WHEN YOU SLEEP                               │
│                                                             │
│   Step 1: "I want to sleep for 100ms"                       │
│           ↓                                                 │
│   Step 2: Calculate wake time: now + 100ms = 4:00:00.100    │
│           ↓                                                 │
│   Step 3: Add timer to the timer heap                       │
│           ┌─────────────────────────────┐                   │
│           │ wake_time: 4:00:00.100      │                   │
│           │ goroutine: G7               │                   │
│           └─────────────────────────────┘                   │
│           ↓                                                 │
│   Step 4: Park the goroutine (it's now sleeping)            │
│           ↓                                                 │
│   Step 5: Scheduler runs OTHER goroutines                   │
│           ...100ms pass...                                  │
│           ↓                                                 │
│   Step 6: Scheduler checks timer heap                       │
│           "Hey, it's 4:00:00.100! Wake G7!"                 │
│           ↓                                                 │
│   Step 7: G7 wakes up, continues executing                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key insight: Your goroutine isn’t actually sleeping on a couch somewhere. It’s parked in a queue, and the scheduler knows when to wake it.

Where Does Time Come From?

The SH-4 CPU has hardware timers. KallistiOS reads them:

//extern timer_us_gettime64
func TimerUsGettime64() uint64

This returns microseconds since boot. Accurate to about 1 μs. Fast to read.

In your Go code, you can use this for precise timing:

//extern timer_us_gettime64
func timerUsGettime64() uint64

func measureSomething() {
    start := timerUsGettime64()
    doExpensiveWork()
    elapsed := timerUsGettime64() - start
    println("Took", elapsed, "microseconds")
}

The Timer Heap

Multiple goroutines can sleep at once. Go keeps them in a heap (priority queue) sorted by wake time:

Timer Heap:
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   [G3: wake at 100ms]    ← Earliest, checked first        │
│           /\                                              │
│          /  \                                             │
│ [G7: 200ms]  [G2: 150ms]                                  │
│       /                                                   │
│  [G5: 500ms]                                              │
│                                                           │
└───────────────────────────────────────────────────────────┘

The scheduler only needs to check the top of the heap. If the earliest timer hasn’t fired, none of them have.


Part 2: File I/O (The Danger Zone)

The Problem

You want to load a texture:

data := loadFile("/cd/textures/enemy.pvr")

Seems innocent, right? Here’s what actually happens:

┌─────────────────────────────────────────────────────────────┐
│   GD-ROM READ: THE SILENT KILLER                            │
│                                                             │
│   Time: 0ms    → loadFile() called                          │
│   Time: 0ms    → KOS asks GD-ROM to seek                    │
│   Time: 50ms   → Drive head moves (mechanical!)             │
│   Time: 100ms  → Data starts streaming                      │
│   Time: 150ms  → Still reading...                           │
│   Time: 200ms  → loadFile() returns                         │
│                                                             │
│   DURING THOSE 200ms:                                       │
│   • No other goroutines run                                 │
│   • Game loop frozen                                        │
│   • Audio buffer might run dry → glitch!                    │
│   • Player sees: lag, stutter, freeze                       │
│                                                             │
│   At 60 FPS, you have 16.6ms per frame.                     │
│   A 200ms file read = 12 FROZEN FRAMES!                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Why does this happen? KOS file operations are synchronous. The CPU sits in a loop waiting for the CD drive. No scheduler runs. Nothing else happens.

The Solutions

Solution 1: Loading Screens

Load everything at startup or level transitions:

func main() {
    showLoadingScreen()
    
    // All the slow stuff happens here
    textures = loadAllTextures()
    sounds = loadAllSounds()
    levelData = loadLevel(1)
    
    hideLoadingScreen()
    
    // Now game loop is safe
    for {
        gameLoop()
    }
}

Solution 2: Streaming in Chunks

If you must load during gameplay, do it in small pieces:

func streamTexture(path string) {
    file := openFile(path)
    defer closeFile(file)
    
    for !file.EOF() {
        chunk := file.Read(4096)  // Read 4KB
        processChunk(chunk)
        runtime.Gosched()  // Let other goroutines run!
    }
}

Solution 3: Pre-load into RAM

The Dreamcast has 16 MB of RAM. Use it!

// At startup, load everything you might need
var textureCache = make(map[string][]byte)

func preloadTexture(name string) {
    textureCache[name] = loadFile("/cd/textures/" + name)
}

// During gameplay, instant access
func getTexture(name string) []byte {
    return textureCache[name]  // Already in RAM!
}

Part 3: Calling C Functions

The //extern Magic

Go code can call C functions directly:

//extern pvr_wait_ready
func PvrWaitReady() int32

//extern maple_enum_dev
func mapleEnumDev(port, unit int32) uintptr

func main() {
    PvrWaitReady()  // Calls the C function!
}

No CGo. No runtime overhead. Just a direct function call.

The Danger

Here’s the catch: C functions run on your goroutine’s stack. Goroutines have fixed stacks (64 KB by default). If the C function is stack-hungry:

┌─────────────────────────────────────────────────────────────┐
│   STACK OVERFLOW SCENARIO                                   │
│                                                             │
│   Goroutine stack: 64 KB                                    │
│                                                             │
│   ┌────────────────────┐ ← Stack top                        │
│   │ Your Go function   │ 1 KB used                          │
│   ├────────────────────┤                                    │
│   │ C function called  │                                    │
│   │   local arrays...  │ 6 KB used                          │
│   │   more locals...   │                                    │
│   ├────────────────────┤                                    │
│   │ C calls another C  │                                    │
│   │   BOOM!            │ OVERFLOW!                          │
│   └────────────────────┘ ← Stack bottom (guard page)        │
│                                                             │
│   Result: Memory corruption, crash, mysterious bugs         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Part 4: Debugging Without Fancy Tools

The Detective’s Toolkit

Tool 1: Print Statements

The oldest debugging technique is still the best:

func suspiciousFunction(x int) {
    println(">>> suspiciousFunction start, x =", x)
    
    result := doSomething(x)
    println("    after doSomething, result =", result)
    
    processResult(result)
    println("<<< suspiciousFunction end")
}

Tool 2: Binary Search Debugging

Program crashes somewhere. Where?

1. Add print at function start and end
2. If it prints START but not END, crash is inside
3. Add print in the middle
4. Repeat until you find the exact line

Tool 3: The Assumptions Checklist

When something “can’t possibly be wrong,” check it:

func processEnemy(e *Enemy) {
    // CHECK YOUR ASSUMPTIONS
    if e == nil {
        println("BUG: e is nil!")
        return
    }
    if e.Health < 0 {
        println("BUG: negative health:", e.Health)
    }
    if e.X < 0 || e.X > 640 {
        println("BUG: X out of bounds:", e.X)
    }
    
    // Now do the actual work
    // ...
}

Reading Crash Information

When your game crashes, you might see:

panic: index out of range [99] with length 3

Registers:
  PC=8c015678  PR=8c015432

Stack trace:
  0x8c015678
  0x8c015432
  0x8c014000

What does this mean?

  • PC (Program Counter) — Where the crash happened
  • PR (Procedure Register) — Who called us (return address)
  • Stack trace — Chain of function calls

Finding the Function Name

You have an address: 0x8c015678. Where is it?

Method 1: addr2line

sh-elf-addr2line -e game.elf 0x8c015678
# Output: /path/to/main.go:42

This tells you the exact line number!

Method 2: Symbol Table

sh-elf-nm game.elf | sort > symbols.txt
# Then search for addresses near 0x8c015678

Method 3: With Function Names

sh-elf-addr2line -f -C -i -e game.elf 0x8c015678
# Output: functionName
#         main.go:42

Common Bugs and Fixes

SymptomLikely CauseFix
Hangs, no outputInfinite loop without yieldAdd runtime.Gosched() in loops
Garbage on screenMemory corruptionCheck array bounds
Random crashesStack overflowCheck deep recursion, big C calls
GC panicToo much live dataReduce heap usage, trigger GC earlier
Works in emu, fails on hwTiming differencesTest on real hardware earlier!

Troubleshooting Flowchart

Use this decision tree when things go wrong:

┌──────────────────────────────────────────────────────────────┐
│   TROUBLESHOOTING FLOWCHART                                  │
│                                                              │
│   What's happening?                                          │
│         │                                                    │
│         ├─► CRASH (program terminates)                       │
│         │         │                                          │
│         │         ├─► Panic message visible?                 │
│         │         │         │                                │
│         │         │         ├─► YES: Read the message!       │
│         │         │         │   • "index out of range"       │
│         │         │         │     → Check slice bounds       │
│         │         │         │   • "nil pointer"              │
│         │         │         │     → Check for nil before use │
│         │         │         │   • "out of memory"            │
│         │         │         │     → Reduce allocations       │
│         │         │         │                                │
│         │         │         └─► NO: Stack overflow likely    │
│         │         │             → Reduce local variables     │
│         │         │             → Convert recursion to loop  │
│         │         │                                          │
│         ├─► FREEZE (no crash, no progress)                   │
│         │         │                                          │
│         │         ├─► Any goroutines running?                │
│         │         │         │                                │
│         │         │         ├─► Only one: Infinite loop      │
│         │         │         │   → Add runtime.Gosched()      │
│         │         │         │                                │
│         │         │         └─► Multiple: Deadlock           │
│         │         │             → Check channel usage        │
│         │         │             → Ensure sends have receivers│
│         │         │                                          │
│         ├─► STUTTER (periodic lag)                           │
│         │         │                                          │
│         │         └─► GC pauses likely                       │
│         │             → Reduce live heap size                │
│         │             → Trigger GC during loading            │
│         │             → Use object pools                     │
│         │                                                    │
│         └─► WRONG OUTPUT (runs but incorrect)                │
│                   │                                          │
│                   └─► Add println() everywhere               │
│                       → Check variable values                │
│                       → Verify assumptions                   │
│                                                              │
└──────────────────────────────────────────────────────────────┘

The 5-Step Debug Process

┌─────────────────────────────────────────────────────────────┐
│   THE DEBUGGING ALGORITHM                                   │
│                                                             │
│   1. REPRODUCE                                              │
│      Can you make it happen consistently?                   │
│      If not, add logging until you can.                     │
│                                                             │
│   2. NARROW DOWN                                            │
│      Binary search with prints.                             │
│      "Does it crash before this line or after?"             │
│                                                             │
│   3. CHECK ASSUMPTIONS                                      │
│      Print everything. That variable you're SURE is         │
│      correct? Print it anyway.                              │
│                                                             │
│   4. SIMPLIFY                                               │
│      Create the smallest program that shows the bug.        │
│      Often, you'll find the bug while simplifying.          │
│                                                             │
│   5. TAKE A BREAK                                           │
│      Seriously. Walk away. Fresh eyes find bugs faster      │
│      than tired eyes.                                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Part 5: Testing on a Game Console

The Test Structure

Our tests are simple: standalone executables that print PASS or FAIL.

tests/
├── test_types.go      → test_types.elf      (maps, interfaces, structs)
├── test_goroutines.go → test_goroutines.elf (goroutines, channels)
├── test_memory.go     → test_memory.elf     (allocation, GC)
└── test_control.go    → test_control.elf    (defer, panic, recover)

No fancy test framework. No JUnit. Just:

  1. Do something
  2. Check if it worked
  3. Print the result

A Minimal Test

package main

func TestMaps() {
    println("maps:")
    passed := 0
    total := 0

    total++
    m := make(map[string]int)
    m["score"] = 100
    if m["score"] == 100 {
        passed++
        println("  PASS: read after write")
    } else {
        println("  FAIL: read after write")
    }

    total++
    if m["missing"] == 0 {
        passed++
        println("  PASS: missing key returns zero")
    } else {
        println("  FAIL: missing key returns zero")
    }

    total++
    delete(m, "score")
    _, ok := m["score"]
    if !ok {
        passed++
        println("  PASS: delete removes key")
    } else {
        println("  FAIL: delete removes key")
    }

    println("  ", passed, "/", total)
}

func main() {
    TestMaps()
}

Running Tests

# Build the test
make test_types

# Run on Dreamcast
dc-tool-ip -t 192.168.2.205 -x test_types.elf

# Output:
# maps:
#   PASS: read after write
#   PASS: missing key returns zero
#   PASS: delete removes key
#   3 / 3

Emulator vs Hardware

AspectEmulatorReal Hardware
SpeedFast iterationSlower uploads
DebuggingCan use host toolsprintf only
AccuracyClose but not exactThe truth
TimingMay differDefinitive

The Strategy:

┌─────────────────────────────────────────────────────────────┐
│   DEVELOPMENT WORKFLOW                                      │
│                                                             │
│   80% of time: Emulator                                     │
│   ├── Fast compile-run cycle                                │
│   ├── Quick iteration                                       │
│   └── Good for logic bugs                                   │
│                                                             │
│   20% of time: Real Hardware                                │
│   ├── Catches timing issues                                 │
│   ├── Finds memory/stack problems                           │
│   └── Final validation before release                       │
│                                                             │
│   RULE: Never release without testing on real hardware!     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Dreamcast is a 25-year-old console with 16 MB of RAM, no debugger, and a CD-ROM that takes 200ms to seek. And yet, people made incredible games for it. You can too. You just need patience, println, and the knowledge in this chapter.

Performance

Part 1: The Cache — Your Best Friend

The Numbers That Matter

┌─────────────────────────────────────────────────────────────┐
│   SH-4 MEMORY HIERARCHY                                     │
│                                                             │
│   Registers:     0 cycles (instant)                         │
│   L1 Cache:      1-2 cycles (~10 ns)                        │
│   Main RAM:      10-20 cycles (~100 ns)                     │
│   CD-ROM:        millions of cycles (200+ ms)               │
│                                                             │
│   Cache miss = 10-20× SLOWER than cache hit!                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Cache Lines: The Free Lunch

When you read one byte from RAM, the CPU doesn’t fetch just that byte. It fetches a whole cache line — 32 bytes on SH-4.

You ask for array[0]:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ 0  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 7  │  ← All 32 bytes loaded!
└────┴────┴────┴────┴────┴────┴────┴────┘
  ▲
  You wanted this one

Next 7 accesses are FREE! They're already in cache.

Sequential Access: The Fast Path

// FAST: Sequential access — 125 elements
sum := 0
for i := 0; i < 125; i++ {
    sum += array[i]
}

What happens:

Access array[0] → Cache miss, load 32 bytes
Access array[1] → Cache HIT (free!)
Access array[2] → Cache HIT (free!)
...
Access array[7] → Cache HIT (free!)
Access array[8] → Cache miss, load next 32 bytes
...

Total cache misses: 125 / 8 = ~16

Strided Access: The Slow Path

// SLOW: Strided access (every 8th element) — also 125 elements
sum := 0
for i := 0; i < 1000; i += 8 {
    sum += array[i]
}

What happens:

Access array[0]   → Cache miss
Access array[8]   → Cache miss (different cache line!)
Access array[16]  → Cache miss
Access array[24]  → Cache miss
...
Access array[992] → Cache miss

Total cache misses: 125 (EVERY access misses!)

Same number of additions (125), but strided is ~8× slower because every access misses the cache.

The Practical Lesson

┌─────────────────────────────────────────────────────────────┐
│   CACHE-FRIENDLY PATTERNS                                   │
│                                                             │
│   ✓ Process arrays left-to-right                            │
│   ✓ Keep related data together (struct of arrays)           │
│   ✓ Avoid pointer-chasing (linked lists are slow!)          │
│   ✓ Small, tight loops                                      │
│                                                             │
│   ✗ Random access patterns                                  │
│   ✗ Large structs with rarely-used fields                   │
│   ✗ Jumping around memory                                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Part 2: The Float64 Trap

The Shocking Truth

Go defaults to float64 for floating-point numbers:

x := 3.14  // This is float64!

On a modern PC, float64 and float32 are about the same speed. On SH-4?

┌─────────────────────────────────────────────────────────────┐
│   FLOAT PERFORMANCE ON SH-4                                 │
│                                                             │
│   float32:  Hardware accelerated, FAST                      │
│             One instruction, one cycle                      │
│                                                             │
│   float64:  Software emulation, SLOW                        │
│             Multiple instructions, 10-20× slower!           │
│                                                             │
│   A physics simulation using float64 could run              │
│   at 6 FPS instead of 60 FPS. That's the difference.        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Fix

Be explicit about float32:

// SLOW
x := 3.14           // float64 by default!
y := x * 2.0        // float64 math

// FAST
var x float32 = 3.14  // Explicit float32
y := x * 2.0          // float32 math

For game physics, positions, velocities — always use float32.


Part 3: What We Deliberately Left Out

“Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.” — Antoine de Saint-Exupéry

libgodc is not a complete Go implementation. That’s intentional. Here’s what we cut and why:

Omission 1: Full Reflection

Standard Go: Every type carries metadata — field names, method signatures, struct tags. This enables reflect and fancy JSON marshaling.

Cost: Binary size can double.

libgodc: Basic reflection only. Enough for println to work.

What you lose:

reflect.MakeFunc(...)     // NOT SUPPORTED
json.Marshal(myStruct)    // NOT SUPPORTED (would need full reflection)

What you do instead: Write explicit serialization. Use code generators.

Omission 2: Finalizers

Standard Go:

runtime.SetFinalizer(obj, func(o *MyType) {
    o.cleanup()  // Runs when GC collects obj
})

The problem: Finalizers are a nightmare for GC:

  • Objects can be resurrected
  • Run order is undefined
  • Timing is unpredictable
  • Complicate the GC significantly

libgodc: No finalizers.

What you do instead: Use defer for cleanup:

func process() {
    resource := acquire()
    defer resource.Release()  // Always runs!
    // ... use resource ...
}

Omission 3: Preemptive Scheduling

Standard Go: The runtime can interrupt a goroutine at almost any point.

libgodc: Goroutines must yield voluntarily.

// THIS FREEZES THE SYSTEM
for {
    // Infinite loop, never yields
    // No other goroutine will EVER run
}

// THIS IS FINE
for {
    doWork()
    runtime.Gosched()  // "Let others run"
}

Why we did this: Preemption requires safe points, stack inspection, and signal handling. Complex for little benefit on single-CPU.

Omission 4: Concurrent GC

Standard Go:

Your code:    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
GC:                ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
              Both run in parallel!
              Pause: < 1ms

libgodc:

Your code:    ░░░░░░░░░░████████████░░░░░░░░
GC:                     ▓▓▓▓▓▓▓▓▓▓▓▓
              EVERYTHING STOPS during GC
              Pause: 5-20ms

Why we did this: Concurrent GC requires write barriers, atomic operations, and careful synchronization. Stop-the-world is simpler and predictable.

What you do: Keep live data small. Trigger GC between frames or during loading.

The Trade-off Table

FeatureWhat We ChoseWhy
GCSemi-space, stop-the-worldSimple, no fragmentation
SchedulingCooperative, M:1No locks, predictable
Panic/Recoversetjmp/longjmpNo DWARF unwinding
ReflectionMinimalBinary size
PreemptionNoneSimplicity
C interopDirect linkingNo CGo complexity

Our philosophy: Predictability over throughput. Simplicity over features.


Part 4: When to Optimize

The Golden Question

Before optimizing anything, ask:

“Have I measured this?”

If the answer is no, stop. You’re guessing. And programmers are notoriously bad at guessing where time is spent.

The 90/10 Rule

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   90% of execution time is spent in 10% of the code         │
│                                                             │
│   That means:                                               │
│   • 90% of your code DOESN'T MATTER for performance         │
│   • Optimizing the wrong code = wasted effort               │
│   • Always measure first!                                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

DO Optimize

  • Code that runs every frame (game loop, rendering)
  • Hot loops with thousands of iterations
  • Code that measurements show is slow

DON’T Optimize

  • Code that runs once (startup, level load)
  • Code that runs rarely (menu navigation)
  • Code you haven’t measured
  • At the cost of readability

How to Measure

//extern timer_us_gettime64
func timerUsGettime64() uint64

func measureGameLoop() {
    start := timerUsGettime64()
    
    updatePhysics()
    physicsTime := timerUsGettime64() - start
    
    renderStart := timerUsGettime64()
    renderFrame()
    renderTime := timerUsGettime64() - renderStart
    
    println("Physics:", physicsTime, "us")
    println("Render:", renderTime, "us")
}

Now you know where time actually goes!


Part 5: The Debug Build System

Production vs Debug

By default, libgodc is silent. Zero debug output, zero overhead.

# Production build (default)
make && make install

# Debug build - enables debug output and assertions
make DEBUG=3 && make install

The Performance Tax of Debug Output

┌─────────────────────────────────────────────────────────────┐
│   OPERATION          Production     DEBUG=3                 │
│                                                             │
│   Goroutine spawn    50 μs          188,000 μs (188 ms!)    │
│   Channel send       19 μs          ~50,000 μs              │
│   GC pause           21 ms          ~500 ms                 │
│                                                             │
│   Debug output is EXTREMELY EXPENSIVE!                      │
│   Never benchmark with DEBUG enabled.                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Debug Macros

Instead of raw printf, use these macros:

MacroUse ForExample
LIBGODC_TRACE()General tracingScheduler events
LIBGODC_WARNING()Non-fatal issuesLarge allocations
LIBGODC_ERROR()Recoverable errorsFailed operations
LIBGODC_CRITICAL()Fatal errorsLogged to crash dump
GC_TRACE()GC-specificCollection details

In production (DEBUG=0): All macros compile to nothing. Zero cost.

In debug (DEBUG=3): Output includes labels:

[godc:main] Scheduling G 42 (status=1)
[godc:main] WARNING: Large allocation 256 KB
[GC] #3: 1024->512 (50% survived) in 21045 us

Using Debug Macros

In C runtime code:

#include "runtime.h"

void my_function(void) {
    LIBGODC_TRACE("Entering my_function");
    
    if (error_condition) {
        LIBGODC_WARNING("Something unexpected: %d", value);
    }
    
    LIBGODC_TRACE("my_function complete");
}

In Go code, use println:

const DEBUG = false  // Set to true when debugging

func debugPrint(msg string) {
    if DEBUG {
        println(msg)
    }
}

Debug Functions Available

When investigating issues, you can call these:

gc_dump_stats();       // Print GC statistics
gc_verify_heap();      // Check heap integrity
gc_print_object(ptr);  // Print object details
gc_dump_heap(10);      // Dump first 10 heap objects

Real Benchmark Results

We ran these benchmarks on actual Dreamcast hardware. These numbers should guide your optimization decisions.

PVRMark: Go vs Native C

We ran the KOS pvrmark benchmark (flat-shaded triangles, no textures) on real Dreamcast hardware to measure Go runtime overhead:

MetricC NativeGo (default)Go (GODC_FAST)
Peak polys/frame17,53313,83314,333
Peak pps~1,054,097~831,714~860,532
vs C performance100%79%82%
Binary size314 KB614 KB614 KB
┌─────────────────────────────────────────────────────────────┐
│   POLYGON THROUGHPUT (polys/frame @ 60fps)                  │
│                                                             │
│   C Native:      ████████████████████████████████████ 17,533│
│   Go Optimized:  ████████████████████████████        14,333 │
│   Go Default:    ██████████████████████████          13,833 │
│                                                             │
│   GODC_FAST=1 adds +500 polys/frame (+3.6%)                 │
│   Go achieves 82% of C polygon throughput                   │
└─────────────────────────────────────────────────────────────┘

Analysis:

  • The 18% overhead comes from bounds checking, slice header overhead, and gccgo code generation differences (not FFI — //extern compiles to direct jsr calls)
  • GODC_FAST=1 improves performance by ~3.6% via aggressive optimization
  • For real games with textures, lighting, and game logic, this difference is negligible
  • 14,333 flat-shaded triangles at 60fps is plenty for actual gameplay

What the extra 300KB binary size buys you:

  • Garbage collection
  • Goroutines and channels
  • Defer/panic/recover
  • Type safety and bounds checking
  • Full Go standard library support

Compiler Optimization Flags

The godc build command uses these SH-4 specific optimizations:

FlagEffectDefault
-O2Standard optimization
-m4-singleSingle-precision FPU mode
-mfsrraHardware reciprocal sqrt (10× faster)
-mfscaHardware sin/cos (10× faster)
-O3Aggressive optimizationGODC_FAST only
-ffast-mathFast FP (breaks IEEE)GODC_FAST only
-funroll-loopsLoop unrollingGODC_FAST only

To enable aggressive optimizations:

GODC_FAST=1 godc build

Warning: -ffast-math breaks IEEE floating point compliance. NaN and infinity handling may not work correctly. Use only for games where FP precision isn’t critical.

Conclusion

What We Built

We started with a simple question: Can Go run on a 1998 game console?

The answer is yes. Not perfectly, not completely, but yes.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   libgodc: A Go Runtime for the Sega Dreamcast              │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐   │
│   │  ✓ Memory allocation with bump allocator            │   │
│   │  ✓ Garbage collection (semi-space copying)          │   │
│   │  ✓ Goroutines (cooperative M:1 scheduling)          │   │
│   │  ✓ Channels (buffered and unbuffered)               │   │
│   │  ✓ Select statement                                 │   │
│   │  ✓ Defer, panic, and recover                        │   │
│   │  ✓ Maps, slices, strings, interfaces                │   │
│   │  ✓ Direct C interop via //extern                    │   │
│   └─────────────────────────────────────────────────────┘   │
│                                                             │
│   All running on 16MB RAM and a 200MHz CPU.                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Trade-offs We Made

Every design decision was a trade-off. Here’s what we chose and why:

DecisionWhat We Gave UpWhat We Gained
Semi-space GC50% of heap unusableNo fragmentation, simple code
Cooperative schedulingPreemptionNo locks, predictable timing
Fixed 64KB stacksStack growthSimplicity, no stack probes
M:1 modelParallelismNo thread synchronization
setjmp/longjmp panicDWARF unwindingWorks without debug info
No finalizersDestructor patternsSimpler GC, predictable cleanup

These aren’t the “right” choices for every platform. They’re the our choices for this platform.


What We Didn’t Build

libgodc is not a complete Go implementation. We deliberately left out:

  • Race detector — No parallelism means no data races
  • CPU/memory profiling — Use println and timers
  • Debugger support — Not available go debugger
  • Full reflection — Binary size matters
  • Preemptive scheduling — Complexity for no benefit
  • Concurrent GC — Single core, stop-the-world is fine

Lessons for Runtime Implementers

If you’re building a runtime for another constrained platform, here’s what we learned:

  • Don’t plan everything upfront. Get println("Hello") working first. The linker errors will guide you to the next step.
  • When documentation fails, read the code. gccgo’s libgo/runtime/ directory answered questions no documentation could.
  • Our first GC was embarrassingly slow. It didn’t matter. Once it worked, we could measure and optimize. Premature optimization would have wasted months.
  • Emulators lie. Timing is different. Memory layout is different. Test on hardware as soon as you can run anything.
  • Fighting the hardware is futile. The SH-4 has 16MB RAM and a 200MHz CPU. Accept it. Design for it. Work with it.

The Bigger Picture

Coding this project, helped me understand better what Go actually does.

When you write go func() {}, something has to:

  • Allocate a stack
  • Save the entry point
  • Add it to a run queue
  • Eventually switch contexts to run it

When you write x := make([]int, 10), something has to:

  • Calculate the size
  • Find free memory
  • Initialize the slice header
  • Eventually clean up when it’s garbage

That “something” is the runtime. Every high-level language has one. Understanding how it works makes you a better programmer in any language.


What’s Next?

libgodc is open source. You can:

  1. Use it — Build games for the Dreamcast in Go
  2. Extend it — Add features you need
  3. Learn from it — Apply these patterns to other platforms
  4. Contribute — Fix bugs, improve performance, write examples

The Dreamcast community is small but passionate. Join us at:


Final Words

The Sega Dreamcast was released on November 27, 1998, in Japan. It was discontinued on March 31, 2001—a commercial failure that outlived its corporate support by decades.

Twenty-five years later, people are still writing code for it. Still pushing its limits. Still finding joy in its constraints.

That’s the magic of retro computing. It’s not about nostalgia. It’s about craft. Modern development gives us infinite resources and infinite complexity. Old hardware gives us finite resources and forces elegant solutions.

libgodc exists because someone asked: “Can Go run on a Dreamcast?”

The answer is yes. And now you know how.


Thank you for reading, Panos

libgodc Design

libgodc is a Go runtime for the Sega Dreamcast. This document explains how it works under the hood.

The Problem

The Dreamcast is a fixed platform: 200MHz SH-4, 16MB RAM, no MMU, no swap. The standard Go runtime assumes infinite memory, preemptive scheduling, operating system threads, and virtual memory. None of these exist here.

libgodc replaces the Go runtime with one designed for this environment.

Architecture

┌────────────────────────────────────────────────────────────────┐
│  Your Go Code                                                  │
│     compiles with sh-elf-gccgo                                 │
│     produces .o files with Go runtime calls                    │
├────────────────────────────────────────────────────────────────┤
│  libgodc (this library)                                        │
│     implements Go runtime functions                            │
│     memory allocation, goroutines, channels, GC                │
├────────────────────────────────────────────────────────────────┤
│  KallistiOS (KOS)                                              │
│     baremetal OS for Dreamcast                                 │
│     provides malloc, threads, drivers                          │
├────────────────────────────────────────────────────────────────┤
│  Dreamcast Hardware                                            │
│     SH4 CPU, PowerVR2 GPU, AICA sound                          │
│     16MB main RAM, 8MB VRAM                                    │
└────────────────────────────────────────────────────────────────┘

We don’t need the full Go runtime. We need enough to run games. Games have different requirements than servers—short sessions, realtime deadlines, no network services. This simplifies everything.

Memory Model

The Budget

16MB total RAM:
 KOS kernel + drivers:     ~1MB
 Your program text/data:   ~13MB
 GC heap (two semispaces): 4MB (2MB active at any time)
 Goroutine stacks:         ~640KB (10 goroutines × 64KB)
 Channel buffers:          Variable
 Available for KOS malloc: ~6-9MB (textures, audio, meshes)

The number are from the source code config: GC heap: GC_SEMISPACE_SIZE_KB in godc_config.h (default 2048 = 2MB × 2) Stack size: GOROUTINE_STACK_SIZE in godc_config.h (default 64KB) Run bench_architecture.elf to verify: prints actual config values

The 16MB limit is absolute. There is no virtual memory, no swap, no second chance. Every byte matters.

Allocation Strategy

libgodc uses three allocation paths:

1. GC Heap (for Go objects)

Small, frequentlyallocated objects go here. The semispace collector manages them automatically. Implementation: gc_heap.c, gc_copy.c.

Implementation of the allocation in simple pseudocode:

// Bump allocator: O(1) allocation (simplified)
void *gc_alloc(size_t size, type_descriptor *type) {
    size = ALIGN(size + HEADER_SIZE, 8);
    if (alloc_ptr + size > alloc_limit) {
        gc_collect();  // Cheney's algorithm
    }
    void *obj = alloc_ptr;
    alloc_ptr += size;
    return obj;
}
```go

This is simplified. The real code in `gc_heap.c` also handles large objects
(>64KB bypass the GC heap and go straight to malloc), alignment edge cases,
and gc_percent threshold checks. But the core is exactly this: bump a pointer.

The bump allocator is the fastest possible allocation strategy. Deallocation
happens during collectionlive objects are copied, dead objects are forgotten.

Usage example:

```go
// Go: allocate freely, GC handles cleanup
func spawnEnemy() *Enemy {
    return &Enemy{bullets: make([]Bullet, 100)}
}
// No kill function needed  when nothing references it, it's collected

2. KOS Heap (for large objects)

Objects larger than 64KB bypass the GC entirely. This is correct for game assetstextures, audio buffers, and mesh data are typically loaded once and never freed during gameplay.

// This goes to KOS malloc, not GC:
texture := make([]byte, 256*256*2)  // 128KB texture
```c

Large objects use `malloc()` internally and are not tracked by the GC.
To free them, use `runtime.FreeExternal`:

```go
//go:linkname freeExternal runtime.FreeExternal
func freeExternal(ptr unsafe.Pointer)

// Allocate large texture
texture := make([]byte, 256*256*2)  // 128KB, bypasses GC

// When done with it:
freeExternal(unsafe.Pointer(&texture[0]))
texture = nil  // Don't use after freeing!

See gc_external_free in gc_heap.c. Run test_free_external.elf to verify.

Typical pattern swap textures between levels:

// Load level 1
bgTexture := make([]byte, 512*512*2)  // 512KB

// ... play level 1 ...

// Unload before level 2
freeExternal(unsafe.Pointer(&bgTexture[0]))
bgTexture = make([]byte, 512*512*2)  // reuses memory

// or you could use a helper function, like that:
func freeSlice(s []byte) {
    if len(s) > 0 {
        freeExternal(unsafe.Pointer(&s[0]))
    }
}

// Then just:
freeSlice(bgTexture)

3. Stack (for goroutine execution)

Each goroutine gets a fixed 64KB stack. No stack growth, no splitstack. This is simpler and faster than growable stacks, but requires discipline.

Stack frames are freed automatically when functions return. Use the stack for temporary buffers:

func processAudio() {
    buffer := [4096]int16{}  // 8KB on stack, automatically freed
    // ...
}

Object Header

Every GC object has an 8byte header. The GC needs to know each object’s size (to copy it) and whether it contains pointers (to scan them). Storing this inline costs 8 bytes per object but makes lookup instant (ptr 8).

┌──────────────────────────────────────────────────────────┐
│  Bits 31: Forwarded (1 = copied during GC)               │
│  Bits 30: NoScan (1 = no pointers)                       │
│  Bits 29-24: Type tag (6 bits, Go type kind)             │
│  Bits 23-0: Size (24 bits, max 16MB)                     │
├──────────────────────────────────────────────────────────┤
│  Type pointer (32 bits, full type descriptor)            │
└──────────────────────────────────────────────────────────┘

Putting numbers on the paper, a [4]byte array actually uses not 4 but 12 bytes (4 data + 8 header). This is why many small allocations hurt more than fewer large ones.

The NoScan bit is critical for performance. Objects containing only integers, floats, or other nonpointer types skip GC scanning entirelythe collector just copies them without inspecting their contents.

The practical takeaway: prefer value types over pointer types when possible.

// Faster GC (NoScan), just a copy:
type Vertex struct { X, Y, Z float32 }
mesh := make([]Vertex, 1000)

// Slower GC (must scan), has pointers:
mesh := make([]*Vertex, 1000)

Garbage Collection

Algorithm: Cheney’s SemiSpace Collector

The heap is divided into two semispaces of equal size. Only one is active at any time. When the active space fills up:

  1. Stop all goroutines (stoptheworld)
  2. Copy all live objects to the other space
  3. Update all pointers to point to new locations
  4. Switch active space
  5. Resume execution
// Two semispaces
gc_heap.space[0] = memalign(32, GC_SEMISPACE_SIZE);
gc_heap.space[1] = memalign(32, GC_SEMISPACE_SIZE);

// Collection switches active space
int old_space = gc_heap.active_space;
int new_space = 1  old_space;
gc_heap.active_space = new_space;

// Copy to new space, scan roots, update pointers
gc_scan_roots();
// ... Cheney's forwarding loop ...

This algorithm is simple, has no fragmentation, and handles cycles naturally. The cost is that only half the heap is usable at any time.

Collection Trigger

GC runs when: Active space exceeds threshold (default: 75% when gc_percent=100) Allocation would exceed remaining space Explicit GC call

The threshold is controlled by gc_percent:

  • gc_percent = 100 (default): threshold = 75% of heap space
  • gc_percent = 50: threshold = 50% of heap space
  • gc_percent = -1: disable automatic GC (only explicit runtime.GC() triggers collection)

To control GC from Go:

//go:linkname setGCPercent debug.SetGCPercent
func setGCPercent(percent int32) int32

//go:linkname gc runtime.GC
func gc()

func init() {
    setGCPercent(50)   // Trigger at 50% instead of 75%
    setGCPercent(-1)   // Disable automatic GC entirely
    gc()               // Force collection now
}

Run test_gc_percent.elf to verify this works.

Pause Times

GC pause time depends on live object count and layout. Run tests/bench_architecture.elf on hardware to measure actual pauses.

For 60fps (16.6ms frames), disable automatic GC during gameplay:

import _ "unsafe"

//go:linkname setGCPercent debug.SetGCPercent
func setGCPercent(percent int32) int32

//go:linkname forceGC runtime.GC
func forceGC()

func main() {
    setGCPercent(-1)  // Disable automatic GC
    
    // ... game runs with no GC pauses ...
    
    // GC during loading screens only:
    showLoadingScreen()
    forceGC()
    startGameplay()
}

Root Scanning

The GC finds live objects by tracing from roots:

static void gc_scan_roots(void)
{
    // Scan explicit roots (gc_add_root)
    for (int i = 0; i < gc_root_table.count; i++) { ... }

    // Scan compilerregistered roots (registerGCRoots)
    gc_scan_compiler_roots();

    // Scan current stack
    gc_scan_stack();

    // Scan all goroutine stacks
    gc_scan_all_goroutine_stacks();
}
  1. Global variables Registered by gccgogenerated code via registerGCRoots(). Each package contributes a root list.

  2. Goroutine stacks Scanned conservatively. Every aligned pointersized value that points into the heap is treated as a potential pointer.

  3. Explicit roots Optional. If you write C code that holds pointers to Go objects, call gc_add_root(&ptr) so the GC doesn’t collect them.

DMA Hazard

The GC moves objects. Any pointer held by hardware (PVR DMA, AICA) will become stale after collection. Safe patterns:

// DANGEROUS  GC might move buffer during DMA:
data := make([]byte, 4096)     // Small, in GC heap
startDMA(data)                  // Hardware holds pointer
runtime.Gosched()               // GC might run here!

// SAFE  Large allocations bypass GC:
data := make([]byte, 100*1024)  // >64KB, uses malloc
startDMA(data)                  // Won't move

// SAFE  VRAM for textures:
tex := kos.PvrMemMalloc(size)   // Allocates in VRAM

Scheduler

M:1 Cooperative Model

All goroutines run on a single KOS thread. One goroutine executes at a time. Context switches happen only at explicit yield points:

Channel operations (send, receive, select) runtime.Gosched() time.Sleep() and timer waits Blocking I/O

A goroutine in a tight CPU loop will monopolize the processor. There is no preemption.

Why M:1?

The Dreamcast has one CPU core. Preemptive scheduling adds complexity and overhead for no parallelism benefit. Cooperative scheduling is simpler, faster, and sufficient for games.

Run Queue Structure

The scheduler maintains a simple FIFO run queue. Goroutines are added to the tail and removed from the head. This is simpler than prioritybased scheduling and sufficient for game workloads where you control when each goroutine yields.

// Goroutines execute in the order they become runnable
runq_put(gp);   // Add to tail
gp = runq_get(); // Remove from head

For realtime requirements, structure your code so timesensitive work runs on the main goroutine or yields frequently.

Context Switching

Each goroutine saves 64 bytes of CPU state when it yields:

typedef struct sh4_context {
    uint32_t r8, r9, r10, r11, r12, r13, r14;  // Calleesaved
    uint32_t sp, pr, pc;                        // Special registers
    uint32_t fr12, fr13, fr14, fr15;           // FPU calleesaved
    uint32_t fpscr, fpul;                       // FPU control
} sh4_context_t;

Context switch is implemented in runtime_sh4_minimal.S (simplified for brevity):

__go_swapcontext:
    ! Save current context
    mov.l   r8, @r4         ! r4 = old_ctx
    mov.l   r9, @(4, r4)
    ...
    ! Restore new context
    mov.l   @r5, r8         ! r5 = new_ctx
    mov.l   @(4, r5), r9
    ...
    rts

FPU Context

Every context switch saves floatingpoint registers, even if your goroutine only uses integers. This costs ~50 extra cycles per switch.

// Both goroutines pay FPU overhead, even though neither uses floats
go audioDecoder()   // Integer PCM math
go networkHandler() // Packet parsing

This is a tradeoff: always saving FPU is slower but correct. A goroutine that unexpectedly uses a float won’t corrupt another’s FPU state.

Goroutine Structure

typedef struct G {
    // ABICRITICAL: gccgo expects these at specific offsets
    PanicRecord *_panic;      // Offset 0: innermost panic
    GccgoDefer *_defer;       // Offset 4: innermost defer

    // Scheduling
    Gstatus atomicstatus;
    G *schedlink;
    void *param;

    // Stack
    void *stack_lo;
    void *stack_hi;
    stack_segment_t *stack;
    void *stack_guard;
    tls_block_t *tls;

    // CPU context (64 bytes)
    sh4_context_t context;

    // Metadata
    int64_t goid;
    WaitReason waitreason;
    int32_t allgs_index;
    uint32_t death_generation;
    G *dead_link;
    uint8_t gflags2;

    // Channel wait
    sudog *waiting;

    // Defer/panic
    Checkpoint *checkpoint;
    int defer_depth;

    // Entry point
    uintptr_t startpc;
    G *freeLink;
} G;

See goroutine.h for the authoritative definition.

Goroutine Lifecycle

  1. Creation __go_go() allocates G struct, stack, and TLS block
  2. Runnable Added to run queue
  3. Running Scheduler switches context to it
  4. Waiting Parked on channel, timer, or I/O
  5. Dead Function returned, queued for cleanup

Dead goroutines are reclaimed after a grace period (epochbased reclamation) to ensure no dangling sudog references from channel wait queues.

Channels

Channels are the primary synchronization primitive. Implementation follows the Go runtime closely.

Structure

typedef struct hchan {
    uint32_t qcount;        // Current element count
    uint32_t dataqsiz;      // Buffer size (0 = unbuffered)
    void *buf;              // Circular buffer
    uint16_t elemsize;      // Element size
    uint8_t closed;         // Channel closed flag
    uint8_t buf_mask_valid; // Optimization: can use & instead of %
    struct __go_type_descriptor *elemtype;
    uint32_t sendx, recvx;  // Buffer indices
    waitq recvq, sendq;     // Wait queues (sudog linked lists)
    uint8_t locked;         // Simple lock flag
} hchan;

Unbuffered Channels

Send blocks until a receiver arrives. Receive blocks until a sender arrives. When both are ready, data transfers directlyno buffering.

This is the fundamental synchronization primitive: rendezvous.

Buffered Channels

Send blocks only when buffer is full. Receive blocks only when buffer is empty. The buffer is a simple circular array.

Select

Select uses randomized ordering to prevent starvation:

select {
case x := <ch1:  // These are checked in random order
case ch2 < y:
case <time.After(timeout):
}

Implementation: shuffle cases, check each for readiness, park on all if none ready.

Defer, Panic, Recover

Defer

Defer uses a linked list per goroutine. Each defer statement pushes a record; function exit pops and executes them in LIFO order.

typedef struct GccgoDefer {
    struct GccgoDefer *link;    // Next entry in defer stack
    bool *frame;                // Pointer to caller's frame bool
    PanicRecord *panicStack;    // Panic stack when deferred
    PanicRecord *_panic;        // Panic that caused defer to run
    uintptr_t pfn;              // Function pointer to call
    void *arg;                  // Argument to pass to function
    uintptr_t retaddr;          // Return address for recover matching
    bool makefunccanrecover;    // MakeFunc recover permission
    bool heap;                  // Whether heap allocated
} GccgoDefer;  // 32 bytes total
```go

### Panic and Recover

Userinitiated panic (`panic()`) is recoverable via `recover()` in a deferred
function. Implementation uses `setjmp`/`longjmp` with checkpoints.

Runtime panics (nil dereference, bounds check, divide by zero) are not
recoverablethey crash immediately with a diagnostic.

Why? Recovering from a bounds check failure would leave the program in an
undefined state. It's better to crash clearly than corrupt silently.

## Type System

### Type Descriptors

gccgo generates type descriptors for every Go type. libgodc uses these for:

 GC pointer scanning (which fields contain pointers?)
 Interface method dispatch (which methods does this type implement?)
 Reflection (what is this type's name and structure?)

```c
typedef struct __go_type_descriptor {
    uint8_t __code;              // Kind (bool, int, slice, etc.)
    uint8_t __align, __field_align;
    uintptr_t __size;
    uint32_t __hash;
    uintptr_t __ptrdata;         // Bytes containing pointers
    const void *__gcdata;        // Pointer bitmap
    // ...
} __go_type_descriptor;

Interface Tables

Interface dispatch uses precomputed method tables. When you write:

var w io.Writer = os.Stdout
w.Write(data)

The compiler generates an itab linking *os.File to io.Writer, containing function pointers for all interface methods.

SH4 Specifics

Register Allocation

r0r7: Callersaved (arguments, scratch) r8r14: Calleesaved (preserved across calls) r15: Stack pointer pr: Procedure return (return address) GBR: Reserved for KOS _Thread_local

We do not use GBR for goroutine TLS. Instead, we use a global current_g pointer. This avoids conflicts with KOS and simplifies context switching.

FPU Mode

libgodc uses singleprecision mode (m4single). The SH4 FPU is fast in singleprecision but slow in doubleprecision. All float64 operations generate software emulation callsavoid them in hot paths.

Cache Considerations

The SH4 has 32byte cache lines. Context switching saves/restores 64 bytes of CPU state (2 cache lines).

DMA operations require explicit cache management. The GC handles this for its semispace flip, but user code doing DMA must use KOS cache functions:

#include <arch/cache.h>

dcache_flush_range((uintptr_t)ptr, size);  // Flush before DMA write
dcache_inval_range((uintptr_t)ptr, size);  // Invalidate after DMA read

File Organization

runtime/
├── gc_heap.c           # Heap initialization, allocation
├── gc_copy.c           # Cheney's copying collector
├── gc_runtime.c        # Go runtime interface (newobject, etc.)
├── scheduler.c         # Run queue, schedule(), goready()
├── proc.c              # Goroutine creation, lifecycle
├── chan.c              # Channel implementation
├── select.c            # Select statement
├── sudog.c             # Wait queue entries
├── defer_dreamcast.c   # Defer/panic/recover
├── timer.c             # Time.Sleep, timers
├── tls_sh4.c           # TLS management
├── runtime_sh4_minimal.S  # Context switching assembly
├── interface_dreamcast.c  # Interface dispatch
├── map_dreamcast.c     # Map implementation
├── goroutine.h         # Core data structures
├── gen-offsets.c       # Generates struct offset definitions
└── asm-offsets.h       # Auto-generated struct offsets for assembly

AssemblyC ABI Synchronization

The Problem

Context switching is implemented in assembly (runtime_sh4_minimal.S). The assembly code accesses G struct fields by hardcoded byte offsets:

mov.l   @(32, r4), r0    ! Load G>context at offset 32

If someone changes the G struct in C (adds/removes/reorders fields), the assembly breaks silentlyit reads garbage from wrong offsets. This is a classic embedded systems bug: C struct layout changes invisibly break handwritten assembly.

The Solution

We use a threelayer defense:

1. Generated Header (asm-offsets.h)

gen-offsets.c uses offsetof() to emit the actual struct offsets:

// genoffsets.c
OFFSET(G_CONTEXT, G, context);  // Emits: #define G_CONTEXT 32

The Makefile compiles this to assembly, extracts the #define lines, and writes asm-offsets.h. This header is committed to git.

2. Build-Time Verification (make check-offsets)

Before release, run:

make check-offsets

This regenerates the offsets from the current struct and diffs against the committed header. If they don’t match, the build fails with a clear error.

3. Runtime Verification (scheduler.c)

At startup, the scheduler verifies critical offsets:

if (offsetof(G, context) != G_CONTEXT) {
    runtime_throw("G struct layout mismatch - update asm-offsets.h");
}

If somehow a mismatched binary runs, it crashes immediately with a diagnostic instead of silently corrupting goroutine state.

Workflow for Changing G Struct

  1. Modify runtime/goroutine.h (the authoritative definition)
  2. Update runtime/gen-offsets.c to match
  3. Run make check-offsets — it will fail if out of sync
  4. Run make runtime/asm-offsets.h to regenerate
  5. Update runtime/runtime_sh4_minimal.S if G_CONTEXT changed
  6. Run make check-offsets again — should pass now
  7. Commit all changed files together

Why This Matters

In games, struct layout bugs cause symptoms like:

Goroutines resume with corrupted registers Context switches overwrite random memory FPU state leaks between goroutines Panics with nonsensical stack traces

These are nearly impossible to debug. The offset verification catches them at build time (or worst case, at startup) instead of during the final boss fight.

Performance

Measured on real Dreamcast hardware (SH4 @ 200MHz), verified December 2025:

| Operation | Time | Notes | | | | | | Gosched yield | 120 ns | Minimal scheduler roundtrip | | Direct call | 140 ns | Baseline comparison | | Buffered channel op | ~1.5 μs | Send to ready receiver | | Context switch | ~6.6 μs | Full goroutine switch | | Unbuffered channel | ~13 μs | Send + receive roundtrip | | Goroutine spawn | ~34 μs | Create + schedule + run | | GC pause (bypass) | ~73 μs | Objects ≥64KB bypass GC | | GC pause (64KB live) | ~2.2 ms | Medium live set | | GC pause (32KB live) | ~6.2 ms | Many small objects |

Run tests/bench_architecture.elf to measure on your hardware.

Note: For a complete reference of performance numbers, see the Glossary.

Design Decisions

Why gccgo instead of gc?

The standard Go compiler (gc) generates code for a completely different runtime. gccgo uses GCC’s backend, which already supports SH4 targets. We replace libgo with libgodc; the compiler doesn’t need modification.

Why semispace instead of marksweep?

Semispace has no fragmentation. In a 16MB system, fragmentation would eventually make large allocations impossible even with free memory. The 50% space overhead is acceptable for games.

Why cooperative instead of preemptive?

Preemptive scheduling requires timer interrupts, signal handling, and safepoint insertion. All of this complexity gains nothing on a singlecore CPU. Cooperative scheduling is simpler, faster, and sufficient.

Why fixed stacks instead of growable?

Growable stacks require compiler support (stack probes) and runtime support (morestack). Fixed stacks work with any compiler flags and simplify the runtime. 64KB is enough for typical game code.

References

Cheney, C.J. “A Nonrecursive List Compacting Algorithm.” CACM, 1970. Jones & Lins. “Garbage Collection.” Wiley, 1996. The Go Programming Language Specification. KallistiOS Documentation. SH4 Software Manual, Renesas.

Effective Dreamcast Go

A practical guide to writing efficient Go code for the Sega Dreamcast.

These patterns come from real debugging sessions with the libgodc runtime. Follow them to write games that run smooth at 60fps on the Dreamcast’s 200MHz SH-4 processor with 16MB RAM.

Memory Model

ResourceLimitNotes
Total RAM16 MBShared with VRAM, sound, OS
GC Heap2 MB × 2Semispace collector, 4MB total
Goroutine Stack64 KBFixed size, cannot grow
Large Object Threshold64 KBObjects larger bypass GC

1. Pre-allocate During Loading

The garbage collector can pause your game for several milliseconds. Allocate everything during load screens, not gameplay.

Bad: Allocating during gameplay

func UpdateParticles() {
    for i := 0; i < 100; i++ {
        p := new(Particle)  // GC pause risk every frame!
        particles = append(particles, p)
    }
}

Good: Object pooling

// Pre-allocated pool
var particlePool [1000]Particle
var activeCount int

func Init() {
    activeCount = 0
}

func SpawnParticle() *Particle {
    if activeCount >= len(particlePool) {
        return nil  // Pool exhausted
    }
    p := &particlePool[activeCount]
    activeCount++
    *p = Particle{}  // Reset to zero
    return p
}

func DespawnParticle(index int) {
    // Swap with last active
    activeCount--
    particlePool[index] = particlePool[activeCount]
}

2. Respect the 64KB Stack Limit

Each goroutine has a fixed 64KB stack. Unlike desktop Go, stacks cannot grow. Deep recursion or large local variables will crash your game.

Bad: Large local arrays

func ProcessFrame() {
    var buffer [16384]float32  // 64KB on stack - CRASH!
    // ...
}

Good: Use globals or heap for large data

var frameBuffer [8192]float32  // Global, not on stack

func ProcessFrame() {
    // Use frameBuffer safely
    for i := range frameBuffer {
        frameBuffer[i] = 0
    }
}

Bad: Deep recursion

func TraverseTree(node *Node) {
    if node == nil { return }
    TraverseTree(node.left)   // Stack grows each call
    TraverseTree(node.right)  // Can overflow on deep trees
}

Good: Iterative with explicit stack

func TraverseTree(root *Node) {
    stack := make([]*Node, 0, 64)  // Heap-allocated
    stack = append(stack, root)
    
    for len(stack) > 0 {
        node := stack[len(stack)-1]
        stack = stack[:len(stack)-1]
        
        if node == nil { continue }
        // Process node...
        stack = append(stack, node.left, node.right)
    }
}

3. Reuse Slices

Creating new slices allocates memory. Reuse existing slices by resetting their length.

Bad: New slice every frame

func GetVisibleEnemies() []Enemy {
    result := make([]Enemy, 0)  // Allocation every call!
    for _, e := range allEnemies {
        if e.visible {
            result = append(result, e)
        }
    }
    return result
}

Good: Reuse with length reset

var visibleEnemies []Enemy

func Init() {
    visibleEnemies = make([]Enemy, 0, 100)  // Once during init
}

func GetVisibleEnemies() []Enemy {
    visibleEnemies = visibleEnemies[:0]  // Reset length, keep capacity
    for _, e := range allEnemies {
        if e.visible {
            visibleEnemies = append(visibleEnemies, e)
        }
    }
    return visibleEnemies
}

4. Minimize Goroutines

Each goroutine consumes 64KB of stack space. 100 goroutines = 6.4MB RAM—40% of total Dreamcast memory!

Bad: Goroutine per entity

for _, enemy := range enemies {
    go enemy.Think()  // 100 enemies = 6.4MB just for stacks!
}

Good: Process on main goroutine

func UpdateAllEnemies() {
    for i := range enemies {
        enemies[i].Think()  // Sequential, predictable
    }
}

Acceptable: Few dedicated goroutines

func main() {
    go audioMixer()      // One for audio streaming
    go networkHandler()  // One for network (if needed)
    
    // Main loop handles game logic
    for {
        Update()
        Render()
    }
}

5. Use Value Types for Small Structs

Small structs passed by value stay on the stack. Pointers may escape to the heap.

Good: Pass small structs by value

type Vec3 struct {
    X, Y, Z float32  // 12 bytes
}

func Add(a, b Vec3) Vec3 {
    return Vec3{a.X + b.X, a.Y + b.Y, a.Z + b.Z}
}

// Usage - no heap allocation
pos := Add(velocity, acceleration)

Bad: Unnecessary pointer for small struct

func Add(a, b *Vec3) *Vec3 {
    return &Vec3{a.X + b.X, a.Y + b.Y, a.Z + b.Z}  // Escapes to heap!
}

Structs under ~64 bytes are fine to pass by value.

6. Avoid String Operations During Gameplay

Strings are immutable. Concatenation creates new strings (garbage).

Bad: String building in loop

var log string
for i := 0; i < 100; i++ {
    log = log + "entry"  // New allocation each iteration!
}

Bad: Formatted strings every frame

func DrawHUD() {
    scoreText := fmt.Sprintf("Score: %d", score)  // Allocates!
    DrawText(scoreText)
}
```c

### Good: Pre-render or avoid strings

```go
// For HUD: use digit sprites
func DrawScore(score int) {
    x := 100
    for score > 0 {
        digit := score % 10
        DrawSprite(digitSprites[digit], x, 10)
        x -= 16
        score /= 10
    }
}

// For debug: print directly (still allocates, but debug only)
println("Debug:", value)

7. Large Assets Bypass GC

Allocations over 64KB use malloc directly and are not garbage collected.

// This 128KB texture is NOT managed by GC
texture := make([]byte, 256*256*2)

// It will live forever (or until program exit)
// This is usually fine - load assets once, keep forever
```go

Implications:
- Large slices don't pressure the GC
- They also don't get freed automatically
- Perfect for textures, sounds, level data

## 8. Escape Analysis Awareness

The Go compiler decides whether variables go on stack (fast) or heap (needs GC). Variables "escape" to heap when:

- Returned from a function
- Stored in a slice, map, or struct field
- Passed to a goroutine
- Address taken and stored somewhere

### Stack allocated (good):

```go
func Calculate() int {
    x := 42        // Stays on stack
    y := x * 2     // Stays on stack
    return y       // Value returned, not pointer
}

Heap allocated (be aware):

func MakeEnemy() *Enemy {
    e := Enemy{}   // Must escape - we return pointer
    return &e      // Heap allocation here
}

Force stack when possible:

// Instead of returning pointer...
func MakeEnemy() *Enemy {
    return &Enemy{HP: 100}  // Heap
}

// Return value and let caller decide:
func NewEnemy() Enemy {
    return Enemy{HP: 100}  // Caller's stack or their choice
}

9. Map Usage Patterns

Maps allocate internally. Pre-size them and avoid creating during gameplay.

Bad: Maps created during gameplay

func SpawnWave() {
    enemyTypes := make(map[string]int)  // Allocation!
    enemyTypes["goblin"] = 10
    // ...
}

Good: Pre-allocated maps

var enemyTypes map[string]int

func Init() {
    enemyTypes = make(map[string]int, 10)  // Pre-size at init
}

func SpawnWave() {
    // Clear and reuse
    for k := range enemyTypes {
        delete(enemyTypes, k)
    }
    enemyTypes["goblin"] = 10
}

10. The Game Loop Pattern

A typical Dreamcast game structure:

package main

// === PRE-ALLOCATED RESOURCES ===
var (
    enemies     [100]Enemy
    particles   [500]Particle
    projectiles [200]Projectile
    
    activeEnemies     []*Enemy
    activeParticles   []*Particle
    activeProjectiles []*Projectile
)

func Init() {
    // Pre-allocate slice capacity
    activeEnemies = make([]*Enemy, 0, 100)
    activeParticles = make([]*Particle, 0, 500)
    activeProjectiles = make([]*Projectile, 0, 200)
    
    // Load assets (large allocations OK here)
    LoadTextures()
    LoadSounds()
    LoadLevel()
}

func Update() {
    // Reset working slices
    activeEnemies = activeEnemies[:0]
    
    // Process game logic (no allocations!)
    for i := range enemies {
        if enemies[i].active {
            enemies[i].Update()
            activeEnemies = append(activeEnemies, &enemies[i])
        }
    }
}

func Render() {
    // Draw using pre-allocated data
    for _, e := range activeEnemies {
        e.Draw()
    }
}

func main() {
    Init()
    
    for !shouldExit {
        Input()
        Update()
        Render()
        // VSync handled by PVR
    }
}

Quick Reference Card

DO

var pool [N]Object             // Pre-allocated pools
slice = slice[:0]              // Reset slice, keep capacity
for i := range arr { }         // Index iteration
small := Vec3{1, 2, 3}         // Value types
make([]T, 0, capacity)         // Pre-sized slices (at init)
val, ok := m[key]              // Safe map access
select { default: }            // Yield in loops
runtime_checkpoint()           // For panic recovery

AVOID (during gameplay)

make([]T, n)                   // New slices
append(s, x)                   // When at capacity  
new(T)                         // For small types
go func() {}()                 // Excessive goroutines
string + string                // String concatenation
fmt.Sprintf()                  // Formatted strings
recover()                      // Use runtime_checkpoint instead
for { busyWork() }             // Loops without yielding

11. Panic/Recover Limitation

Standard Go’s recover() does not work on Dreamcast due to ABI differences. Use the runtime_checkpoint() pattern instead:

Bad: Standard recover (won’t work)

func SafeCall() {
    defer func() {
        if r := recover(); r != nil {  // NEVER catches panics!
            println("recovered")
        }
    }()
    panic("oops")
}

Good: Use runtime_checkpoint

import _ "unsafe"

//go:linkname runtime_checkpoint runtime.runtime_checkpoint
func runtime_checkpoint() int

func SafeCall() (recovered bool) {
    defer func() {
        if runtime_checkpoint() != 0 {
            recovered = true
            return
        }
        // Normal cleanup here
    }()
    panic("oops")
    return false
}
```go

Most game code shouldn't need recover. Design to avoid panics:
- Check bounds before indexing
- Validate inputs at entry points
- Use `ok` form for map access: `val, ok := m[key]`

## 12. Cooperative Scheduling

The Dreamcast scheduler is **cooperative**, not preemptive. Goroutines run until they yield.

### Goroutines yield when they:
- Send/receive on channels
- Call `select` (including with `default`)
- Call explicit yield functions
- Block on I/O

### Bad: Infinite loop without yielding

```go
go func() {
    for {
        doWork()  // Never yields - blocks all other goroutines!
    }
}()

Good: Yield periodically

go func() {
    for {
        doWork()
        select {
        case <-done:
            return
        default:
            // Yields to scheduler, then continues
        }
    }
}()

Better: Use channels for work

go func() {
    for item := range workQueue {  // Yields while waiting
        process(item)
    }
}()

Timing is not guaranteed

Because of cooperative scheduling:

  • Don’t rely on precise goroutine ordering
  • Deadlines are “best effort”, not hard guarantees
  • For real-time needs, keep critical work on main goroutine

13. Select with Default

select with default is an efficient polling pattern that yields correctly:

func pollChannels() {
    for {
        select {
        case msg := <-inputChan:
            handleInput(msg)
        case result := <-resultChan:
            handleResult(result)
        default:
            // No message ready - yields to other goroutines
            // then returns immediately
        }
        
        // Can do other work here
        processFrame()
    }
}

This pattern works well for:

  • Non-blocking channel checks
  • Game loops that need to poll multiple sources
  • Background workers that shouldn’t block the main loop

Platform Constraints

Goroutine Leak

Dead goroutines retain ~160 bytes each (G struct only). The stack memory and TLS are properly reclaimed, and the G struct is kept in a free list for reuse by future goroutines. When you spawn a new goroutine, it reuses a G from the free list if available.

If you spawn 10,000 goroutines that all exit without spawning new ones, you’ll have ~1.6MB in the free list. This memory is reused when you spawn new goroutines. Monitor goroutine count with runtime.NumGoroutine().

Unrecoverable Runtime Panics

User panic() is recoverable. Runtime panics are not:

  • Nil pointer dereference
  • Array/slice bounds check
  • Integer divide by zero
  • Stack overflow

These crash immediately. A bounds check failure means program invariants are violated—continuing would corrupt data.

32-bit Pointers

All pointers are 4 bytes. Code assuming 64-bit pointers will break. unsafe.Sizeof(uintptr(0)) returns 4, not 8.

Single-Precision FPU

The SH-4 FPU operates in single precision. Double precision is software emulated—extremely slow. Avoid float64 in hot paths.

Cache Coherency

DMA operations require explicit cache management. Use KOS cache functions from C or via //extern:

#include <arch/cache.h>

dcache_flush_range((uintptr_t)ptr, size);  // Before DMA write (CPU -> HW)
dcache_inval_range((uintptr_t)ptr, size);  // After DMA read (HW -> CPU)

Not Implemented

  • Race detector
  • CPU/memory profiling
  • Debugger support (delve, gdb)
  • Plugin package
  • cgo (use //extern for C functions)
  • Signals (os.Signal, signal.Notify)
  • Networking (requires Broadband Adapter)

Limited Implementation

  • reflect: Basic type inspection only, no reflect.MakeFunc
  • unsafe: Works, but remember 4-byte pointers
  • sync: Mutexes work, but with M:1 scheduling no other goroutine runs while you hold a lock—deadlock is impossible but starvation is easy

Compatibility

  • gccgo only (not the standard gc compiler)
  • KallistiOS required
  • SH-4 architecture only

Debugging Tips

Available tools:

  • Serial output via println() (routed to dc-tool)
  • LIBGODC_ERROR / LIBGODC_CRITICAL macros (defined in runtime.h)
  • GC statistics via the C function gc_stats(&used, &total, &collections)
  • runtime.NumGoroutine() to count active goroutines
  • KOS debug console (dbglog())

Not available: stack traces, core dumps, breakpoints, variable inspection, heap profiling. When something goes wrong, you have println() and your brain.

If your game stutters:

  1. Check GC pauses: Add timing around forceGC() calls to measure
  2. Count allocations: Use pools and count activeCount
  3. Monitor goroutines: Keep count of active goroutines
  4. Profile stack usage: Deep call chains near 64KB will crash

If your game freezes (but doesn’t crash):

  1. Goroutine not yielding: A goroutine in a tight loop starves others
  2. Deadlock: Two goroutines waiting on each other’s channels
  3. Main blocked: Main goroutine waiting on a channel nobody sends to

If your game crashes:

  1. Stack overflow: Reduce recursion, shrink local arrays
  2. Nil pointer: Check slice bounds, map existence
  3. GC corruption: Ensure pointers are valid (not into freed memory)
  4. Panic without checkpoint: Use runtime_checkpoint() for recovery

Further Reading

  • docs/DESIGN.md - Runtime architecture
  • docs/KOS_WRAPPERS.md - Hardware access
  • examples/ - Working game examples

Console development is the art of saying ‘no’ to malloc.

KOS API Bindings

KOS is written in C. Your game is written in Go. gccgo’s //extern directive lets you call C functions directly with no wrapper overhead.

┌─────────────────────────────────────────────────────┐
│  Go Code                                            │
│      kos.PvrInitDefaults()                          │
│                │                                    │
│                ▼                                    │
│  //extern pvr_init_defaults                         │
│  func PvrInitDefaults() int32                       │
│                │                                    │
│                ▼                                    │
│  pvr_init_defaults() in libkallisti.a               │
│                │                                    │
│                ▼                                    │
│  Dreamcast Hardware                                 │
└─────────────────────────────────────────────────────┘

Basic Syntax

Function with No Arguments

//go:build gccgo

package kos

//extern pvr_scene_begin
func PvrSceneBegin()
```go

The `//extern` comment must immediately precede the function declaration,
with no blank lines between them. The function has no body—gccgo generates
the call directly.

### Function with Arguments

```go
//extern pvr_list_begin
func PvrListBegin(list uint32) int32

//extern pvr_poly_compile
func pvrPolyCompile(header uintptr, context uintptr)

Arguments are passed according to the SH-4 ABI: first four in registers (r4-r7), remainder on the stack.

Function with Return Value

//extern pvr_mem_available
func PvrMemAvailable() uint32

//extern timer_us_gettime64
func TimerUsGettime64() uint64

Return values come back in r0 (32-bit) or r0:r1 (64-bit).

Type Mappings

The SH-4 is a 32-bit architecture with 4-byte alignment.

C TypeGo TypeSizeNotes
void(no return)-
intint324SH-4 int is 32-bit
unsigned intuint324
int8_tint81
uint8_tuint81
int16_tint162
uint16_tuint162
int32_tint324
uint32_tuint324
int64_tint648
uint64_tuint648
floatfloat324
doublefloat648Software emulated—slow
void*unsafe.Pointer4
char**byte4Or unsafe.Pointer
size_tuint324uintptr also works
struct foo**Foo4Define matching Go struct

Pointer Size

All pointers are 4 bytes. Code that assumes 64-bit pointers will break. unsafe.Sizeof(uintptr(0)) is 4, not 8.

Struct Mappings

When a KOS function takes a pointer to a struct, you have two options:

Option 1: unsafe.Pointer (Quick and Dirty)

//extern pvr_vertex_submit
func pvrVertexSubmit(data unsafe.Pointer, size int32)

// Usage:
func SubmitVertex(v *PvrVertex) {
    pvrVertexSubmit(unsafe.Pointer(v), int32(unsafe.Sizeof(*v)))
}
```go

Works but provides no type safety. Fine for prototyping.

### Option 2: Matching Go Struct (Correct)

Define a Go struct with identical layout to the C struct:

```c
// From dc/pvr.h
typedef struct {
    uint32_t flags;
    float x, y, z;
    float u, v;
    uint32_t argb;
    uint32_t oargb;
} pvr_vertex_t;
// In Go
type PvrVertex struct {
    Flags      uint32
    X, Y, Z    float32
    U, V       float32
    ARGB       uint32
    OARGB      uint32
}

//extern pvr_prim
func pvrPrim(data unsafe.Pointer, size int32)

// PvrPrimVertex submits a vertex to the TA
func PvrPrimVertex(v *PvrVertex) {
    pvrPrim(unsafe.Pointer(v), 32)  // 32 bytes
}

Verify the struct size matches:

func init() {
    if unsafe.Sizeof(PvrVertex{}) != 32 {
        panic("PvrVertex size mismatch")
    }
}

Alignment Matters

C structs may have padding for alignment. Go structs follow Go’s alignment rules, which may differ. Always verify sizes match.

// C struct with padding:
// struct { char a; int b; }  // 8 bytes (3 bytes padding after a)

// Go equivalent:
type Example struct {
    A   byte
    _   [3]byte  // Explicit padding
    B   int32
}
```go

## Stub Files for Host Compilation

Go files using `//extern` only compile with gccgo. For IDE support and
host-side testing, create stub files:

### pvr.go (Dreamcast build)

```go
//go:build gccgo

package kos

//extern pvr_init_defaults
func PvrInitDefaults() int32

//extern pvr_scene_begin
func PvrSceneBegin()
```go

### pvr_stub.go (Host build)

```go
//go:build !gccgo

package kos

func PvrInitDefaults() int32 { panic("kos: not on Dreamcast") }
func PvrSceneBegin()         { panic("kos: not on Dreamcast") }
```go

The build tag ensures the right file is used:

- `gccgo` tag: compiles with sh-elf-gccgo (Dreamcast)
- `!gccgo` tag: compiles with standard go (host)

## Common Patterns

### Wrapper for Type Safety

Expose a safe public API, hide the unsafe internals:

```go
// Private: direct C binding
//extern maple_dev_status
func mapleDevStatus(dev uintptr) uintptr

// Public: type-safe wrapper with method syntax
func (d *MapleDevice) ContState() *ContState {
    if d == nil {
        return nil
    }
    ptr := mapleDevStatus(uintptr(unsafe.Pointer(d)))
    if ptr == 0 {
        return nil
    }
    return (*ContState)(unsafe.Pointer(ptr))
}

Slice to C Array

C functions expect a pointer and length. Go slices have both:

//extern pvr_txr_load
func pvrTxrLoad(src unsafe.Pointer, dst unsafe.Pointer, count uint32)

func PvrTxrLoad(src []byte, dst unsafe.Pointer) {
    if len(src) == 0 {
        return
    }
    pvrTxrLoad(unsafe.Pointer(&src[0]), dst, uint32(len(src)))
}

Always check for empty slices—&src[0] panics on an empty slice.

String to C String

Go strings are not null-terminated. C functions expect null-terminated strings.

import "unsafe"

// Convert Go string to C string (allocates)
func cstring(s string) *byte {
    b := make([]byte, len(s)+1)
    copy(b, s)
    b[len(s)] = 0
    return &b[0]
}

// Usage:
//extern fs_open
func fsOpen(path *byte, mode int32) int32

func Open(path string) int32 {
    return fsOpen(cstring(path), O_RDONLY)
}
```c

For hot paths, avoid allocation by using fixed buffers:

```go
var pathBuf [256]byte

func OpenFast(path string) int32 {
    if len(path) >= 255 {
        panic("path too long")
    }
    copy(pathBuf[:], path)
    pathBuf[len(path)] = 0
    return fsOpen(&pathBuf[0], O_RDONLY)
}

Callback Functions

Some KOS functions take callbacks. This requires careful handling:

//extern pvr_set_bg_color
func PvrSetBgColor(r, g, b float32)

// For callbacks, you often need to use //export to make a Go function
// callable from C. However, this is complex with gccgo.
// Prefer polling over callbacks when possible.

Callbacks from C to Go are tricky because:

  1. The callback runs on whatever stack C chooses
  2. The Go scheduler may not be in a consistent state
  3. The GC may be running

Poll instead of using callbacks when you can.

Caveats

Stack Usage

KOS functions run on the calling goroutine’s stack. Deep C call chains can overflow the 64KB stack:

// DANGEROUS: Unknown stack depth
func LoadLevel(path string) {
    // fs_open -> iso9660_read -> g2_read -> ...
    // How deep does this go?
}

Solutions:

  1. Call from the main goroutine (larger stack)
  2. Limit recursion depth in your code
  3. Move heavy I/O to loading screens

Blocking Calls

Some KOS functions block (file I/O, CD reads). During blocking:

  • No other goroutines run (M:1 scheduler is blocked)
  • Timers don’t fire
  • The game freezes
// BAD: Blocks entire game for 200ms+
data := loadFile("/cd/level.dat")

// BETTER: Do during loading screen
showLoadingScreen()
data := loadFile("/cd/level.dat")
hideLoadingScreen()

// BEST: Stream data over multiple frames
go streamFile("/cd/level.dat", dataChan)

GBR Register

libgodc uses a global pointer for goroutine TLS, leaving GBR for KOS. This means KOS _Thread_local variables work correctly.

If you’re writing assembly or using inline asm, don’t touch GBR—it’s reserved for KOS.

Building the kos Package

The kos/ directory contains the official bindings. To rebuild:

cd kos/
make clean
make
make install  # Copies to $KOS_BASE/lib/

This produces:

  • kos.gox — Export data for the Go compiler
  • libkos.a — Compiled bindings for the linker

Adding New Bindings

Step 1: Find the C Declaration

grep -r "pvr_mem_reset" $KOS_BASE/include/
# Found in dc/pvr.h:
# void pvr_mem_reset(void);

Step 2: Write the Go Binding

//extern pvr_mem_reset
func PvrMemReset()

For functions with complex signatures, check the header carefully:

// From dc/pvr.h
int pvr_prim(void *data, size_t size);
//extern pvr_prim
func pvrPrim(data unsafe.Pointer, size uint32) int32

Step 3: Add Type-Safe Wrapper (Optional)

// For polygon headers (using helper function)
func PvrPrim(hdr *PvrPolyHdr) int32 {
    return goPvrPrimHdr(unsafe.Pointer(hdr))
}

// For vertices (using helper function)
func PvrPrimVertex(v *PvrVertex) int32 {
    return goPvrPrimVertex(unsafe.Pointer(v))
}

Note: For performance-critical paths like vertex submission, libgodc uses specialized C helper functions (__go_pvr_prim_hdr, __go_pvr_prim_vertex) that handle store queue operations efficiently.

Step 4: Add Stub

func PvrMemReset() {
    panic("kos: not on Dreamcast")
}

Step 5: Rebuild

make clean && make && make install

Reference: KOS Subsystems

SubsystemHeaderPrefixDescription
PVRdc/pvr.hpvr_PowerVR graphics
Mapledc/maple.hmaple_Controllers, VMU, etc.
Sounddc/sound/snd_AICA sound chip
Streamingdc/snd_stream.hsnd_stream_Audio streaming
Filesystemkos/fs.hfs_File operations
Timerarch/timer.htimer_High-resolution timing
Videodc/video.hvid_Video modes
G2 Busdc/g2bus.hg2_Bus transfers
CDROMdc/cdrom.hcdrom_CD access
VMUdc/vmu_*.hvmu_Visual Memory Unit
BFontdc/biosfont.hbfont_BIOS font rendering

Example: Complete PVR Bindings

pvr.go

//go:build gccgo

package kos

import "unsafe"

// PvrPtr is a pointer to PVR video memory (VRAM)
type PvrPtr uintptr

// PVR list types
const (
    PVR_LIST_OP_POLY uint32 = 0  // Opaque polygons
    PVR_LIST_OP_MOD  uint32 = 1  // Opaque modifiers
    PVR_LIST_TR_POLY uint32 = 2  // Translucent polygons
    PVR_LIST_TR_MOD  uint32 = 3  // Translucent modifiers
    PVR_LIST_PT_POLY uint32 = 4  // Punch-through polygons
)

// Initialization
//extern pvr_init_defaults
func PvrInitDefaults() int32

// Scene management
//extern pvr_scene_begin
func PvrSceneBegin()

//extern pvr_scene_finish
func PvrSceneFinish() int32

//extern pvr_wait_ready
func PvrWaitReady() int32

// List management
//extern pvr_list_begin
func PvrListBegin(list uint32) int32

//extern pvr_list_finish
func PvrListFinish() int32

// Primitive submission via helper functions
//extern __go_pvr_prim_hdr
func goPvrPrimHdr(data unsafe.Pointer) int32

//extern __go_pvr_prim_vertex
func goPvrPrimVertex(data unsafe.Pointer) int32

type PvrVertex struct {
    Flags      uint32
    X, Y, Z    float32
    U, V       float32
    ARGB       uint32
    OARGB      uint32
}

// PvrPrim submits a polygon header
func PvrPrim(hdr *PvrPolyHdr) int32 {
    return goPvrPrimHdr(unsafe.Pointer(hdr))
}

// PvrPrimVertex submits a vertex
func PvrPrimVertex(v *PvrVertex) int32 {
    return goPvrPrimVertex(unsafe.Pointer(v))
}

// Memory management
//extern pvr_mem_malloc
func PvrMemMalloc(size uint32) PvrPtr

//extern pvr_mem_free
func PvrMemFree(ptr PvrPtr)

//extern pvr_mem_available
func PvrMemAvailable() uint32

pvr_stub.go

//go:build !gccgo

package kos

type PvrPtr uintptr

const (
    PVR_LIST_OP_POLY uint32 = 0
    PVR_LIST_OP_MOD  uint32 = 1
    PVR_LIST_TR_POLY uint32 = 2
    PVR_LIST_TR_MOD  uint32 = 3
    PVR_LIST_PT_POLY uint32 = 4
)

type PvrVertex struct {
    Flags      uint32
    X, Y, Z    float32
    U, V       float32
    ARGB       uint32
    OARGB      uint32
}

func PvrInitDefaults() int32           { panic("kos: not on Dreamcast") }
func PvrSceneBegin()                   { panic("kos: not on Dreamcast") }
func PvrSceneFinish() int32            { panic("kos: not on Dreamcast") }
func PvrWaitReady() int32              { panic("kos: not on Dreamcast") }
func PvrListBegin(list uint32) int32   { panic("kos: not on Dreamcast") }
func PvrListFinish() int32             { panic("kos: not on Dreamcast") }
func PvrPrim(hdr *PvrPolyHdr) int32    { panic("kos: not on Dreamcast") }
func PvrPrimVertex(v *PvrVertex) int32 { panic("kos: not on Dreamcast") }
func PvrMemMalloc(size uint32) PvrPtr  { panic("kos: not on Dreamcast") }
func PvrMemFree(ptr PvrPtr)            { panic("kos: not on Dreamcast") }
func PvrMemAvailable() uint32          { panic("kos: not on Dreamcast") }

Usage in Games

package main

import "kos"

func main() {
    kos.PvrInitDefaults()

    for {
        kos.PvrWaitReady()
        kos.PvrSceneBegin()

        kos.PvrListBegin(kos.PVR_LIST_OP_POLY)
        drawOpaqueGeometry()
        kos.PvrListFinish()

        kos.PvrListBegin(kos.PVR_LIST_TR_POLY)
        drawTranslucentGeometry()
        kos.PvrListFinish()

        kos.PvrSceneFinish()
    }
}

func drawOpaqueGeometry() {
    // First submit a polygon header
    var hdr kos.PvrPolyHdr
    var ctx kos.PvrPolyCxt
    kos.PvrPolyCxtCol(&ctx, kos.PVR_LIST_OP_POLY)
    kos.PvrPolyCompile(&hdr, &ctx)
    kos.PvrPrim(&hdr)
    
    // Then submit vertices
    v := kos.PvrVertex{
        Flags: kos.PVR_CMD_VERTEX_EOL,  // End of strip
        X: 320, Y: 240, Z: 1,
        ARGB: 0xffffffff,
    }
    kos.PvrPrimVertex(&v)
}

Further Reading

Limitations

This document describes the known limitations of libgodc. Understanding these is essential for writing reliable Dreamcast Go programs.

Memory

16MB Total

The Dreamcast has 16MB of RAM. No virtual memory, no swap, no second chance.

Budget your memory:

  • KOS + drivers: ~1MB
  • Your code: ~1-3MB
  • GC heap: 2MB active (4MB total, two semi-spaces)
  • Goroutine stacks: 64KB each
  • Everything else: KOS malloc

When you run out, you crash.

Goroutine Memory Overhead

Dead goroutines retain approximately 160 bytes each (G struct only). The stack memory and TLS are properly reclaimed, and the G struct is kept in a free list for reuse by future goroutines.

Why the free list? Reusing G structs avoids repeated malloc/free overhead. When you spawn a new goroutine, it reuses a G from the free list if available.

Impact: If you spawn 10,000 goroutines that all exit without spawning new ones, you’ll have ~1.6MB in the free list. This memory is reused when you spawn new goroutines. For a typical game session, this is rarely a problem if you design with long-lived goroutines.

Workaround: Prefer long-lived goroutines or let the free list grow to a stable size. If you spawn and exit many goroutines, the G structs accumulate in the free list but are reused:

// GOOD: Fixed set of long-lived goroutines
go audioHandler()      // Lives for entire game
go inputPoller()       // Lives for entire game
go gameLoop()          // Lives for entire game

// OK: Spawning goroutines per-event (G structs are reused)
for event := range events {
    go handleEvent(event)  // ~160B stays in free list for reuse
}

GC Pause Times

The garbage collector stops the world during collection. Pause times depend on live heap size:

Live HeapPause
100KB1-2ms
500KB5-10ms
1MB10-20ms

At 60fps, you have 16.6ms per frame. A 10ms GC pause causes visible stutter.

Workarounds:

  1. Keep the live heap small (<500KB)
  2. Disable automatic GC for action sequences:
    debug.SetGCPercent(-1)  // Disable automatic GC
    runtime.GC()            // Manual GC during loading screens
    
  3. Use KOS malloc for large, long-lived data (textures, audio, levels)

Fixed 64KB Stacks

Goroutine stacks do not grow. Each goroutine gets exactly 64KB.

This limits recursion depth:

Frame SizeSafe Depth
50 bytes~300
100 bytes~150
250 bytes~60
500 bytes~30

Workarounds:

  1. Convert recursion to iteration
  2. Use smaller local variables
  3. Pass large data by pointer, not by value
  4. Avoid deep call chains
// BAD: Large local arrays
func processLevel(depth int) {
    var buffer [4096]byte  // 4KB per stack frame!
    // ... recursive call
}

// GOOD: Heap allocation for large buffers
func processLevel(depth int) {
    buffer := make([]byte, 4096)  // GC heap
    // ... recursive call
}

Scheduling

No Parallelism (M:1)

All goroutines run on a single thread. The go keyword provides concurrency (interleaved execution), not parallelism (simultaneous execution).

There is no benefit from GOMAXPROCS—the Dreamcast has one CPU core.

No Preemption

Goroutines yield only at explicit points:

  • Channel operations
  • runtime.Gosched()
  • time.Sleep()
  • Timer operations

A goroutine in a tight loop blocks all other goroutines:

// BAD: Blocks entire system
for {
    calculateNextFrame()  // Never yields!
}

// GOOD: Explicit yield
for {
    calculateNextFrame()
    runtime.Gosched()  // Let others run
}

Channel Lock Contention

Under high contention, channel locks use spin-yield loops. Many goroutines racing for the same channel wastes CPU.

Workaround: Use buffered channels to reduce contention:

// Unbuffered: every send/receive contends
events := make(chan Event)

// Buffered: reduced contention
events := make(chan Event, 16)

Language Features

Not Implemented

  • Race detector
  • CPU/memory profiling
  • Debugger support (delve, gdb)
  • Plugin package
  • cgo (use KOS C functions directly via //extern)

Limited Implementation

  • reflect: Basic type inspection only. No reflect.MakeFunc.
  • unsafe: Works, but remember pointers are 4 bytes.
  • sync: Mutexes work, but see M:1 scheduling caveat—no goroutine runs while you hold a lock, so deadlock is impossible but starvation is easy.

Unrecoverable Runtime Panics

User panic() is recoverable via recover(). Runtime panics are not:

  • Nil pointer dereference
  • Array/slice bounds check
  • Integer divide by zero
  • Stack overflow

These crash immediately. There is no recovery.

Why? A bounds check failure means your program’s invariants are violated. Continuing would corrupt data. It’s better to crash cleanly.

Platform Constraints

32-bit Pointers

All pointers are 4 bytes. Code assuming 64-bit pointers will break:

// BAD: Assumes 64-bit
type Header struct {
    flags uint32
    ptr   uintptr  // 4 bytes on Dreamcast, not 8!
    size  uint32
}

Single-Precision FPU

The SH-4 FPU operates in single precision (-m4-single). Double precision operations are emulated in software—extremely slow.

// FAST: Single precision
var x float32 = 3.14

// SLOW: Software emulation
var y float64 = 3.14159265358979

Avoid float64 in hot paths. The compiler flag -m4-single makes all FPU operations single precision, but libraries may still use doubles.

Cache Coherency

The SH-4 has separate instruction and data caches. DMA operations require explicit cache management using KOS functions:

// Before DMA write (CPU -> hardware):
dcache_flush_range((uintptr_t)ptr, size);   // Flush data cache

// After DMA read (hardware -> CPU):
dcache_inval_range((uintptr_t)ptr, size);  // Invalidate data cache

The GC handles cache management for semi-space flips via incremental invalidation, but your DMA code must handle it explicitly using KOS cache functions.

No Signals

There are no Unix signals. os.Signal, signal.Notify, etc. don’t work. Use KOS’s interrupt handlers or polling instead.

No Networking (by default)

Networking requires a Broadband Adapter (BBA) or modem. Most Dreamcast units don’t have one. Design your game to work offline.

Debugging

Available

  • Serial output via println() (routed to dc-tool)
  • LIBGODC_ERROR / LIBGODC_CRITICAL macros (defined in runtime.h)
  • GC statistics via the C function gc_stats(&used, &total, &collections)
  • runtime.NumGoroutine() to count active goroutines
  • KOS debug console (dbglog())

Not Available

  • Stack traces on panic (limited)
  • Core dumps
  • Breakpoints
  • Variable inspection
  • Heap profiling

When something goes wrong, you have println() and your brain. Use them.

Compatibility

gccgo Only

This runtime is for gccgo (GCC’s Go frontend), not the standard gc compiler. Code compiled with go build will not work. Use sh-elf-gccgo.

KallistiOS Required

libgodc requires KallistiOS. It won’t work with other Dreamcast development libraries.

SH-4 Architecture Only

This code is specifically for the Hitachi SH-4 CPU. It won’t run on other architectures.

Summary

LimitationImpactWorkaround
G struct pooling~160B per dead goroutineLong-lived goroutines
GC pauses1-20ms depending on heapSmall heap, manual GC timing
M:1 schedulingNo parallelismExplicit yields
Fixed stacksLimited recursionIteration, smaller frames
No preemptionTight loops block allruntime.Gosched()
Runtime panicsUnrecoverableDefensive coding
16MB RAMMemory pressureMonitor usage, plan carefully

For typical Dreamcast games—15-60 minute sessions with a fixed goroutine architecture—these limitations are manageable. Design with constraints in mind from the start, and you’ll have a runtime that’s simple, fast, and reliable.

Glossary

Quick reference for terms used throughout this documentation.

Runtime Terms

Bump Allocator

An allocation strategy where memory is allocated by simply incrementing a pointer. O(1) allocation, but cannot free individual objects. libgodc uses this for the GC heap.

Cheney’s Algorithm

A garbage collection algorithm that copies live objects from one semispace to another using two pointers (scan and alloc). Named after C.J. Cheney who invented it in 1970.

Context Switch

Saving one goroutine’s CPU registers and loading another’s, allowing multiple goroutines to share a single CPU. On SH4, this involves saving 64 bytes of state.

Cooperative Scheduling

A scheduling model where goroutines must voluntarily yield control. Contrast with preemptive scheduling where the runtime can interrupt goroutines at any time.

Forwarding Pointer

During garbage collection, a pointer left in an object’s old location that points to its new location. Prevents copying the same object twice.

G (Goroutine Struct)

The data structure representing a goroutine. Contains stack bounds, saved CPU context, defer chain, panic state, and scheduling information.

GC Heap

The memory region managed by the garbage collector. In libgodc, this is 4MB total (two 2MB semispaces), with 2MB usable at any time.

hchan

The internal structure representing a Go channel. Contains the buffer, send/receive indices, and wait queues.

M:1 Model

A threading model where many goroutines (M) run on one OS thread (1). All goroutines share a single CPU, providing concurrency but not parallelism.

Root

A starting point for garbage collection tracing. Roots include global variables, stack variables, and CPU registers that contain pointers.

Run Queue

A list of goroutines that are ready to execute. The scheduler picks goroutines from this queue.

SemiSpace Collector

A garbage collector that divides memory into two equal halves. Objects are allocated in one half; during collection, live objects are copied to the other half.

Stop the World

A GC phase where all program execution pauses while the collector runs. libgodc uses stoptheworld collection exclusively.

Sudog

“Sender/receiver descriptor” a structure representing a goroutine waiting on a channel operation. Contains pointers to the goroutine, the channel, and the data being transferred.

TLS (ThreadLocal Storage)

Pergoroutine storage. In libgodc, each goroutine has its own TLS block containing runtime state.

Type Descriptor

Compilergenerated metadata about a Go type, including size, alignment, hash, and a bitmap indicating which fields contain pointers.

Hardware Terms

AICA

The Dreamcast’s sound processor. An ARM7based chip with 2MB of dedicated sound RAM. Runs independently of the SH4 CPU.

Cache Line

The unit of data transfer between cache and main memory. 32 bytes on SH4. Accessing one byte loads the entire cache line.

GBR (Global Base Register)

An SH4 register reserved for threadlocal storage in KallistiOS. libgodc does not use GBR for goroutine TLS.

KallistiOS (KOS)

The standard opensource SDK for Dreamcast homebrew development. Provides hardware abstraction, memory management, and drivers. It’s pronounced “Kay os”, so it resembles the sound of the word “chaos”.

PowerVR2

The Dreamcast’s GPU. A tilebased deferred renderer with 8MB of dedicated VRAM.

SH4

The Hitachi (now Renesas) SuperH4 processor used in the Dreamcast. 200MHz, 32bit, littleendian, with an FPU optimized for singleprecision math.

VRAM

Video RAM. 8MB dedicated to the PowerVR2 GPU for textures and framebuffers. Allocated via PvrMemMalloc(), not the GC.

Go Terms

//extern

A gccgo directive that declares a function implemented in C. Allows Go code to call KOS functions directly.

Escape Analysis

Compiler analysis that determines whether a variable can stay on the stack or must be allocated on the heap.

gccgo

The GCC frontend for Go. Uses GCC’s backend for code generation, supporting architectures like SH4 that the standard Go compiler doesn’t support.

Interface

A Go type that specifies a set of methods. Variables of interface type can hold any value that implements those methods.

libgo

The standard gccgo runtime library. libgodc replaces this with a Dreamcastspecific implementation.

Slice Header

The 12byte structure representing a Go slice: a pointer to the backing array, length, and capacity.

String Header

The 8byte structure representing a Go string: a pointer to the character data and length.

Abbreviations

| Abbr | Full Form | Meaning | |||| | ABI | Application Binary Interface | How functions pass arguments and return values | | BBA | Broadband Adapter | Dreamcast network adapter (10/100 Ethernet) | | DMA | Direct Memory Access | Hardwaretohardware memory transfer without CPU | | FPU | Floating Point Unit | CPU component for floatingpoint math | | GC | Garbage Collector | Automatic memory management system | | KB | Kilobyte | 1,024 bytes | | MB | Megabyte | 1,048,576 bytes | | MMU | Memory Management Unit | Hardware for virtual memory (Dreamcast doesn’t have one) | | PC | Program Counter | CPU register pointing to current instruction | | PR | Procedure Register | SH4 register holding return address | | SP | Stack Pointer | CPU register pointing to top of stack | | TA | Tile Accelerator | PowerVR2 component that processes geometry | | TLS | ThreadLocal Storage | Perthread/goroutine private data | | VMU | Visual Memory Unit | Dreamcast memory card with LCD screen |

Performance Numbers

Reference benchmarks from real Dreamcast hardware (200MHz SH4).

Verified using tests/bench_architecture.elf:

| Operation | Time | Notes | |||| | runtime.Gosched() | 120 ns | Minimal yield | | Direct function call | 140 ns | Baseline comparison | | Buffered channel op | 1,459 ns | ~1.5 μs | | Context switch | 6,634 ns | ~6.6 μs, full register save/restore | | Unbuffered channel roundtrip | 12,782 ns | ~13 μs, send + receive | | Goroutine spawn + run | 33,659 ns | ~34 μs, 240× overhead vs direct call |

GC Pause Times

| Scenario | Pause | Notes | |||| | Minimal/bypass (≥128 KB objects) | 73 μs | Objects bypass GC heap | | 64 KB live data | 2,199 μs | ~2.2 ms | | 32 KB live data | 6,172 μs | ~6.2 ms |

Note: Objects ≥64 KB bypass the GC heap and go directly to malloc, hence the minimal pause. The 32 KB scenario with many small objects shows the highest pause because more objects must be scanned and copied.

Memory Configuration

| Parameter | Value | ||| | Goroutine stack | 64 KB | | Context size | 64 bytes | | GC header | 8 bytes | | Large object threshold | 64 KB |

Run tests/bench_architecture.elf on your hardware to verify these numbers.

Acknowledgements

Kudos to:

  • Ian Lance Taylor for writing gccgo.
  • KallistiOS team for building and maintaining the Dreamcast SDK.
  • Dreamcast homebrew community for keeping the console alive.

without you, there would be no libgodc project.