Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

libgodc

libgodc

Welcome to libgodc — a minimal Go runtime implementation for the Sega Dreamcast.

This project brings the Go programming language to a 1998 game console with 16MB of RAM, a 200MHz SH4 processor, and absolutely no operating system to speak of. It’s an exercise in constraints, a love letter to retro hardware, and a deep dive into how programming languages actually work under the hood.

What is libgodc?

libgodc replaces the standard Go runtime (libgo) with one designed for the Dreamcast’s unique constraints:

FeatureDesktop Golibgodc
MemoryGigabytes16 MB total
CPUMulticore GHzSinglecore 200 MHz
SchedulerPreemptiveCooperative
GCConcurrent tricolorStop-the-world semispace
StacksGrowableFixed 64 KB

Despite these differences, you write normal Go code. Goroutines work. Channels work. Maps, slices, interfaces — they all work. The magic is in the runtime.

Who is this for?

  • Systems programmers curious about runtime implementation
  • Go developers who want to understand what happens below go run
  • Retro enthusiasts who think game consoles deserve modern languages
  • Anyone who enjoys the challenge of severe constraints

Prerequisites

Before diving in, you should be comfortable with:

SkillLevelWhy You Need It
GoIntermediateVariables, functions, structs, goroutines, channels
CBasicPointers, memory layout, basic syntax
Command lineComfortableBuilding, running, navigating directories

You don’t need to know:

  • Assembly language (we’ll explain what you need)
  • Dreamcast hardware (KallistiOS handles the hard parts)
  • Garbage collection algorithms (we’ll build one together)
  • Operating system internals (we’ll cover what’s relevant)

If you can write a Go program that uses goroutines and channels, and you know what a pointer is in C, you’re ready.

What’s in this book?

Getting Started

Installation, toolchain setup, and your first Dreamcast Go program.

The Book

A complete walkthrough of building a Go runtime from scratch:

  • Memory allocation and garbage collection
  • Goroutine scheduling without threads
  • Channel implementation
  • Panic, defer, and recover
  • Building real games

Reference

Technical documentation for daily use:

  • API design
  • Best practices
  • Hardware integration
  • Known limitations

Quick Example

package main

import "kos"

func main() {
    kos.PvrInitDefaults()
    
    for {
        kos.PvrWaitReady()
        kos.PvrSceneBegin()
        kos.PvrListBegin(kos.PVR_LIST_OP_POLY)
        // draw stuff here
        kos.PvrListFinish()
        kos.PvrSceneFinish()
    }
}

This runs on a Dreamcast. Real hardware. 1998 technology. Go code.

Getting Started

Ready to begin? Head to the Installation page.

Or if you want to understand the full journey, start with Building From Nothing.

"Console development is the art of saying 'no' to malloc."

Installation

Requirements

  • A Unix-like system (Linux, macOS, WSL2)
  • 4GB disk space for the toolchain
  • An x86_64 or arm64 host
  • Go 1.25.3 or later — required to install the godc CLI tool
  • make — required for building projects
  • git — required for toolchain setup and updates

Quick Start

The godc tool automates everything:

go install github.com/drpaneas/godc@latest
godc setup

This downloads the prebuilt toolchain to ~/dreamcast and configures your environment. Run godc doctor to verify the installation.

godc Commands

CommandDescription
godc setupInstall entire toolchain from scratch
godc configConfigure paths and settings
godc initCreate project files in current directory
godc buildCompile your game
godc runBuild and run in emulator
godc run --ipBuild and run on real Dreamcast via BBA
godc cleanRemove build artifacts
godc doctorCheck if everything is installed
godc updateUpdate libgodc to latest version
godc envShow current paths
godc versionPrint godc version

Configuration

godc stores its config in ~/.config/godc/config.toml:

Path = "/home/user/dreamcast"    # Toolchain location
Emu = "flycast"                  # Default emulator
IP = "192.168.2.203"             # Dreamcast IP for dc-tool

To update settings interactively:

godc config

Manual Installation

If the automated setup doesn’t work for your environment:

Step 1: Get the Toolchain

Download the prebuilt toolchain for your platform:

# Linux x86_64
curl -LO https://github.com/drpaneas/dreamcast-toolchain-builds/releases/download/gcc15.1.0-kos2.2.1/dreamcast-toolchain-gcc15.1.0-kos2.2.1-linux-x86_64.tar.gz

# Linux arm64 (aarch64)
curl -LO https://github.com/drpaneas/dreamcast-toolchain-builds/releases/download/gcc15.1.0-kos2.2.1/dreamcast-toolchain-gcc15.1.0-kos2.2.1-linux-arm64.tar.gz

# macOS arm64 (Apple Silicon)
curl -LO https://github.com/drpaneas/dreamcast-toolchain-builds/releases/download/gcc15.1.0-kos2.2.1/dreamcast-toolchain-gcc15.1.0-kos2.2.1-darwin-arm64.tar.gz

Step 2: Extract

mkdir -p ~/dreamcast
tar -xf dreamcast-toolchain-*.tar.gz -C ~/dreamcast --strip-components=1

The toolchain contains:

~/dreamcast/
├── sh-elf/           # Cross-compiler (sh-elf-gccgo, binutils)
├── kos/              # KallistiOS (OS, drivers, headers)
├── libgodc/          # This library (Go runtime)
└── tools/            # Utilities (elf2bin, makeip, etc.)

Step 3: Environment

Add these to your shell configuration (~/.bashrc, ~/.zshrc, etc.):

export PATH="$HOME/dreamcast/sh-elf/bin:$PATH"
source ~/dreamcast/kos/environ.sh

environ.sh sets KOS_BASE, KOS_ARCH, and other build variables.

Step 4: Verify

sh-elf-gccgo --version
# Should print: sh-elf-gccgo (GCC) 14.x.x ...

ls $KOS_BASE/lib/libgodc.a
# Should exist

Building libgodc from Source

If you need to modify the runtime, or if prebuilt libraries aren’t available:

git clone https://github.com/drpaneas/libgodc ~/dreamcast/libgodc
cd ~/dreamcast/libgodc
source ~/dreamcast/kos/environ.sh
make clean
make
make install

This builds libgodc.a (the runtime) and libgodcbegin.a (startup code), then installs them to $KOS_BASE/lib/.

Debug Build

For development, enable debug output:

make DEBUG=1

This adds -DLIBGODC_DEBUG=1 -g to the compiler flags, enabling trace output and symbols.

Running Code

Emulator

lxdream-nitro or flycast can run Dreamcast binaries.

cd examples/hello
make
flycast hello.elf

Real Hardware

With a Broadband Adapter or serial cable:

# Upload via IP (BBA)
dc-tool-ip -t 192.168.1.100 -x hello.elf

# Upload via serial
dc-tool-ser -t /dev/ttyUSB0 -x hello.elf

The godc run command automates this:

godc run              # Uses configured emulator
godc run --ip         # Uses dc-tool-ip with configured address

Project Structure

A minimal project:

myproject/
├── go.mod            # Module definition
├── main.go           # Your code
├── .Makefile         # Build rules (generated by godc)
└── romdisk/          # Optional: game assets
    ├── texture.png
    └── sound.wav

Example 1: Minimal (hello)

The simplest program — no graphics, just debug output:

main.go:

// Minimal Dreamcast program
package main

func main() {
    println("Hello, Dreamcast!")
}

go.mod (generated by godc init):

module hello

go 1.25.3

replace kos => ~/dreamcast/libgodc/kos

Example 2: Screen Output (hello_screen)

Display text on screen using the BIOS font:

main.go:

// Hello World on Dreamcast screen using BIOS font
package main

import "kos"

func main() {
    // center "Hello World" on 640x480 screen
    x := 640/2 - (11*kos.BFONT_THIN_WIDTH)/2
    y := 480/2 - kos.BFONT_HEIGHT/2
    offset := y*640 + x

    kos.BfontDrawStr(kos.VramSOffset(offset), 640, true, "Hello World")

    for {
        kos.TimerSpinSleep(100)
    }
}

go.mod (generated by godc init):

module hello_screen

go 1.25.3

replace kos => ~/dreamcast/libgodc/kos

require kos v0.0.0-00010101000000-000000000000

Build and Run

godc init             # Generate go.mod and .Makefile
godc build            # Compile to .elf
godc run              # Launch in emulator

Or manually:

sh-elf-gccgo -O2 -ml -m4-single -fno-split-stack -mfsrra -mfsca \
    -I$KOS_BASE/lib -L$KOS_BASE/lib \
    -c main.go -o main.o

kos-cc -o myproject.elf main.o \
    -L$KOS_BASE/lib -Wl,--whole-archive -lgodcbegin \
    -Wl,--no-whole-archive -lkos -lgodc

Romdisks — Packaging Assets

A romdisk is a read-only filesystem compiled into your executable. Put assets in the romdisk/ directory:

myproject/
├── main.go
└── romdisk/
    ├── player.png
    └── music.wav

The build system automatically:

  1. Creates romdisk.img using genromfs
  2. Converts it to romdisk.o using bin2o
  3. Links it into your executable

Access files in Go via /rd/:

texture := kos.PlxTxrLoad("/rd/player.png", true, 0)
sound := kos.SndSfxLoad("/rd/music.wav")

Compiler Flags

Default flags used by godc:

FlagPurpose
-O2Standard optimization
-mlLittle-endian mode
-m4-singleSH-4 with single-precision FPU
-fno-split-stackFixed-size goroutine stacks
-mfsrraHardware reciprocal sqrt
-mfscaHardware sin/cos lookup

For maximum performance:

GODC_FAST=1 godc build

This enables -O3 -ffast-math -funroll-loops. Warning: -ffast-math breaks IEEE floating-point compliance.

Project Overrides

Create godc.mk for project-specific customizations:

# Reduce GC heap to free RAM for assets
CFLAGS += -DGC_SEMISPACE_SIZE_KB=1024

# Add extra libraries
LIBS += -lmy_custom_lib

# Custom romdisk location
ROMDISK_DIR = assets

Troubleshooting

“sh-elf-gccgo: command not found”

The compiler isn’t in your PATH. Check:

echo $PATH | tr ':' '\n' | grep dreamcast
which sh-elf-gccgo

“cannot find -lgodc”

The runtime library isn’t installed. Build and install it:

cd ~/dreamcast/libgodc
make install
ls $KOS_BASE/lib/libgodc.a

“undefined reference to `__go_runtime_init’”

You’re linking with the wrong library order. The correct order is:

-Wl,--whole-archive -lgodcbegin -Wl,--no-whole-archive -lkos -lgodc

-lgodcbegin must be wrapped in --whole-archive to ensure all its symbols are included.

Runtime crashes immediately

Check if your program uses double-precision floats. The SH-4 FPU is single-precision only. Compile with -m4-single and avoid float64 in hot paths.

Out of memory

The Dreamcast has 16MB. Check your allocations using the C API:

#include "gc_semispace.h"

size_t used, total;
uint32_t collections;
gc_stats(&used, &total, &collections);
printf("Heap: %zu / %zu bytes, %u collections\n", used, total, collections);

From Go, you can count goroutines:

println("Goroutines:", runtime.NumGoroutine())

Consider using KOS malloc directly for large buffers:

ptr := kos.PvrMemMalloc(size)  // PVR VRAM
ptr := kos.Malloc(size)        // KOS heap

Next Steps

Quick Start

Let’s create your first Dreamcast Go program.

Create a Project

mkdir myproject && cd myproject
godc init

Example output:

$ godc init
go: found kos in kos v0.0.0-00010101000000-000000000000

This creates go.mod and go.work files that configure your project to use the kos package from your libgodc installation.

Project Structure

A minimal project looks like this:

myproject/
├── go.mod            # Module definition with kos dependency
├── go.work           # Workspace configuration
└── main.go           # Your code

The go.mod file (paths will match your libgodc location):

module myproject

go 1.25.3

replace kos => /path/to/your/libgodc/kos

require kos v0.0.0-00010101000000-000000000000

The go.work file:

go 1.25.3

use (
        /path/to/your/libgodc
        .
)

Note: The paths in go.mod and go.work will automatically point to your libgodc installation location.

Hello, Dreamcast!

Create main.go:

package main

import "kos"

func main() {
    kos.PvrInitDefaults()
    println("Hello, Dreamcast!")
    for {}
}

Build and Run

Using godc:

godc build            # Compile to .elf
godc run              # Launch in emulator

Or manually with sh-elf-gccgo:

sh-elf-gccgo -O2 -ml -m4-single -fno-split-stack -mfsrra -mfsca \
    -I$KOS_BASE/lib -L$KOS_BASE/lib \
    -c main.go -o main.o

kos-cc -o myproject.elf main.o \
    -L$KOS_BASE/lib -Wl,--whole-archive -lgodcbegin \
    -Wl,--no-whole-archive -lkos -lgodc

Your First Graphics

Let’s draw something on screen:

package main

import "kos"

func main() {
    kos.PvrInitDefaults()
    
    for {
        kos.PvrWaitReady()
        kos.PvrSceneBegin()
        
        // Draw opaque geometry
        kos.PvrListBegin(kos.PVR_LIST_OP_POLY)
        drawTriangle()
        kos.PvrListFinish()
        
        kos.PvrSceneFinish()
    }
}

func drawTriangle() {
    // Create and submit polygon header
    var hdr kos.PvrPolyHdr
    var ctx kos.PvrPolyCxt
    kos.PvrPolyCxtCol(&ctx, kos.PVR_LIST_OP_POLY)
    kos.PvrPolyCompile(&hdr, &ctx)
    kos.PvrPrim(&hdr)  // Submit header
    
    // Submit vertices (use PvrPrimVertex for vertices)
    v := kos.PvrVertex{
        Flags: kos.PVR_CMD_VERTEX,
        X: 320, Y: 100, Z: 1,
        ARGB: 0xFFFF0000,  // Red
    }
    kos.PvrPrimVertex(&v)
    
    v.X, v.Y = 200, 400
    v.ARGB = 0xFF00FF00  // Green
    kos.PvrPrimVertex(&v)
    
    v.X, v.Y = 440, 400
    v.Flags = kos.PVR_CMD_VERTEX_EOL  // End of strip
    v.ARGB = 0xFF0000FF  // Blue
    kos.PvrPrimVertex(&v)
}

Using Goroutines

Goroutines work on Dreamcast:

package main

import "kos"

func main() {
    kos.PvrInitDefaults()
    
    // Start a background goroutine
    go func() {
        counter := 0
        for {
            counter++
            println("Background:", counter)
            select {}  // Yield to scheduler
        }
    }()
    
    // Main loop
    for {
        kos.PvrWaitReady()
        kos.PvrSceneBegin()
        render()
        kos.PvrSceneFinish()
    }
}

Using Channels

Channels enable communication between goroutines:

package main

import "kos"

func main() {
    kos.PvrInitDefaults()
    
    // Create a buffered channel
    scores := make(chan int, 10)
    
    // Score counter goroutine
    go func() {
        total := 0
        for score := range scores {
            total += score
            println("Total score:", total)
        }
    }()
    
    // Main game loop
    for {
        // Game logic
        if playerScored() {
            scores <- 100  // Send score
        }
        render()
    }
}

Next Steps

Building From Nothing

The Real Starting Point

Most documentation starts after the hard part. “Here’s the GC” assumes you know you need one. “Here’s how goroutines work” assumes you figured out the symbol names.

Let’s go back to the real beginning:

DAY 0: THE SITUATION

You have:
• sh-elf-gccgo (Go compiler for SH-4)
• KallistiOS (Dreamcast SDK)
• A simple Go program: println("Hello, Dreamcast!")

You try to compile it. What happens?

$ sh-elf-gccgo -c hello.go
$ sh-elf-gcc hello.o -o hello.elf

LINKER ERRORS. Hundreds of them.

undefined reference to `runtime.printstring'
undefined reference to `runtime.printnl'
undefined reference to `__go_runtime_error'
undefined reference to `runtime.newobject'
...

Those undefined references are the holes we discussed in Chapter 2. The compiler generated calls to runtime functions that don’t exist.

Your job: Provide implementations for every one of them.


Part 1: The Discovery Process

How Do You Know What gccgo Expects?

This is the question nobody answers. Where is it documented? What’s the ABI?

Answer: It’s not well-documented. You have to investigate.

Here’s the process we used:

Method 1: Read the Linker Errors

The linker tells you exactly what’s missing:

sh-elf-gccgo -c myprogram.go -o myprogram.o
sh-elf-gcc myprogram.o -o myprogram.elf 2>&1 | grep "undefined reference"

You’ll see output like:

undefined reference to `runtime.printstring'
undefined reference to `runtime.printnl'
undefined reference to `__go_runtime_error'
undefined reference to `runtime.newobject'
undefined reference to `runtime.makeslice'

Start here. Each undefined symbol is a function you need to write.

Method 2: Read the gccgo Source

The gccgo frontend lives in the GCC source tree. The key directories:

gcc/go/gofrontend/      ← The Go parser and type checker
libgo/runtime/          ← The reference runtime (for Linux)
libgo/go/               ← Go standard library

When gccgo compiles make([]int, 10), it emits a call to runtime.makeslice. To find the expected signature:

# In the GCC source tree
grep -r "makeslice" libgo/runtime/

You’ll find the actual implementation. Study its parameters and return type.

Method 3: Use nm on Object Files

Compile your Go code and inspect what symbols it references:

sh-elf-gccgo -c test.go -o test.o
sh-elf-nm test.o | grep " U "   # "U" = undefined (needs linking)

This shows you every external symbol your code needs.

Method 4: Disassemble and Trace

When things don’t work, disassemble:

sh-elf-objdump -d test.o | less

Look at how functions are called. What registers hold arguments? What’s expected in return registers?

The Symbol Naming Convention

gccgo uses a specific naming scheme:

Go ConceptSymbol Name
runtime.Xruntime.X (literal dot)
main.foomain.foo
Method on type TT.MethodName
Interface methodComplex mangling

Since C can’t have dots in identifiers, we use the __asm__ trick:

void runtime_printstring(String s) __asm__("runtime.printstring");

void runtime_printstring(String s) {
    // Implementation
}

Part 2: The Build Order

You can’t build everything at once. There are dependencies:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   DEPENDENCY GRAPH                                          │
│                                                             │
│                       ┌─────────┐                           │
│                       │ println │                           │
│                       └────┬────┘                           │
│                            │ needs                          │
│                       ┌────▼────┐                           │
│                       │ strings │                           │
│                       └────┬────┘                           │
│                            │ needs                          │
│                       ┌────▼────┐                           │
│                       │ memory  │                           │
│                       │ alloc   │                           │
│                       └────┬────┘                           │
│                            │ needs                          │
│                       ┌────▼────┐                           │
│                       │  heap   │                           │
│                       │  init   │                           │
│                       └─────────┘                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Milestone 1: Hello World

Goal: Print a string. No GC, no goroutines, nothing fancy.

What you need:

  1. Memory allocator — Even println allocates internally
  2. Print functionsruntime.printstring, runtime.printnl, runtime.printint
  3. String support — Go strings are {pointer, length} structs
  4. Entry point — Something to call main.main

The minimal files:

runtime/
├── go-main.c           # Entry point, calls main.main
├── malloc_dreamcast.c  # Basic malloc wrapper
├── go-print.c          # Print functions
└── runtime.h           # Common definitions

Test:

package main

func main() {
    println("Hello, Dreamcast!")
}

If this prints, you have a foundation.

Milestone 2: Basic Types

Goal: Slices, arrays, basic type operations.

What you need:

  1. makeslice — Create slices
  2. growslice — Append to slices
  3. Type descriptors — Compiler generates these, you need to understand them
  4. Memory operationsmemcpy, memset, memmove wrappers

New files:

runtime/
├── slice_dreamcast.c   # Slice operations
├── string_dreamcast.c  # String operations
└── type_descriptors.h  # Type metadata structures

Test:

package main

func main() {
    s := make([]int, 5)
    s[0] = 42
    println(s[0])
}

Milestone 3: Panic and Defer

Goal: Error handling works.

Why before GC? Because GC needs defer for cleanup. And panic is simpler than GC.

What you need:

  1. Defer chain — Linked list of deferred calls per goroutine
  2. Panic mechanism — setjmp/longjmp based
  3. Recover — Check if in deferred function

Test:

package main

func main() {
    defer println("world")
    println("hello")
}
// Should print: hello, then world

Milestone 4: Maps

Goal: Hash tables work.

The problem: Go maps have complex semantics:

  • Iteration order is randomized
  • Growing rehashes everything
  • Keys can be any comparable type

What you need:

  1. Hash function — For each key type
  2. Bucket structure — Go uses a specific layout
  3. makemap, mapaccess, mapassign, mapdelete — Core operations
  4. Map iteration — Complex state machine

Lesson learned: Map iteration state is stored in a hiter struct. If you get this wrong, range loops break mysteriously.

Milestone 5: Garbage Collection

Goal: Automatic memory management.

Design decision: We chose semi-space copying GC because:

  • No fragmentation
  • Simple implementation
  • Predictable pause times (though not short)

What you need:

  1. Root scanning — Find all pointers on stack and in globals
  2. Object copying — Move live objects to new space
  3. Pointer updating — Fix all references
  4. Type bitmaps — Know which words are pointers

The hard part: Knowing which stack slots are pointers. gccgo generates __gcdata bitmaps for types, but stack scanning is conservative.

Milestone 6: Goroutines

What you need:

  1. G struct — Goroutine state
  2. Stack allocation — Each goroutine needs its own stack
  3. Context switching — Save/restore CPU registers (assembly!)
  4. Scheduler — Pick which goroutine runs next
  5. Run queue — List of runnable goroutines

The assembly is unavoidable. You must write swapcontext in SH-4 assembly. There’s no way around it. You see, you have to do context switching in the actual registers, but C doesn’t give you access to talk to them. The compiler manages the registers behind your back.

! Save current context
mov.l   r8, @-r4
mov.l   r9, @-r4
! ... save all callee-saved registers ...

! Load new context
mov.l   @r5+, r8
mov.l   @r5+, r9
! ... restore all registers ...

rts

Milestone 7: Channels

Goal: Goroutines can communicate.

Channels require:

  • Wait queues (goroutines blocked on send/receive)
  • Buffered storage (ring buffer)
  • Select statement (waiting on multiple channels)

The “3 days of debugging” commit touched channels. The issue was usually:

  • Waking the wrong goroutine
  • Corrupting state during concurrent access
  • Stack misalignment after context switch

Part 3: Resources You’ll Need

Essential Reading

  1. gccgo source codegcc/go/gofrontend/ and libgo/runtime/
  2. Go runtime source$GOROOT/src/runtime/ (different ABI, but same concepts)
  3. SH-4 programming manual — For assembly and ABI
  4. KallistiOS documentation — For Dreamcast specifics

Tools

ToolPurpose
sh-elf-nmList symbols in object files
sh-elf-objdumpDisassemble code
sh-elf-addr2lineConvert addresses to line numbers
dc-tool-ipUpload and run on Dreamcast
lxdreamDreamcast emulator (for faster iteration)

The Checklist Mentality

Before each phase, write down:

  1. What symbols must I implement?
  2. What’s the expected signature?
  3. How will I test it?

After each phase:

  1. Did all tests pass?
  2. What surprised me?
  3. What would I do differently?

The journey from nothing to a working Go runtime is not easy. But it is achievable. Every problem has a solution. Every bug can be found. Every undefined symbol can be implemented.

You now have the map. Go build it.

Introduction to libgodc

What Is This Book?

This book is about building a Go runtime for the Sega Dreamcast.

Wait, what?

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   THE CRAZY PROJECT                                         │
│                                                             │
│   Go:                                                       │
│   • Designed for servers and cloud computing                │
│   • Expects gigabytes of RAM                                │
│   • Has a sophisticated garbage collector                   │
│   • Written for modern multi-core CPUs                      │
│                                                             │
│   Dreamcast:                                                │
│   • A game console from 1998                                │
│   • Has 16 MB of RAM (megabytes, not giga)                  │
│   • Single CPU core at 200 MHz                              │
│   • Was designed for arcade games                           │
│                                                             │
│   These shouldn't work together. But they do.               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

We call this project libgodc, a library that implements Go’s runtime for the Dreamcast. By the end of this book, you’ll understand how we built the Dreamcast Go runtime from scratch: memory allocation, garbage collection, goroutine scheduling, channels, and more.


Who Is This Book For?

You should read this book if:

  • You’re curious how programming languages work “under the hood”
  • You want to understand what a runtime actually does
  • You enjoy systems programming and low-level details
  • You think retro game consoles are cool

You’ll need to know:

  • Basic Go (variables, functions, structs, goroutines)
  • Some C (pointers, memory, basic syntax)
  • What a compiler does (turns source code into machine code, duh!)

You don’t need to know:

  • Assembly language (we’ll explain what you need)
  • How to program the Dreamcast (KallistiOS handles the hard parts)
  • Anything about garbage collectors (we’ll build one together)

The Machine We’re Programming

Let’s meet our hardware. The Sega Dreamcast (1998) was ahead of its time—the first 128-bit console, they said! (Marketing math, but still impressive.)

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   THE SEGA DREAMCAST                                        │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐   │
│   │                                                     │   │
│   │   CPU:     Hitachi SH-4 @ 200 MHz                   │   │
│   │                                                     │   │
│   │   RAM:     16 MB (yes, that's megabytes, not giga)  │   │
│   │                                                     │   │
│   │   VRAM:    8 MB (for the GPU)                       │   │
│   │                                                     │   │
│   │   GPU:     PowerVR2 CLX2                            │   │
│   │                                                     │   │
│   │   Sound:   Yamaha AICA (has its own ARM7 + 2 MB)    │   │
│   │                                                     │   │
│   │   Storage: GD-ROM (or SD card adapter)              │   │
│   │                                                     │   │
│   └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

For comparison, your phone probably has:

  • 4-8 CPU cores at 2+ GHz
  • 4-8 GB of RAM
  • Virtual memory, memory protection, multiple privilege levels

The Dreamcast has:

  • 1 CPU core at 200 MHz
  • 16 MB of RAM
  • No virtual memory, no memory protection, no privilege levels

Different world.


Why Can’t We Just Use Standard Go?

Go has an official compiler called gc. It generates code for x86, ARM, and other modern architectures.

The Dreamcast uses a SuperH SH-4 processor. Adding SH-4 support to gc would require rewriting significant portions of the compiler backend—months of work, requiring deep expertise in both Go internals and the SH-4 architecture. That’s a project for a team of compiler engineers with sleepless nights, questionable caffeine consumption, and possibly mild insanity.

Instead, we use gccgo, an alternative Go compiler built on GCC. GCC already supports SH-4 (from decades of embedded development). So gccgo can compile Go to SH-4—we just need to provide the runtime.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   TWO PATHS TO GO ON DREAMCAST                              │
│                                                             │
│   Path A: Modify gc                                         │
│   ─────────────────────                                     │
│   - Write a new SH-4 backend                                │
│   - Write a new Dreamcast Operating System                  │
│   - Understand SSA, register allocation, etc.               │
│   - Result: "real" Go on Dreamcast                          │
│                                                             │
│   Path B: Use gccgo + write runtime (this book)             │
│   ────────────────────────────────────────────              │
│   - GCC already knows SH-4                                  │
│   - Write runtime in C                                      │
│   - Result: Go dialect for Dreamcast                        │
│                                                             │
│   We chose Path B. It's faster and teaches more.            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The 16 Megabyte Problem

Sixteen megabytes. That’s it. Everything must fit:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   16 MB = 16,777,216 bytes                                  │
│                                                             │
│   That's shared between:                                    │
│                                                             │
│   ┌─────────────────────────────────────────────────┐       │
│   │  Your program's code           (0.5 - 2 MB)     │       │
│   ├─────────────────────────────────────────────────┤       │
│   │  KallistiOS overhead           (~0.5 MB)        │       │
│   ├─────────────────────────────────────────────────┤       │
│   │  Go runtime heap               (??? MB)         │       │
│   ├─────────────────────────────────────────────────┤       │
│   │  Goroutine stacks              (??? MB)         │       │
│   ├─────────────────────────────────────────────────┤       │
│   │  Game assets (textures, etc.)  (??? MB)         │       │
│   └─────────────────────────────────────────────────┘       │
│                                                             │
│   Everything fights for space.                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

This is why our garbage collector choice matters so much. We use a semi-space copying collector, which needs two equally-sized spaces. libgodc allocates 2 MB per space = 4 MB total = 2 MB usable heap.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Semi-space GC memory usage (libgodc default):             │
│                                                             │
│   ┌─────────────────────┬─────────────────────┐             │
│   │    FROM-SPACE       │     TO-SPACE        │             │
│   │      2 MB           │       2 MB          │             │
│   │                     │                     │             │
│   │  (active heap)      │  (empty, waiting    │             │
│   │                     │   for next GC)      │             │
│   └─────────────────────┴─────────────────────┘             │
│                                                             │
│   Total: 4 MB for a 2 MB usable heap. That's 50% overhead!  │
│                                                             │
│   But: no fragmentation, simple, predictable.               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Design decision: We chose simplicity (semi-space GC) over memory efficiency. On a 16 MB machine, this hurts. But a more memory-efficient collector would be much more complex to implement and debug. The 2 MB usable heap is sufficient for most Dreamcast games—large assets like textures should use external allocation anyway. For games needing more RAM, compile with -DGC_SEMISPACE_SIZE_KB=1024 to shrink the heap to 1 MB usable (2 MB total).


Where Does Everything Live?

The Dreamcast has 16 MB of main RAM at addresses 0x8C000000 to 0x8CFFFFFF. Here’s how it’s organized:

    0x8C000000 ──────────────────────────────────────────────
                 │
                 │   KOS kernel + drivers (~1 MB)
                 │
                 ├──────────────────────────────────────────
                 │   .text (your compiled code)
                 │   .rodata (constants, type descriptors)
                 │   .data (initialized globals)
                 │   .bss (uninitialized globals)
                 ├──────────────────────────────────────────
                 │
                 │   KOS malloc heap (everything below):
                 │
                 │   ┌─────────────────────────────────────┐
                 │   │  GC semi-space 0 (2 MB)             │
                 │   ├─────────────────────────────────────┤
                 │   │  GC semi-space 1 (2 MB)             │
                 │   ├─────────────────────────────────────┤
                 │   │  Goroutine stacks (64 KB each)      │
                 │   ├─────────────────────────────────────┤
                 │   │  Textures, audio, game assets       │
                 │   └─────────────────────────────────────┘
                 │
                 ├──────────────────────────────────────────
                 │   Main thread stack (grows downward)
                 │
    0x8CFFFFFF ──────────────────────────────────────────────

                 Total: 16 MB (0x1000000 bytes)

KOS manages the heap via malloc. When you run out of memory, malloc returns NULL and your program crashes. There’s no virtual memory, no swap file, no second chance. See our implementation friendly messages (lol):

// runtime/gc_heap.c
if (gc_heap.alloc_ptr + total_size > gc_heap.alloc_limit)
    runtime_throw("out of memory");

// runtime/stack.c  
void *base = memalign(8, size);
if (!base)
    runtime_throw("stack_alloc: out of memory");

// runtime/chan.c
c = (hchan *)gc_alloc(totalSize, &__hchan_type);
if (!c)
    runtime_throw("makechan: out of memory");

// runtime/tls_sh4.c
tls = (tls_block_t *)malloc(sizeof(tls_block_t));
if (!tls)
    runtime_throw("tls_alloc: out of memory");

The SH-4 Processor

Let’s get to know the CPU that runs our code.

The Alignment Rule

Here’s something that will bite you if you forget it:

The SH-4 requires natural alignment.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Type          Size     Must be aligned to                 │
│   ────          ────     ──────────────────                 │
│   uint8         1 byte   Any address is fine                │
│   uint16        2 bytes  Address must be divisible by 2     │
│   uint32        4 bytes  Address must be divisible by 4     │
│   uint64        8 bytes  Address must be divisible by 8     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

On x86 (your laptop), unaligned access is just slow. On SH-4, it crashes the CPU.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   x86 (your laptop):                                        │
│   Unaligned access?  → Works, but slower                    │
│                                                             │
│   SH-4 (Dreamcast):                                         │
│   Unaligned access?  → ADDRESS ERROR EXCEPTION              │
│                         System crashes. No recovery.        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Our allocator must always return properly aligned addresses.

The Floating Point Unit

The SH-4 has a powerful FPU with a twist:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Single-precision (float32):  FAST! ✓                      │
│   - Hardware accelerated                                    │
│   - Multiply-add in 1 cycle                                 │
│                                                             │
│   Double-precision (float64):  Slow ✗                       │
│   - Takes many more cycles                                  │
│   - Avoid in performance-critical code                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Go defaults to float64. For games, consider using float32 where precision isn’t critical. Sadly making float32 the new default for our libgodc is not possible. Unless someone, is crazy enough to recompile gccgo and change all the consts and all the standard library to use float32, that is a massive work, especially around the math libraries and ones that depend on it. So, just remember to use float32 and never float64.

A better way to solve this, in the future, would be to create float32 wrappers around common math functions.


The Cache Problem

The SH-4 has a 16 KB data cache with “write-back” behavior. When you write data, it might only go to the cache, not to main memory.

THE PROBLEM:
════════════

  Your code writes to address 0x8C100000
          │
          ▼
  ┌───────────────┐
  │    CACHE      │  ← Data goes HERE
  │  (new value)  │
  └───────────────┘
          
  ┌───────────────┐
  │  MAIN MEMORY  │  ← But not HERE (yet)
  │  (old value)  │
  └───────────────┘
          │
          ▼
  GPU reads from 0x8C100000
  Gets the OLD value!  💥

We have to manually flush the cache before hardware reads from memory:

dcache_flush_range(addr, len);  // Push cache → memory

On your laptop, the OS handles this. On the Dreamcast, it’s our job.


KallistiOS: The Foundation

We’re not programming bare-metal. We build on KallistiOS (KOS), the standard SDK for Dreamcast homebrew.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   ┌───────────────────────────────────────────────────┐     │
│   │              Your Go Program                      │     │
│   └───────────────────────────────────────────────────┘     │
│                          │                                  │
│                          ▼                                  │
│   ┌───────────────────────────────────────────────────┐     │
│   │                  libgodc                          │     │
│   │  (Go runtime: GC, scheduler, channels, etc.)      │     │
│   └───────────────────────────────────────────────────┘     │
│                          │                                  │
│                          ▼                                  │
│   ┌───────────────────────────────────────────────────┐     │
│   │               KallistiOS                          │     │
│   │  (hardware abstraction, malloc, timers)           │     │
│   └───────────────────────────────────────────────────┘     │
│                          │                                  │
│                          ▼                                  │
│   ┌───────────────────────────────────────────────────┐     │
│   │            Dreamcast Hardware                     │     │
│   └───────────────────────────────────────────────────┘     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

KOS is a minimal embedded operating system that gets statically linked into your program. There’s no user/kernel mode separation, no process isolation, and no memory protection. Your code runs with full hardware access, alongside the KOS kernel.


The Constraints That Shape Everything

These hardware limitations drive every decision in libgodc:

Constraint 1: No Memory Protection

On your laptop, accessing invalid memory gives: Segmentation fault (core dumped)

On the Dreamcast: the program corrupts silently or crashes without explanation.

Constraint 2: Real-Time Requirements

Games need consistent frame rates. At 60 FPS, you have 16.67 milliseconds per frame:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   One frame = 16.67 ms                                      │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐   │
│   │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │   │
│   └─────────────────────────────────────────────────────┘   │
│   Game logic  Rendering             GC pause                │
│   ░░░░░░░░░░  ░░░░░░░░░░░░░░░      ░░░░                     │
│                                      ▲                      │
│                                      │                      │
│                        If GC takes 20ms, you miss frames!   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Constraint 3: Single Core

The SH-4 is single-core CPU. Even if we wanted parallel GC, the SH-4 can’t run threads simultaneously. That said, when GC runs, everything stops.

The Toolchain

In this chapter

  • You learn why we use gccgo instead of the standard Go compiler
  • You see how Go code becomes Dreamcast machine code
  • You understand the “holes” in compiled code and how we fill them
  • You discover the dark arts: making C pretend to be Go
  • You learn about calling conventions and type descriptors

Why gccgo?

A compiler is just a program that writes programs. Most Go developers use gc, the standard Go compiler. It’s fast, produces excellent code, and has a fantastic runtime.

But gc only speaks certain architectures:

┌─────────────────────────────────────────┐
│                                         │
│     gc compiler's architecture list     │
│                                         │
│     ✓ x86-64   laptops, desktops        │
│     ✓ ARM64    phones, Raspberry Pi     │
│     ✓ RISC-V   new trend                │
│                                         │
│     ✗ SH-4     "never heard of this"    │
│                                         │
└─────────────────────────────────────────┘

The Dreamcast uses a Hitachi SuperH SH-4 processor. Adding support to gc would require modifying the compiler backend—months of work, lots of caffeine, and at least three existential crises.

But here’s the thing: GCC has supported the SH-4 for over two decades.

┌─────────────────┐         ┌─────────────────┐
│   gc compiler   │         │   GCC compiler  │
│                 │         │                 │
│  Knows Go ✓     │         │  Knows Go ✗     │
│  Knows SH-4 ✗   │         │  Knows SH-4 ✓   │
└─────────────────┘         └─────────────────┘
        │                           │
        └─────── combine? ──────────┘
                    │
                    ▼
          ┌─────────────────┐
          │     gccgo       │
          │                 │
          │  Knows Go ✓     │
          │  Knows SH-4 ✓   │
          └─────────────────┘

gccgo is a Go frontend for GCC. It reads Go source code, performs type checking, then hands everything to GCC’s backend. GCC handles the hard part—generating SH-4 machine code.

We get Go compilation for the Dreamcast “for free.” Our job is to provide the runtime library.

What is a Runtime?

A runtime is a library of functions that a compiled program calls during execution. It handles things the compiler can’t (or shouldn’t) generate inline: memory allocation, garbage collection, goroutine scheduling, panic handling, and more.

Why do languages use this pattern? Portability. The compiler translates your source code into machine instructions, but those instructions need to interact with the operating system or hardware. By separating “language translation” from “platform interaction,” you can:

  1. Reuse the compiler — gccgo already knows Go. We don’t touch it.
  2. Swap the runtime — We write a Dreamcast-specific runtime. The same compiler now works on a new platform.

This is how Go supports Linux, Windows, macOS, and now Dreamcast—same language, same compiler frontend, different runtimes.

Other languages use similar patterns:

  • C has startup code (crt0) and libc for system calls
  • C++ adds exception handling (libgcc) and the standard library (libstdc++)
  • Rust has a minimal runtime embedded in libstd
  • Java has the JVM—a full runtime with GC, JIT, and class loading
  • Python has libpython—the interpreter itself

The difference is scope. C’s runtime is small—just system call wrappers. Go’s runtime is large—it includes a garbage collector, scheduler, and channel implementation. That’s why porting Go is harder than porting C, but the principle is identical.


Code with Holes

Here’s the key insight of this entire book. When you compile Go code, the compiler doesn’t include everything.

func main() {
    s := make([]int, 10)
    m := make(map[string]int)
    go doSomething()
}

What does make([]int, 10) actually do? It needs to allocate memory, initialize the slice header, and return it. Does the compiler generate all that code inline?

No. It generates function calls instead:

Your Go code              What the compiler emits
─────────────             ──────────────────────

make([]int, 10)       →   CALL runtime.makeslice
make(map[string]int)  →   CALL runtime.makemap  
go doSomething()      →   CALL runtime.newproc

The compiled object file is full of these calls. But the implementations aren’t there:

┌─────────────────────────────────────────────────────┐
│                                                     │
│                  main.o (your compiled code)        │
│                                                     │
│    ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐      │
│    │HOLE │  │HOLE │  │HOLE │  │HOLE │  │HOLE │      │
│    └─────┘  └─────┘  └─────┘  └─────┘  └─────┘      │
│    runtime  runtime  runtime  runtime  runtime      │
│    .make    .make    .make    .new     .defer       │
│    slice    map      chan     proc     proc         │
│                                                     │
└─────────────────────────────────────────────────────┘

These are unresolved symbols. The object file knows it needs to call runtime.makeslice, but doesn’t know where that function is.

Who fills in the holes? That’s us. That’s libgodc.


Filling the Holes

Our job is to provide implementations. When the linker combines your code with our library, every hole gets filled:

BEFORE LINKING:
═══════════════

┌──────────────────┐          ┌──────────────────┐
│    main.o        │          │   libgodc.a      │
│                  │          │                  │
│  HOLE: runtime.  │          │  runtime.        │
│        makeslice │          │  makeslice ──────┼──→ actual code!
│                  │          │                  │
│  HOLE: runtime.  │          │  runtime.        │
│        newproc   │          │  newproc ────────┼──→ actual code!
└──────────────────┘          └──────────────────┘


AFTER LINKING:
══════════════

┌─────────────────────────────────────────────────────┐
│                    game.elf                         │
│                                                     │
│    call runtime.makeslice ───→ [makeslice code]     │
│    call runtime.newproc ─────→ [newproc code]       │
│                                                     │
│    No more holes! Ready to run.                     │
└─────────────────────────────────────────────────────┘

The Symbol Problem

There’s a wrinkle. Go uses dots in names: runtime.makeslice.

But dots are illegal in C identifiers:

void runtime.makeslice() { }  // SYNTAX ERROR!

How do we write a C function with a dot in its name?

The __asm__ Trick

GCC lets you specify the symbol name separately:

// C identifier uses underscore, but symbol has a dot
void *runtime_makeslice(void *type, int len, int cap)
    __asm__("runtime.makeslice");

void *runtime_makeslice(void *type, int len, int cap) {
    // implementation
}
┌────────────────────────────────────────────────────────┐
│                                                        │
│   In C code:         →    In object file:              │
│                                                        │
│   runtime_makeslice()     runtime.makeslice            │
│   (underscore)            (dot)                        │
│                                                        │
│   Go calls runtime.makeslice, linker finds it,         │
│   Go never knows it was written in C.                  │
│                                                        │
└────────────────────────────────────────────────────────┘

Every runtime function in libgodc uses this pattern.


Symbols vs. Signatures

Two things must match between caller and callee:

1. The Symbol (the name): Get it wrong, the linker complains loudly.

2. The Signature (the shape): What arguments, what order, what return values.

The compiler has already decided how to call runtime.makeslice:

Register r4:  pointer to type descriptor
Register r5:  length
Register r6:  capacity

Return value in r0

If our implementation expects arguments in different registers:

What compiler sends:        What our code expects:
────────────────────        ──────────────────────

  r4 = type pointer           r4 = length        ← WRONG!
  r5 = length                 r5 = capacity      ← WRONG!

The linker won’t catch this. Symbol names match, so it happily connects them. The mismatch only shows up at runtime as mysterious crashes.

┌─────────────────────────────────────────────────────┐
│                                                     │
│   Symbol mismatch:        Signature mismatch:       │
│   ───────────────         ──────────────────        │
│   Linker error            Linker succeeds           │
│   Clear message           Runtime crash             │
│   Easy to fix             Hard to debug             │
│                                                     │
└─────────────────────────────────────────────────────┘

The Calling Convention

When a function calls another function, they need to agree on how to pass data. This is the calling convention.

SH-4 Register Usage

┌─────────────────────────────────────────────────────────────┐
│   SH-4 Register Usage                                       │
│                                                             │
│   r0      Return value / scratch                            │
│   r1      Return value (64-bit) / scratch                   │
│   r2-r3   Scratch                                           │
│   ─────────────────────────────────────────────             │
│   r4      1st argument                                      │
│   r5      2nd argument                                      │
│   r6      3rd argument                                      │
│   r7      4th argument                                      │
│   ─────────────────────────────────────────────             │
│   r8-r13  Callee-saved (must preserve)                      │
│   r14     Frame pointer                                     │
│   r15     Stack pointer                                     │
└─────────────────────────────────────────────────────────────┘

Why does this matter? Most of the time, it doesn’t—the compiler handles it. But understanding the calling convention helps when:

  • Debugging crashes: Register dumps make sense when you know r4-r7 hold arguments
  • Writing //extern bindings: You need to match what C functions expect
  • Reading the runtime assembly: Context switching must save/restore the right registers (r8-r14 are callee-saved, so the callee must preserve them)

Multiple Return Values

Go functions can return multiple values. C can’t. gccgo handles this by returning a struct:

struct result {
    int quotient;
    int remainder;
};

struct result divmod(int a, int b) {
    return (struct result){ a / b, a % b };
}

Small structs fit in r0-r1. When implementing runtime functions that return multiple values, we must match exactly what gccgo expects.


Reading CPU Registers

Sometimes we need to know register values directly:

// This variable IS register r15
register uintptr_t sp asm("r15");

printf("Stack pointer: 0x%08x\n", sp);

This isn’t a copy—sp is the register. We use this for:

  • Stack bounds checking
  • Context switching (saving/restoring goroutine state)
  • Debugging (dump registers on crash)

Inline Assembly

Sometimes C can’t express what we need. Here are real examples from libgodc:

// Prefetch - hint CPU to load cache line (gc_copy.c)
#define GC_PREFETCH(addr) __asm__ volatile("pref @%0" : : "r"(addr))

// Read the stack pointer (gc_copy.c)
void *sp;
__asm__ volatile("mov r15, %0" : "=r"(sp));

// Read/write status register (scheduler.c)
__asm__ volatile("stc sr, %0" : "=r"(sr));  // read
__asm__ volatile("ldc %0, sr" : : "r"(sr)); // write

// Memory barrier - prevent compiler reordering (runtime.h)
#define CONTEXT_SWITCH_BARRIER() __asm__ volatile("" ::: "memory")

We use assembly for:

  • Prefetching (hint cache to load data we’ll need soon)
  • Context switching (save/restore all registers—see runtime_sh4_minimal.S)
  • Reading special registers (stack pointer, status register)
  • Memory barriers (ensure memory operations complete before continuing)

Don’t use it for anything you can do in C. KOS handles cache flush/invalidate via dcache_flush_range().


Type Descriptors

When you define a Go type, the compiler generates a type descriptor. Here are the key fields (the full struct has 12 fields, 36 bytes):

struct __go_type_descriptor {
    uintptr_t __size;        // Size of an instance
    uintptr_t __ptrdata;     // Bytes containing pointers
    uint32_t  __hash;        // Hash for type comparison
    uint8_t   __code;        // Kind (int, string, struct...)
    const uint8_t *__gcdata; // GC bitmap: which words are pointers
    // ... plus alignment, equality function, reflection string, etc.
};

For this Go type:

type Point struct {
    X, Y int
    Name *string
}

The compiler generates:

┌─────────────────────────────────────────────────────────────┐
│   Type descriptor for Point:                                │
│                                                             │
│   __size:    12 bytes  (int + int + pointer)                │
│   __ptrdata: 12 bytes  (all 3 words may contain pointers)   │
│   __code:    STRUCT                                         │
│   __gcdata:  bit-packed bitmap (1 bit per word)             │
│                                                             │
│   Word 0 (X):    int, not a pointer  → bit 0 = 0            │
│   Word 1 (Y):    int, not a pointer  → bit 1 = 0            │
│   Word 2 (Name): pointer             → bit 2 = 1            │
│                                                             │
│   __gcdata[0] = 0b00000100 = 0x04                           │
│                                                             │
│   GC reads: gcdata[word/8] & (1 << (word%8))                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The garbage collector uses __gcdata to know which fields to scan. The bitmap is bit-packed: one bit per pointer-sized word. Without it, the GC would have to guess which values are pointers.


The Build Process

══════════════════════════════════════════════════════════════
                    THE BUILD PIPELINE
══════════════════════════════════════════════════════════════

ONCE (building libgodc):
────────────────────────

  gc_runtime.c ─┐
  chan.c ───────┼──→ sh-elf-gcc ──→ *.o ──→ ar ──→ libgodc.a
  scheduler.c ──┤
  map.c ────────┘


EVERY TIME (building your game):
────────────────────────────────

  main.go ──→ sh-elf-gccgo ──→ main.o (with holes)
                                   │
                                   ▼
  main.o + libgodc.a + libkallisti.a ──→ sh-elf-ld ──→ game.elf

══════════════════════════════════════════════════════════════

The linker doesn’t care what language produced the code. It just matches symbol names.


Why C, Not Go?

libgodc is written in C (specifically, C11 with GNU extensions).

The Bootstrap Problem: To compile Go, you need a Go runtime. To get a Go runtime, you need to compile Go. Chicken, meet egg.

By writing the runtime in C, we sidestep the problem. The C compiler doesn’t need anything from Go.

Also, KallistiOS is written in C, so we can directly call its functions.


What Runs Before main()?

Your Go main() isn’t the first thing that runs. libgodcbegin.a provides the C main() (in go-main.c) that sets everything up:

Dreamcast powers on
        │
        ▼
KallistiOS boots
        │
        ▼
C main() [go-main.c]
        │
        ├──→ runtime_args()              Save argc/argv
        ├──→ runtime_init()
        │       ├──→ gc_init()           Set up garbage collector
        │       ├──→ map_init()          Initialize map subsystem
        │       ├──→ sudog_pool_init()   Pre-allocate channel waiters
        │       ├──→ stack_pool_preallocate()  Pre-allocate goroutine stacks
        │       ├──→ proc_init()         Set up scheduler (tls_init, g0)
        │       └──→ panic_init()        Set up panic/recover
        │
        ├──→ __go_go(main_wrapper)       Create goroutine for main.main
        │
        └──→ scheduler_run_loop()        Start scheduler
                    │
                    ▼
            YOUR CODE RUNS HERE

Memory Management

The Problem with Memory

In C, you’re the janitor:

char *name = malloc(100);
strcpy(name, "Mario");
free(name);  // Forget this? Memory leak.
             // Do it twice? Crash.

It’s like putting your cup of coffee to your desk every morning and never putting back. Monday is fine, but Friday looks like a pile of empty coffee mugs all over the place.

Go says: “I’ll handle the cleaning the trash and your coffee mugs.”

player := &Player{name: "Mario"} // struct allocation (heap)
enemies := make([]Enemy, 10) // slice allocation (heap)
scores := make(map[string]int) // map allocation (heap)
// That's it. Go cleans up automatically when you're done with them.

Stack vs Heap: Where Does Memory Live?

If you’re coming from Python or JavaScript, you might never have thought about where your variables live. In those languages, everything “just works” in the sense where you create objects, use them, and the runtime cleans up. But programs actually use two different regions of RAM: the stack and the heap. Both are in main memory, but they’re managed very differently.

func calculate() int {
    x := 42           // stack: lives only during this function call
    y := x * 2        // stack: same, gone when function returns
    return y          // value is copied out, then x and y disappear
}

func createPlayer() *Player {
    p := &Player{name: "Mario"}   // heap: we're returning a pointer
    return p                      // p (the pointer) disappears, but the
                                  // Player data survives on the heap
}

The stack is memory that belongs to the current function call. When the function returns, that memory is immediately reclaimed—no cleanup needed, no garbage collector involved. But the data is gone forever.

The heap is memory that persists beyond the function that created it. When you take the address of something (&Player{...}), return a pointer, or use make() for slices/maps, Go allocates on the heap. That memory sticks around until the garbage collector determines nothing references it anymore.

There’s also the data segment where global variables live. These are allocated once when the program starts and exist until the program exits—no cleanup, no GC, they just persist for the program’s entire lifetime.

var highScore int       // data segment - exists from start to end

func main() {
    x := 42             // stack - gone when main() returns
    p := &Player{}      // heap - GC cleans up when unreferenced
    highScore = 9999    // modifying global, not allocating
}

On Dreamcast, there are additional memory regions you’ll encounter:

RegionSizeContains
CodevariesYour compiled program (read-only instructions)
Data/BSSvariesGlobal variables
Stack64 KB per goroutineLocal variables, function calls
Heap~4 MB (2 MB usable)GC-managed allocations
VRAM8 MB totalTextures, framebuffer (via PVR functions)
Sound RAM2 MBAudio samples (via sound functions)

VRAM and Sound RAM are physically separate chips—they can’t corrupt main RAM or each other. If you run out of VRAM, PvrMemMalloc() returns 0. If you don’t check and try to use that zero pointer, your program crashes. Use PvrMemAvailable() to check how much VRAM remains (the framebuffer takes some of the 8 MB, so you won’t have all of it for textures).

When your game ends (power off or reset), all memory is simply gone—the “cleanup” is turning off the console.

func example() {
    // STACK - temporary, fast, automatic cleanup:
    count := 10
    sum := 0.0
    flag := true

    // HEAP - persists, needs GC to clean up:
    player := &Player{}           // pointer escapes? heap
    enemies := make([]Enemy, 5)   // slices go to heap
    scores := make(map[string]int) // maps always heap
}

The compiler decides where each variable lives through escape analysis: if the data could be used after the function returns (passed around, stored somewhere, returned), it goes to the heap. Otherwise, it stays on the stack.

The garbage collector (GC) finds stuff you’re not using anymore and reclaims the memory. But here’s the catch—it takes time to run.


How Allocation Works

When you create something in Go, where does the memory come from?

We use bump allocation. Think of it like a notepad:

┌─────────────────────────────────────────────────────┐
│ Mario │ Luigi │ Peach │                             │
└─────────────────────────────────────────────────────┘
                        ↑
                     You are here
                   (next free spot)

To allocate: just write at the current spot and move the marker.

┌─────────────────────────────────────────────────────┐
│ Mario │ Luigi │ Peach │ Toad │                      │
└─────────────────────────────────────────────────────┘
                               ↑
                            Moved!

That’s it! Just move a pointer. Way faster than malloc.

Verifying Allocations: A Hands-On Example

Embedded developers are used to inspecting memory directly. Here’s how you can see these allocations in action:

package main

import "unsafe"

type Player struct {
    X, Y  float32
    Score int32
}

//go:noinline
func allocOnHeap() *Player {
    return &Player{X: 10, Y: 20, Score: 100}
}

func main() {
    // Stack allocation
    var local Player
    stackAddr := uintptr(unsafe.Pointer(&local))
    println("Stack allocation at:", stackAddr)

    // Heap allocation
    p := allocOnHeap()
    heapAddr := uintptr(unsafe.Pointer(p))
    println("Heap allocation at:", heapAddr)

    // Multiple heap allocations - watch the bump pointer move
    for i := 0; i < 5; i++ {
        obj := allocOnHeap()
        addr := uintptr(unsafe.Pointer(obj))
        println("  Player", i, "at:", addr)
    }
}

Actual output from Dreamcast hardware (from tests/test_alloc_inspect.elf):

Stack allocation:
  Address (hex):     0x8c494cc4

Heap allocation:
  Address (hex):     0x8c084b00

Allocating 5 Player structs consecutively:
  Player 0 at: 0x8c084b50
  Player 1 at: 0x8c084b68  (+ 24 bytes)
  Player 2 at: 0x8c084b80  (+ 24 bytes)
  Player 3 at: 0x8c084b98  (+ 24 bytes)
  Player 4 at: 0x8c084bb0  (+ 24 bytes)

Global variable at:  0x8c05ecc0
  → Data segment (matches .data section start)

Notice the heap addresses increment by 24 bytes each time—that’s the 12-byte Player struct plus the 8-byte GC header, rounded up to 8-byte alignment. The bump pointer just keeps moving forward.

Using GDB to inspect:

# Start dc-tool with GDB server enabled
$ dc-tool-ip -t 192.168.x.x -g -x your_game.elf

# In another terminal, connect GDB
$ sh-elf-gdb your_game.elf
(gdb) target remote :2159

# Set breakpoint and run
(gdb) break main.main
(gdb) continue

# Examine heap memory (address from test output)
(gdb) x/32x 0x8c084b00    # Dump heap region
(gdb) info registers r15  # Stack pointer (SP)

# View GC heap structure
(gdb) p gc_heap           # Print GC heap state
(gdb) p gc_heap.alloc_ptr # Current bump pointer

Memory layout from real hardware (16 MB RAM at 0x8c000000-0x8d000000):

0x8c000000 ┌─────────────────────────────────────┐
           │ KOS kernel and system data          │
0x8c010000 ├─────────────────────────────────────┤
           │ .text (your compiled code)          │ ← Binary starts here
0x8c052aa0 ├─────────────────────────────────────┤
           │ .rodata (read-only data, strings)   │
0x8c05ecc0 ├─────────────────────────────────────┤
           │ .data (global variables)            │ ← Global at 0x8c05ecc0
0x8c0622ac ├─────────────────────────────────────┤
           │ Heap (KOS malloc)                   │
           │   - GC semi-spaces                  │ ← Heap alloc at 0x8c084b00
           │   - KOS thread stacks               │ ← Stack var at 0x8c494cc4
           │   - Other malloc allocations        │
           │                                     │
0x8d000000 └─────────────────────────────────────┘

Note: KOS manages thread stacks via malloc, so both heap allocations and stack memory come from the same pool. The addresses above are from running test_alloc_inspect.elf on real hardware.

But wait…! We never erase anything. Eventually we run out of pages. Yikes!


Why Two Spaces? (Semi-Space Collection)

The bump allocator has a problem: it can only allocate, never free individual objects. When the space fills up, we need a way to reclaim garbage.

Why not free objects in place? Because it creates fragmentation:

┌──────────────────────────────────────────────────────┐
│ Player │ FREE │ Enemy │ FREE │ FREE │ Bullet │ FREE  │
└──────────────────────────────────────────────────────┘
          ↑       can't fit a 3-slot object here

You end up with “free” holes everywhere. A 3-slot object might not fit even though there’s enough total free space.

The solution: copy to a second space. Instead of freeing in place:

  1. Allocate a second space of equal size
  2. When the first space fills, scan for live objects (objects still referenced)
  3. Copy only live objects to the second space
  4. The first space is now 100% garbage—reset the bump pointer to the start
BEFORE (Space A full):           AFTER (Space B active):
┌────────────────────────┐       ┌────────────────────────┐
│ Player │ xxx │ Enemy │ │  →    │ Player │ Enemy │ Bullet│
│ xxx │ Bullet │ xxx │   │       │                        │
└────────────────────────┘       └────────────────────────┘
 (xxx = garbage)                  (compacted, no gaps!)

This copying collection solves two problems at once:

  • Garbage is reclaimed: everything left in Space A is garbage
  • Memory is compacted: no fragmentation in Space B

How Copying Works: Cheney’s Algorithm

The copying process uses an elegant algorithm invented by C.J. Cheney in 1970. It needs only two pointers and no recursion:

TO-SPACE:
┌────────────────────────────────────────────────────────┐
│ Player │ Enemy │ Bullet │                              │
└────────────────────────────────────────────────────────┘
         ↑                 ↑
       SCAN              ALLOC
  1. Start with roots (global variables, stack references, CPU registers)

    Why roots? The GC needs to know which objects are still in use. It can’t ask the running program—the program is paused. The only way to determine if an object is “live” is to check: can any code reach it? Roots are the starting points—references the program definitely has access to. If an object isn’t reachable from any root (directly or through a chain of pointers), no code can ever access it again. It’s garbage.

  2. Copy each root object to to-space at the ALLOC position, then move ALLOC forward by the object’s size (this is the same bump allocation from earlier—just alloc_ptr += size)

  3. Scan copied objects (starting at SCAN pointer) for pointers to other objects

    “Scan” doesn’t mean checking every byte—that would be slow and error-prone. Each object has type information (the __gcdata bitmap from its type descriptor) that tells the GC exactly which fields are pointers. The GC only checks those fields.

  4. If a referenced object hasn’t been copied, copy it to to-space

  5. Update the pointer to point to the new location

  6. Repeat until SCAN catches up with ALLOC—all live objects are now copied

The clever part: when you copy an object, you leave a forwarding pointer in the old location. If another reference points to that same object, you find the forwarding pointer and update the reference without copying again.

// Simplified from runtime/gc_copy.c
void *gc_copy_object(void *old_ptr) {
    gc_header_t *header = gc_get_header(old_ptr);
    
    // Already copied? Return the forwarding address
    if (GC_HEADER_IS_FORWARDED(header))
        return GC_HEADER_GET_FORWARD(header);
    
    size_t obj_size = GC_HEADER_GET_SIZE(header);
    
    // Copy to to-space at current alloc_ptr
    gc_header_t *new_header = (gc_header_t *)gc_heap.alloc_ptr;
    memcpy(new_header, header, obj_size);
    gc_heap.alloc_ptr += obj_size;
    
    void *new_ptr = gc_get_user_ptr(new_header);
    
    // Leave forwarding pointer in old location
    GC_HEADER_SET_FORWARD(header, new_ptr);
    
    return new_ptr;
}

Why this algorithm is elegant:

  • O(live objects) time—dead objects aren’t even touched
  • No recursion—just two pointers chasing each other
  • Single pass—scan and copy happen together
  • Compaction is free—objects naturally pack together

The trade-off: 50% of heap is always reserved for the copy destination.


The 50% Memory Cost

You may have noticed the trade-off mentioned earlier: one space is always reserved for copying. That means half your heap is “unusable” at any given time.

┌─────────────────────────────────────────────────────┐
│        4 MB total GC heap                           │
│  ┌──────────────────┬──────────────────┐            │
│  │   Space A        │   Space B        │            │
│  │   2 MB           │   2 MB           │            │
│  │   (active)       │   (copy target)  │            │
│  └──────────────────┴──────────────────┘            │
│                                                     │
│  Usable at any time: 2 MB                           │
└─────────────────────────────────────────────────────┘

Why accept this 50% cost? Because you get:

  • No fragmentation: Cheney’s algorithm compacts automatically
  • O(1) allocation: just bump alloc_ptr, no free-list search
  • O(live objects) collection: dead objects aren’t even touched
  • Simple implementation: fewer bugs in the runtime
  • Cache-friendly: live objects end up packed together

It’s a deliberate trade-off: memory for speed and simplicity. On a 16 MB system where you’re also using VRAM and Sound RAM for assets, 2 MB of usable GC heap is often sufficient.

Customizing heap size: The default is GC_SEMISPACE_SIZE_KB=2048 (2 MB per space, 4 MB total). To change it, edit runtime/godc_config.h or rebuild libgodc with make CFLAGS="-DGC_SEMISPACE_SIZE_KB=1024" for 1 MB usable, leaving more RAM for game assets.


The Freeze

Here’s the bad news. When the GC runs, your game stops.

Timeline:
────────────────────────────────────────────────────────
Game:   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓████████████▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
                        ↑            ↑
                      GC starts    GC ends
                        
                     "stop-the-world"

All Go code freezes, game logic, physics, input handling. No goroutines run during collection. (Music keeps playing though—the AICA sound processor runs independently of the SH-4 CPU.)

How long does this take? Let’s find out with real numbers.


Real Benchmark Results

Benchmarks from actual Dreamcast hardware (from tests/bench_architecture.elf), verified December 2025:

┌─────────────────────────────────────────────────────┐
│  SCENARIO                   GC PAUSE                │
├─────────────────────────────────────────────────────┤
│  Large objects (≥128 KB)    ~73 μs   (bypass GC)    │
│  64 KB live data            ~2.2 ms                 │
│  32 KB live data            ~6.2 ms                 │
└─────────────────────────────────────────────────────┘

GC pause scales with the number of objects, not just total size. Many small objects (32 KB scenario) require more traversal and copying than fewer large objects.

Key insight: Allocations ≥64 KB bypass the GC heap entirely (go straight to malloc), which is why the “large objects” scenario shows only ~73 μs—that’s just the baseline GC setup cost with nothing to copy.

See the Glossary for a complete reference of all benchmark numbers.


What This Means for Games

Let’s do the math with real data (assuming ~128KB live data = ~6ms pause):

┌─────────────────────────────────────────────────────┐
│  TARGET FPS    FRAME BUDGET    GC PAUSE (~6ms)      │
├─────────────────────────────────────────────────────┤
│  60 FPS        16.7 ms         ~1/3 frame stutter   │
│  30 FPS        33.3 ms         barely noticeable    │
│  20 FPS        50 ms           unnoticeable         │
└─────────────────────────────────────────────────────┘

At 60 FPS, a 6ms GC pause is noticeable but brief. Keep live data small, and pauses stay short.


Big Objects Get Special Treatment

Here’s a surprise: big allocations skip the GC entirely!

small := make([]byte, 1000)      // → GC heap
big := make([]byte, 100*1024)    // → malloc (bypasses GC!)

The threshold is 64 KB:

┌─────────────────────────────────────────────────────┐
│  SIZE           WHERE IT GOES     FREED BY          │
├─────────────────────────────────────────────────────┤
│  < 64 KB        GC heap           GC (automatic)    │
│  ≥ 64 KB        malloc            NEVER! (manual)   │
└─────────────────────────────────────────────────────┘

Wait, never? That’s right. Big objects are never automatically freed.

Why? Copying a 256 KB texture during GC would be too slow. So we skip it entirely. But that means you’re responsible for freeing it.

      ⚠️  WARNING  ⚠️
      
      Large objects (≥64 KB) are NEVER 
      automatically freed by the GC!
      
      This is a memory leak unless you
      call freeExternal() manually (see next section).

When Is This OK?

Fine: Loading a texture at game start. It lives forever anyway.

Problem: Loading new textures every level without freeing old ones.


Freeing Big Objects

Here’s how to clean up big allocations:

import "unsafe"

//extern _runtime.FreeExternal
func freeExternal(ptr unsafe.Pointer)

// Load a big texture
texture := make([]byte, 256*1024)  // 256KB, bypasses GC

// Later, when done with it:
freeExternal(unsafe.Pointer(&texture[0]))
texture = nil  // Don't use it anymore!

The best time to do this? Level transitions.

func LoadLevel(num int) {
    // Free old level's big stuff
    if oldTexture != nil {
        freeExternal(unsafe.Pointer(&oldTexture[0]))
        oldTexture = nil
    }
    
    // Load new level
    oldTexture = loadTexture(num)
    
    // Clean up small stuff too
    runtime.GC()
}

EXERCISE

3.3 You load a 128 KB texture each level. After 10 levels without calling freeExternal(), how much memory have you leaked?


Making GC Hurt Less

Techniques to reduce GC impact, validated by real benchmarks from tests/bench_gc_techniques.elf.

Technique 1: Pre-allocate Slices

Benchmark result: 78% faster!

Real numbers from Dreamcast:

  • Growing slice: 72,027 ns/iteration
  • Pre-allocated: 40,450 ns/iteration
// SLOW: Slice grows, triggers multiple allocations
var items []int
for i := 0; i < 100; i++ {
    items = append(items, i)
}

Why is this slow? A slice in Go is three things: a pointer to data, a length, and a capacity. When you append beyond capacity, Go must:

  1. Allocate a new, larger array (typically 2x the size)
  2. Copy all existing elements to the new array
  3. Abandon the old array (becomes garbage for GC to collect)

Here’s what happens in memory when appending 5 items to an empty slice:

append #1:  Allocate [_], write item         → 1 alloc, 0 copies
append #2:  Full! Allocate [_,_], copy 1     → 2 allocs, 1 copy
append #3:  Full! Allocate [_,_,_,_], copy 2 → 3 allocs, 3 copies total
append #4:  Space available, just write      → 3 allocs, 3 copies total
append #5:  Full! Allocate [_,_,_,_,_,_,_,_], copy 4 → 4 allocs, 7 copies total

For 100 items, this triggers ~7 reallocations and copies ~200 elements total. Each abandoned array is garbage that fills the heap faster.

Memory timeline (growing slice):
┌─────────────────────────────────────────────────────┐
│ [1]  ← alloc #1 (abandoned)                         │
│ [1,2]  ← alloc #2 (abandoned)                       │
│ [1,2,3,_]  ← alloc #3 (abandoned)                   │
│ [1,2,3,4,5,_,_,_]  ← alloc #4 (abandoned)           │
│ [1,2,3,4,5,6,7,8,9,...]  ← alloc #5 (current)       │
│                                                     │
│ GC must eventually clean up allocs #1-#4!           │
└─────────────────────────────────────────────────────┘

The fix: If you know (or can estimate) how many items you’ll need, pre-allocate:

// FAST: Pre-allocate with known capacity
items := make([]int, 0, 100)  // length=0, capacity=100
for i := 0; i < 100; i++ {
    items = append(items, i)
}
Memory timeline (pre-allocated):
┌─────────────────────────────────────────────────────┐
│ [_,_,_,_,_,...100 slots...]  ← single allocation    │
│ [1,_,_,_,_,...] → [1,2,_,_,...] → [1,2,3,_,...]     │
│                                                     │
│ No copying. No garbage. Just fill in the blanks.    │
└─────────────────────────────────────────────────────┘

No growing. No copying. No garbage. 78% faster.

When to use: Loading enemy spawns from a level file? You know the count. Parsing a protocol with a length header? Pre-allocate. Even a rough estimate (round up to next power of 2) beats growing from zero.


Technique 2: Object Pools

Important: Pools are NOT faster for allocation!

Real numbers from Dreamcast:

  • new() allocation: 201 ns/object
  • Pool get/return: 1,450 ns/object (7x slower!)

This is counter-intuitive if you’re coming from desktop Go or other languages. Let’s understand why.

Why is new() so fast? Our bump allocator is essentially one operation:

new(Bullet):
┌───────────────────────────────────────────────────────┐
│ alloc_ptr → [████████ used █████|▓▓▓▓ free ▓▓▓▓▓]     │
│                                 ↑                     │
│                            alloc_ptr += sizeof(Bullet)│
│                                                       │
│ Total: 1 pointer increment. Done.                     │
└───────────────────────────────────────────────────────┘

That’s it. No free lists to search. No size classes. No locking. Just bump the pointer forward. This is why 201 ns is achievable—it’s maybe 40-50 CPU cycles.

Why are pools slower? Pool operations involve slice manipulation:

GetFromPool():
┌─────────────────────────────────────────────────────┐
│ 1. Check if len(pool) > 0      ← bounds check       │
│ 2. Read pool[len-1]            ← memory access      │
│ 3. pool = pool[:len-1]         ← slice header write │
│ 4. Return pointer              ← done               │
│                                                     │
│ ReturnToPool():                                     │
│ 1. Reset object fields         ← memory writes      │
│ 2. pool = append(pool, obj)    ← may grow slice!    │
│                                                     │
│ Total: ~7x more work than bump allocation           │
└─────────────────────────────────────────────────────┘

So why use pools at all? The trade-off isn’t about allocation speed. It’s about when you pay the cost:

WITHOUT POOL (100 bullets/frame):
─────────────────────────────────────────────────────
Frame 1:  new new new new... (100x)  │ 20 μs │ smooth
Frame 2:  new new new new... (100x)  │ 20 μs │ smooth
Frame 3:  new new new new... (100x)  │ 20 μs │ smooth
  ...
Frame 50: GC TRIGGERED!              │ 6 ms  │ ← STUTTER!
─────────────────────────────────────────────────────
                                     └─ 60 FPS target = 16.6 ms
                                        6 ms pause = 1/3 frame drop


WITH POOL (100 bullets/frame):
─────────────────────────────────────────────────────
Frame 1:  get get get... return...   │ 145 μs │ smooth
Frame 2:  get get get... return...   │ 145 μs │ smooth
Frame 3:  get get get... return...   │ 145 μs │ smooth
  ...
Frame 50: (no GC needed)             │ 145 μs │ still smooth!
─────────────────────────────────────────────────────

You’re trading ~125 μs per frame for no GC pauses. For a bullet hell game, that’s worth it.

When to use pools:

  • High-frequency create/destroy (bullets, particles, audio events)
  • Objects with predictable lifetimes (spawned and despawned together)
  • When you need consistent frame times (no surprise stutters)

When NOT to use pools:

  • Objects created once and kept (player, level geometry)
  • Low churn rate (a few allocations per second)
  • Prototype/debugging (just use new(), it’s simpler)

Simple pool implementation:

var pool []*Bullet

func GetBullet() *Bullet {
    if len(pool) > 0 {
        b := pool[len(pool)-1]
        pool = pool[:len(pool)-1]
        return b
    }
    return new(Bullet)  // Pool empty? Allocate fresh
}

func ReturnBullet(b *Bullet) {
    b.X, b.Y, b.Active = 0, 0, false  // Reset state!
    pool = append(pool, b)
}

Pro tip: Pre-populate the pool at game start to avoid any new() calls during gameplay:

func InitBulletPool(size int) {
    pool = make([]*Bullet, size)
    for i := range pool {
        pool[i] = new(Bullet)
    }
}

Now GetBullet() never allocates during gameplay—predictable performance every frame.


Technique 3: Trigger GC at Safe Times

Benchmark: Manual GC takes ~35 μs with minimal live data

The problem with automatic GC is unpredictability. You don’t control when it runs. It just happens when the heap fills up. That might be during a boss fight.

GC pause times from real benchmarks (from bench_gc_pause.elf):

Live DataGC PauseImpact at 60 FPS
Minimal~100 μsUnnoticeable
32 KB~2 msMinor stutter
128 KB~6 ms1/3 frame drop

The key insight: GC pause scales with live data, not garbage. If you trigger GC when live data is minimal (between levels, during menus), the pause is tiny.

Uncontrolled vs Controlled GC:

UNCONTROLLED (GC surprises you):
─────────────────────────────────────────────────────────────
│ Gameplay ││ Gameplay ││ Gameplay ││ GC! ││ Gameplay       │
│  smooth  ││  smooth  ││  smooth  ││6 ms!││  smooth        │
─────────────────────────────────────────────────────────────
                                      ↑
                                 Player notices!
                                 "Why did it stutter
                                  when I jumped?"


CONTROLLED (you choose when):
─────────────────────────────────────────────────────────────
│ Gameplay ││ Menu Opens ││ Gameplay ││ Level End ││ Next   │
│  smooth  ││ GC (35 μs) ││  smooth  ││ GC (35 μs)││Level   │
─────────────────────────────────────────────────────────────
              ↑                         ↑
         Player is reading         Victory animation
         menu anyway               playing anyway

How to trigger GC manually:

//go:linkname forceGC runtime.GC
func forceGC()

Best times to trigger GC (player won’t notice):

func OnDialogueStart() {
    forceGC()  // Text appearing letter-by-letter anyway
}

func OnMenuOpen() {
    forceGC()  // Player is reading options
}

func OnLevelComplete() {
    forceGC()  // Victory fanfare playing, score tallying
}

func OnLoadingScreen() {
    forceGC()  // Already showing "Loading..."
}

func OnRoomTransition() {
    forceGC()  // Screen is fading to black
}

func OnCutsceneStart() {
    forceGC()  // Video/animation taking over
}

Important caveats:

  1. Don’t trigger too often. GC still takes time. Once per scene transition is reasonable. Once per frame defeats the purpose.

  2. This doesn’t reduce garbage. You’re just choosing when to pay the cost. Combine with pre-allocation and pools to reduce how much garbage you create.

  3. Live data still matters. If you have 128 KB of permanent game state, even manual GC takes ~6 ms. Keep live data lean.

Good: Trigger GC → level enemies/items are garbage → fast GC
Bad:  Trigger GC → 10,000 persistent objects → slow GC anyway

Technique 4: Reuse Slices

Benchmark: 5% faster (13,200 ns → 12,500 ns)

Small gain per-call, but the real win is less garbage over time. Reset with [:0] instead of allocating new:

// BAD: New allocation every frame
func ProcessFrame() {
    items := make([]int, 0, 100)  // ← garbage next frame
    // ...
}

// GOOD: Reuse backing array
var items = make([]int, 0, 100)  // Allocate once

func ProcessFrame() {
    items = items[:0]  // Reset length, keep capacity
    // ...
}

The [:0] trick keeps the backing array. Over 1000 frames: 1 allocation instead of 1000.

Bonus pattern—shift without allocating:

// Creates new slice header:
queue = append(queue[1:], newItem)

// Reuses existing array:
copy(queue, queue[1:])
queue[len(queue)-1] = newItem

Technique 5: Compact In-Place

When entities die, don’t allocate a filtered slice. Compact the existing one:

// BAD: Allocates new slice
alive := make([]*Enemy, 0)
for _, e := range enemies {
    if e.Active {
        alive = append(alive, e)  // ← garbage
    }
}
enemies = alive

// GOOD: Compact in place
n := 0
for _, e := range enemies {
    if e.Active {
        enemies[n] = e
        n++
    }
}
enemies = enemies[:n]  // Shrink, no allocation

Visual:

Before: [A, _, B, _, _, C]  (3 active, 3 dead)
         ↓ compact
After:  [A, B, C]           (same backing array, shorter length)

Classic game loop pattern: every frame, compact dead bullets/particles/enemies without touching the allocator.

Goroutines

The Trade-off

Let me set expectations: goroutines on Dreamcast work, but differently than on modern hardware.

You get zero parallelism (single CPU), but you get everything else: clean concurrency primitives, channels, and code that feels like Go.

Here’s the thing. Goroutines shine when you have multiple CPU cores:

Modern PC (8 cores):
────────────────────────────────────────────────────────────
Core 1: [──────goroutine A──────]
Core 2: [──────goroutine B──────]
Core 3: [──────goroutine C──────]
Core 4: [──────goroutine D──────]
...
        ↑
        All running SIMULTANEOUSLY
        4x faster than running them one-by-one!

But Dreamcast?

Dreamcast (1 core):
────────────────────────────────────────────────────────────
CPU:    [───A───][───B───][───A───][───C───][───B───]...
        ↑
        Only ONE runs at a time
        ZERO parallelism benefit

So why did libgodc implements them?


Why Bother?

Because Go without goroutines isn’t Go.

Imagine porting Python to a machine without lists. Or JavaScript without callbacks. You could do it, but would it feel like the same language?

I wanted Go on Dreamcast to feel like Go. You can write:

go processEnemies()
go playBackgroundMusic()
go handleInput()

It works. It’s correct. The code is cleaner. It’s just not faster than calling them directly:

processEnemies()
playBackgroundMusic()
handleInput()

There’s overhead—but less than you might expect. Let’s see the numbers.


What Happens Under the Hood

When you create a goroutine, here’s what actually happens:

┌─────────────────────────────────────────────────────────────┐
│   go doSomething()                                          │
│   ────────────────                                          │
│                                                             │
│   1. Allocate 64 KB stack (from pool or malloc)             │
│   2. Initialize G struct (~150 bytes)                       │
│   3. Save 16 CPU registers to context                       │
│   4. Set up context (sp, pc, pr)                            │
│   5. Add to run queue                                       │
│   6. Later: context switch to run (~6.6 μs)                 │
│   ─────────────────────────────────────────────────────     │
│   Total spawn + first run: ~32 μs                           │
│                                                             │
│   That's ~6,400 CPU cycles per goroutine spawn!             │
└─────────────────────────────────────────────────────────────┘

What do you get for this overhead? On a multi-core system: parallelism. On Dreamcast: proper Go semantics and working concurrency primitives. That’s actually worth something!

The Numbers

I ran benchmarks on real Dreamcast hardware (from bench_architecture.elf):

┌─────────────────────────────────────────────────────────────┐
│   OPERATION               TIME                              │
├─────────────────────────────────────────────────────────────┤
│   runtime.Gosched()       120 ns      ← very cheap!         │
│   Buffered channel op     ~1.5 μs                           │
│   Context switch          ~6.6 μs                           │
│   Channel round-trip      ~13 μs                            │
│   Goroutine spawn+run     ~34 μs                            │
└─────────────────────────────────────────────────────────────┘

At 200 MHz, you get about 200 million cycles per second. At 60 FPS you have 3.3 million cycles per frame. A 34 μs goroutine spawn is ~6,800 cycles—that’s only 0.2% of your frame budget. You can afford a few goroutines per frame, just don’t spawn hundreds!

See the Glossary for a complete reference of all benchmark numbers.


How It Works

The implementation is pretty elegant for a 200 MHz machine. Let’s see how we create the illusion of concurrency.

The G Struct

Every goroutine is a G structure (see runtime/goroutine.h):

┌─────────────────────────────────────────────────────────────┐
│   Goroutine (G)                                             │
│                                                             │
│   _panic:     nil         (current panic - offset 0)        │
│   _defer:     nil         (deferred functions - offset 4)   │
│   atomicstatus: Grunning  (or Gwaiting, Grunnable, etc.)    │
│   schedlink:  next G      (run queue linkage)               │
│   stack_lo:   0x8c100000  (bottom of stack)                 │
│   stack_hi:   0x8c110000  (top of stack, 64 KB above)       │
│   context:    saved CPU registers (64 bytes)                │
│                           ├── r8-r14 (callee-saved GPRs)    │
│                           ├── sp, pc, pr (special)          │
│                           └── fr12-fr15, fpscr, fpul (FPU)  │
│   goid:       42          (unique ID - 8 bytes)             │
│   waiting:    sudog*      (channel wait queue entry)        │
│   checkpoint: ptr         (for panic/recover)               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The key is context, aka the saved CPU registers. This lets us pause mid-function and resume later.

The Run Queue

Runnable goroutines wait in line:

     head                                    tail
       ↓                                       ↓
    ┌────┐   ┌────┐   ┌────┐   ┌────┐
    │ G3 │──▶│ G7 │──▶│ G2 │──▶│ G9 │──▶ NULL
    └────┘   └────┘   └────┘   └────┘
      ↑
   "I'm next!"

The scheduler is simple:

while (true) {
    G *gp = runq_get();       // Get next goroutine
    if (gp) {
        switch_to(gp);        // Run it
    }
    // When it yields, we come back here
}

Context Switching

This is where the magic happens. We’re running goroutine A, and we need to switch to B:

STEP 1: Save A's registers to A's context
────────────────────────────────────────────────────────
        CPU                         A's Context
    ┌─────────┐                   ┌─────────┐
    │ r8 = 42 │ ────────────────▶ │ r8 = 42 │
    │ r9 = 17 │                   │ r9 = 17 │
    │ sp = X  │                   │ sp = X  │
    │ pc = Y  │                   │ pc = Y  │
    └─────────┘                   └─────────┘


STEP 2: Load B's registers from B's context  
────────────────────────────────────────────────────────
    B's Context                       CPU
    ┌─────────┐                   ┌─────────┐
    │ r8 = 99 │ ────────────────▶ │ r8 = 99 │
    │ r9 = 55 │                   │ r9 = 55 │
    │ sp = P  │                   │ sp = P  │
    │ pc = Q  │                   │ pc = Q  │
    └─────────┘                   └─────────┘


STEP 3: Return (now running B!)
────────────────────────────────────────────────────────
CPU continues from B's saved PC with B's saved registers.
To B, it's like it never stopped running!

On SH-4, we save/restore 16 registers (64 bytes). The full context switch with FPU takes ~88 cycles. With lazy FPU optimization (skipping FPU for integer-only goroutines), it drops to ~38 cycles. At 200 MHz, that’s under 0.5 microseconds—the total yield path including scheduler overhead is ~6.6 μs as shown in the benchmarks.


Cooperative Scheduling: The Gotcha

Our scheduler is cooperative, not preemptive. This is different from official Go!

Preemptive (official Go since 1.14): The runtime can forcibly pause a goroutine at any time using timer interrupts or signals. Even an infinite loop gets interrupted so other goroutines can run.

Cooperative (libgodc): Goroutines must volunteer to give up the CPU. The runtime never forces a switch. If a goroutine doesn’t yield, nothing else runs.

Why the difference? Preemptive scheduling requires:

  • Signal handlers or timer interrupts to interrupt running code
  • Complex stack inspection to find safe preemption points
  • More saved state per context switch

On Dreamcast, we keep it simple. The cost is that you must be careful:

// This freezes your Dreamcast (but works fine in official Go!):
func badGoroutine() {
    for {
        x++  // Infinite loop, never yields
    }
}

Where Goroutines Yield

┌─────────────────────────────────────────────────────────────┐
│   YIELDS (lets others run)         DOESN'T YIELD            │
├─────────────────────────────────────────────────────────────┤
│   ✓ Channel send: ch <- x          ✗ Math: x + y * z        │
│   ✓ Channel receive: <-ch          ✗ Memory: array[i]       │
│   ✓ time.Sleep()                   ✗ Loops: for i := ...    │
│   ✓ runtime.Gosched()                                       │
│   ✓ select {}                                               │
└─────────────────────────────────────────────────────────────┘

The Fix for Long Computations

// Bad: No yields for 10 million iterations
for i := 0; i < 10000000; i++ {
    result += compute(i)
}

// Good: Yield periodically
for i := 0; i < 10000000; i++ {
    result += compute(i)
    if i % 10000 == 0 {
        runtime.Gosched()  // Let others run
    }
}

Note: if you have a single long computation with no natural yield points, a direct function call is simpler. Goroutines shine when you have multiple things that can interleave.


When Goroutines Shine

Goroutines work well for several patterns. Here’s real benchmark data from bench_goroutine_usecase.elf:

┌─────────────────────────────────────────────────────────────┐
│   USE CASE                    OVERHEAD    VERDICT           │
├─────────────────────────────────────────────────────────────┤
│   Multiple independent tasks  10-38%      ✓ Acceptable      │
│   Producer-consumer pattern   ~163%       ⚠ Use carefully   │
│   Channel ping-pong           ~13 μs/op   Know the cost     │
└─────────────────────────────────────────────────────────────┘

The key insight: independent tasks (each goroutine does its own work, minimal channel communication) have reasonable overhead (typically ~25%, varies with scheduling). Heavy channel use (producer-consumer with many sends) costs ~163%.

Porting Existing Go Code

If you’re porting Go code that uses goroutines, it works without modification:

// This Go code just works:
func fetch(urls []string) []Result {
    ch := make(chan Result, len(urls))
    for _, url := range urls {
        go func(u string) {
            ch <- download(u)
        }(url)
    }
    // ... collect results
}

Patterns to Avoid

Some patterns don’t make sense on a single-core system:

Don’t: Spawn Per-Item

// Inefficient: 1000 spawns = 32 ms overhead
for i := 0; i < 1000; i++ {
    go process(items[i])
}

// Better: Process directly, or use one goroutine
for i := 0; i < 1000; i++ {
    process(items[i])
}

Don’t: Force Sequential With Channels

// Overcomplicated: These are sequential anyway
go step1()
<-done1
go step2()
<-done2

// Simpler:
step1()
step2()

Be Careful: Heavy Channel Traffic

// Each channel op is ~13 μs
// High-volume producer-consumer shows ~163% overhead
for item := range items {
    workChan <- item
}

For high-throughput paths, batch items or use direct calls.

Panic and Recover

Two Kinds of Errors

Most errors in Go are… boring. And that’s good! You handle them like this:

file, err := openFile("game.sav")
if err != nil {
    // No saved game? No problem.
    // Start a new game instead.
}

The function tells you something went wrong, and you decide what to do. Maybe you retry. Maybe you use a default. Maybe you tell the user. It’s your choice.

But some errors are different. They’re programmer mistakes:

enemies := []Enemy{orc, goblin, troll}
enemy := enemies[99]  // WAIT. There's only 3 enemies!

This isn’t “the file doesn’t exist.” This is “the code is broken.” There’s no sensible way to continue.

This is when Go panics.


What Happens When You Panic

Here’s the sequence, step by step:

                  Normal Execution
                        ↓
        ┌───────────────────────────────┐
        │  enemies := []Enemy{...}      │
        │  enemy := enemies[99]         │ ← PANIC!
        │  moveEnemy(enemy)             │ ← never runs
        └───────────────────────────────┘
                        ↓
              EXECUTION STOPS
                        ↓
        ┌───────────────────────────────┐
        │  Run all deferred functions   │
        │  (in reverse order!)          │
        └───────────────────────────────┘
                        ↓
          Did any defer call recover()?
                  /           \
                YES             NO
                 ↓               ↓
        Program continues   Program dies

The key insight: deferred functions always run, even during a panic. This is Go’s cleanup guarantee. Well… there are some really really bad cases (e.g. panic before runtime init or too many nested panics) where this statement is false.


Defer: The Cleanup Crew

Before we talk more about panic, let’s understand defer. It’s simple but powerful.

func processEnemy(e *Enemy) {
    file := openLog("combat.log")
    defer closeLog(file)  // "Remember to do this when I leave!"
    
    damage := calculateDamage(e)
    applyDamage(e, damage)
    
    // closeLog runs here, automatically
}

The defer keyword says: “Don’t run this now. Run it when the function exits.”

No matter how you exit—return, panic, whatever—the deferred function runs.

Multiple Defers: LIFO

If you have multiple defers, they run in reverse order. Last in, first out. Like a stack of plates:

func setup() {
    defer println("First defer")   // Runs 3rd
    defer println("Second defer")  // Runs 2nd
    defer println("Third defer")   // Runs 1st
    println("Normal code")
}

// Output:
// Normal code
// Third defer
// Second defer
// First defer

Why reverse order? Think about it: if you opened file A, then file B, you want to close B before A. The last thing you set up is the first thing you tear down.

Visualizing the Defer Chain

Each goroutine maintains a linked list of deferred functions:

G.defer → [cleanup3] → [cleanup2] → [cleanup1]
            newest                    oldest
             runs                      runs
             first                     last

When the function returns (or panics):

  1. Pop cleanup3, run it
  2. Pop cleanup2, run it
  3. Pop cleanup1, run it
  4. Done!

Recover: Catching the Fall

Here’s the safety net. recover() catches a panic mid-flight:

func safeGameLoop() {
    if runtime_checkpoint() != 0 {
        // We land here after recovering from a panic
        // libgodc needs this, if you are going to use "recover" mechanisms
        println("Recovered! Returning to main menu...")
        return
    }
    
    defer func() {
        if r := recover(); r != nil {
            println("Caught panic:", r)
        }
    }()
    
    runGame()  // If this panics, we catch it!
}

func main() {
    safeGameLoop()
    println("Program continues!")  // This runs even after panic!
}

Note: libgodc requires runtime_checkpoint() for recover to work properly. Without it, even a successful recover() will terminate the program. Standard Go handles this automatically via DWARF unwinding, but we use setjmp/longjmp instead (explained later in this chapter).

Let’s trace what happens:

1. safeGameLoop() starts
2. runtime_checkpoint() saves recovery point, returns 0
3. defer registers our recovery function
4. runGame() starts
5. ... something bad happens ...
6. PANIC!
7. Deferred function runs
8. recover() catches the panic, marks it recovered
9. longjmp back to checkpoint, runtime_checkpoint() returns 1
10. "Recovered!" prints, function returns normally
11. "Program continues!" prints

The panic was caught. The program lives.


The Golden Rule

Here’s the catch: recover only works inside a deferred function.

// THIS WORKS ✓
defer func() {
    recover()  // Called directly in defer
}()

// THIS DOESN'T WORK ✗
recover()  // Not in a defer—does nothing!

Why? Because recover needs to intercept the panic during the cleanup phase. If you’re not in a defer, you’re not in cleanup mode.

libgodc note: Standard Go is even stricter—recover must be called directly in the defer, not in a helper function. We relaxed this rule because it’s complex to implement and the behavior difference is benign for games. More panics get caught, which is fine.


How We Implement It

Standard Go uses something called DWARF unwinding. It’s sophisticated: the compiler generates detailed metadata about every function’s stack layout, and a runtime library uses this to carefully walk back up the stack.

That’s a lot of complexity. We don’t have DWARF support on Dreamcast, yet (?).

Instead, we use an old C trick: setjmp/longjmp.

The Teleportation Trick

Imagine setjmp as dropping a bookmark:

jmp_buf bookmark;

if (setjmp(bookmark) == 0) {
    // First time through: setjmp returns 0
    printf("Starting...\n");
    doRiskyThing();
    printf("Made it!\n");
} else {
    // After longjmp: setjmp returns 1
    printf("Something went wrong!\n");
}

And longjmp teleports you back to that bookmark:

void doRiskyThing() {
    // ...
    if (disaster) {
        longjmp(bookmark, 1);  // TELEPORT!
    }
    // ...
}

When longjmp is called, execution jumps back to setjmp, which now returns 1 instead of 0. All the function calls in between? Gone. Skipped. Like they never happened.

The Recovery Path

┌─────────────────────────────────────────────────────────────┐
│   PANIC WITH CHECKPOINT                                     │
│                                                             │
│   func risky() {                                            │
│       if runtime_checkpoint() != 0 {                        │
│           return  // Recovered! Continue here.              │
│       }                                                     │
│       defer func() {                                        │
│           recover()                                         │
│       }()                                                   │
│       panic("oops")  // longjmp to checkpoint               │
│   }                                                         │
│                                                             │
│   → Clean, predictable                                      │
│   → Required for recover() to work in libgodc               │
└─────────────────────────────────────────────────────────────┘

Important: Without runtime_checkpoint(), calling recover() will still mark the panic as recovered, but the program will terminate with “FATAL: recover without checkpoint”. The checkpoint is required for proper recovery in libgodc.


When Nobody Catches the Panic

If no recover catches the panic, the program dies. On Dreamcast, you’ll see:

panic: index out of range [99] with length 3

goroutine 1 [running]:
  0x8c010234
  0x8c010456
  0x8c010678

Memory: arena=4194304 used=1258291 free=2936013

The console halts. The user has to manually reset. This is intentional. A crash is better than continuing with corrupted state and zombies.


When Should You Panic?

Here’s the decision tree:

Is this a programmer mistake?
        │
        ├── YES → Maybe panic is okay
        │           ├── nil pointer dereference
        │           ├── index out of bounds
        │           └── calling method on nil
        │
        └── NO → DON'T PANIC. Return an error.
                    ├── File not found
                    ├── Network timeout
                    ├── Invalid user input
                    └── Resource unavailable

When Recover Makes Sense

Use recover at boundaries—places where you want to contain failures. In libgodc, remember to use runtime_checkpoint():

func handleEventSafely(event Event) {
    if runtime_checkpoint() != 0 {
        println("Event handler crashed, continuing...")
        return
    }
    
    defer func() {
        if r := recover(); r != nil {
            println("Caught:", r)
        }
    }()
    
    handleEvent(event)  // If this panics, we catch it
}

One bad event handler shouldn’t kill the entire game.

For general Go error handling best practices (when to panic vs return errors), see Effective Go.


Data Structures

Part 1: Strings

The Million-Dollar Question

How long is this string?

"Hello, Dreamcast!"

In C, you have to count:

char *msg = "Hello, Dreamcast!";
int len = 0;
while (msg[len] != '\0') {  // Keep going until null byte
    len++;
}

// H  e  l  l  o  ,     D  r  e  a  m  c  a  s  t  !  \0
// 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17
// len is now 17... but we checked 18 characters!

C strings end with a special “null byte” (\0). To find the length, you walk through every character until you hit it. For a 10,000-character string, that’s 10,000 checks.

Go strings are smarter. They remember their length:

┌────────────────┐
│  str: ─────────────────▶ h │ e │ l │ l │ o │
│  len: 5        │
└────────────────┘

In libgodc, this is an 8-byte structure (on 32-bit Dreamcast):

// From runtime/runtime.h, see GoString C struct
typedef struct {
    const uint8_t *str;  // 4 bytes: pointer to character data
    intptr_t len;        // 4 bytes: length in bytes
} GoString;

Unlike C strings (null-terminated), Go strings store their length explicitly. This means:

  • O(1) length lookup just read the len field
  • Can contain null bytes no special terminator
  • Bounds checked we know exactly where the string ends

String Allocation

Strings are immutable. Every concatenation allocates new memory:

s := "foo" + "bar"  // Allocates 6 bytes, copies both strings

Repeated concatenation in a loop is O(n²), where each iteration copies all previous data. This is a common Go performance pitfall; see Effective Go for solutions.

The tmpBuf Optimization

Here’s a secret: libgodc cheats for short strings.

When you concatenate strings that total ≤32 bytes, we use a stack buffer instead of allocating from the heap:

"a" + "b" = "ab"

Stack (fast):  ┌────────────────────────────────┐
               │ a │ b │   │   │ ... │   │   │  │  32 bytes
               └────────────────────────────────┘

No GC allocation needed!

This happens automatically. You don’t have to do anything—the compiler passes a stack buffer to the runtime, and we use it when we can.


Part 2: Slices

The Three-Part Header

A slice is not just a pointer. It’s a header (that means struct) with three fields:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Slice: []int with values [10, 20, 30]                     │
│                                                             │
│   ┌────────────────┐        ┌─────┬─────┬─────┬─────┬─────┐ │
│   │  array: ───────────────▶│ 10  │ 20  │ 30  │  ?  │  ?  │ │
│   │  len:   3      │        └─────┴─────┴─────┴─────┴─────┘ │
│   │  cap:   5      │             ▲           ▲              │
│   └────────────────┘          length      capacity          │
│                                                             │
└─────────────────────────────────────────────────────────────┘
  • array :: Pointer to the underlying data
  • len :: How many elements are currently in use
  • cap:: How many elements could fit before reallocation

Think of it like a notebook. You have 100 pages (capacity), but you’ve only written on 30 (length).

The Magic of Slicing

Here’s the trick that makes Go slices amazing. When you “slice” a slice, no data is copied:

a := []int{10, 20, 30, 40, 50}
b := a[1:4]  // b is [20, 30, 40]

What actually happens:

Underlying array:
┌─────┬─────┬─────┬─────┬─────┐
│ 10  │ 20  │ 30  │ 40  │ 50  │
└─────┴─────┴─────┴─────┴─────┘
  ▲     ▲
  │     │
  │     └── b.array points here
  │         b.len = 3
  │         b.cap = 4
  │
  └── a.array points here
      a.len = 5
      a.cap = 5

Both a and b point to the same memory. Slicing is O(1) — just create a new 12-byte header.

The Sharing Trap

But wait. If they share memory…

a := []int{10, 20, 30, 40, 50}
b := a[1:4]

b[0] = 999  // What happens to a?
After b[0] = 999:
┌─────┬─────┬─────┬─────┬─────┐
│ 10  │ 999 │ 30  │ 40  │ 50  │
└─────┴─────┴─────┴─────┴─────┘
  ▲     ▲
  │     │
  a     b

a is now [10, 999, 30, 40, 50]!

Both slices see the change! This is usually a bug waiting to happen.

If you need independent data, use copy:

b := make([]int, 3)
copy(b, a[1:4])  // b has its own data now

How libgodc Implements copy

When you write copy(dst, src), what actually happens?

Step 1: Figure out how many elements to copy
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   dst has room for 3       src has 5 elements               │
│   ┌───┬───┬───┐            ┌───┬───┬───┬───┬───┐            │
│   │   │   │   │            │ A │ B │ C │ D │ E │            │
│   └───┴───┴───┘            └───┴───┴───┴───┴───┘            │
│                                                             │
│   Copy min(3, 5) = 3 elements                               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Step 2: Calculate byte size
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   3 elements × 4 bytes each (int) = 12 bytes                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Step 3: copy the bytes safely (aka memmove in C)
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   src:  ████████████░░░░░░░░  (copy first 12 bytes)         │
│              │                                              │
│              ▼                                              │
│   dst:  ████████████                                        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Step 4: Return 3 (number of elements copied)

Why memmove instead of memcpy? Because slices can overlap:

s := []int{1, 2, 3, 4, 5}
copy(s[1:], s[:4])  // Shift elements right — overlapping!

memmove handles this safely. memcpy would corrupt the data.

Growing Slices: The append Dance

What happens when you append beyond capacity?

s := make([]int, 3, 4)  // len=3, cap=4
s = append(s, 10)       // len=4, cap=4 — fits!
s = append(s, 20)       // len=5, cap=??? — doesn't fit!

libgodc allocates a new, bigger array:

Before:
┌─────┬─────┬─────┬─────┐
│  0  │  0  │  0  │ 10  │  cap=4, FULL
└─────┴─────┴─────┴─────┘

After append(s, 20):
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│  0  │  0  │  0  │ 10  │ 20  │     │     │     │  cap=8, NEW ARRAY
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘

Old array becomes garbage (GC will clean it up).

libgodc’s Growth Strategy

Standard Go doubles capacity for small slices and grows by 25% for large ones. But Dreamcast only has 16MB RAM, so libgodc is more conservative by design:

┌─────────────────────────────────────────────────────────────┐
│   libgodc growth algorithm (runtime_growslice)              │
│                                                             │
│   if capacity < 64:                                         │
│       new_cap = capacity × 2      ← Double (same as std Go) │
│   else:                                                     │
│       new_cap = capacity × 1.125  ← Only 12.5% growth!      │
│                                                             │
└─────────────────────────────────────────────────────────────┘
Slice sizeStandard Golibgodc
Small (< 64 elements)DoubleDouble
Large (≥ 64 elements)+25%+12.5%

Why the difference? On a 16MB system, aggressive doubling wastes precious memory. A 10,000-element slice growing by 25% allocates 2,500 extra slots. At 12.5%, that’s only 1,250, so half the waste.

Pro tip: If you know how big you’ll need, then pre-allocate!

// Bad: many reallocations
enemies := []Enemy{}
for i := 0; i < 100; i++ {
    enemies = append(enemies, loadEnemy(i))
}

// Good: one allocation
enemies := make([]Enemy, 0, 100)
for i := 0; i < 100; i++ {
    enemies = append(enemies, loadEnemy(i))
}

Part 3: Maps

The Problem: Finding Things Fast

Suppose you’re building an item shop for your game. You have a price list:

type Item struct {
    Name  string
    Price int
}

items := []Item{
    {"Potion", 50},
    {"Sword", 300},
    {"Shield", 250},
    {"Bow", 200},
    // ... 100 more items
}

A customer asks: “How much is the Bow?”

You have to search through every item:

for _, item := range items {
    if item.Name == "Bow" {
        return item.Price
    }
}

If the item list has 100 items, you might check up to 100 items. That’s O(n) time.

Now imagine you have a friend named Maggie who has memorized every item and its price. You ask “How much is the Bow?” and she instantly says “200 gold!”

Maggie gives you the answer in O(1) time — constant time. It doesn’t matter if there are 10 items or 10,000. She just knows.

How do you get a “Maggie”?

You use a hash table. In Go, that’s a map.

Building Your Own Maggie

A hash table combines two things:

  1. A hash function that turns keys into numbers
  2. An array to store the values

Let’s build one step by step. Start with an empty array of 5 slots:

┌───────┬───────┬───────┬───────┬───────┐
│   0   │   1   │   2   │   3   │   4   │
├───────┼───────┼───────┼───────┼───────┤
│       │       │       │       │       │
└───────┴───────┴───────┴───────┴───────┘

Now we need a hash function. A hash function takes a string and returns a number. Here’s the important part:

  • It must be consistent: “Potion” always returns the same number.
  • It should spread things out: different strings should (usually) give different numbers.

Let’s add the price of a Potion. We feed “Potion” into the hash function:

hash("Potion") → 7392
7392 % 5 = 2  ← slot 2!

We store the price (50) at index 2:

┌───────┬───────┬───────┬───────┬───────┐
│   0   │   1   │   2   │   3   │   4   │
├───────┼───────┼───────┼───────┼───────┤
│       │       │ 50    │       │       │
│       │       │Potion │       │       │
└───────┴───────┴───────┴───────┴───────┘

Now add the Sword (300 gold):

hash("Sword") → 4281
4281 % 5 = 1  ← slot 1!
┌───────┬───────┬───────┬───────┬───────┐
│   0   │   1   │   2   │   3   │   4   │
├───────┼───────┼───────┼───────┼───────┤
│       │ 300   │ 50    │       │       │
│       │ Sword │Potion │       │       │
└───────┴───────┴───────┴───────┴───────┘

Add the Shield and Bow:

hash("Shield") % 5 = 0
hash("Bow") % 5 = 4
┌───────┬───────┬───────┬───────┬───────┐
│   0   │   1   │   2   │   3   │   4   │
├───────┼───────┼───────┼───────┼───────┤
│ 250   │ 300   │ 50    │       │ 200   │
│Shield │ Sword │Potion │       │ Bow   │
└───────┴───────┴───────┴───────┴───────┘

Now when someone asks “How much is the Bow?”:

  1. hash("Bow") % 5 = 4
  2. Look at slot 4
  3. It’s 200 gold!

No searching! The hash function tells you exactly where to look. This is O(1) — constant time.

You just built a “Maggie”!

Collisions: When Two Keys Want the Same Slot

Here’s a problem. What if two items hash to the same slot?

hash("Potion") % 5 = 2
hash("Scroll") % 5 = 2  ← Same slot!

Oh no! Potions are already in slot 2. If we put Scrolls there, we’ll overwrite Potions!

This is called a collision. There are different ways to handle it. Go uses a simple approach: store both items in the same slot using a small list.

┌───────┬───────┬────────────────────┬───────┬───────┐
│   0   │   1   │         2          │   3   │   4   │
├───────┼───────┼────────────────────┼───────┼───────┤
│ 250   │ 300   │ Potion→50          │       │ 200   │
│Shield │ Sword │ Scroll→75          │       │ Bow   │
└───────┴───────┴────────────────────┴───────┴───────┘

Now when you look up “Scroll”:

  1. hash("Scroll") % 5 = 2
  2. Look at slot 2
  3. Check if “Potion” matches — no
  4. Check if “Scroll” matches — yes! Return 75.

It takes a tiny bit longer, but it works.

The Worst Case: Everyone in One Slot

What if you’re really unlucky and every item hashes to the same slot?

┌───────┬───────┬──────────────────────────┬───────┬───────┐
│   0   │   1   │            2             │   3   │   4   │
├───────┼───────┼──────────────────────────┼───────┼───────┤
│       │       │ Potion→50                │       │       │
│       │       │ Sword→300                │       │       │
│       │       │ Shield→250               │       │       │
│       │       │ Bow→200                  │       │       │
│       │       │ Scroll→75                │       │       │
└───────┴───────┴──────────────────────────┴───────┴───────┘

Now looking up “Scroll” requires checking 5 items. That’s just as slow as a regular list!

This is the worst case: O(n) instead of O(1).

Two things prevent this:

  1. Good hash functions spread keys evenly
  2. Resizing — when the table gets too full, Go makes it bigger

The Tophash Optimization

Each bucket stores a “tophash” — the top 8 bits of the hash — for quick rejection:

Bucket 2:
┌─────────────────────────────────────────────────┐
│ tophash: [a3] [7f] [  ] [  ] [  ] [  ] [  ] [  ]│
│ keys:    [Potion] [Scroll] [  ] [  ] [  ] [  ]  │
│ values:  [  50  ] [  75  ] [  ] [  ] [  ] [  ]  │
└─────────────────────────────────────────────────┘

When looking up “Sword” (tophash = 0xb2):

  1. Check if 0xb2 == 0xa3? No. Skip.
  2. Check if 0xb2 == 0x7f? No. Skip.
  3. Not found!

We didn’t even compare the full strings. The tophash check is super fast.

Performance Comparison

┌─────────────────────────────────────────────────────────────┐
│   Hash Table vs Array: Searching 100 elements               │
│                                                             │
│   Array (linear search):                                    │
│   ┌───────────────────────────────────────────────────────┐ │
│   │ Average: check 50 elements                            │ │
│   │ Worst:   check 100 elements                           │ │
│   │ Time:    O(n)                                         │ │
│   └───────────────────────────────────────────────────────┘ │
│                                                             │
│   Hash Table (map):                                         │
│   ┌───────────────────────────────────────────────────────┐ │
│   │ Average: check 1 element                              │ │
│   │ Worst:   check all elements (very rare!)              │ │
│   │ Time:    O(1) average                                 │ │
│   └───────────────────────────────────────────────────────┘ │
│                                                             │
│   With 1,000,000 elements:                                  │
│   • Array: up to 1,000,000 checks                           │
│   • Map:   still just ~1 check!                             │
└─────────────────────────────────────────────────────────────┘

How libgodc Implements Maps

libgodc’s map implementation is tuned for the Dreamcast’s SH-4 CPU and 16MB memory limit.

The GoMap header (28 bytes):

┌─────────────────────────────────────────────────────────────┐
│   GoMap Structure                                           │
│                                                             │
│   ┌──────────────┬──────────────────────────────────────┐   │
│   │ count        │ Number of entries                    │   │
│   │ flags + B    │ State flags + log2(bucket count)     │   │
│   │ hash0        │ Random seed (different per map!)     │   │
│   │ buckets ─────────▶ Current bucket array             │   │
│   │ oldbuckets ──────▶ Old buckets (during resize)      │   │
│   │ nevacuate    │ Resize progress counter              │   │
│   └──────────────┴──────────────────────────────────────┘   │
│                                                             │
│   Total: 28 bytes (compact for Dreamcast's limited RAM)     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

SH-4 optimized hashing:

The hash function uses wyhash, a fast 32-bit algorithm that takes advantage of SH-4’s dmuls.l instruction (32×32→64 multiply):

┌─────────────────────────────────────────────────────────────┐
│   Hash("Potion", seed=0x12345678)                           │
│                                                             │
│   Step 1: Mix 4 bytes at a time                             │
│           wymix32(h ^ "Poti", 0x9E3779B9)                   │
│                                                             │
│   Step 2: Handle remaining bytes                            │
│           wymix32(h ^ "on\0\0", 0x85EBCA6B)                 │
│                                                             │
│   Step 3: Final mix with length                             │
│           wymix32(h, 6)  →  0x7A3B2C1D                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Dreamcast-specific limits:

SettinglibgodcStandard Go
Max bucket shift15 (32K buckets)~24 (16M buckets)
Hash seed sourceDreamcast timerOS random
Prefetch hintSH-4 pref @RnPlatform-specific

Lazy allocation for small maps:

items := make(map[string]int)  // No buckets yet!
items["key"] = 1               // NOW buckets are allocated

This saves memory when you create maps that might stay empty.

The Nil Map Trap

This is the #1 map bug for Go beginners:

var inventory map[string]int  // nil map

// Reading: works! Returns zero value.
count := inventory["sword"]  // count is 0

// Writing: PANIC!
inventory["sword"] = 1  // "assignment to entry in nil map"

A nil map is like a locked filing cabinet. You can look through the glass (read), but you can’t put anything in (write).

Always initialize:

inventory := make(map[string]int)
// or
inventory := map[string]int{}

Map Iteration is Random

scores := map[string]int{
    "Mario": 100,
    "Luigi": 85,
    "Peach": 95,
}

for name, score := range scores {
    println(name, score)
}

Run this twice. You might get:

Run 1:          Run 2:
Luigi 85        Peach 95
Peach 95        Mario 100
Mario 100       Luigi 85

This is intentional. Go randomizes iteration order to prevent you from depending on it. If you need sorted keys, sort them yourself.


Choosing the Right Tool

┌─────────────────────────────────────────────────────────────┐
│   DECISION TREE: What Data Structure Should I Use?          │
│                                                             │
│   Need to look up by name/key?                              │
│           │                                                 │
│           ├── YES → Use a map (O(1) lookup!)                │
│           │                                                 │
│           └── NO → Is the data ordered/sequential?          │
│                       │                                     │
│                       ├── YES → Use a slice                 │
│                       │                                     │
│                       └── NO → Still probably use a slice   │
│                                (maps have memory overhead)  │
│                                                             │
│   Is it text? → Use a string (immutable)                    │
│   Need to build text? → Use []byte, convert at the end      │
└─────────────────────────────────────────────────────────────┘

Summary Table

OperationStringSliceMap
Get lengthO(1)O(1)O(1)
Access by indexO(1)O(1)
Access by keyO(1) avg
AppendN/AO(1)*O(1) avg
ConcatenateO(n)O(n)

* Amortized — occasional reallocations

Memory Overhead

String header:  8 bytes  (pointer + length)
Slice header:  12 bytes  (pointer + length + capacity)
Map header:    28 bytes  (+ bucket overhead per entry)

Maps have the most overhead. For small, dense integer keys (0 to N), a slice is often better:

// If enemy IDs are 0-999, use a slice!
enemies := make([]*Enemy, 1000)
enemies[42] = &orc  // O(1), less memory than map

Real Benchmark Results

We ran these benchmarks on actual Dreamcast hardware. The numbers don’t lie!

Map vs Slice: The “Maggie” Effect

Looking up an item by ID, searching near the end of the collection:

ElementsSlice (linear search)Map lookupMap is…
10017 μs1.3 μs13× faster
50092 μs0.9 μs97× faster
1,000187 μs0.9 μs203× faster
2,000443 μs1.2 μs376× faster

Notice how slice time grows linearly (O(n)) while map time stays constant (O(1)). With 2,000 enemies, map lookup is 376× faster!

String Concatenation: The Hidden Cost

Building a string character by character:

Characterss += "x" in loopappend to []byteSpeedup
50122 μs23 μs5× faster
200665 μs69 μs9× faster
5002,725 μs161 μs16× faster
1,0008,973 μs314 μs28× faster

The loop method is O(n²) — time explodes as strings get longer. For 1,000 characters, pre-allocation is 28× faster!

Slice Pre-allocation: One Allocation vs Many

Appending items to a slice:

ItemsGrowing []int{}Pre-alloc make(0,n)Time saved
5035 μs24 μs32% faster
10076 μs41 μs46% faster
200178 μs76 μs57% faster

Pre-allocation eliminates the repeated reallocations as the slice grows.


The right data structure is like having the right superpower. A map turns an O(n) search into O(1). That’s not just faster… it’s magic.

Channels

This chapter explains how libgodc implements Go channels for the Dreamcast. The implementation differs significantly from the standard Go runtime due to our M:1 cooperative scheduling model.


The hchan Structure

Every channel is an hchan structure allocated on the GC heap:

typedef struct hchan {
    uint32_t qcount;      // Items currently in buffer
    uint32_t dataqsiz;    // Buffer capacity (0 = unbuffered)
    void *buf;            // Ring buffer (follows hchan in memory)
    uint16_t elemsize;    // Size of each element
    uint8_t closed;       // Channel closed flag
    uint8_t buf_mask_valid; // Power-of-2 optimization flag
    
    struct __go_type_descriptor *elemtype;
    
    uint32_t sendx;       // Send index into ring buffer
    uint32_t recvx;       // Receive index into ring buffer
    
    waitq recvq;          // Goroutines waiting to receive
    waitq sendq;          // Goroutines waiting to send
    
    uint8_t locked;       // Simple lock (no contention in M:1)
} hchan;

When you write make(chan int, 3), libgodc allocates a single block containing both the hchan header and the buffer:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Memory Layout for make(chan int, 3)                       │
│                                                             │
│   ┌─────────────────────┬─────────────────────────────────┐ │
│   │      hchan (48B)    │     buffer (3 × 4B = 12B)       │ │
│   ├─────────────────────┼───────┬───────┬───────┬─────────┤ │
│   │ qcount, dataqsiz,   │ [0]   │ [1]   │ [2]   │         │ │
│   │ sendx, recvx,       │ int   │ int   │ int   │         │ │
│   │ waitqueues, ...     │       │       │       │         │ │
│   └─────────────────────┴───────┴───────┴───────┴─────────┘ │
│                                                             │
│   Total allocation: sizeof(hchan) + (cap × elemsize)        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Ring Buffer Indexing

The buffer is a circular queue. To find where to read/write:

static inline void *chanbuf(hchan *c, uint32_t i) {
    uint32_t index = chan_index(c, i);
    return (void *)((uintptr_t)c->buf + (uintptr_t)index * c->elemsize);
}

For power-of-2 capacities, we use bitwise AND instead of modulo:

static inline uint32_t chan_index(hchan *c, uint32_t i) {
    if (c->buf_mask_valid)
        return i & (c->dataqsiz - 1);  // Fast: i & 3 for cap=4
    return i % c->dataqsiz;            // Slow: division
}

Tip: Use power-of-2 buffer sizes (2, 4, 8, 16…) for faster indexing.


The Send Algorithm

When you write ch <- value, this is chansend():

┌─────────────────────────────────────────────────────────────┐
│   chansend(c, elem, block)                                  │
│                                                             │
│   1. nil channel?                                           │
│      └── block=true: gopark forever (deadlock)              │
│      └── block=false: return false                          │
│                                                             │
│   2. Channel closed?                                        │
│      └── runtime_throw("send on closed channel")            │
│                                                             │
│   3. Receiver waiting in recvq?                             │
│      └── YES: Copy data DIRECTLY to receiver's elem         │
│               Wake receiver with goready()                  │
│               Return true                                   │
│                                                             │
│   4. Buffer has space? (qcount < dataqsiz)                  │
│      └── YES: Copy to buf[sendx], increment sendx           │
│               Return true                                   │
│                                                             │
│   5. Non-blocking? (block=false)                            │
│      └── Return false                                       │
│                                                             │
│   6. Must block:                                            │
│      └── Create sudog, enqueue in sendq                     │
│          gopark() - yield to scheduler                      │
│          When woken: return success flag                    │
└─────────────────────────────────────────────────────────────┘

The key insight: direct transfer. If a receiver is already waiting, we copy data straight to their memory location, bypassing the buffer entirely. This is why unbuffered channels involve no buffer at all.


The Receive Algorithm

When you write value := <-ch, this is chanrecv():

┌─────────────────────────────────────────────────────────────┐
│   chanrecv(c, elem, block)                                  │
│                                                             │
│   1. nil channel?                                           │
│      └── block=true: gopark forever                         │
│      └── block=false: return false                          │
│                                                             │
│   2. Closed AND empty?                                      │
│      └── Zero out elem, return (true, received=false)       │
│                                                             │
│   3. Sender waiting in sendq?                               │
│      └── Unbuffered: Copy directly from sender's elem       │
│      └── Buffered: Take from buffer, move sender's data in  │
│          Wake sender with goready()                         │
│          Return (true, received=true)                       │
│                                                             │
│   4. Buffer has data? (qcount > 0)                          │
│      └── Copy from buf[recvx], zero slot, decrement qcount  │
│          Return (true, received=true)                       │
│                                                             │
│   5. Non-blocking?                                          │
│      └── Return false                                       │
│                                                             │
│   6. Must block:                                            │
│      └── Create sudog, enqueue in recvq                     │
│          gopark()                                           │
│          When woken: return success                         │
└─────────────────────────────────────────────────────────────┘

The Buffered Receive with Waiting Sender

This case is subtle. When the buffer is full and a sender is waiting:

if (c->dataqsiz > 0) {  // Buffered channel
    // 1. Take oldest item from buffer for receiver
    src = chanbuf(c, c->recvx);
    chan_copy(c, elem, src);
    
    // 2. Put sender's NEW item into the freed slot
    chan_copy(c, src, sg->elem);
    
    // 3. Advance indices (sendx follows recvx)
    c->recvx = chan_index(c, c->recvx + 1);
    c->sendx = c->recvx;
}

This maintains FIFO order: the receiver gets the oldest buffered value, not the sender’s new value.


Wait Queues and Sudogs

When a goroutine blocks on a channel, it creates a sudog (sender/receiver descriptor):

typedef struct sudog {
    G *g;                // The blocked goroutine
    struct sudog *next;  // Next in wait queue
    struct sudog *prev;  // Previous in wait queue
    void *elem;          // Pointer to data being sent/received
    uint64_t ticket;     // Used by select for case index
    bool isSelect;       // Part of a select statement?
    bool success;        // Did operation succeed?
    struct sudog *waitlink;   // For select: links all sudogs
    struct sudog *releasetime; // Unused (Go runtime compat)
    struct hchan *c;     // Channel we're waiting on
} sudog;

The Sudog Pool

Creating sudogs during gameplay would trigger malloc(). libgodc pre-allocates a pool at startup:

void sudog_pool_init(void) {
    for (int i = 0; i < 16; i++) {
        sudog *s = (sudog *)malloc(sizeof(sudog));
        s->next = global_pool;
        global_pool = s;
    }
}

acquireSudog() pulls from the pool; releaseSudog() returns to it. If the pool is exhausted, we fall back to malloc().

Wait Queues

Each channel has two wait queues (doubly-linked lists):

typedef struct waitq {
    struct sudog *first;
    struct sudog *last;
} waitq;

Operations:

  • waitq_enqueue() - add blocked goroutine to end
  • waitq_dequeue() - remove and return first goroutine
  • waitq_remove() - remove specific sudog (for select cancellation)

Blocking and Waking: gopark/goready

This is where libgodc’s M:1 model shines.

gopark() - Block Current Goroutine

void gopark(bool (*unlockf)(void *), void *lock, WaitReason reason) {
    G *gp = getg();
    if (!gp || gp == g0)
        runtime_throw("gopark on g0 or nil");

    gp->atomicstatus = Gwaiting;
    gp->waitreason = reason;

    // Call unlock function - if it returns false, abort parking
    if (unlockf && !unlockf(lock)) {
        gp->atomicstatus = Grunnable;
        runq_put(gp);
        return;
    }

    // Context switch to scheduler
    __go_swapcontext(&gp->context, &sched_context);
}

The goroutine saves its context and swaps to the scheduler. The unlockf callback releases the channel lock atomically with parking - if it returns false, we abort and re-enqueue instead.

goready() - Wake a Goroutine

void goready(G *gp) {
    if (!gp) return;

    // Don't wake dead/already-runnable/running goroutines
    Gstatus status = gp->atomicstatus;
    if (status == Gdead || status == Grunnable || status == Grunning)
        return;

    gp->atomicstatus = Grunnable;
    gp->waitreason = waitReasonZero;
    runq_put(gp);
}

The woken goroutine becomes runnable and will be scheduled on the next schedule() call.

Why M:1 Simplifies Things

In standard Go, channels need atomic operations and memory barriers because multiple OS threads access them. libgodc runs all goroutines on one KOS thread:

  • No atomics needed for locked flag (simple bool)
  • No memory barriers
  • No contention on wait queues
  • Context switches are explicit (cooperative)

The chan_lock()/chan_unlock() functions just set a flag:

void chan_lock(hchan *c) {
    if (!c)
        runtime_throw("chan: nil channel");
    if (c->locked)
        runtime_throw("chan: recursive lock");
    c->locked = 1;
}

void chan_unlock(hchan *c) {
    if (c) c->locked = 0;
}

This is safe because we never preempt a goroutine in the middle of a channel operation.


Select Implementation

Select is the most complex part. Here’s how selectgo() works:

Phase 1: Setup

SelectGoResult selectgo(scase *cas0, uint16_t *order0, 
                        int nsends, int nrecvs, bool block) {
    int ncases = nsends + nrecvs;
    
    // order0 provides space for two arrays:
    uint16_t *pollorder = order0;           // Random order to check cases
    uint16_t *lockorder = order0 + ncases;  // Order to lock channels

Phase 2: Randomize Poll Order (Fairness)

// Fisher-Yates shuffle
for (int i = ncases - 1; i > 0; i--) {
    int j = fastrand() % (i + 1);
    uint16_t tmp = pollorder[i];
    pollorder[i] = pollorder[j];
    pollorder[j] = tmp;
}

Why random? If we always checked cases in order, the first case would always win when multiple are ready. Randomization ensures fairness.

Phase 3: Lock Channels (Deadlock Prevention)

// Sort by channel address using heap sort
heapsort_lockorder(cas0, lockorder, ncases);

// Lock in address order
sellock(cas0, lockorder, ncases);

If goroutine A does select { case <-ch1: case <-ch2: } and goroutine B does select { case <-ch2: case <-ch1: }, they could deadlock if they lock in different orders. Sorting by address ensures everyone locks in the same global order.

Phase 4: Check for Ready Cases

for (int i = 0; i < ncases; i++) {
    int casi = pollorder[i];  // Check in random order
    scase *cas = &cas0[casi];
    hchan *c = cas->c;
    
    if (c == NULL)
        continue;
    
    if (casi < nsends) {
        // Send: closed channel will panic - select it
        if (c->closed) {
            selected = casi;
            break;
        }
        // Check for waiting receiver or buffer space
        if (!waitq_empty(&c->recvq) || c->qcount < c->dataqsiz) {
            selected = casi;
            break;
        }
    } else {
        // Receive: check for waiting sender, buffer data, or closed
        if (!waitq_empty(&c->sendq) || c->qcount > 0 || c->closed) {
            selected = casi;
            break;
        }
    }
}

If any case is ready, execute it immediately and return.

Phase 5: Block on All Channels

If nothing is ready and block=true, we enqueue on ALL channels:

sudog *sglist = NULL;

for (int i = 0; i < ncases; i++) {
    int casi = pollorder[i];
    scase *cas = &cas0[casi];
    hchan *c = cas->c;
    
    if (c == NULL)
        continue;
    
    sudog *sg = acquireSudog();
    sg->g = gp;
    sg->c = c;
    sg->elem = cas->elem;
    sg->isSelect = true;
    sg->success = false;
    sg->ticket = casi;  // Remember which case this is
    
    // Link for later cleanup
    sg->waitlink = sglist;
    sglist = sg;
    
    if (casi < nsends)
        waitq_enqueue(&c->sendq, sg);
    else
        waitq_enqueue(&c->recvq, sg);
}

gp->waiting = sglist;
gopark(selparkcommit, &unlock_arg, waitReasonSelect);

Phase 6: Woken - Find Winner

When woken, one sudog has success=true. Find it and dequeue from all other channels:

// Pass 3: Find winner and dequeue losers
for (sudog *sg = sglist; sg != NULL; sg = sgnext) {
    sgnext = sg->waitlink;  // Save before we might release
    int casi = (int)sg->ticket;
    
    if (sg->success) {
        selected = casi;
        if (casi >= nsends)
            recvOK = true;  // Received actual data
    } else {
        // Remove from wait queue (we won't use this case)
        if (casi < nsends)
            waitq_remove(&sg->c->sendq, sg);
        else
            waitq_remove(&sg->c->recvq, sg);
    }
}

// Release all sudogs in separate pass
for (sudog *sg = sglist; sg != NULL; sg = sgnext) {
    sgnext = sg->waitlink;
    releaseSudog(sg);
}

The Default Case

When block=false and nothing is ready, selectgo() returns selected=-1:

if (!block) {
    selunlock(cas0, lockorder, ncases);
    go_yield();  // Give other goroutines a chance
    return (SelectGoResult){-1, false};
}

The go_yield() prevents tight polling loops from starving other goroutines.


Closing Channels

closechan() marks the channel closed and wakes ALL waiting goroutines:

void closechan(hchan *c) {
    G *wake_list = NULL;
    G *wake_tail = NULL;
    
    chan_lock(c);
    
    if (c->closed) {
        chan_unlock(c);
        runtime_throw("close of closed channel");
    }
    
    c->closed = 1;
    
    // Collect all receivers (they'll get zero values)
    while ((sg = waitq_dequeue(&c->recvq)) != NULL) {
        sg->success = false;  // Indicates closed, not real data
        gp = sg->g;
        if (!gp || gp->atomicstatus == Gdead)
            continue;
        if (sg->elem && c->elemsize > 0)
            memset(sg->elem, 0, c->elemsize);
        // Add gp to wake_list via schedlink...
    }
    
    // Collect all senders (they'll panic when they wake)
    while ((sg = waitq_dequeue(&c->sendq)) != NULL) {
        sg->success = false;
        gp = sg->g;
        if (!gp || gp->atomicstatus == Gdead)
            continue;
        // Add gp to wake_list via schedlink...
    }
    
    chan_unlock(c);
    
    // Wake everyone outside the lock
    while (wake_list) {
        gp = wake_list;
        wake_list = gp->schedlink;
        goready(gp);
    }
}

Senders check success when they wake and throw “send on closed channel” if false.


Performance

For benchmark numbers, see the Performance section in Design. You can run the benchmarks yourself with tests/bench_architecture.elf on hardware.

Why Unbuffered is Slower

Unbuffered channels always require a context switch:

Sender                          Receiver
──────                          ────────
ch <- 42                        
  │                             
  └── gopark() ─────────────────► scheduler picks receiver
                                       │
                                  x := <-ch
                                       │
  ◄── goready() ────────────────── wakes sender
  │
continues

Buffered channels avoid this when buffer has space/data.

Optimization Tips

  1. Use buffered channels for producer/consumer patterns
  2. Power-of-2 buffer sizes for faster indexing (uses bitwise AND instead of modulo)
  3. Batch data - send structs with multiple values instead of multiple sends
  4. select with default for non-blocking checks in game loops
  5. Pre-warm channels - send/receive once during init to allocate sudogs

Limitations

libgodc channels have some constraints:

LimitValueReason
Max buffer size65536 elementsSanity check in makechan()
Max element size65536 bytes16-bit elemsize field in hchan
Sudog pool16 pre-allocated, 128 maxDefined in godc_config.h

For game code, these limits are rarely hit. If you need larger queues, consider using slices with your own synchronization.

System Integration

The Layer Cake

Imagine your game as an office building. You’re on the top floor, writing Go code. But when you need something done — read a file, play a sound, draw a sprite. Well, obviously there is no such thing as “the cloud”. Someone else does the actual work.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Floor 4:  Your Go Program                                 │
│             "I want to play a sound!"                       │
│                    ↓                                        │
│   Floor 3:  libgodc (Go runtime)                            │
│             "Let me translate that..."                      │
│                    ↓                                        │
│   Floor 2:  KallistiOS                                      │
│             "I know how to talk to hardware."               │
│                    ↓                                        │
│   Floor 1:  Dreamcast Hardware                              │
│             *beep boop*                                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Each floor speaks a different language. libgodc translates Go into something KallistiOS understands. KallistiOS translates that into hardware register writes.

You don’t need to know all the details, but understanding the stack helps you debug problems.


Part 1: Timers and Sleep

How Does Sleep Work?

When you write:

time.Sleep(100 * time.Millisecond)

What actually happens? Let’s trace it:

┌─────────────────────────────────────────────────────────────┐
│   WHAT HAPPENS WHEN YOU SLEEP                               │
│                                                             │
│   Step 1: "I want to sleep for 100ms"                       │
│           ↓                                                 │
│   Step 2: Calculate wake time: now + 100ms = 4:00:00.100    │
│           ↓                                                 │
│   Step 3: Add timer to the timer heap                       │
│           ┌─────────────────────────────┐                   │
│           │ wake_time: 4:00:00.100      │                   │
│           │ goroutine: G7               │                   │
│           └─────────────────────────────┘                   │
│           ↓                                                 │
│   Step 4: Park the goroutine (it's now sleeping)            │
│           ↓                                                 │
│   Step 5: Scheduler runs OTHER goroutines                   │
│           ...100ms pass...                                  │
│           ↓                                                 │
│   Step 6: Scheduler checks timer heap                       │
│           "Hey, it's 4:00:00.100! Wake G7!"                 │
│           ↓                                                 │
│   Step 7: G7 wakes up, continues executing                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key insight: Your goroutine isn’t actually sleeping on a couch somewhere. It’s parked in a queue, and the scheduler knows when to wake it.

Where Does Time Come From?

The SH-4 CPU has hardware timers. KallistiOS reads them:

//extern timer_us_gettime64
func TimerUsGettime64() uint64

This returns microseconds since boot. Accurate to about 1 μs. Fast to read.

In your Go code, you can use this for precise timing:

//extern timer_us_gettime64
func timerUsGettime64() uint64

func measureSomething() {
    start := timerUsGettime64()
    doExpensiveWork()
    elapsed := timerUsGettime64() - start
    println("Took", elapsed, "microseconds")
}

The Timer Heap

Multiple goroutines can sleep at once. Go keeps them in a heap (priority queue) sorted by wake time:

Timer Heap:
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   [G3: wake at 100ms]    ← Earliest, checked first        │
│           /\                                              │
│          /  \                                             │
│ [G7: 200ms]  [G2: 150ms]                                  │
│       /                                                   │
│  [G5: 500ms]                                              │
│                                                           │
└───────────────────────────────────────────────────────────┘

The scheduler only needs to check the top of the heap. If the earliest timer hasn’t fired, none of them have.


Part 2: File I/O (The Danger Zone)

The Problem

You want to load a texture:

data := loadFile("/cd/textures/enemy.pvr")

Seems innocent, right? Here’s what actually happens:

┌─────────────────────────────────────────────────────────────┐
│   GD-ROM READ: THE SILENT KILLER                            │
│                                                             │
│   Time: 0ms    → loadFile() called                          │
│   Time: 0ms    → KOS asks GD-ROM to seek                    │
│   Time: 50ms   → Drive head moves (mechanical!)             │
│   Time: 100ms  → Data starts streaming                      │
│   Time: 150ms  → Still reading...                           │
│   Time: 200ms  → loadFile() returns                         │
│                                                             │
│   DURING THOSE 200ms:                                       │
│   • No other goroutines run                                 │
│   • Game loop frozen                                        │
│   • Audio buffer might run dry → glitch!                    │
│   • Player sees: lag, stutter, freeze                       │
│                                                             │
│   At 60 FPS, you have 16.6ms per frame.                     │
│   A 200ms file read = 12 FROZEN FRAMES!                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Why does this happen? KOS file operations are synchronous. The CPU sits in a loop waiting for the CD drive. No scheduler runs. Nothing else happens.

The Solutions

Solution 1: Loading Screens

Load everything at startup or level transitions:

func main() {
    showLoadingScreen()
    
    // All the slow stuff happens here
    textures = loadAllTextures()
    sounds = loadAllSounds()
    levelData = loadLevel(1)
    
    hideLoadingScreen()
    
    // Now game loop is safe
    for {
        gameLoop()
    }
}

Solution 2: Streaming in Chunks

If you must load during gameplay, do it in small pieces:

func streamTexture(path string) {
    file := openFile(path)
    defer closeFile(file)
    
    for !file.EOF() {
        chunk := file.Read(4096)  // Read 4KB
        processChunk(chunk)
        runtime.Gosched()  // Let other goroutines run!
    }
}

Solution 3: Pre-load into RAM

The Dreamcast has 16 MB of RAM. Use it!

// At startup, load everything you might need
var textureCache = make(map[string][]byte)

func preloadTexture(name string) {
    textureCache[name] = loadFile("/cd/textures/" + name)
}

// During gameplay, instant access
func getTexture(name string) []byte {
    return textureCache[name]  // Already in RAM!
}

Part 3: Calling C Functions

The //extern Magic

Go code can call C functions directly:

//extern pvr_wait_ready
func PvrWaitReady() int32

//extern maple_enum_dev
func mapleEnumDev(port, unit int32) uintptr

func main() {
    PvrWaitReady()  // Calls the C function!
}

No CGo. No runtime overhead. Just a direct function call.

The Danger

Here’s the catch: C functions run on your goroutine’s stack. Goroutines have fixed stacks (64 KB by default). If the C function is stack-hungry:

┌─────────────────────────────────────────────────────────────┐
│   STACK OVERFLOW SCENARIO                                   │
│                                                             │
│   Goroutine stack: 64 KB                                    │
│                                                             │
│   ┌────────────────────┐ ← Stack top                        │
│   │ Your Go function   │ 1 KB used                          │
│   ├────────────────────┤                                    │
│   │ C function called  │                                    │
│   │   local arrays...  │ 6 KB used                          │
│   │   more locals...   │                                    │
│   ├────────────────────┤                                    │
│   │ C calls another C  │                                    │
│   │   BOOM!            │ OVERFLOW!                          │
│   └────────────────────┘ ← Stack bottom (guard page)        │
│                                                             │
│   Result: Memory corruption, crash, mysterious bugs         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Part 4: Debugging Without Fancy Tools

The Detective’s Toolkit

Tool 1: Print Statements

The oldest debugging technique is still the best:

func suspiciousFunction(x int) {
    println(">>> suspiciousFunction start, x =", x)
    
    result := doSomething(x)
    println("    after doSomething, result =", result)
    
    processResult(result)
    println("<<< suspiciousFunction end")
}

Tool 2: Binary Search Debugging

Program crashes somewhere. Where?

1. Add print at function start and end
2. If it prints START but not END, crash is inside
3. Add print in the middle
4. Repeat until you find the exact line

Tool 3: The Assumptions Checklist

When something “can’t possibly be wrong,” check it:

func processEnemy(e *Enemy) {
    // CHECK YOUR ASSUMPTIONS
    if e == nil {
        println("BUG: e is nil!")
        return
    }
    if e.Health < 0 {
        println("BUG: negative health:", e.Health)
    }
    if e.X < 0 || e.X > 640 {
        println("BUG: X out of bounds:", e.X)
    }
    
    // Now do the actual work
    // ...
}

Reading Crash Information

When your game crashes, you might see:

panic: index out of range [99] with length 3

Registers:
  PC=8c015678  PR=8c015432

Stack trace:
  0x8c015678
  0x8c015432
  0x8c014000

What does this mean?

  • PC (Program Counter) — Where the crash happened
  • PR (Procedure Register) — Who called us (return address)
  • Stack trace — Chain of function calls

Finding the Function Name

You have an address: 0x8c015678. Where is it?

Method 1: addr2line

sh-elf-addr2line -e game.elf 0x8c015678
# Output: /path/to/main.go:42

This tells you the exact line number!

Method 2: Symbol Table

sh-elf-nm game.elf | sort > symbols.txt
# Then search for addresses near 0x8c015678

Method 3: With Function Names

sh-elf-addr2line -f -C -i -e game.elf 0x8c015678
# Output: functionName
#         main.go:42

Common Bugs and Fixes

SymptomLikely CauseFix
Hangs, no outputInfinite loop without yieldAdd runtime.Gosched() in loops
Garbage on screenMemory corruptionCheck array bounds
Random crashesStack overflowCheck deep recursion, big C calls
GC panicToo much live dataReduce heap usage, trigger GC earlier
Works in emu, fails on hwTiming differencesTest on real hardware earlier!

Troubleshooting Flowchart

Use this decision tree when things go wrong:

┌──────────────────────────────────────────────────────────────┐
│   TROUBLESHOOTING FLOWCHART                                  │
│                                                              │
│   What's happening?                                          │
│         │                                                    │
│         ├─► CRASH (program terminates)                       │
│         │         │                                          │
│         │         ├─► Panic message visible?                 │
│         │         │         │                                │
│         │         │         ├─► YES: Read the message!       │
│         │         │         │   • "index out of range"       │
│         │         │         │     → Check slice bounds       │
│         │         │         │   • "nil pointer"              │
│         │         │         │     → Check for nil before use │
│         │         │         │   • "out of memory"            │
│         │         │         │     → Reduce allocations       │
│         │         │         │                                │
│         │         │         └─► NO: Stack overflow likely    │
│         │         │             → Reduce local variables     │
│         │         │             → Convert recursion to loop  │
│         │         │                                          │
│         ├─► FREEZE (no crash, no progress)                   │
│         │         │                                          │
│         │         ├─► Any goroutines running?                │
│         │         │         │                                │
│         │         │         ├─► Only one: Infinite loop      │
│         │         │         │   → Add runtime.Gosched()      │
│         │         │         │                                │
│         │         │         └─► Multiple: Deadlock           │
│         │         │             → Check channel usage        │
│         │         │             → Ensure sends have receivers│
│         │         │                                          │
│         ├─► STUTTER (periodic lag)                           │
│         │         │                                          │
│         │         └─► GC pauses likely                       │
│         │             → Reduce live heap size                │
│         │             → Trigger GC during loading            │
│         │             → Use object pools                     │
│         │                                                    │
│         └─► WRONG OUTPUT (runs but incorrect)                │
│                   │                                          │
│                   └─► Add println() everywhere               │
│                       → Check variable values                │
│                       → Verify assumptions                   │
│                                                              │
└──────────────────────────────────────────────────────────────┘

The 5-Step Debug Process

┌─────────────────────────────────────────────────────────────┐
│   THE DEBUGGING ALGORITHM                                   │
│                                                             │
│   1. REPRODUCE                                              │
│      Can you make it happen consistently?                   │
│      If not, add logging until you can.                     │
│                                                             │
│   2. NARROW DOWN                                            │
│      Binary search with prints.                             │
│      "Does it crash before this line or after?"             │
│                                                             │
│   3. CHECK ASSUMPTIONS                                      │
│      Print everything. That variable you're SURE is         │
│      correct? Print it anyway.                              │
│                                                             │
│   4. SIMPLIFY                                               │
│      Create the smallest program that shows the bug.        │
│      Often, you'll find the bug while simplifying.          │
│                                                             │
│   5. TAKE A BREAK                                           │
│      Seriously. Walk away. Fresh eyes find bugs faster      │
│      than tired eyes.                                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Part 5: Testing on a Game Console

The Test Structure

Our tests are simple: standalone executables that print PASS or FAIL.

tests/
├── test_types.go      → test_types.elf      (maps, interfaces, structs)
├── test_goroutines.go → test_goroutines.elf (goroutines, channels)
├── test_memory.go     → test_memory.elf     (allocation, GC)
└── test_control.go    → test_control.elf    (defer, panic, recover)

No fancy test framework. No JUnit. Just:

  1. Do something
  2. Check if it worked
  3. Print the result

A Minimal Test

package main

func TestMaps() {
    println("maps:")
    passed := 0
    total := 0

    total++
    m := make(map[string]int)
    m["score"] = 100
    if m["score"] == 100 {
        passed++
        println("  PASS: read after write")
    } else {
        println("  FAIL: read after write")
    }

    total++
    if m["missing"] == 0 {
        passed++
        println("  PASS: missing key returns zero")
    } else {
        println("  FAIL: missing key returns zero")
    }

    total++
    delete(m, "score")
    _, ok := m["score"]
    if !ok {
        passed++
        println("  PASS: delete removes key")
    } else {
        println("  FAIL: delete removes key")
    }

    println("  ", passed, "/", total)
}

func main() {
    TestMaps()
}

Running Tests

# Build the test
make test_types

# Run on Dreamcast
dc-tool-ip -t 192.168.2.205 -x test_types.elf

# Output:
# maps:
#   PASS: read after write
#   PASS: missing key returns zero
#   PASS: delete removes key
#   3 / 3

Emulator vs Hardware

AspectEmulatorReal Hardware
SpeedFast iterationSlower uploads
DebuggingCan use host toolsprintf only
AccuracyClose but not exactThe truth
TimingMay differDefinitive

The Strategy:

┌─────────────────────────────────────────────────────────────┐
│   DEVELOPMENT WORKFLOW                                      │
│                                                             │
│   80% of time: Emulator                                     │
│   ├── Fast compile-run cycle                                │
│   ├── Quick iteration                                       │
│   └── Good for logic bugs                                   │
│                                                             │
│   20% of time: Real Hardware                                │
│   ├── Catches timing issues                                 │
│   ├── Finds memory/stack problems                           │
│   └── Final validation before release                       │
│                                                             │
│   RULE: Never release without testing on real hardware!     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Dreamcast is a 25-year-old console with 16 MB of RAM, no debugger, and a CD-ROM that takes 200ms to seek. And yet, people made incredible games for it. You can too. You just need patience, println, and the knowledge in this chapter.

Performance

Part 1: The Cache — Your Best Friend

The Numbers That Matter

┌─────────────────────────────────────────────────────────────┐
│   SH-4 MEMORY HIERARCHY                                     │
│                                                             │
│   Registers:     0 cycles (instant)                         │
│   L1 Cache:      1-2 cycles (~10 ns)                        │
│   Main RAM:      10-20 cycles (~100 ns)                     │
│   CD-ROM:        millions of cycles (200+ ms)               │
│                                                             │
│   Cache miss = 10-20× SLOWER than cache hit!                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Cache Lines: The Free Lunch

When you read one byte from RAM, the CPU doesn’t fetch just that byte. It fetches a whole cache line — 32 bytes on SH-4.

You ask for array[0]:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ 0  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 7  │  ← All 32 bytes loaded!
└────┴────┴────┴────┴────┴────┴────┴────┘
  ▲
  You wanted this one

Next 7 accesses are FREE! They're already in cache.

Sequential Access: The Fast Path

// FAST: Sequential access — 125 elements
sum := 0
for i := 0; i < 125; i++ {
    sum += array[i]
}

What happens:

Access array[0] → Cache miss, load 32 bytes
Access array[1] → Cache HIT (free!)
Access array[2] → Cache HIT (free!)
...
Access array[7] → Cache HIT (free!)
Access array[8] → Cache miss, load next 32 bytes
...

Total cache misses: 125 / 8 = ~16

Strided Access: The Slow Path

// SLOW: Strided access (every 8th element) — also 125 elements
sum := 0
for i := 0; i < 1000; i += 8 {
    sum += array[i]
}

What happens:

Access array[0]   → Cache miss
Access array[8]   → Cache miss (different cache line!)
Access array[16]  → Cache miss
Access array[24]  → Cache miss
...
Access array[992] → Cache miss

Total cache misses: 125 (EVERY access misses!)

Same number of additions (125), but strided is ~8× slower because every access misses the cache.

The Practical Lesson

┌─────────────────────────────────────────────────────────────┐
│   CACHE-FRIENDLY PATTERNS                                   │
│                                                             │
│   ✓ Process arrays left-to-right                            │
│   ✓ Keep related data together (struct of arrays)           │
│   ✓ Avoid pointer-chasing (linked lists are slow!)          │
│   ✓ Small, tight loops                                      │
│                                                             │
│   ✗ Random access patterns                                  │
│   ✗ Large structs with rarely-used fields                   │
│   ✗ Jumping around memory                                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Part 2: The Float64 Trap

The Shocking Truth

Go defaults to float64 for floating-point numbers:

x := 3.14  // This is float64!

On a modern PC, float64 and float32 are about the same speed. On SH-4?

┌─────────────────────────────────────────────────────────────┐
│   FLOAT PERFORMANCE ON SH-4                                 │
│                                                             │
│   float32:  Hardware accelerated, FAST                      │
│             One instruction, one cycle                      │
│                                                             │
│   float64:  Software emulation, SLOW                        │
│             Multiple instructions, 10-20× slower!           │
│                                                             │
│   A physics simulation using float64 could run              │
│   at 6 FPS instead of 60 FPS. That's the difference.        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Fix

Be explicit about float32:

// SLOW
x := 3.14           // float64 by default!
y := x * 2.0        // float64 math

// FAST
var x float32 = 3.14  // Explicit float32
y := x * 2.0          // float32 math

For game physics, positions, velocities — always use float32.


Part 3: What We Deliberately Left Out

“Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.” — Antoine de Saint-Exupéry

libgodc is not a complete Go implementation. That’s intentional. Here’s what we cut and why:

Omission 1: Full Reflection

Standard Go: Every type carries metadata — field names, method signatures, struct tags. This enables reflect and fancy JSON marshaling.

Cost: Binary size can double.

libgodc: Basic reflection only. Enough for println to work.

What you lose:

reflect.MakeFunc(...)     // NOT SUPPORTED
json.Marshal(myStruct)    // NOT SUPPORTED (would need full reflection)

What you do instead: Write explicit serialization. Use code generators.

Omission 2: Finalizers

Standard Go:

runtime.SetFinalizer(obj, func(o *MyType) {
    o.cleanup()  // Runs when GC collects obj
})

The problem: Finalizers are a nightmare for GC:

  • Objects can be resurrected
  • Run order is undefined
  • Timing is unpredictable
  • Complicate the GC significantly

libgodc: No finalizers.

What you do instead: Use defer for cleanup:

func process() {
    resource := acquire()
    defer resource.Release()  // Always runs!
    // ... use resource ...
}

Omission 3: Preemptive Scheduling

Standard Go: The runtime can interrupt a goroutine at almost any point.

libgodc: Goroutines must yield voluntarily.

// THIS FREEZES THE SYSTEM
for {
    // Infinite loop, never yields
    // No other goroutine will EVER run
}

// THIS IS FINE
for {
    doWork()
    runtime.Gosched()  // "Let others run"
}

Why we did this: Preemption requires safe points, stack inspection, and signal handling. Complex for little benefit on single-CPU.

Omission 4: Concurrent GC

Standard Go:

Your code:    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
GC:                ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
              Both run in parallel!
              Pause: < 1ms

libgodc:

Your code:    ░░░░░░░░░░████████████░░░░░░░░
GC:                     ▓▓▓▓▓▓▓▓▓▓▓▓
              EVERYTHING STOPS during GC
              Pause: 5-20ms

Why we did this: Concurrent GC requires write barriers, atomic operations, and careful synchronization. Stop-the-world is simpler and predictable.

What you do: Keep live data small. Trigger GC between frames or during loading.

The Trade-off Table

FeatureWhat We ChoseWhy
GCSemi-space, stop-the-worldSimple, no fragmentation
SchedulingCooperative, M:1No locks, predictable
Panic/Recoversetjmp/longjmpNo DWARF unwinding
ReflectionMinimalBinary size
PreemptionNoneSimplicity
C interopDirect linkingNo CGo complexity

Our philosophy: Predictability over throughput. Simplicity over features.


Part 4: When to Optimize

The Golden Question

Before optimizing anything, ask:

“Have I measured this?”

If the answer is no, stop. You’re guessing. And programmers are notoriously bad at guessing where time is spent.

The 90/10 Rule

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   90% of execution time is spent in 10% of the code         │
│                                                             │
│   That means:                                               │
│   • 90% of your code DOESN'T MATTER for performance         │
│   • Optimizing the wrong code = wasted effort               │
│   • Always measure first!                                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

DO Optimize

  • Code that runs every frame (game loop, rendering)
  • Hot loops with thousands of iterations
  • Code that measurements show is slow

DON’T Optimize

  • Code that runs once (startup, level load)
  • Code that runs rarely (menu navigation)
  • Code you haven’t measured
  • At the cost of readability

How to Measure

//extern timer_us_gettime64
func timerUsGettime64() uint64

func measureGameLoop() {
    start := timerUsGettime64()
    
    updatePhysics()
    physicsTime := timerUsGettime64() - start
    
    renderStart := timerUsGettime64()
    renderFrame()
    renderTime := timerUsGettime64() - renderStart
    
    println("Physics:", physicsTime, "us")
    println("Render:", renderTime, "us")
}

Now you know where time actually goes!


Part 5: The Debug Build System

Production vs Debug

By default, libgodc is silent. Zero debug output, zero overhead.

# Production build (default)
make && make install

# Debug build - enables debug output and assertions
make DEBUG=3 && make install

The Performance Tax of Debug Output

┌─────────────────────────────────────────────────────────────┐
│   OPERATION          Production     DEBUG=3                 │
│                                                             │
│   Goroutine spawn    50 μs          188,000 μs (188 ms!)    │
│   Channel send       19 μs          ~50,000 μs              │
│   GC pause           21 ms          ~500 ms                 │
│                                                             │
│   Debug output is EXTREMELY EXPENSIVE!                      │
│   Never benchmark with DEBUG enabled.                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Debug Macros

Instead of raw printf, use these macros:

MacroUse ForExample
LIBGODC_TRACE()General tracingScheduler events
LIBGODC_WARNING()Non-fatal issuesLarge allocations
LIBGODC_ERROR()Recoverable errorsFailed operations
LIBGODC_CRITICAL()Fatal errorsLogged to crash dump
GC_TRACE()GC-specificCollection details

In production (DEBUG=0): All macros compile to nothing. Zero cost.

In debug (DEBUG=3): Output includes labels:

[godc:main] Scheduling G 42 (status=1)
[godc:main] WARNING: Large allocation 256 KB
[GC] #3: 1024->512 (50% survived) in 21045 us

Using Debug Macros

In C runtime code:

#include "runtime.h"

void my_function(void) {
    LIBGODC_TRACE("Entering my_function");
    
    if (error_condition) {
        LIBGODC_WARNING("Something unexpected: %d", value);
    }
    
    LIBGODC_TRACE("my_function complete");
}

In Go code, use println:

const DEBUG = false  // Set to true when debugging

func debugPrint(msg string) {
    if DEBUG {
        println(msg)
    }
}

Debug Functions Available

When investigating issues, you can call these:

gc_dump_stats();       // Print GC statistics
gc_verify_heap();      // Check heap integrity
gc_print_object(ptr);  // Print object details
gc_dump_heap(10);      // Dump first 10 heap objects

Real Benchmark Results

We ran these benchmarks on actual Dreamcast hardware. These numbers should guide your optimization decisions.

PVRMark: Go vs Native C

We ran the KOS pvrmark benchmark (flat-shaded triangles, no textures) on real Dreamcast hardware to measure Go runtime overhead:

MetricC NativeGo (default)Go (GODC_FAST)
Peak polys/frame17,53313,83314,333
Peak pps~1,054,097~831,714~860,532
vs C performance100%79%82%
Binary size314 KB614 KB614 KB
┌─────────────────────────────────────────────────────────────┐
│   POLYGON THROUGHPUT (polys/frame @ 60fps)                  │
│                                                             │
│   C Native:      ████████████████████████████████████ 17,533│
│   Go Optimized:  ████████████████████████████        14,333 │
│   Go Default:    ██████████████████████████          13,833 │
│                                                             │
│   GODC_FAST=1 adds +500 polys/frame (+3.6%)                 │
│   Go achieves 82% of C polygon throughput                   │
└─────────────────────────────────────────────────────────────┘

Analysis:

  • The 18% overhead comes from bounds checking, slice header overhead, and gccgo code generation differences (not FFI — //extern compiles to direct jsr calls)
  • GODC_FAST=1 improves performance by ~3.6% via aggressive optimization
  • For real games with textures, lighting, and game logic, this difference is negligible
  • 14,333 flat-shaded triangles at 60fps is plenty for actual gameplay

What the extra 300KB binary size buys you:

  • Garbage collection
  • Goroutines and channels
  • Defer/panic/recover
  • Type safety and bounds checking
  • Full Go standard library support

Compiler Optimization Flags

The godc build command uses these SH-4 specific optimizations:

FlagEffectDefault
-O2Standard optimization
-m4-singleSingle-precision FPU mode
-mfsrraHardware reciprocal sqrt (10× faster)
-mfscaHardware sin/cos (10× faster)
-O3Aggressive optimizationGODC_FAST only
-ffast-mathFast FP (breaks IEEE)GODC_FAST only
-funroll-loopsLoop unrollingGODC_FAST only

To enable aggressive optimizations:

GODC_FAST=1 godc build

Warning: -ffast-math breaks IEEE floating point compliance. NaN and infinity handling may not work correctly. Use only for games where FP precision isn’t critical.

Conclusion

What We Built

We started with a simple question: Can Go run on a 1998 game console?

The answer is yes. Not perfectly, not completely, but yes.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   libgodc: A Go Runtime for the Sega Dreamcast              │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐   │
│   │  ✓ Memory allocation with bump allocator            │   │
│   │  ✓ Garbage collection (semi-space copying)          │   │
│   │  ✓ Goroutines (cooperative M:1 scheduling)          │   │
│   │  ✓ Channels (buffered and unbuffered)               │   │
│   │  ✓ Select statement                                 │   │
│   │  ✓ Defer, panic, and recover                        │   │
│   │  ✓ Maps, slices, strings, interfaces                │   │
│   │  ✓ Direct C interop via //extern                    │   │
│   └─────────────────────────────────────────────────────┘   │
│                                                             │
│   All running on 16MB RAM and a 200MHz CPU.                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Trade-offs We Made

Every design decision was a trade-off. Here’s what we chose and why:

DecisionWhat We Gave UpWhat We Gained
Semi-space GC50% of heap unusableNo fragmentation, simple code
Cooperative schedulingPreemptionNo locks, predictable timing
Fixed 64KB stacksStack growthSimplicity, no stack probes
M:1 modelParallelismNo thread synchronization
setjmp/longjmp panicDWARF unwindingWorks without debug info
No finalizersDestructor patternsSimpler GC, predictable cleanup

These aren’t the “right” choices for every platform. They’re the our choices for this platform.


What We Didn’t Build

libgodc is not a complete Go implementation. We deliberately left out:

  • Race detector — No parallelism means no data races
  • CPU/memory profiling — Use println and timers
  • Debugger support — Not available go debugger
  • Full reflection — Binary size matters
  • Preemptive scheduling — Complexity for no benefit
  • Concurrent GC — Single core, stop-the-world is fine

Lessons for Runtime Implementers

If you’re building a runtime for another constrained platform, here’s what we learned:

  • Don’t plan everything upfront. Get println("Hello") working first. The linker errors will guide you to the next step.
  • When documentation fails, read the code. gccgo’s libgo/runtime/ directory answered questions no documentation could.
  • Our first GC was embarrassingly slow. It didn’t matter. Once it worked, we could measure and optimize. Premature optimization would have wasted months.
  • Emulators lie. Timing is different. Memory layout is different. Test on hardware as soon as you can run anything.
  • Fighting the hardware is futile. The SH-4 has 16MB RAM and a 200MHz CPU. Accept it. Design for it. Work with it.

The Bigger Picture

Coding this project, helped me understand better what Go actually does.

When you write go func() {}, something has to:

  • Allocate a stack
  • Save the entry point
  • Add it to a run queue
  • Eventually switch contexts to run it

When you write x := make([]int, 10), something has to:

  • Calculate the size
  • Find free memory
  • Initialize the slice header
  • Eventually clean up when it’s garbage

That “something” is the runtime. Every high-level language has one. Understanding how it works makes you a better programmer in any language.


What’s Next?

libgodc is open source. You can:

  1. Use it — Build games for the Dreamcast in Go
  2. Extend it — Add features you need
  3. Learn from it — Apply these patterns to other platforms
  4. Contribute — Fix bugs, improve performance, write examples

The Dreamcast community is small but passionate. Join us at:


Final Words

The Sega Dreamcast was released on November 27, 1998, in Japan. It was discontinued on March 31, 2001—a commercial failure that outlived its corporate support by decades.

Twenty-five years later, people are still writing code for it. Still pushing its limits. Still finding joy in its constraints.

That’s the magic of retro computing. It’s not about nostalgia. It’s about craft. Modern development gives us infinite resources and infinite complexity. Old hardware gives us finite resources and forces elegant solutions.

libgodc exists because someone asked: “Can Go run on a Dreamcast?”

The answer is yes. And now you know how.


Thank you for reading, Panos

libgodc Design

libgodc is a Go runtime for the Sega Dreamcast. This document explains how it works under the hood.

The Problem

The Dreamcast is a fixed platform: 200MHz SH-4, 16MB RAM, no MMU, no swap. The standard Go runtime assumes infinite memory, preemptive scheduling, operating system threads, and virtual memory. None of these exist here.

libgodc replaces the Go runtime with one designed for this environment.

Architecture

┌────────────────────────────────────────────────────────────────┐
│  Your Go Code                                                  │
│     compiles with sh-elf-gccgo                                 │
│     produces .o files with Go runtime calls                    │
├────────────────────────────────────────────────────────────────┤
│  libgodc (this library)                                        │
│     implements Go runtime functions                            │
│     memory allocation, goroutines, channels, GC                │
├────────────────────────────────────────────────────────────────┤
│  KallistiOS (KOS)                                              │
│     bare-metal OS for Dreamcast                                │
│     provides malloc, threads, drivers                          │
├────────────────────────────────────────────────────────────────┤
│  Dreamcast Hardware                                            │
│     SH4 CPU, PowerVR2 GPU, AICA sound                          │
│     16MB main RAM, 8MB VRAM                                    │
└────────────────────────────────────────────────────────────────┘

We don’t need the full Go runtime. We need enough to run games. Games have different requirements than servers, cloud services, or desktop systems. This simplifies everything.

Memory Model

The Budget

16MB total RAM:
 GC heap (two semispaces): 4MB total (2MB active at any time)
 Goroutine stack:          64KB per goroutine
 Large object threshold:   >64KB bypasses the GC heap
 Everything else:          KOS + program text/data + malloc-backed assets

These values come from the runtime configuration:

  • GC heap: GC_SEMISPACE_SIZE_KB in runtime/godc_config.h (default 2048, so 2MB per semispace)
  • Stack size: GOROUTINE_STACK_SIZE in runtime/godc_config.h (default 64KB)
  • Large-object threshold: GC_LARGE_OBJECT_THRESHOLD_KB in runtime/godc_config.h (default 64KB)

Address bounds and load base follow the same conventions documented in memory-map.md: main RAM is treated as the half-open interval [0x8C000000, 0x8D000000), and the program image is linked to start at 0x8C010000 (KOS linker LOAD_OFFSET). This leaves the first 64KB below 0x8C010000 for low-memory KOS/dcload use.

tests/bench_architecture.go reports the stack size and large-object threshold at runtime. The semispace size comes directly from the compile-time config.

The 16MB limit is absolute. There is no virtual memory, no swap, no second chance. Every byte matters.

Allocation Strategy

libgodc uses three allocation paths:

1. GC Heap (for Go objects)

Small, frequently allocated objects go here. The semispace collector manages them automatically. Implementation: gc_heap.c, gc_copy.c.

Implementation of the allocation in simple pseudocode:

// Bump allocator: O(1) allocation (simplified)
void *gc_alloc(size_t size, type_descriptor *type) {
    size_t aligned = ALIGN(size, 8);
    size_t total = HEADER_SIZE + aligned;
    if (alloc_ptr + total > alloc_limit) {
        gc_collect();  // Cheney's algorithm
    }
    void *obj = alloc_ptr;
    alloc_ptr += total;
    return obj + HEADER_SIZE;
}

This is simplified. The real code in gc_heap.c also handles large objects (size > 64KB bypasses the GC heap and goes straight to malloc() by default), alignment edge cases, and gc_percent threshold checks. But the core is exactly this: bump a pointer.

The bump allocator is the fastest possible allocation strategy. Deallocation happens during collection: live objects are copied, dead objects are forgotten.

Usage example:

// Go: allocate freely, GC handles cleanup
func spawnEnemy() *Enemy {
    return &Enemy{bullets: make([]Bullet, 100)}
}
// No kill function needed  when nothing references it, it's collected

2. KOS Heap (for large objects)

Objects larger than 64KB bypass the moving GC heap. With the default config, the threshold check is strict: a 64KB allocation still stays in the GC heap, while 64KB + 1 byte goes to malloc(). This fits common game-asset usage: textures, audio buffers, and mesh data are often large and long-lived.

// This goes to KOS malloc, not GC:
texture := make([]byte, 256*256*2)  // 128KB texture

Large objects use malloc() internally and are not tracked by the GC. To free them, use runtime.FreeExternal:

//go:linkname freeExternal runtime.FreeExternal
func freeExternal(ptr unsafe.Pointer)

// Allocate large texture
texture := make([]byte, 256*256*2)  // 128KB, bypasses GC

// When done with it:
freeExternal(unsafe.Pointer(&texture[0]))
texture = nil  // Don't use after freeing!

The unsafe.Pointer(&texture[0]) syntax is intentional. A slice in Go is a header (data pointer, length, capacity) - not the array itself. Passing &texture would give a pointer to the slice header on the stack, not the malloc()’d array. &texture[0] reaches through to the underlying data pointer, which is what free() needs.

A typed wrapper like FreeSlice(s *[]byte) would be cleaner for callers, but it would only work for []byte. Game code also allocates large []uint16 (pixel buffers), []float32 (vertex data), and others. Without generics support, you would need a separate wrapper per slice type. Using interface{} with reflect is not an option either - the reflect package is too heavy for a 16MB console, and interface boxing itself allocates memory, which defeats the purpose of a function meant to free it. The raw unsafe.Pointer version is ugly but it works: zero-cost, type-agnostic, and no heavy dependencies. Tracked in #2 for a future typed FreeSlice wrapper once generics support is available.

The texture = nil after freeing is optional but strongly recommended. After freeExternal, the slice header still holds the old data pointer, length, and capacity - it looks valid to Go code. If you accidentally access it, that’s a use-after-free. On a desktop OS, the MMU (Memory Management Unit) would catch this: the hardware marks freed pages as inaccessible, and the next access triggers a segfault that crashes the process immediately with a clear error. The Dreamcast’s SH4 has an MMU, but KallistiOS runs with it disabled for performance. All 16MB of RAM is flat and directly accessible. A use-after-free silently reads or writes whatever now lives at that address - another allocation, GC metadata, the stack. The bug might show up as a wrong pixel, a corrupted sound, or a crash hundreds of frames later in unrelated code. Setting the slice to nil turns that silent corruption into a Go panic, which is immediately visible. The function itself cannot nil the caller’s variable because it only receives a raw unsafe.Pointer with no knowledge of the slice it came from. Enabling the SH4 MMU in debug builds to catch these at the hardware level is explored in #3.

See gc_external_free in gc_heap.c. Run test_free_external.elf to verify.

Typical pattern swap textures between levels:

// Load level 1
bgTexture := make([]byte, 512*512*2)  // 512KB

// ... play level 1 ...

// Unload before level 2
freeExternal(unsafe.Pointer(&bgTexture[0]))
bgTexture = make([]byte, 512*512*2)  // reuses memory

// or you could use a helper function, like that:
func freeSlice(s []byte) {
    if len(s) > 0 {
        freeExternal(unsafe.Pointer(&s[0]))
    }
}

// Then just:
freeSlice(bgTexture)

3. Stack (for program execution)

Every function call in libgodc uses a stack - local variables, return addresses, function arguments all live here. The stack is not specific to goroutines; even your main() function runs on one. What differs is where each stack comes from and how big it is.

Main goroutine (g0): Runs on the KOS main thread stack. KOS defaults to 32KB, but libgodc overrides this to 128KB in kos_startup.c (KOS_MAIN_STACK_SIZE). This larger size is needed for deep call chains during GC scanning, printf formatting with large buffers, and test harnesses. You can override it at compile time with -DKOS_MAIN_STACK_SIZE=N.

Spawned goroutines (go func()): Each gets a fixed 64KB stack allocated via goroutine_stack_init() in proc.c. The size comes from GOROUTINE_STACK_SIZE in godc_config.h.

In standard Go, goroutines start with a small stack (a few KB) that grows automatically when needed - the runtime detects when a function call would overflow the current stack, allocates a larger one, copies everything over, and updates all pointers. Earlier Go versions used “segmented stacks” (split-stack), where additional stack segments were chained together on demand instead of copying. Both approaches let goroutines use only as much stack as they actually need.

libgodc uses neither. Stacks are fixed-size, allocated once, never resized. This is simpler and faster (no growth checks, no copying, no pointer updates), but requires discipline - if a call chain goes deeper than the stack size, it overflows silently.

The infrastructure for overflow detection exists: every goroutine has a stack_guard field set to the bottom of its stack, and the TLS block stores it at offset 0 where the SH4 split-stack prologue would read it via @(0, GBR). However, the compiler flag -fno-split-stack disables the overflow-checking prologues, and __morestack has been removed from the assembly. The reason: the GBR register, which the split-stack prologue reads, conflicts with KOS’s _Thread_local storage. Without the prologues, no check happens, and overflow corrupts whatever sits below the stack in memory - with no fault, no panic, and no warning (the Dreamcast has no MMU protection, see #3). Alternative detection strategies (stack canaries, SP checks at yield points, MMU guard pages) are tracked in #4.

Stack frames are freed automatically when functions return. Use the stack for temporary buffers:

func processAudio() {
    buffer := [4096]int16{}  // 8KB on stack, automatically freed
    // ...
}

Be careful with fixed-size arrays. The compiler places non-escaping fixed-size arrays entirely on the stack. A var buf [100000]byte inside a goroutine would put 100KB on a 64KB stack, silently overflowing it. This does not apply to make() - make([]byte, 100000) always heap-allocates the backing array through gc_alloc, which routes large objects (>64KB) to malloc(). Only the 12-byte slice header (data pointer, length, capacity) stays on the stack. Rule of thumb: use make for large buffers, keep fixed-size arrays well under the stack size. Compile-time detection of oversized stack frames is tracked in #5.

Garbage Collection

Object Header

The GC sees raw memory, not a high-level Go object graph. For each GC-managed semispace object it encounters, it must answer two questions:

  1. “how many bytes do I copy to the other semispace?” and
  2. “does this object contain pointers I need to follow?”

Without answers, the GC cannot tell where one object ends and the next begins.

Solution? Each GC-managed semispace object carries an 8-byte header right before its data. The GC reaches it with ptr - 8 - a single subtract, no lookups, no hash tables. The cost is 8 extra bytes per semispace object. External malloc()-backed allocations do not use this header.

              8-byte header                    your data
        ┌──────────┬──────────┐          ┌─────────────────┐
        │  word 1  │  word 2  │          │                 │
        │ (4 bytes)│ (4 bytes)│          │  object payload │
        └──────────┴──────────┘          └─────────────────┘
             ▲           ▲                       ▲
             │           │                       │
          GC info    type pointer          what your Go code sees

For GC-managed semispace objects, the header has two 4-byte words:

Word 1 - GC info (packed into 32 bits):

  • Forwarded (1 bit): Has this object already been copied to the other semispace during the current GC cycle? Prevents copying the same object twice.
  • NoScan (1 bit): Does this object contain any pointers? If not, the GC can skip scanning its contents - it just copies the bytes without looking inside. This is the key performance flag.
  • Type tag (6 bits): Compact Go kind metadata copied into the header. The collector primarily relies on the full type descriptor pointer, not this small tag, when it needs object layout information.
  • Size (24 bits): The total object size in bytes, including the 8-byte header and any alignment padding. The GC needs this to know how many bytes to copy. 24 bits allows sizes up to 16MB, which covers the entire Dreamcast RAM.

Word 2 - Type pointer (32 bits): A pointer to the full type descriptor, which contains detailed information about the object’s layout (field offsets, which fields are pointers, etc.). The GC uses this when scanning objects that contain pointers.

For example, a [4]byte array in memory looks like this:

        header (8 bytes)         data (4 bytes)    padding (4 bytes)
┌───────────────────────────┬──────────────────┬──────────────────┐
│ NoScan=1, Size=16, type=..│  0x41 0x42 ...   │   (alignment)   │
└───────────────────────────┴──────────────────┴──────────────────┘
                                                 total: 16 bytes

The [4]byte holds 4 bytes of useful data but costs 16 bytes in total (8 header + 4 data + 4 alignment padding). This is why many small allocations hurt more than fewer large ones.

The NoScan bit is critical for performance. Objects containing only integers, floats, or other non-pointer types skip GC scanning entirely: the collector just copies them without inspecting their contents.

The practical takeaway: prefer value types over pointer types when possible.

type Vertex struct { X, Y, Z float32 }

// No pointers inside, GC copies the bytes and moves on (NoScan):
mesh := make([]Vertex, 1000)

// Every element is a pointer, GC must inspect each one to find
// and copy the Vertex it points to (scan):
mesh := make([]*Vertex, 1000)

Algorithm: Cheney’s Semispace Collector

The GC-managed heap is divided into two equally sized regions called semispaces. Think of them as “space A” and “space B”. Small and medium allocations happen in one of them (the active space) using the bump allocator. The other semispace is reserved as the destination for the next collection. Objects larger than GC_LARGE_OBJECT_THRESHOLD (64KB by default) currently bypass this moving heap and go through malloc() instead. That is convenient for large raw buffers, but it is also a current runtime limitation for large typed Go allocations that contain pointers; see #6.

GC does not wait until the active semispace is literally full. By default it can trigger once usage crosses the configured threshold (75% when gc_percent=100), and an allocation that would overflow the active semispace also forces a collection.

When collection runs, this runtime does the following:

  1. Effectively stop-the-world for Go goroutines. libgodc runs one goroutine at a time on a single KOS thread, and goroutines switch only when they explicitly yield. The goroutine that triggers GC enters gc_collect() directly, while all other goroutines are already inactive as saved contexts. So for Go code this has the same effect as stop-the-world GC, but the runtime does not need a separate mechanism to pause parallel mutators.
  2. Flip to the other semispace. The collector chooses the inactive semispace as the new active space and resets the three pointers that drive collection: alloc_ptr becomes the next free byte for new allocations, alloc_limit marks the end of that semispace, and scan_ptr starts at the beginning of the copied-object queue that Cheney’s algorithm will walk.
  3. Scan the roots first. Roots are the references the collector knows how to find without already knowing which heap objects are live. In libgodc, those roots are explicit roots (gc_add_root), compiler-registered global roots (registerGCRoots), the current stack, all goroutine stacks, and the G structs that hold runtime metadata. Starting with roots is essential: they define where reachability begins. Only after scanning them can the collector follow pointers into GC-managed memory, copy the first reachable objects into to-space, and seed Cheney’s work queue. Global roots use compiler-provided pointer metadata when available; stacks and G structs are scanned conservatively. There is no separate explicit register scan.
  4. Copy reachable objects. When a root points into from-space, the object it references is copied to to-space and the old header is rewritten with a forwarding pointer. If another root or object reaches it again, the forwarding pointer is reused instead of copying a second time.
  5. Perform Cheney’s scan. Objects already copied into to-space are scanned in allocation order. Their pointer fields are updated, and any referenced objects still in from-space are copied. The new semispace acts as the work queue: scan_ptr advances through copied objects until it catches alloc_ptr.
  6. Finish the cycle. When scan_ptr == alloc_ptr, all reachable semispace objects have been moved. bytes_allocated becomes the live size, the old space is no longer used for allocation, and cache invalidation of that old space is deferred and processed incrementally after the GC pause.
Before GC:                          After GC:
  Space A (active, live + dead)       Space A (old from-space)
  ┌────────────────────┐              ┌────────────────────┐
  │ obj1 obj2 ... objN │              │ no longer used for │
  │ (reachable and     │              │ allocation; cache  │
  │  unreachable mixed)│              │ invalidated later  │
  └────────────────────┘              └────────────────────┘

  Space B (inactive)                  Space B (new to-space / active)
  ┌────────────────────┐              ┌────────────────────┐
  │                    │              │ obj1 obj3 obj7 ... │
  │                    │              │ (only reachable    │
  │                    │              │  objects)    ░░░░░ │
  └────────────────────┘              └────────────────────┘
                                             ▲ alloc_ptr after GC

Before GC, Space A is the active semispace holding all allocated objects - reachable and unreachable intermixed. Space B sits empty. During GC, the collector copies only reachable objects from A (now called “from-space”) into B (now called “to-space”). Unreachable objects like obj2 are never copied - they are reclaimed implicitly by not being moved. After GC, Space B becomes the active semispace with only live objects packed together, and Space A is abandoned. The next GC cycle flips them again: B becomes from-space, A becomes to-space.

// Two semispaces, allocated at startup
gc_heap.space[0] = memalign(32, GC_SEMISPACE_SIZE);
gc_heap.space[1] = memalign(32, GC_SEMISPACE_SIZE);

// Collection flips to the other semispace
int old_space = gc_heap.active_space;
int new_space = 1 - old_space;
gc_heap.active_space = new_space;
gc_heap.alloc_ptr = gc_heap.space[new_space];
gc_heap.alloc_limit = gc_heap.space[new_space] + gc_heap.space_size;
gc_heap.scan_ptr = gc_heap.space[new_space];

// Scan roots, copy reachable objects, then scan copied objects
gc_scan_roots();
while (gc_heap.scan_ptr < gc_heap.alloc_ptr) {
    gc_header_t *header = (gc_header_t *)gc_heap.scan_ptr;
    void *obj = gc_get_user_ptr(header);
    gc_scan_object(obj);
    gc_heap.scan_ptr += GC_HEADER_GET_SIZE(header);
}

Why this algorithm? For the semispace heap it is simple to implement, allocation is just a bump pointer, surviving objects are compacted automatically, and reference cycles are handled naturally through forwarding pointers. The tradeoff is that only one semispace is available for GC-managed allocation at a time, so the other half must stay reserved as the copy destination for the next collection.

Collection Trigger

GC runs when: Active space exceeds threshold (default: 75% when gc_percent=100) Allocation would exceed remaining space Explicit GC call

The threshold is controlled by gc_percent:

  • gc_percent = 100 (default): threshold = 75% of heap space
  • gc_percent = 50: threshold = 50% of heap space
  • gc_percent = -1: disable threshold-triggered GC; allocations that would overflow the active semispace still force collection

To control GC from Go:

import _ "unsafe"

//go:linkname setGCPercent debug.SetGCPercent
func setGCPercent(percent int32) int32

//go:linkname gc runtime.GC
func gc()

func init() {
    setGCPercent(50)   // Trigger at 50% instead of 75%
    setGCPercent(-1)   // Disable threshold-triggered GC
    gc()               // Force collection now
}

Build and run tests/test_gc_percent.elf to verify this works.

Pause Times

GC pause time is the time spent inside a collection cycle while Go code is waiting for the collector to finish. In libgodc, this is the elapsed time inside gc_collect(): roots are scanned, reachable objects are copied, and Cheney’s scan completes before execution continues. More live data usually means a longer pause because there is more memory to copy and scan.

Pause time matters because it directly consumes your frame budget. At 60fps, one frame is only 16.6ms. A 2ms GC pause is noticeable but often acceptable; a 10ms pause consumes most of the frame; a pause longer than 16.6ms can cause a missed frame or visible hitch.

Run tests/bench_gc_pause.elf on hardware for focused pause measurements, or tests/bench_architecture.elf for a broader benchmark that also reports GC pauses.

For 60fps gameplay, disable threshold-triggered GC during hot gameplay sections:

import _ "unsafe"

//go:linkname setGCPercent debug.SetGCPercent
func setGCPercent(percent int32) int32

//go:linkname forceGC runtime.GC
func forceGC()

func main() {
    setGCPercent(-1)  // Disable threshold-triggered GC

    // ... gameplay avoids threshold-triggered GC pauses ...
    
    // GC during loading screens only:
    showLoadingScreen()
    forceGC()
    startGameplay()
}

Even with gc_percent = -1, an allocation that would overflow the active semispace still forces a collection. So this reduces surprise GC pauses, but it is not a hard guarantee of “no GC during gameplay” if you keep allocating or your live data no longer fits in one semispace.

Root Scanning

The GC finds live objects by tracing from roots:

static void gc_scan_roots(void)
{
    // Scan explicit roots (gc_add_root)
    for (int i = 0; i < gc_root_table.count; i++) { ... }

    // Scan compiler-registered roots (registerGCRoots)
    gc_scan_compiler_roots();

    // Scan current stack
    gc_scan_stack();

    // Scan all goroutine stacks (and their G structs)
    gc_scan_all_goroutine_stacks();
}
  1. Explicit roots Optional. If C code holds pointers to Go objects across a collection, it must register the pointer location with gc_add_root(&ptr). During GC, the collector updates that location if the object moves.

  2. Compiler-registered roots These are not “all globals” scanned blindly. gccgo registers root lists with registerGCRoots(), and the collector scans those roots using compiler-provided pointer metadata when available.

  3. Current stack The stack of the goroutine that triggered GC is scanned conservatively from the current stack pointer upward.

  4. Other goroutine stacks Each live goroutine’s saved stack is scanned conservatively using the saved stack pointer, so only the used portion of the stack is examined.

  5. Goroutine metadata (G structs) Each live goroutine’s G struct is also scanned conservatively for runtime-held pointers such as _panic, _defer, waiting, and checkpoint.

Conservative here does not mean “every aligned word is automatically treated as a pointer.” The scanner first filters for values that look like valid heap object pointers and only then treats them as references. There is no separate explicit register scan; stack scanning catches spilled registers.

DMA Hazard

DMA (Direct Memory Access) is a hardware feature. The CPU sets up the transfer by telling a hardware block such as the graphics processor (PowerVR2) or sound processor (AICA): use this source address, this destination address, and this size, then start. After that, the hardware moves the bytes on its own while the CPU goes on doing other work instead of copying the data byte by byte.

The hazard is timing. Once DMA starts, the hardware keeps using the raw memory address the CPU programmed into it. But libgodc’s GC moves small heap objects. If the buffer lives in the GC heap and a collection happens before DMA completes, the GC may move that buffer to a new address. Go code can be updated to the new address, but the hardware is not aware of the move and keeps reading from or writing to the old, now stale, address.

There are only a few ways to avoid this:

  1. Use memory that the moving GC will never relocate. In libgodc, large allocations (size > 64KB) bypass the semispace GC and use malloc() instead. Textures and framebuffers can also live in VRAM via kos.PvrMemMalloc().
  2. Make sure no collection can happen while the DMA transfer is in flight. In practice that means: do not call runtime.GC(), avoid allocations that could fill the active semispace, and if needed disable threshold-triggered GC around that section with debug.SetGCPercent(-1). This reduces the risk, but it is not a perfect guarantee: if you overflow the active semispace, GC still runs.
  3. Handle cache coherency separately. A stable address is necessary, but not sufficient. Even if the buffer does not move, DMA code still needs cache flush/invalidate calls so hardware and CPU see the same bytes.

Longer-term API work to make these memory domains explicit and steer callers toward DMA-safe buffers is tracked in #11.

Movement-safe patterns:

// DANGEROUS: small buffer in moving GC heap
data := make([]byte, 4096)      // Small, in GC heap
startDMA(data)                  // Hardware keeps this address until DMA completes
runtime.Gosched()               // Another goroutine may run
// Any later allocation or explicit runtime.GC() before DMA completes can move `data`

// SAFE: large allocations bypass the moving GC heap
data := make([]byte, 100*1024)  // >64KB, uses malloc
startDMA(data)                  // Address will not move during GC

// SAFE: VRAM for textures
tex := kos.PvrMemMalloc(size)   // Allocates in VRAM

These patterns only avoid movement by the GC. DMA code still needs explicit cache flush/invalidate calls; see Cache Coherency below.

Scheduler

The scheduler is the part of the runtime that decides which goroutine runs next. In libgodc, it keeps a queue of goroutines that are ready to run. When the current goroutine cannot continue yet, such as waiting on a channel or timer, or when it voluntarily lets another goroutine run, the scheduler saves its state and restores another goroutine’s state so execution can continue there.

M:1 Cooperative Model

All goroutines run on top of a single underlying KallistiOS (KOS) OS thread. Only one goroutine executes at a time. The scheduler gets a chance to run only when the current goroutine voluntarily gives up control or blocks waiting for something:

  • Channel operations that need to wait, such as send, receive, or select
  • runtime.Gosched(), which means “let another goroutine run now”
  • time.Sleep() or timer waits, which park the goroutine until a timer expires

A goroutine in a tight CPU loop will monopolize the processor. The runtime cannot forcibly interrupt it and switch to another goroutine.

Why M:1?

M:1 means many goroutines (M) are multiplexed onto one underlying OS thread (1). On Dreamcast, that OS thread runs on the console’s single CPU core, so only one goroutine can execute at a time.

Preemptive scheduling means the runtime or OS can interrupt a running goroutine at arbitrary points, usually on a timer, and switch to another one. That improves fairness, but it also needs more machinery: timer-driven interrupts, more bookkeeping, and more forced context switches.

Cooperative scheduling means the current goroutine keeps running until it blocks or explicitly yields. That is simpler because switches happen only at known points, and it is usually faster on this target because the runtime avoids timer-driven forced switches and their bookkeeping cost.

Since the Dreamcast has one CPU core, this extra preemptive machinery would not let Go goroutines run in parallel. The tradeoff is fairness: a goroutine that never yields can starve others. For libgodc’s game-oriented workloads, the simpler M:1 cooperative design is a good fit.

Possible future work in this area is tracked in #12, which explores enforced safepoints and better scheduler fairness without abandoning the single-threaded design. The related question of making the single KallistiOS-thread contract explicit is tracked in #9.

Run Queue Structure

The run queue is the scheduler’s waiting room for goroutines that are ready to run right now. Some goroutines are blocked waiting on a channel or timer; those cannot run yet. Others are ready to run as soon as the CPU becomes available. The scheduler needs a place to remember that second group, and the run queue is that place.

Nothing maintains this queue in parallel or in the background. The scheduler updates it directly in the same single-threaded runtime code:

  • when a goroutine becomes runnable again, it is appended to the queue
  • when the current goroutine yields, it is appended to the queue
  • when the scheduler wants the next goroutine, it removes one from the front

libgodc uses a simple FIFO queue: goroutines are added to the tail and removed from the head. This is easy to implement, predictable, and sufficient for game-oriented workloads where the program mostly controls fairness by deciding when goroutines yield.

// Goroutines execute in the order they become runnable
runq_put(gp);   // Add to tail
gp = runq_get(); // Remove from head

For real-time requirements, structure your code so time-sensitive work runs on the main goroutine or yields frequently.

One subtlety: it is first ready, first run, not first created, first run forever. A goroutine can run, block, wake up later, and then re-enter the queue at a later time.

Context Switching

Context switching is the low-level mechanism the scheduler uses to pause one goroutine and resume another. A context is the CPU state needed to continue execution later as if nothing happened: register values, stack pointer, return address/program counter, and floating-point state.

This solves a basic problem: the Dreamcast has one CPU core, but libgodc wants many goroutines to take turns running on it. When goroutine A yields, the runtime must remember exactly where A stopped so it can later resume A and run goroutine B in the meantime. Without context switching, the runtime would lose the CPU state for the paused goroutine.

In libgodc, each goroutine saves 64 bytes of CPU state when it yields:

typedef struct sh4_context {
    uint32_t r8, r9, r10, r11, r12, r13, r14;  // Callee-saved
    uint32_t sp, pr, pc;                        // Special registers
    uint32_t fr12, fr13, fr14, fr15;           // FPU callee-saved
    uint32_t fpscr, fpul;                       // FPU control
} sh4_context_t;

The actual save/restore operation is implemented in runtime_sh4_minimal.S (simplified for brevity).

This code is written in assembly rather than C or Go because it must directly manipulate CPU registers and stack state with exact control. A normal C or Go function would already be using registers, following the calling convention, and touching the stack while it runs, which could overwrite the very machine state the runtime is trying to preserve. The SH-4 return register (pr) and stack pointer (sp) also need precise save/restore handling that is awkward or impossible to express safely in ordinary Go or C.

__go_swapcontext:
    ! Save current context
    mov.l   r8, @r4         ! r4 = old_ctx
    mov.l   r9, @(4, r4)
    ...
    ! Restore new context
    mov.l   @r5, r8         ! r5 = new_ctx
    mov.l   @(4, r5), r9
    ...
    rts

This snippet shows the core idea:

  • r4 points to the context structure where the current goroutine’s CPU state should be saved
  • r5 points to the context structure for the goroutine we want to resume
  • the first group of instructions copies the current register values into old_ctx
  • the second group loads previously saved register values from new_ctx
  • rts returns into the restored execution state, so the resumed goroutine continues as if it had never stopped

The omitted lines do the same for the stack pointer, return/program location, and floating-point state.

FPU Context

Every context switch saves floating-point registers, even if your goroutine only uses integers. Compared with the no-FPU path, this adds about 25 extra cycles inside each low-level __go_swapcontext call.

// Both goroutines pay FPU overhead, even though neither uses floats
go audioDecoder()   // Integer PCM math
go networkHandler() // Packet parsing

This is a tradeoff: always saving FPU is slower but correct. A goroutine that unexpectedly uses a float won’t corrupt another’s FPU state.

runtime_sh4_minimal.S also contains __go_swapcontext_lazy and __go_swapcontext_nofpu, but the scheduler currently calls only the full __go_swapcontext path. The current runtime does not track per-goroutine FPU usage in G or pass fpu_flags from the scheduler, so the conservative always-save path is the one that is actually used.

Possible future work to use the lazy/no-FPU paths safely is tracked in #13.

Goroutine Structure

A goroutine is not just a function call. The runtime needs a record that keeps track of everything required to pause it, resume it, and manage its state over time. In libgodc, that record is the G struct.

The important takeaway is not every field, but the categories of information the runtime keeps for each goroutine:

  • panic/defer state, with ABI-critical fields expected by gccgo
  • scheduling state, such as whether the goroutine is runnable or waiting
  • stack bounds and TLS
  • saved CPU context, so the goroutine can be resumed later
  • wait-state metadata for channels, timers, and cleanup

Here is a reduced view of the structure:

typedef struct G {
    // ABI-critical: gccgo expects these at fixed offsets
    PanicRecord *_panic;      // Offset 0: innermost panic
    GccgoDefer *_defer;       // Offset 4: innermost defer

    // Scheduling
    Gstatus atomicstatus;

    // Stack
    void *stack_lo;
    void *stack_hi;

    // Saved CPU state for context switching
    sh4_context_t context;

    // Metadata
    int64_t goid;
    WaitReason waitreason;

    // Waiting / panic bookkeeping
    sudog *waiting;
    Checkpoint *checkpoint;
} G;

goroutine.h contains the authoritative full definition. The reason this structure matters is simple: without it, the runtime would have nowhere to store the goroutine’s saved CPU state, stack information, wait status, and panic/defer bookkeeping between scheduler decisions.

Goroutine Lifecycle

A goroutine moves through a small runtime state machine:

  1. Created __go_go() allocates the G struct, a stack, and a TLS block.
  2. Ready to run The goroutine is placed on the run queue, meaning it could run as soon as the scheduler picks it.
  3. Running The scheduler context-switches into that goroutine, so it now owns the CPU.
  4. Waiting If it cannot continue yet, such as waiting on a channel, select, or timer, the runtime parks it and runs something else.
  5. Finished When the goroutine function returns, the runtime marks it dead and queues it for cleanup.

proc.c includes a grace-period dead queue using death_generation and dead_link, but in the current source global_generation is never advanced. So the cleanup path exists, but exited goroutines do not currently become eligible for actual cleanup and reuse.

Channels

Channels are the primary synchronization primitive. Implementation follows the Go runtime closely.

They matter to libgodc’s design because channel operations are one of the main ways goroutines block, wake up, and hand work to each other. So channels are not just a language feature here; they are tightly connected to the scheduler, wait queues, and overall runtime behavior.

Structure

A channel is represented at runtime by an hchan structure. The runtime needs this structure to solve three problems:

  • store buffered elements for buffered channels
  • remember which goroutines are waiting to send or receive
  • track channel state, such as element size and whether the channel is closed

Here is a reduced view of the structure:

typedef struct hchan {
    uint32_t qcount;       // How many elements are buffered right now
    uint32_t dataqsiz;     // Buffer capacity (0 means unbuffered)
    void *buf;             // Circular buffer storage
    uint16_t elemsize;     // Size of each element
    uint8_t closed;        // Whether close(ch) has happened
    uint32_t sendx, recvx; // Circular-buffer indices
    waitq recvq, sendq;    // Goroutines waiting to receive or send
    uint8_t locked;        // Internal lock for channel operations
} hchan;

The important fields fall into three groups:

  • qcount, dataqsiz, buf, sendx, and recvx describe the buffered data
  • recvq and sendq remember blocked receivers and senders
  • elemsize and closed describe the channel’s element layout and state

chan.h contains the authoritative full definition, including extra metadata such as elemtype and internal optimizations.

Unbuffered Channels

An unbuffered channel has no storage for queued elements. A send cannot finish until a receiver is ready, and a receive cannot finish until a sender is ready. When both are ready, the value transfers directly from sender to receiver with no intermediate buffer.

This makes an unbuffered channel more than a transport mechanism: it is also a synchronization point. The sender and receiver rendezvous at the channel operation, so neither side continues past that point until the other side is ready.

Buffered Channels

A buffered channel adds storage between sender and receiver. The sender can place values into the channel without waiting, as long as there is still room in the buffer. The receiver can remove values without waiting, as long as the buffer is not empty.

So a buffered channel trades strict rendezvous for decoupling: sender and receiver do not have to meet at exactly the same moment. The runtime implements the buffer as a simple circular array.

Select

select lets one goroutine wait on several channel operations at once and continue with whichever one becomes possible first. Instead of committing to a single send or receive, the goroutine asks the runtime to choose among multiple cases.

When more than one case is ready, the runtime randomizes the polling order so the same case does not always win first:

select {
case x := <-ch1:  // These are checked in random order
case ch2 <- y:
case <time.After(timeout):
}

Implementation summary: shuffle the case order, check for any ready case, and if none are ready, park the goroutine on all relevant wait queues until one case wakes it up.

Defer, Panic, Recover

These features matter to libgodc’s design because they require per-goroutine runtime state and controlled stack unwinding. Deferred calls live on each goroutine, panic state lives on each goroutine, and recovery depends on checkpoint machinery that can jump execution back to a known safe point.

Defer

Each goroutine keeps a linked list of deferred calls. Every defer statement pushes a record onto that list. When the function returns normally, or when the runtime unwinds the stack during panic, those records are popped and executed in LIFO order.

The runtime needs each defer record to remember at least:

  • which deferred call comes next in the chain
  • which function to invoke
  • what argument to pass
  • whether the defer is currently being executed as part of panic unwinding

Here is a reduced view of the structure:

typedef struct GccgoDefer {
    struct GccgoDefer *link; // Next deferred call in the goroutine's chain
    PanicRecord *_panic;     // Panic currently being unwound, if any
    uintptr_t pfn;           // Function pointer to call
    void *arg;               // Argument to pass
    uintptr_t retaddr;       // Return address used for recover matching
} GccgoDefer;

panic_dreamcast.h contains the authoritative full definition, including the ABI-sensitive fields gccgo expects.

Panic and Recover

A panic is the runtime’s stack-unwinding path. When panic() starts, libgodc records panic state on the current goroutine and walks that goroutine’s defer chain. Deferred calls run one by one, and one of them may call recover().

To resume execution after a successful recover(), the runtime jumps back to a previous checkpoint using setjmp/longjmp. This is why panic recovery is more constrained here than the simple Go-level story suggests: a recovered panic without a checkpoint becomes fatal.

Current implementation detail: helper functions for nil dereference, bounds checks, and divide-by-zero also call runtime_panicstring(), so they enter the same panic/recover machinery instead of going straight to runtime_throw().

Fatal runtime failures still use runtime_throw() and abort immediately.

Type System

Type Descriptors

Type descriptors are the bridge between Go’s compile-time type system and the runtime’s behavior. gccgo generates a descriptor for every Go type, and libgodc uses that metadata for:

  • precise GC pointer scanning (which parts of an object may contain pointers?)
  • interface method dispatch (which methods does this type implement?)
  • limited reflection and type-name metadata

Here is a reduced view of the base descriptor:

typedef struct __go_type_descriptor {
    uintptr_t __size;                            // Size of values of this type
    uintptr_t __ptrdata;                         // Prefix of the value that may contain pointers
    uint8_t __code;                              // Kind (bool, int, slice, struct, etc.)
    const uint8_t *__gcdata;                     // GC bitmap or GC program
    const struct __go_string *__reflection;      // String form used by limited reflection/reporting
    const struct __go_uncommon_type *__uncommon; // Method metadata for named/method-bearing types
} __go_type_descriptor;

The full structure also contains equality, alignment, hashing, and pointer-type metadata. On the current 32-bit SH-4 build this base descriptor is 36 bytes. See runtime/type_descriptors.h for the authoritative layout and offset checks.

Interface Tables

Interface method calls need another runtime structure: an interface table (itab). When you write:

var w io.Writer = os.Stdout
w.Write(data)

the runtime must answer two questions before w.Write(data) can run:

  • what concrete type is actually stored inside the interface?
  • which concrete function implements the interface method slot being called?

libgodc answers that by building an itab on demand in get_itab(). The layout follows what gccgo expects:

  • Iface.itab points at the methods[0] entry of the table
  • methods[0] stores the concrete type descriptor
  • methods[1+] store function pointers for the interface methods

Once built, the table is cached and reused for later interface calls with the same interface type and concrete type pair.

SH4 Specifics

These hardware and ABI constraints shape how libgodc is implemented.

Calling Convention Basics

When one function calls another, the SH-4 ABI defines which registers may be overwritten and which must be preserved. This matters directly to libgodc’s assembly stubs and context-switching code, because the runtime must save and restore the right machine state.

  • r0-r3: argument, result, and scratch registers
  • r4-r7: additional argument registers
  • r8-r13: callee-saved general-purpose registers
  • r14: frame pointer
  • r15: stack pointer
  • pr: procedure return register (return address)
  • GBR: global base register, reserved by KOS for _Thread_local

The terms caller-saved and callee-saved simply describe who is responsible for preserving a register across a function call. Caller-saved registers may be overwritten by the callee, so the caller must save them first if it still needs their values. Callee-saved registers must be restored by the callee before it returns.

libgodc does not use GBR for goroutine TLS. Instead, it uses global current_g / current_tls pointers. This avoids conflicts with KOS and keeps context switching simpler.

FPU Mode

The SH-4 has a hardware floating-point unit (FPU), but libgodc is built with GCC’s -m4-single mode. That makes single-precision (float32) arithmetic the fast path. Double-precision (float64) operations are much slower because they fall back to software-emulation helpers instead of running efficiently in hardware.

This matters both for application code and for the runtime. In hot numeric paths, prefer float32. In the scheduler, FPU state also affects context-switch cost, because floating-point registers must be preserved when goroutines switch.

Cache Coherency

The SH-4 has 32-byte cache lines. A saved goroutine context is 64 bytes, so a context switch touches two cache lines of CPU state.

More importantly, DMA-capable hardware does not automatically see dirty CPU cache lines. The GC handles cache management for its own semispace flip, but user code doing DMA must use KOS cache functions explicitly:

#include <arch/cache.h>

dcache_flush_range((uintptr_t)ptr, size);  // Flush before DMA write
dcache_inval_range((uintptr_t)ptr, size);  // Invalidate after DMA read

AssemblyC ABI Synchronization

The Problem

Context switching is implemented partly in assembly (runtime_sh4_minimal.S). That assembly code must know the exact byte offsets of fields inside C structures such as G and sh4_context_t.

Unlike C, assembly has no type system and no offsetof() operator. It only sees numbers. A typical pattern looks like this:

.equ G_CONTEXT, ...
...
add     r1, r0           ! r0 = &G->context

If someone changes the G struct in C by adding, removing, or reordering fields, the assembly can silently start reading or writing the wrong memory. This is a classic low-level systems bug: the C layout changed, but the assembly constants did not.

Current State

The current protection story is more limited than a fully generated include workflow:

  1. runtime/runtime_sh4_minimal.S still hardcodes the offsets it uses via local .equ constants such as G_CONTEXT.
  2. runtime/gen-offsets.c, runtime/asm-offsets.h, and make check-offsets still exist, but the current header-generation pipeline greps #define lines that gen-offsets.c does not emit.
  3. As checked in today, runtime/asm-offsets.h is therefore just a placeholder header with include guards and no actual offset definitions.
  4. make check-offsets currently validates only that placeholder output, not the .equ constants in the assembly file.
  5. runtime/scheduler.c does not currently perform a runtime offsetof() assertion before scheduling starts.

In other words, the assembly file is still the effective source of truth for the offsets it consumes, even though that is exactly the thing we would like to avoid.

This gap is tracked in #14.

Practical Rule

Until that synchronization path is repaired, treat any layout change to runtime/goroutine.h or related context structures as an assembly change too: update the mirrored definitions, update the .equ constants in runtime/runtime_sh4_minimal.S, and verify the offset-check workflow.

Why This Matters

In games, struct layout bugs cause symptoms like:

  • Goroutines resume with corrupted registers
  • Context switches overwrite random memory
  • FPU state leaks between goroutines
  • Panics with nonsensical stack traces

These are nearly impossible to debug. Even with the current partial verification, keeping the C layout, placeholder generated header, and assembly constants in sync is critical.

Design Decisions

Why gccgo instead of gc?

The standard Go compiler (gc) targets a very different runtime and does not provide a practical SH-4 backend path for this project. gccgo already uses GCC’s backend, which supports SH-4 targets, so libgodc can replace libgo with a Dreamcast-specific runtime without having to build a new compiler backend first.

Why semispace instead of marksweep?

Semispace allocation is simple, fast, and has no fragmentation. On a 16MB console, fragmentation could eventually make large allocations fail even when total free memory still exists. The tradeoff is that only half of the GC-managed semispace heap is usable at once, which is acceptable for the intended game workloads. The current large-allocation bypass has additional tradeoffs and limitations; see #6.

Why cooperative instead of preemptive?

Preemptive scheduling can improve fairness and timer responsiveness, but it also needs more machinery: timer-driven interruption, safepoints, and extra bookkeeping. On a single-core Dreamcast, that machinery still would not make Go goroutines run in parallel, so libgodc currently favors a simpler cooperative model. Follow-up work on fairness without abandoning the single-threaded design is tracked in #12.

Why fixed stacks instead of growable?

Growable stacks require both compiler support and runtime machinery to detect overflow, allocate a larger stack, copy active frames, and resume execution on the new stack. Fixed stacks remove that machinery and fit the current -fno-split-stack model, which keeps the runtime much simpler. The tradeoff is that stack size becomes a real limit rather than something the runtime can expand dynamically. The current default is 64KB for spawned goroutines, but that is not a universal guarantee. Related follow-up work is tracked in #4, #5, and #10.

References

Cheney, C.J. “A Nonrecursive List Compacting Algorithm.” CACM, 1970. Jones & Lins. “Garbage Collection.” Wiley, 1996. The Go Programming Language Specification. KallistiOS Documentation. SH4 Software Manual, Renesas.

Effective Dreamcast Go

A practical guide to writing efficient Go code for the Sega Dreamcast.

These patterns come from real debugging sessions with the libgodc runtime. Follow them to write games that run smooth at 60fps on the Dreamcast’s 200MHz SH-4 processor with 16MB RAM.

Memory Model

ResourceDefault build configNotes
Total RAM16 MB main RAMDreamcast system RAM budget
GC Heap2 MB × 2Default semispace size, configurable
Spawned Goroutine Stack64 KBDefault fixed size, cannot grow
Main Goroutine Stack128 KBKOS main-thread stack by default
Large Object Threshold64 KBObjects strictly larger bypass the GC heap

1. Pre-allocate During Loading

The garbage collector can pause your game for several milliseconds. Allocate everything during load screens, not gameplay.

Bad: Allocating during gameplay

func UpdateParticles() {
    for i := 0; i < 100; i++ {
        p := new(Particle)  // GC pause risk every frame!
        particles = append(particles, p)
    }
}

Good: Object pooling

// Pre-allocated pool
var particlePool [1000]Particle
var activeCount int

func Init() {
    activeCount = 0
}

func SpawnParticle() *Particle {
    if activeCount >= len(particlePool) {
        return nil  // Pool exhausted
    }
    p := &particlePool[activeCount]
    activeCount++
    *p = Particle{}  // Reset to zero
    return p
}

func DespawnParticle(index int) {
    // Swap with last active
    activeCount--
    particlePool[index] = particlePool[activeCount]
}

2. Respect the Default Stack Limits

Spawned goroutines use a fixed 64KB stack by default. Unlike desktop Go, stacks cannot grow. The main goroutine uses the KOS main-thread stack instead (128KB by default), but deep recursion or large local variables are still a bad fit for this runtime.

Bad: Large local arrays

func ProcessFrame() {
    var buffer [16384]float32  // 64KB on stack - CRASH!
    // ...
}

Good: Use globals or heap for large data

var frameBuffer [8192]float32  // Global, not on stack

func ProcessFrame() {
    // Use frameBuffer safely
    for i := range frameBuffer {
        frameBuffer[i] = 0
    }
}

Bad: Deep recursion

func TraverseTree(node *Node) {
    if node == nil { return }
    TraverseTree(node.left)   // Stack grows each call
    TraverseTree(node.right)  // Can overflow on deep trees
}

Good: Iterative with explicit stack

func TraverseTree(root *Node) {
    stack := make([]*Node, 0, 64)  // Heap-allocated
    stack = append(stack, root)
    
    for len(stack) > 0 {
        node := stack[len(stack)-1]
        stack = stack[:len(stack)-1]
        
        if node == nil { continue }
        // Process node...
        stack = append(stack, node.left, node.right)
    }
}

3. Reuse Slices

Creating new slices allocates memory. Reuse existing slices by resetting their length.

Bad: New slice every frame

func GetVisibleEnemies() []Enemy {
    result := make([]Enemy, 0)  // Allocation every call!
    for _, e := range allEnemies {
        if e.visible {
            result = append(result, e)
        }
    }
    return result
}

Good: Reuse with length reset

var visibleEnemies []Enemy

func Init() {
    visibleEnemies = make([]Enemy, 0, 100)  // Once during init
}

func GetVisibleEnemies() []Enemy {
    visibleEnemies = visibleEnemies[:0]  // Reset length, keep capacity
    for _, e := range allEnemies {
        if e.visible {
            visibleEnemies = append(visibleEnemies, e)
        }
    }
    return visibleEnemies
}

4. Minimize Goroutines

Each spawned goroutine consumes 64KB of stack space by default. 100 spawned goroutines = 6.4MB RAM.

Bad: Goroutine per entity

for _, enemy := range enemies {
    go enemy.Think()  // 100 enemies = 6.4MB just for stacks!
}

Good: Process on main goroutine

func UpdateAllEnemies() {
    for i := range enemies {
        enemies[i].Think()  // Sequential, predictable
    }
}

Acceptable: Few dedicated goroutines

func main() {
    go audioMixer()      // One for audio streaming
    go networkHandler()  // One for network (if needed)
    
    // Main loop handles game logic
    for {
        Update()
        Render()
    }
}

5. Use Value Types for Small Structs

Small structs passed by value stay on the stack. Pointers may escape to the heap.

Good: Pass small structs by value

type Vec3 struct {
    X, Y, Z float32  // 12 bytes
}

func Add(a, b Vec3) Vec3 {
    return Vec3{a.X + b.X, a.Y + b.Y, a.Z + b.Z}
}

// Usage - no heap allocation
pos := Add(velocity, acceleration)

Bad: Unnecessary pointer for small struct

func Add(a, b *Vec3) *Vec3 {
    return &Vec3{a.X + b.X, a.Y + b.Y, a.Z + b.Z}  // Escapes to heap!
}

Structs under ~64 bytes are fine to pass by value.

6. Avoid String Operations During Gameplay

Strings are immutable. Concatenation creates new strings (garbage).

Bad: String building in loop

var log string
for i := 0; i < 100; i++ {
    log = log + "entry"  // New allocation each iteration!
}

Bad: Formatted strings every frame

func DrawHUD() {
    scoreText := fmt.Sprintf("Score: %d", score)  // Allocates!
    DrawText(scoreText)
}

Good: Pre-render or avoid strings

// For HUD: use digit sprites
func DrawScore(score int) {
    x := 100
    for score > 0 {
        digit := score % 10
        DrawSprite(digitSprites[digit], x, 10)
        x -= 16
        score /= 10
    }
}

// For debug: print directly (still allocates, but debug only)
println("Debug:", value)

7. Large Assets Bypass the GC Heap

Allocations larger than 64KB use malloc directly and are not garbage collected.

// This 128KB texture is NOT managed by the GC heap
texture := make([]byte, 256*256*2)

// It is not freed automatically.
// This is usually fine for load-once assets.

Implications:

  • Large slices don’t pressure the GC
  • They also don’t get freed automatically
  • A manual free path exists via runtime.FreeExternal
  • Perfect for textures, sounds, level data

8. Escape Analysis Awareness

The Go compiler decides whether variables go on stack (fast) or heap (needs GC). Variables “escape” to heap when:

  • Returned from a function
  • Stored in a slice, map, or struct field
  • Passed to a goroutine
  • Address taken and stored somewhere

Stack allocated (good):

func Calculate() int {
    x := 42        // Stays on stack
    y := x * 2     // Stays on stack
    return y       // Value returned, not pointer
}

Heap allocated (be aware):

func MakeEnemy() *Enemy {
    e := Enemy{}   // Must escape - we return pointer
    return &e      // Heap allocation here
}

Force stack when possible:

// Instead of returning pointer...
func MakeEnemy() *Enemy {
    return &Enemy{HP: 100}  // Heap
}

// Return value and let caller decide:
func NewEnemy() Enemy {
    return Enemy{HP: 100}  // Caller's stack or their choice
}

9. Map Usage Patterns

Maps allocate internally. Pre-size them and avoid creating during gameplay.

Bad: Maps created during gameplay

func SpawnWave() {
    enemyTypes := make(map[string]int)  // Allocation!
    enemyTypes["goblin"] = 10
    // ...
}

Good: Pre-allocated maps

var enemyTypes map[string]int

func Init() {
    enemyTypes = make(map[string]int, 10)  // Pre-size at init
}

func SpawnWave() {
    // Clear and reuse
    for k := range enemyTypes {
        delete(enemyTypes, k)
    }
    enemyTypes["goblin"] = 10
}

10. The Game Loop Pattern

A typical Dreamcast game structure:

package main

// === PRE-ALLOCATED RESOURCES ===
var (
    enemies     [100]Enemy
    particles   [500]Particle
    projectiles [200]Projectile
    
    activeEnemies     []*Enemy
    activeParticles   []*Particle
    activeProjectiles []*Projectile
)

func Init() {
    // Pre-allocate slice capacity
    activeEnemies = make([]*Enemy, 0, 100)
    activeParticles = make([]*Particle, 0, 500)
    activeProjectiles = make([]*Projectile, 0, 200)
    
    // Load assets (large allocations OK here)
    LoadTextures()
    LoadSounds()
    LoadLevel()
}

func Update() {
    // Reset working slices
    activeEnemies = activeEnemies[:0]
    
    // Process game logic (no allocations!)
    for i := range enemies {
        if enemies[i].active {
            enemies[i].Update()
            activeEnemies = append(activeEnemies, &enemies[i])
        }
    }
}

func Render() {
    // Draw using pre-allocated data
    for _, e := range activeEnemies {
        e.Draw()
    }
}

func main() {
    Init()
    
    for !shouldExit {
        Input()
        Update()
        Render()
        // VSync handled by PVR
    }
}

Quick Reference Card

DO

var pool [N]Object             // Pre-allocated pools
slice = slice[:0]              // Reset slice, keep capacity
for i := range arr { }         // Index iteration
small := Vec3{1, 2, 3}         // Value types
make([]T, 0, capacity)         // Pre-sized slices (at init)
val, ok := m[key]              // Safe map access
select { default: }            // Yield when no case is ready
runtime_checkpoint()           // Establish checkpoint before deferred recover

AVOID (during gameplay)

make([]T, n)                   // New slices
append(s, x)                   // When at capacity  
new(T)                         // For small types
go func() {}()                 // Excessive goroutines
string + string                // String concatenation
fmt.Sprintf()                  // Formatted strings
recover()                      // Not enough without a checkpoint
for { busyWork() }             // Loops without yielding

11. Panic Recovery Is Limited

libgodc implements recover(), but resumed execution currently depends on a checkpoint established before the code that may panic. A recovered panic longjmps back to that checkpoint.

Practical rules:

  • Plain recover() without a checkpoint is not enough.
  • Nil dereference, bounds, and divide-by-zero helpers currently go through the same panic machinery as panic().
  • Fatal runtime_throw() paths and interface type-assertion panic helpers still abort immediately.
  • For gameplay code, avoid panic-based control flow and validate inputs early.

Most game code shouldn’t need recovery. Design to avoid panics:

  • Check bounds before indexing
  • Validate inputs at entry points
  • Use ok form for map access: val, ok := m[key]

12. Cooperative Scheduling

The Dreamcast scheduler is cooperative, not preemptive. Goroutines run until they yield.

Goroutines yield when they:

  • Block on channel operations
  • Call select/default when no case is ready
  • Call explicit yield functions such as runtime.Gosched()
  • Sleep or wait on timers

Bad: Infinite loop without yielding

go func() {
    for {
        doWork()  // Never yields - blocks all other goroutines!
    }
}()

Good: Yield periodically

go func() {
    for {
        doWork()
        select {
        case <-done:
            return
        default:
            // Yields to scheduler, then continues
        }
    }
}()

Better: Use channels for work

go func() {
    for item := range workQueue {  // Yields while waiting
        process(item)
    }
}()

Timing is not guaranteed

Because of cooperative scheduling:

  • Don’t rely on precise goroutine ordering
  • Deadlines are “best effort”, not hard guarantees
  • For real-time needs, keep critical work on main goroutine

13. Select with Default

select with default is an efficient polling pattern that yields when no case is ready:

func pollChannels() {
    for {
        select {
        case msg := <-inputChan:
            handleInput(msg)
        case result := <-resultChan:
            handleResult(result)
        default:
            // No message ready - yields to other goroutines
            // and then returns immediately
        }
        
        // Can do other work here
        processFrame()
    }
}

This pattern works well for:

  • Non-blocking channel checks
  • Game loops that need to poll multiple sources
  • Background workers that shouldn’t block the main loop

Platform Constraints

Goroutine Leak

The runtime contains a dead-goroutine queue and a freegs reuse path, but in the current source exited goroutines do not age into reclaimable state because global_generation is never advanced.

In practice, high-churn spawn/exit patterns can retain goroutine state instead of recycling it promptly. Prefer long-lived goroutines and monitor goroutine count with runtime.NumGoroutine().

Panic Recovery Boundary

panic() participates in the panic/recover machinery, and nil/bounds/divide helpers currently do too.

The hard boundary is elsewhere:

  • recover() without an earlier checkpoint is fatal
  • runtime_throw() failures abort immediately
  • Interface type-assertion panic helpers abort immediately

Treat panic recovery as a specialized escape hatch, not a normal game-code tool.

32-bit Pointers

All pointers are 4 bytes. Code assuming 64-bit pointers will break. unsafe.Sizeof(uintptr(0)) returns 4, not 8.

Single-Precision FPU

The SH-4 FPU operates in single precision. Double precision is software emulated—extremely slow. Avoid float64 in hot paths.

Cache Coherency

DMA operations require explicit cache management. Use KOS cache functions from C or via //extern:

#include <arch/cache.h>

dcache_flush_range((uintptr_t)ptr, size);  // Before DMA write (CPU -> HW)
dcache_inval_range((uintptr_t)ptr, size);  // After DMA read (HW -> CPU)

Not Implemented

  • Race detector
  • CPU/memory profiling
  • Debugger support (delve, gdb)
  • Plugin package
  • cgo (use //extern for C functions)
  • Signals (os.Signal, signal.Notify)
  • Networking (requires Broadband Adapter)

Limited Implementation

  • reflect: Basic type inspection only, no reflect.MakeFunc
  • unsafe: Works, but remember 4-byte pointers
  • sync: Mutexes work, but deadlocks and starvation are still possible. Avoid blocking or sleeping while holding locks.

Compatibility

  • gccgo only (not the standard gc compiler)
  • KallistiOS required
  • SH-4 architecture only

Debugging Tips

Available tools:

  • Serial output via println() (routed to dc-tool)
  • LIBGODC_ERROR / LIBGODC_CRITICAL macros (defined in runtime.h)
  • GC statistics via the C function gc_stats(&used, &total, &collections)
  • runtime.NumGoroutine() to count active goroutines
  • KOS debug console (dbglog())

Not available: stack traces, core dumps, breakpoints, variable inspection, heap profiling. When something goes wrong, you have println() and your brain.

If your game stutters:

  1. Check GC pauses: Add timing around forceGC() calls to measure
  2. Count allocations: Use pools and count activeCount
  3. Monitor goroutines: Keep count of active goroutines
  4. Profile stack usage: Deep call chains near 64KB will crash

If your game freezes (but doesn’t crash):

  1. Goroutine not yielding: A goroutine in a tight loop starves others
  2. Deadlock: Two goroutines waiting on each other’s channels
  3. Main blocked: Main goroutine waiting on a channel nobody sends to

If your game crashes:

  1. Stack overflow: Reduce recursion, shrink local arrays
  2. Nil pointer: Check slice bounds, map existence
  3. GC corruption: Ensure pointers are valid (not into freed memory)
  4. Panic recovery mismatch: Recovery is checkpoint-based; plain recover() is not enough

Further Reading

  • docs/reference/design.md - Runtime architecture
  • docs/reference/kos-wrappers.md - Hardware access
  • examples/ - Working game examples

Console development is the art of saying ‘no’ to malloc.

KOS API Bindings

KOS is written in C. Your game is written in Go. gccgo’s //extern directive lets you call C functions directly with no wrapper overhead.

┌─────────────────────────────────────────────────────┐
│  Go Code                                            │
│      kos.PvrInitDefaults()                          │
│                │                                    │
│                ▼                                    │
│  //extern pvr_init_defaults                         │
│  func PvrInitDefaults() int32                       │
│                │                                    │
│                ▼                                    │
│  pvr_init_defaults() in libkallisti.a               │
│                │                                    │
│                ▼                                    │
│  Dreamcast Hardware                                 │
└─────────────────────────────────────────────────────┘

Basic Syntax

Function with No Arguments

//go:build gccgo

package kos

//extern pvr_scene_begin
func PvrSceneBegin()
```go

The `//extern` comment must immediately precede the function declaration,
with no blank lines between them. The function has no body—gccgo generates
the call directly.

### Function with Arguments

```go
//extern pvr_list_begin
func PvrListBegin(list uint32) int32

//extern pvr_poly_compile
func pvrPolyCompile(header uintptr, context uintptr)

Arguments are passed according to the SH-4 ABI: first four in registers (r4-r7), remainder on the stack.

Function with Return Value

//extern pvr_mem_available
func PvrMemAvailable() uint32

//extern timer_us_gettime64
func TimerUsGettime64() uint64

Return values come back in r0 (32-bit) or r0:r1 (64-bit).

Type Mappings

The SH-4 is a 32-bit architecture with 4-byte alignment.

C TypeGo TypeSizeNotes
void(no return)-
intint324SH-4 int is 32-bit
unsigned intuint324
int8_tint81
uint8_tuint81
int16_tint162
uint16_tuint162
int32_tint324
uint32_tuint324
int64_tint648
uint64_tuint648
floatfloat324
doublefloat648Software emulated—slow
void*unsafe.Pointer4
char**byte4Or unsafe.Pointer
size_tuint324uintptr also works
struct foo**Foo4Define matching Go struct

Pointer Size

All pointers are 4 bytes. Code that assumes 64-bit pointers will break. unsafe.Sizeof(uintptr(0)) is 4, not 8.

Struct Mappings

When a KOS function takes a pointer to a struct, you have two options:

Option 1: unsafe.Pointer (Quick and Dirty)

//extern pvr_vertex_submit
func pvrVertexSubmit(data unsafe.Pointer, size int32)

// Usage:
func SubmitVertex(v *PvrVertex) {
    pvrVertexSubmit(unsafe.Pointer(v), int32(unsafe.Sizeof(*v)))
}
```go

Works but provides no type safety. Fine for prototyping.

### Option 2: Matching Go Struct (Correct)

Define a Go struct with identical layout to the C struct:

```c
// From dc/pvr.h
typedef struct {
    uint32_t flags;
    float x, y, z;
    float u, v;
    uint32_t argb;
    uint32_t oargb;
} pvr_vertex_t;
// In Go
type PvrVertex struct {
    Flags      uint32
    X, Y, Z    float32
    U, V       float32
    ARGB       uint32
    OARGB      uint32
}

//extern pvr_prim
func pvrPrim(data unsafe.Pointer, size int32)

// PvrPrimVertex submits a vertex to the TA
func PvrPrimVertex(v *PvrVertex) {
    pvrPrim(unsafe.Pointer(v), 32)  // 32 bytes
}

Verify the struct size matches:

func init() {
    if unsafe.Sizeof(PvrVertex{}) != 32 {
        panic("PvrVertex size mismatch")
    }
}

Alignment Matters

C structs may have padding for alignment. Go structs follow Go’s alignment rules, which may differ. Always verify sizes match.

// C struct with padding:
// struct { char a; int b; }  // 8 bytes (3 bytes padding after a)

// Go equivalent:
type Example struct {
    A   byte
    _   [3]byte  // Explicit padding
    B   int32
}
```go

## Stub Files for Host Compilation

Go files using `//extern` only compile with gccgo. For IDE support and
host-side testing, create stub files:

### pvr.go (Dreamcast build)

```go
//go:build gccgo

package kos

//extern pvr_init_defaults
func PvrInitDefaults() int32

//extern pvr_scene_begin
func PvrSceneBegin()
```go

### pvr_stub.go (Host build)

```go
//go:build !gccgo

package kos

func PvrInitDefaults() int32 { panic("kos: not on Dreamcast") }
func PvrSceneBegin()         { panic("kos: not on Dreamcast") }
```go

The build tag ensures the right file is used:

- `gccgo` tag: compiles with sh-elf-gccgo (Dreamcast)
- `!gccgo` tag: compiles with standard go (host)

## Common Patterns

### Wrapper for Type Safety

Expose a safe public API, hide the unsafe internals:

```go
// Private: direct C binding
//extern maple_dev_status
func mapleDevStatus(dev uintptr) uintptr

// Public: type-safe wrapper with method syntax
func (d *MapleDevice) ContState() *ContState {
    if d == nil {
        return nil
    }
    ptr := mapleDevStatus(uintptr(unsafe.Pointer(d)))
    if ptr == 0 {
        return nil
    }
    return (*ContState)(unsafe.Pointer(ptr))
}

Slice to C Array

C functions expect a pointer and length. Go slices have both:

//extern pvr_txr_load
func pvrTxrLoad(src unsafe.Pointer, dst unsafe.Pointer, count uint32)

func PvrTxrLoad(src []byte, dst unsafe.Pointer) {
    if len(src) == 0 {
        return
    }
    pvrTxrLoad(unsafe.Pointer(&src[0]), dst, uint32(len(src)))
}

Always check for empty slices—&src[0] panics on an empty slice.

String to C String

Go strings are not null-terminated. C functions expect null-terminated strings.

import "unsafe"

// Convert Go string to C string (allocates)
func cstring(s string) *byte {
    b := make([]byte, len(s)+1)
    copy(b, s)
    b[len(s)] = 0
    return &b[0]
}

// Usage:
//extern fs_open
func fsOpen(path *byte, mode int32) int32

func Open(path string) int32 {
    return fsOpen(cstring(path), O_RDONLY)
}
```c

For hot paths, avoid allocation by using fixed buffers:

```go
var pathBuf [256]byte

func OpenFast(path string) int32 {
    if len(path) >= 255 {
        panic("path too long")
    }
    copy(pathBuf[:], path)
    pathBuf[len(path)] = 0
    return fsOpen(&pathBuf[0], O_RDONLY)
}

Callback Functions

Some KOS functions take callbacks. This requires careful handling:

//extern pvr_set_bg_color
func PvrSetBgColor(r, g, b float32)

// For callbacks, you often need to use //export to make a Go function
// callable from C. However, this is complex with gccgo.
// Prefer polling over callbacks when possible.

Callbacks from C to Go are tricky because:

  1. The callback runs on whatever stack C chooses
  2. The Go scheduler may not be in a consistent state
  3. The GC may be running

Poll instead of using callbacks when you can.

Caveats

Stack Usage

KOS functions run on the calling goroutine’s stack. Deep C call chains can overflow the 64KB stack:

// DANGEROUS: Unknown stack depth
func LoadLevel(path string) {
    // fs_open -> iso9660_read -> g2_read -> ...
    // How deep does this go?
}

Solutions:

  1. Call from the main goroutine (larger stack)
  2. Limit recursion depth in your code
  3. Move heavy I/O to loading screens

Blocking Calls

Some KOS functions block (file I/O, CD reads). During blocking:

  • No other goroutines run (M:1 scheduler is blocked)
  • Timers don’t fire
  • The game freezes
// BAD: Blocks entire game for 200ms+
data := loadFile("/cd/level.dat")

// BETTER: Do during loading screen
showLoadingScreen()
data := loadFile("/cd/level.dat")
hideLoadingScreen()

// BEST: Stream data over multiple frames
go streamFile("/cd/level.dat", dataChan)

GBR Register

libgodc uses a global pointer for goroutine TLS, leaving GBR for KOS. This means KOS _Thread_local variables work correctly.

If you’re writing assembly or using inline asm, don’t touch GBR—it’s reserved for KOS.

Building the kos Package

The kos/ directory contains the official bindings. To rebuild:

cd kos/
make clean
make
make install  # Copies to $KOS_BASE/lib/

This produces:

  • kos.gox — Export data for the Go compiler
  • libkos.a — Compiled bindings for the linker

Adding New Bindings

Step 1: Find the C Declaration

grep -r "pvr_mem_reset" $KOS_BASE/include/
# Found in dc/pvr.h:
# void pvr_mem_reset(void);

Step 2: Write the Go Binding

//extern pvr_mem_reset
func PvrMemReset()

For functions with complex signatures, check the header carefully:

// From dc/pvr.h
int pvr_prim(void *data, size_t size);
//extern pvr_prim
func pvrPrim(data unsafe.Pointer, size uint32) int32

Step 3: Add Type-Safe Wrapper (Optional)

// For polygon headers (using helper function)
func PvrPrim(hdr *PvrPolyHdr) int32 {
    return goPvrPrimHdr(unsafe.Pointer(hdr))
}

// For vertices (using helper function)
func PvrPrimVertex(v *PvrVertex) int32 {
    return goPvrPrimVertex(unsafe.Pointer(v))
}

Note: For performance-critical paths like vertex submission, libgodc uses specialized C helper functions (__go_pvr_prim_hdr, __go_pvr_prim_vertex) that handle store queue operations efficiently.

Step 4: Add Stub

func PvrMemReset() {
    panic("kos: not on Dreamcast")
}

Step 5: Rebuild

make clean && make && make install

Reference: KOS Subsystems

SubsystemHeaderPrefixDescription
PVRdc/pvr.hpvr_PowerVR graphics
Mapledc/maple.hmaple_Controllers, VMU, etc.
Sounddc/sound/snd_AICA sound chip
Streamingdc/snd_stream.hsnd_stream_Audio streaming
Filesystemkos/fs.hfs_File operations
Timerarch/timer.htimer_High-resolution timing
Videodc/video.hvid_Video modes
G2 Busdc/g2bus.hg2_Bus transfers
CDROMdc/cdrom.hcdrom_CD access
VMUdc/vmu_*.hvmu_Visual Memory Unit
BFontdc/biosfont.hbfont_BIOS font rendering

Example: Complete PVR Bindings

pvr.go

//go:build gccgo

package kos

import "unsafe"

// PvrPtr is a pointer to PVR video memory (VRAM)
type PvrPtr uintptr

// PVR list types
const (
    PVR_LIST_OP_POLY uint32 = 0  // Opaque polygons
    PVR_LIST_OP_MOD  uint32 = 1  // Opaque modifiers
    PVR_LIST_TR_POLY uint32 = 2  // Translucent polygons
    PVR_LIST_TR_MOD  uint32 = 3  // Translucent modifiers
    PVR_LIST_PT_POLY uint32 = 4  // Punch-through polygons
)

// Initialization
//extern pvr_init_defaults
func PvrInitDefaults() int32

// Scene management
//extern pvr_scene_begin
func PvrSceneBegin()

//extern pvr_scene_finish
func PvrSceneFinish() int32

//extern pvr_wait_ready
func PvrWaitReady() int32

// List management
//extern pvr_list_begin
func PvrListBegin(list uint32) int32

//extern pvr_list_finish
func PvrListFinish() int32

// Primitive submission via helper functions
//extern __go_pvr_prim_hdr
func goPvrPrimHdr(data unsafe.Pointer) int32

//extern __go_pvr_prim_vertex
func goPvrPrimVertex(data unsafe.Pointer) int32

type PvrVertex struct {
    Flags      uint32
    X, Y, Z    float32
    U, V       float32
    ARGB       uint32
    OARGB      uint32
}

// PvrPrim submits a polygon header
func PvrPrim(hdr *PvrPolyHdr) int32 {
    return goPvrPrimHdr(unsafe.Pointer(hdr))
}

// PvrPrimVertex submits a vertex
func PvrPrimVertex(v *PvrVertex) int32 {
    return goPvrPrimVertex(unsafe.Pointer(v))
}

// Memory management
//extern pvr_mem_malloc
func PvrMemMalloc(size uint32) PvrPtr

//extern pvr_mem_free
func PvrMemFree(ptr PvrPtr)

//extern pvr_mem_available
func PvrMemAvailable() uint32

pvr_stub.go

//go:build !gccgo

package kos

type PvrPtr uintptr

const (
    PVR_LIST_OP_POLY uint32 = 0
    PVR_LIST_OP_MOD  uint32 = 1
    PVR_LIST_TR_POLY uint32 = 2
    PVR_LIST_TR_MOD  uint32 = 3
    PVR_LIST_PT_POLY uint32 = 4
)

type PvrVertex struct {
    Flags      uint32
    X, Y, Z    float32
    U, V       float32
    ARGB       uint32
    OARGB      uint32
}

func PvrInitDefaults() int32           { panic("kos: not on Dreamcast") }
func PvrSceneBegin()                   { panic("kos: not on Dreamcast") }
func PvrSceneFinish() int32            { panic("kos: not on Dreamcast") }
func PvrWaitReady() int32              { panic("kos: not on Dreamcast") }
func PvrListBegin(list uint32) int32   { panic("kos: not on Dreamcast") }
func PvrListFinish() int32             { panic("kos: not on Dreamcast") }
func PvrPrim(hdr *PvrPolyHdr) int32    { panic("kos: not on Dreamcast") }
func PvrPrimVertex(v *PvrVertex) int32 { panic("kos: not on Dreamcast") }
func PvrMemMalloc(size uint32) PvrPtr  { panic("kos: not on Dreamcast") }
func PvrMemFree(ptr PvrPtr)            { panic("kos: not on Dreamcast") }
func PvrMemAvailable() uint32          { panic("kos: not on Dreamcast") }

Usage in Games

package main

import "kos"

func main() {
    kos.PvrInitDefaults()

    for {
        kos.PvrWaitReady()
        kos.PvrSceneBegin()

        kos.PvrListBegin(kos.PVR_LIST_OP_POLY)
        drawOpaqueGeometry()
        kos.PvrListFinish()

        kos.PvrListBegin(kos.PVR_LIST_TR_POLY)
        drawTranslucentGeometry()
        kos.PvrListFinish()

        kos.PvrSceneFinish()
    }
}

func drawOpaqueGeometry() {
    // First submit a polygon header
    var hdr kos.PvrPolyHdr
    var ctx kos.PvrPolyCxt
    kos.PvrPolyCxtCol(&ctx, kos.PVR_LIST_OP_POLY)
    kos.PvrPolyCompile(&hdr, &ctx)
    kos.PvrPrim(&hdr)
    
    // Then submit vertices
    v := kos.PvrVertex{
        Flags: kos.PVR_CMD_VERTEX_EOL,  // End of strip
        X: 320, Y: 240, Z: 1,
        ARGB: 0xffffffff,
    }
    kos.PvrPrimVertex(&v)
}

Further Reading

Limitations

This document describes the known limitations of libgodc. Understanding these is essential for writing reliable Dreamcast Go programs.

Memory

16MB Total

The Dreamcast has 16MB of RAM. No virtual memory, no swap, no second chance.

Budget your memory:

  • KOS + drivers: ~1MB
  • Your code: build-dependent
  • GC heap: 2MB active by default (4MB total, two semi-spaces)
  • Spawned goroutine stacks: 64KB each by default
  • Main goroutine stack: KOS main-thread stack (128KB by default)
  • Everything else: KOS malloc

When you run out, you crash.

Goroutine Memory Overhead

The runtime contains a dead-goroutine queue and a freegs reuse path for G structs, but in the current source exited goroutines do not age into reclaimable state because global_generation is never advanced.

That means exited goroutines do not currently reach the cleanup path that would reclaim their stack and TLS and then recycle the G struct.

Workaround: Prefer long-lived goroutines and avoid high-churn spawn/exit patterns:

// GOOD: Fixed set of long-lived goroutines
go audioHandler()      // Lives for entire game
go inputPoller()       // Lives for entire game
go gameLoop()          // Lives for entire game

// Risky today: spawning goroutines per-event can accumulate unreclaimed state
for event := range events {
    go handleEvent(event)
}

GC Pause Times

The garbage collector effectively stops the world for Go goroutines during collection. Pause times depend primarily on live heap size and object layout:

Live HeapPause
100KB1-2ms
500KB5-10ms
1MB10-20ms

At 60fps, you have 16.6ms per frame. A 10ms GC pause consumes most of that budget and can cause visible stutter.

Workarounds:

  1. Keep the live heap small (<500KB)
  2. Disable threshold-triggered GC for action sequences:
    debug.SetGCPercent(-1)  // Disable threshold-triggered GC
    runtime.GC()            // Manual GC during loading screens
    
    This reduces surprise GC pauses, but it is not a hard guarantee: if an allocation would overflow the active semispace, GC still runs.
  3. Use non-moving memory for large raw buffers (textures, audio, levels). In the current runtime, allocations larger than 64KB bypass the semispace heap and use malloc(). This is useful for raw buffers, but large typed Go allocations that contain pointers are a known limitation; see #6.

Fixed Spawned-Goroutine Stacks

Spawned goroutine stacks do not grow. By default each spawned goroutine gets 64KB, while the main goroutine uses the KOS main-thread stack.

This limits recursion depth:

Frame SizeSafe Depth
50 bytes~300
100 bytes~150
250 bytes~60
500 bytes~30

Workarounds:

  1. Convert recursion to iteration
  2. Use smaller local variables
  3. Pass large data by pointer, not by value
  4. Avoid deep call chains
// BAD: Large local arrays
func processLevel(depth int) {
    var buffer [4096]byte  // 4KB per stack frame!
    // ... recursive call
}

// GOOD: Heap allocation for large buffers
func processLevel(depth int) {
    buffer := make([]byte, 4096)  // GC heap
    // ... recursive call
}

Scheduling

No Parallelism (M:1)

All goroutines run on a single thread. The go keyword provides concurrency (interleaved execution), not parallelism (simultaneous execution).

There is no benefit from GOMAXPROCS—the Dreamcast has one CPU core.

No Preemption

Goroutines yield only at explicit points:

  • Blocking channel operations
  • runtime.Gosched()
  • time.Sleep()
  • Timer operations
  • Non-blocking select/default when no case is ready

A goroutine in a tight loop blocks all other goroutines:

// BAD: Blocks entire system
for {
    calculateNextFrame()  // Never yields!
}

// GOOD: Explicit yield
for {
    calculateNextFrame()
    runtime.Gosched()  // Let others run
}

Channel Lock Contention

Under high contention, many goroutines contending for the same channel spend time parking and waking through the wait queues. Channel locking is still a serialization point, but it is not implemented as a spin-yield loop.

Workaround: Use buffered channels to reduce contention:

// Unbuffered: every send/receive contends
events := make(chan Event)

// Buffered: reduced contention
events := make(chan Event, 16)

Language Features

Not Implemented

  • Race detector
  • CPU/memory profiling
  • Debugger support (delve, gdb)
  • Plugin package
  • cgo (use KOS C functions directly via //extern)

Limited Implementation

  • reflect: Basic type inspection only. No reflect.MakeFunc.
  • unsafe: Works, but remember pointers are 4 bytes.
  • sync: Mutexes work, but M:1 scheduling does not make deadlocks impossible. Avoid blocking or sleeping while holding locks, and keep critical sections short.

Panic Recovery Is Limited

panic() enters the panic/recover machinery, and helper paths for nil dereference, bounds failures, and divide-by-zero currently go through runtime_panicstring() as well.

To resume execution after recovery, the runtime expects a checkpoint to have been established earlier. A recovered panic without a checkpoint becomes fatal (recover without checkpoint).

Some failures still abort immediately, including fatal runtime_throw() paths and interface type-assertion panic helpers that call abort() directly.

For gameplay code, the practical rule is still the same: do not rely on panic recovery for ordinary control flow.

Platform Constraints

32-bit Pointers

All pointers are 4 bytes. Code assuming 64-bit pointers will break:

// BAD: Assumes 64-bit
type Header struct {
    flags uint32
    ptr   uintptr  // 4 bytes on Dreamcast, not 8!
    size  uint32
}

Single-Precision FPU

The SH-4 FPU operates in single precision (-m4-single). Double precision operations are emulated in software—extremely slow.

// FAST: Single precision
var x float32 = 3.14

// SLOW: Software emulation
var y float64 = 3.14159265358979

Avoid float64 in hot paths. The compiler flag -m4-single makes all FPU operations single precision, but libraries may still use doubles.

Cache Coherency

The SH-4 has separate instruction and data caches. DMA operations require explicit cache management using KOS functions:

// Before DMA write (CPU -> hardware):
dcache_flush_range((uintptr_t)ptr, size);   // Flush data cache

// After DMA read (hardware -> CPU):
dcache_inval_range((uintptr_t)ptr, size);  // Invalidate data cache

The GC handles cache management for semispace flips via incremental invalidation, but your DMA code must handle cache coherency explicitly using KOS cache functions.

This is only part of DMA safety. Cache management makes CPU and hardware agree about the bytes at a given address, but it does not stop the GC from moving a small heap buffer to a different address mid-transfer. DMA code therefore needs both:

  1. Correct cache flush/invalidate calls.
  2. A stable, non-moving buffer for the lifetime of the transfer.

Longer-term API work to make DMA-safe memory explicit is tracked in #11.

No Signals

There are no Unix signals. os.Signal, signal.Notify, etc. don’t work. Use KOS’s interrupt handlers or polling instead.

No Networking (by default)

Networking requires a Broadband Adapter (BBA) or modem. Most Dreamcast units don’t have one. Design your game to work offline.

Debugging

Available

  • Serial output via println() (routed to dc-tool)
  • LIBGODC_ERROR / LIBGODC_CRITICAL macros (defined in runtime.h)
  • GC statistics via the C function gc_stats(&used, &total, &collections)
  • runtime.NumGoroutine() to count active goroutines
  • KOS debug console (dbglog())

Not Available

  • Stack traces on panic (limited)
  • Core dumps
  • Breakpoints
  • Variable inspection
  • Heap profiling

When something goes wrong, you have println() and your brain. Use them.

Compatibility

gccgo Only

This runtime is for gccgo (GCC’s Go frontend), not the standard gc compiler. Code compiled with go build will not work. Use sh-elf-gccgo.

KallistiOS Required

libgodc requires KallistiOS. It won’t work with other Dreamcast development libraries.

SH-4 Architecture Only

This code is specifically for the Hitachi SH-4 CPU. It won’t run on other architectures.

Summary

LimitationImpactWorkaround
Exited goroutine cleanupHigh spawn/exit churn retains stack/TLS stateLong-lived goroutines
GC pauses1-20ms depending on heapSmall heap, manual GC timing
M:1 schedulingNo parallelismExplicit yields
Fixed stacksLimited recursionIteration, smaller frames
No preemptionTight loops block allruntime.Gosched()
Panic recoveryCheckpoint-based and limitedAvoid panic-driven control flow
16MB RAMMemory pressureMonitor usage, plan carefully

For typical Dreamcast games—15-60 minute sessions with a fixed goroutine architecture—these limitations are manageable. Design with constraints in mind from the start, and you’ll have a runtime that’s simple, fast, and reliable.

Glossary

Quick reference for terms used throughout this documentation.

Runtime Terms

Bump Allocator

An allocation strategy where memory is allocated by simply incrementing a pointer. O(1) allocation, but cannot free individual objects. libgodc uses this for the GC heap.

Cheney’s Algorithm

A garbage collection algorithm that copies live objects from one semispace to another using two pointers (scan and alloc). Named after C.J. Cheney who invented it in 1970.

Context Switch

Saving one goroutine’s CPU registers and loading another’s, allowing multiple goroutines to share a single CPU. On SH4, this involves saving 64 bytes of state.

Cooperative Scheduling

A scheduling model where goroutines must voluntarily yield control. Contrast with preemptive scheduling where the runtime can interrupt goroutines at any time.

Forwarding Pointer

During garbage collection, a pointer left in an object’s old location that points to its new location. Prevents copying the same object twice.

G (Goroutine Struct)

The data structure representing a goroutine. Contains stack bounds, saved CPU context, defer chain, panic state, and scheduling information.

GC Heap

The memory region managed by the garbage collector. In libgodc, this is 4MB total (two 2MB semispaces), with 2MB usable at any time.

hchan

The internal structure representing a Go channel. Contains the buffer, send/receive indices, and wait queues.

M:1 Model

A threading model where many goroutines (M) run on one OS thread (1). All goroutines share a single CPU, providing concurrency but not parallelism.

Root

A root is a place outside the GC heap where the collector looks first for pointers into GC-managed memory. The collector must start there because, at the beginning of GC, it does not yet know which heap objects are live. It first scans the roots, then follows any heap pointers it finds, then follows pointers inside those objects, and so on.

In libgodc, roots include compiler-registered globals, goroutine stacks, the G structs that hold per-goroutine runtime metadata (defer chain, panic state, etc.), and explicit C roots registered with gc_add_root(). There is no separate explicit register scan in the current implementation.

Go example:

// currentLevel itself is outside GC-managed memory.
// but it holds a pointer to the Level struct, on the GC heap.
var currentLevel *Level // global variable: root

func tick() {
    player := &Player{}   // local variable on the current stack: root
    weapon := &Weapon{}   // local variable on the current stack: root
    player.Weapon = weapon
    currentLevel.Player = player

    // During GC:
    // 1. The collector sees `currentLevel`, `player`, and `weapon` in the roots.
    // 2. It follows those pointers into the heap.
    // 3. It then follows `player.Weapon`.
    //
    // So `player` and `weapon` stay alive, even though `weapon` is not itself
    // a global variable. It is reachable from a root.
}

C example:

#include "gc_semispace.h"

void example(void) {
    void *player = gc_alloc(sizeof(void *), NULL);
    void *weapon = gc_alloc(32, NULL);

    *(void **)player = weapon; // player points to weapon
    weapon = NULL;             // now only player still reaches that object

    gc_add_root(&player);      // the variable `player` is an explicit root
    gc_collect();

    // During GC:
    // 1. The collector sees `player` in the explicit root table.
    // 2. It follows `player` into the GC heap, so that object stays alive.
    // 3. It then follows the pointer stored inside `player`, so the `weapon`
    //    object also stays alive even though it was not itself registered as a root.

    gc_remove_root(&player);
}

Run Queue

A list of goroutines that are ready to execute. The scheduler picks goroutines from this queue.

SemiSpace Collector

A garbage collector that divides memory into two equal halves. Objects are allocated in one half; during collection, live objects are copied to the other half.

Stop the World

A GC phase where all program execution pauses while the collector runs. libgodc uses stoptheworld collection exclusively.

Sudog

“Sender/receiver descriptor” a structure representing a goroutine waiting on a channel operation. Contains pointers to the goroutine, the channel, and the data being transferred.

TLS (ThreadLocal Storage)

Pergoroutine storage. In libgodc, each goroutine has its own TLS block containing runtime state.

Type Descriptor

Compilergenerated metadata about a Go type, including size, alignment, hash, and a bitmap indicating which fields contain pointers.

Hardware Terms

AICA

The Dreamcast’s sound processor. An ARM7based chip with 2MB of dedicated sound RAM. Runs independently of the SH4 CPU.

Cache Line

The unit of data transfer between cache and main memory. 32 bytes on SH4. Accessing one byte loads the entire cache line.

GBR (Global Base Register)

An SH4 register reserved for threadlocal storage in KallistiOS. libgodc does not use GBR for goroutine TLS.

KallistiOS (KOS)

The standard opensource SDK for Dreamcast homebrew development. Provides hardware abstraction, memory management, and drivers. It’s pronounced “Kay os”, so it resembles the sound of the word “chaos”.

PowerVR2

The Dreamcast’s GPU. A tilebased deferred renderer with 8MB of dedicated VRAM.

SH4

The Hitachi (now Renesas) SuperH4 processor used in the Dreamcast. 200MHz, 32bit, littleendian, with an FPU optimized for singleprecision math.

VRAM

Video RAM. 8MB dedicated to the PowerVR2 GPU for textures and framebuffers. Allocated via PvrMemMalloc(), not the GC.

Go Terms

//extern

A gccgo directive that declares a function implemented in C. Allows Go code to call KOS functions directly.

Escape Analysis

Compiler analysis that determines whether a variable can stay on the stack or must be allocated on the heap.

gccgo

The GCC frontend for Go. Uses GCC’s backend for code generation, supporting architectures like SH4 that the standard Go compiler doesn’t support.

Interface

A Go type that specifies a set of methods. Variables of interface type can hold any value that implements those methods.

libgo

The standard gccgo runtime library. libgodc replaces this with a Dreamcastspecific implementation.

Slice Header

The 12byte structure representing a Go slice: a pointer to the backing array, length, and capacity.

String Header

The 8byte structure representing a Go string: a pointer to the character data and length.

Abbreviations

| Abbr | Full Form | Meaning | |||| | ABI | Application Binary Interface | How functions pass arguments and return values | | BBA | Broadband Adapter | Dreamcast network adapter (10/100 Ethernet) | | DMA | Direct Memory Access | Hardwaretohardware memory transfer without CPU | | FPU | Floating Point Unit | CPU component for floatingpoint math | | GC | Garbage Collector | Automatic memory management system | | KB | Kilobyte | 1,024 bytes | | MB | Megabyte | 1,048,576 bytes | | MMU | Memory Management Unit | Hardware for virtual memory (Dreamcast doesn’t have one) | | PC | Program Counter | CPU register pointing to current instruction | | PR | Procedure Register | SH4 register holding return address | | SP | Stack Pointer | CPU register pointing to top of stack | | TA | Tile Accelerator | PowerVR2 component that processes geometry | | TLS | ThreadLocal Storage | Perthread/goroutine private data | | VMU | Visual Memory Unit | Dreamcast memory card with LCD screen |

Performance Numbers

The source tree includes tests/bench_architecture.go, which reports these metrics when run on Dreamcast hardware.

BenchmarkOutputNotes
runtime.Gosched()ns per yieldTight-loop yield benchmark
Baseline comparisonns per inline-loop iterationRough baseline only; not a direct function call
Buffered channelns per operationSends and receives on a buffered channel
Context switchns per switchDerived from ping-pong goroutines
Unbuffered channel roundtripns per roundtripSend + receive over an unbuffered channel
Goroutine spawn + runns per spawnCreate, schedule, run, and receive

GC Pause Times

bench_architecture forces GC with retained allocations of 32, 64, 128, 256, 512, and 1024 KB, and reports pause time in microseconds for each case.

Note: With the default configuration, only allocations strictly larger than 64 KB bypass the GC heap and go directly to malloc.

Memory Configuration

ParameterValue
Goroutine stack64 KB
Context size64 bytes
GC header8 bytes
Large object threshold64 KB (size > 64 KB bypasses the GC heap)

Run tests/bench_architecture.elf on your hardware for current numbers.

Acknowledgements

Kudos to:

  • Ian Lance Taylor for writing gccgo.
  • KallistiOS team for building and maintaining the Dreamcast SDK.
  • Dreamcast homebrew community for keeping the console alive.

without you, there would be no libgodc project.