libgodc
Welcome to libgodc — a minimal Go runtime implementation for the Sega Dreamcast.
This project brings the Go programming language to a 1998 game console with 16MB of RAM, a 200MHz SH4 processor, and absolutely no operating system to speak of. It’s an exercise in constraints, a love letter to retro hardware, and a deep dive into how programming languages actually work under the hood.
What is libgodc?
libgodc replaces the standard Go runtime (libgo) with one designed for the Dreamcast’s unique constraints:
| Feature | Desktop Go | libgodc |
|---|---|---|
| Memory | Gigabytes | 16 MB total |
| CPU | Multicore GHz | Singlecore 200 MHz |
| Scheduler | Preemptive | Cooperative |
| GC | Concurrent tricolor | Stop-the-world semispace |
| Stacks | Growable | Fixed 64 KB |
Despite these differences, you write normal Go code. Goroutines work. Channels work. Maps, slices, interfaces — they all work. The magic is in the runtime.
Who is this for?
- Systems programmers curious about runtime implementation
- Go developers who want to understand what happens below
go run - Retro enthusiasts who think game consoles deserve modern languages
- Anyone who enjoys the challenge of severe constraints
Prerequisites
Before diving in, you should be comfortable with:
| Skill | Level | Why You Need It |
|---|---|---|
| Go | Intermediate | Variables, functions, structs, goroutines, channels |
| C | Basic | Pointers, memory layout, basic syntax |
| Command line | Comfortable | Building, running, navigating directories |
You don’t need to know:
- Assembly language (we’ll explain what you need)
- Dreamcast hardware (KallistiOS handles the hard parts)
- Garbage collection algorithms (we’ll build one together)
- Operating system internals (we’ll cover what’s relevant)
If you can write a Go program that uses goroutines and channels, and you know what a pointer is in C, you’re ready.
What’s in this book?
Getting Started
Installation, toolchain setup, and your first Dreamcast Go program.
The Book
A complete walkthrough of building a Go runtime from scratch:
- Memory allocation and garbage collection
- Goroutine scheduling without threads
- Channel implementation
- Panic, defer, and recover
- Building real games
Reference
Technical documentation for daily use:
- API design
- Best practices
- Hardware integration
- Known limitations
Quick Example
package main
import "kos"
func main() {
kos.PvrInitDefaults()
for {
kos.PvrWaitReady()
kos.PvrSceneBegin()
kos.PvrListBegin(kos.PVR_LIST_OP_POLY)
// draw stuff here
kos.PvrListFinish()
kos.PvrSceneFinish()
}
}
This runs on a Dreamcast. Real hardware. 1998 technology. Go code.
Getting Started
Ready to begin? Head to the Installation page.
Or if you want to understand the full journey, start with Building From Nothing.
Installation
Requirements
- A Unix-like system (Linux, macOS, WSL2)
- 4GB disk space for the toolchain
- An x86_64 or arm64 host
- Go 1.25.3 or later — required to install the
godcCLI tool - make — required for building projects
- git — required for toolchain setup and updates
Quick Start
The godc tool automates everything:
go install github.com/drpaneas/godc@latest
godc setup
This downloads the prebuilt toolchain to ~/dreamcast and configures your
environment. Run godc doctor to verify the installation.
godc Commands
| Command | Description |
|---|---|
godc setup | Install entire toolchain from scratch |
godc config | Configure paths and settings |
godc init | Create project files in current directory |
godc build | Compile your game |
godc run | Build and run in emulator |
godc run --ip | Build and run on real Dreamcast via BBA |
godc clean | Remove build artifacts |
godc doctor | Check if everything is installed |
godc update | Update libgodc to latest version |
godc env | Show current paths |
godc version | Print godc version |
Configuration
godc stores its config in ~/.config/godc/config.toml:
Path = "/home/user/dreamcast" # Toolchain location
Emu = "flycast" # Default emulator
IP = "192.168.2.203" # Dreamcast IP for dc-tool
To update settings interactively:
godc config
Manual Installation
If the automated setup doesn’t work for your environment:
Step 1: Get the Toolchain
Download the prebuilt toolchain for your platform:
# Linux x86_64
curl -LO https://github.com/drpaneas/dreamcast-toolchain-builds/releases/download/gcc15.1.0-kos2.2.1/dreamcast-toolchain-gcc15.1.0-kos2.2.1-linux-x86_64.tar.gz
# Linux arm64 (aarch64)
curl -LO https://github.com/drpaneas/dreamcast-toolchain-builds/releases/download/gcc15.1.0-kos2.2.1/dreamcast-toolchain-gcc15.1.0-kos2.2.1-linux-arm64.tar.gz
# macOS arm64 (Apple Silicon)
curl -LO https://github.com/drpaneas/dreamcast-toolchain-builds/releases/download/gcc15.1.0-kos2.2.1/dreamcast-toolchain-gcc15.1.0-kos2.2.1-darwin-arm64.tar.gz
Step 2: Extract
mkdir -p ~/dreamcast
tar -xf dreamcast-toolchain-*.tar.gz -C ~/dreamcast --strip-components=1
The toolchain contains:
~/dreamcast/
├── sh-elf/ # Cross-compiler (sh-elf-gccgo, binutils)
├── kos/ # KallistiOS (OS, drivers, headers)
├── libgodc/ # This library (Go runtime)
└── tools/ # Utilities (elf2bin, makeip, etc.)
Step 3: Environment
Add these to your shell configuration (~/.bashrc, ~/.zshrc, etc.):
export PATH="$HOME/dreamcast/sh-elf/bin:$PATH"
source ~/dreamcast/kos/environ.sh
environ.sh sets KOS_BASE, KOS_ARCH, and other build variables.
Step 4: Verify
sh-elf-gccgo --version
# Should print: sh-elf-gccgo (GCC) 14.x.x ...
ls $KOS_BASE/lib/libgodc.a
# Should exist
Building libgodc from Source
If you need to modify the runtime, or if prebuilt libraries aren’t available:
git clone https://github.com/drpaneas/libgodc ~/dreamcast/libgodc
cd ~/dreamcast/libgodc
source ~/dreamcast/kos/environ.sh
make clean
make
make install
This builds libgodc.a (the runtime) and libgodcbegin.a (startup code),
then installs them to $KOS_BASE/lib/.
Debug Build
For development, enable debug output:
make DEBUG=1
This adds -DLIBGODC_DEBUG=1 -g to the compiler flags, enabling trace
output and symbols.
Running Code
Emulator
lxdream-nitro or flycast can run Dreamcast binaries.
cd examples/hello
make
flycast hello.elf
Real Hardware
With a Broadband Adapter or serial cable:
# Upload via IP (BBA)
dc-tool-ip -t 192.168.1.100 -x hello.elf
# Upload via serial
dc-tool-ser -t /dev/ttyUSB0 -x hello.elf
The godc run command automates this:
godc run # Uses configured emulator
godc run --ip # Uses dc-tool-ip with configured address
Project Structure
A minimal project:
myproject/
├── go.mod # Module definition
├── main.go # Your code
├── .Makefile # Build rules (generated by godc)
└── romdisk/ # Optional: game assets
├── texture.png
└── sound.wav
Example 1: Minimal (hello)
The simplest program — no graphics, just debug output:
main.go:
// Minimal Dreamcast program
package main
func main() {
println("Hello, Dreamcast!")
}
go.mod (generated by godc init):
module hello
go 1.25.3
replace kos => ~/dreamcast/libgodc/kos
Example 2: Screen Output (hello_screen)
Display text on screen using the BIOS font:
main.go:
// Hello World on Dreamcast screen using BIOS font
package main
import "kos"
func main() {
// center "Hello World" on 640x480 screen
x := 640/2 - (11*kos.BFONT_THIN_WIDTH)/2
y := 480/2 - kos.BFONT_HEIGHT/2
offset := y*640 + x
kos.BfontDrawStr(kos.VramSOffset(offset), 640, true, "Hello World")
for {
kos.TimerSpinSleep(100)
}
}
go.mod (generated by godc init):
module hello_screen
go 1.25.3
replace kos => ~/dreamcast/libgodc/kos
require kos v0.0.0-00010101000000-000000000000
Build and Run
godc init # Generate go.mod and .Makefile
godc build # Compile to .elf
godc run # Launch in emulator
Or manually:
sh-elf-gccgo -O2 -ml -m4-single -fno-split-stack -mfsrra -mfsca \
-I$KOS_BASE/lib -L$KOS_BASE/lib \
-c main.go -o main.o
kos-cc -o myproject.elf main.o \
-L$KOS_BASE/lib -Wl,--whole-archive -lgodcbegin \
-Wl,--no-whole-archive -lkos -lgodc
Romdisks — Packaging Assets
A romdisk is a read-only filesystem compiled into your executable. Put assets
in the romdisk/ directory:
myproject/
├── main.go
└── romdisk/
├── player.png
└── music.wav
The build system automatically:
- Creates
romdisk.imgusinggenromfs - Converts it to
romdisk.ousingbin2o - Links it into your executable
Access files in Go via /rd/:
texture := kos.PlxTxrLoad("/rd/player.png", true, 0)
sound := kos.SndSfxLoad("/rd/music.wav")
Compiler Flags
Default flags used by godc:
| Flag | Purpose |
|---|---|
-O2 | Standard optimization |
-ml | Little-endian mode |
-m4-single | SH-4 with single-precision FPU |
-fno-split-stack | Fixed-size goroutine stacks |
-mfsrra | Hardware reciprocal sqrt |
-mfsca | Hardware sin/cos lookup |
For maximum performance:
GODC_FAST=1 godc build
This enables -O3 -ffast-math -funroll-loops. Warning: -ffast-math breaks
IEEE floating-point compliance.
Project Overrides
Create godc.mk for project-specific customizations:
# Reduce GC heap to free RAM for assets
CFLAGS += -DGC_SEMISPACE_SIZE_KB=1024
# Add extra libraries
LIBS += -lmy_custom_lib
# Custom romdisk location
ROMDISK_DIR = assets
Troubleshooting
“sh-elf-gccgo: command not found”
The compiler isn’t in your PATH. Check:
echo $PATH | tr ':' '\n' | grep dreamcast
which sh-elf-gccgo
“cannot find -lgodc”
The runtime library isn’t installed. Build and install it:
cd ~/dreamcast/libgodc
make install
ls $KOS_BASE/lib/libgodc.a
“undefined reference to `__go_runtime_init’”
You’re linking with the wrong library order. The correct order is:
-Wl,--whole-archive -lgodcbegin -Wl,--no-whole-archive -lkos -lgodc
-lgodcbegin must be wrapped in --whole-archive to ensure all its symbols are included.
Runtime crashes immediately
Check if your program uses double-precision floats. The SH-4 FPU is
single-precision only. Compile with -m4-single and avoid float64 in
hot paths.
Out of memory
The Dreamcast has 16MB. Check your allocations using the C API:
#include "gc_semispace.h"
size_t used, total;
uint32_t collections;
gc_stats(&used, &total, &collections);
printf("Heap: %zu / %zu bytes, %u collections\n", used, total, collections);
From Go, you can count goroutines:
println("Goroutines:", runtime.NumGoroutine())
Consider using KOS malloc directly for large buffers:
ptr := kos.PvrMemMalloc(size) // PVR VRAM
ptr := kos.Malloc(size) // KOS heap
Next Steps
- Read the Design Document to understand the runtime architecture
- Read KOS Wrappers to learn how to use KOS functions
- Look at the
examples/directory for working programs - Read Effective Dreamcast Go for best practices and limitations
Quick Start
Let’s create your first Dreamcast Go program.
Create a Project
mkdir myproject && cd myproject
godc init
Example output:
$ godc init
go: found kos in kos v0.0.0-00010101000000-000000000000
This creates go.mod and go.work files that configure your project to use the kos package from your libgodc installation.
Project Structure
A minimal project looks like this:
myproject/
├── go.mod # Module definition with kos dependency
├── go.work # Workspace configuration
└── main.go # Your code
The go.mod file (paths will match your libgodc location):
module myproject
go 1.25.3
replace kos => /path/to/your/libgodc/kos
require kos v0.0.0-00010101000000-000000000000
The go.work file:
go 1.25.3
use (
/path/to/your/libgodc
.
)
Note: The paths in
go.modandgo.workwill automatically point to your libgodc installation location.
Hello, Dreamcast!
Create main.go:
package main
import "kos"
func main() {
kos.PvrInitDefaults()
println("Hello, Dreamcast!")
for {}
}
Build and Run
Using godc:
godc build # Compile to .elf
godc run # Launch in emulator
Or manually with sh-elf-gccgo:
sh-elf-gccgo -O2 -ml -m4-single -fno-split-stack -mfsrra -mfsca \
-I$KOS_BASE/lib -L$KOS_BASE/lib \
-c main.go -o main.o
kos-cc -o myproject.elf main.o \
-L$KOS_BASE/lib -Wl,--whole-archive -lgodcbegin \
-Wl,--no-whole-archive -lkos -lgodc
Your First Graphics
Let’s draw something on screen:
package main
import "kos"
func main() {
kos.PvrInitDefaults()
for {
kos.PvrWaitReady()
kos.PvrSceneBegin()
// Draw opaque geometry
kos.PvrListBegin(kos.PVR_LIST_OP_POLY)
drawTriangle()
kos.PvrListFinish()
kos.PvrSceneFinish()
}
}
func drawTriangle() {
// Create and submit polygon header
var hdr kos.PvrPolyHdr
var ctx kos.PvrPolyCxt
kos.PvrPolyCxtCol(&ctx, kos.PVR_LIST_OP_POLY)
kos.PvrPolyCompile(&hdr, &ctx)
kos.PvrPrim(&hdr) // Submit header
// Submit vertices (use PvrPrimVertex for vertices)
v := kos.PvrVertex{
Flags: kos.PVR_CMD_VERTEX,
X: 320, Y: 100, Z: 1,
ARGB: 0xFFFF0000, // Red
}
kos.PvrPrimVertex(&v)
v.X, v.Y = 200, 400
v.ARGB = 0xFF00FF00 // Green
kos.PvrPrimVertex(&v)
v.X, v.Y = 440, 400
v.Flags = kos.PVR_CMD_VERTEX_EOL // End of strip
v.ARGB = 0xFF0000FF // Blue
kos.PvrPrimVertex(&v)
}
Using Goroutines
Goroutines work on Dreamcast:
package main
import "kos"
func main() {
kos.PvrInitDefaults()
// Start a background goroutine
go func() {
counter := 0
for {
counter++
println("Background:", counter)
select {} // Yield to scheduler
}
}()
// Main loop
for {
kos.PvrWaitReady()
kos.PvrSceneBegin()
render()
kos.PvrSceneFinish()
}
}
Using Channels
Channels enable communication between goroutines:
package main
import "kos"
func main() {
kos.PvrInitDefaults()
// Create a buffered channel
scores := make(chan int, 10)
// Score counter goroutine
go func() {
total := 0
for score := range scores {
total += score
println("Total score:", total)
}
}()
// Main game loop
for {
// Game logic
if playerScored() {
scores <- 100 // Send score
}
render()
}
}
Next Steps
- Read the Design Document to understand the runtime architecture
- Check out the examples/ directory for working programs
- Read Effective Dreamcast Go for best practices
Building From Nothing
The Real Starting Point
Most documentation starts after the hard part. “Here’s the GC” assumes you know you need one. “Here’s how goroutines work” assumes you figured out the symbol names.
Let’s go back to the real beginning:
DAY 0: THE SITUATION
You have:
• sh-elf-gccgo (Go compiler for SH-4)
• KallistiOS (Dreamcast SDK)
• A simple Go program: println("Hello, Dreamcast!")
You try to compile it. What happens?
$ sh-elf-gccgo -c hello.go
$ sh-elf-gcc hello.o -o hello.elf
LINKER ERRORS. Hundreds of them.
undefined reference to `runtime.printstring'
undefined reference to `runtime.printnl'
undefined reference to `__go_runtime_error'
undefined reference to `runtime.newobject'
...
Those undefined references are the holes we discussed in Chapter 2. The compiler generated calls to runtime functions that don’t exist.
Your job: Provide implementations for every one of them.
Part 1: The Discovery Process
How Do You Know What gccgo Expects?
This is the question nobody answers. Where is it documented? What’s the ABI?
Answer: It’s not well-documented. You have to investigate.
Here’s the process we used:
Method 1: Read the Linker Errors
The linker tells you exactly what’s missing:
sh-elf-gccgo -c myprogram.go -o myprogram.o
sh-elf-gcc myprogram.o -o myprogram.elf 2>&1 | grep "undefined reference"
You’ll see output like:
undefined reference to `runtime.printstring'
undefined reference to `runtime.printnl'
undefined reference to `__go_runtime_error'
undefined reference to `runtime.newobject'
undefined reference to `runtime.makeslice'
Start here. Each undefined symbol is a function you need to write.
Method 2: Read the gccgo Source
The gccgo frontend lives in the GCC source tree. The key directories:
gcc/go/gofrontend/ ← The Go parser and type checker
libgo/runtime/ ← The reference runtime (for Linux)
libgo/go/ ← Go standard library
When gccgo compiles make([]int, 10), it emits a call to runtime.makeslice. To find the expected signature:
# In the GCC source tree
grep -r "makeslice" libgo/runtime/
You’ll find the actual implementation. Study its parameters and return type.
Method 3: Use nm on Object Files
Compile your Go code and inspect what symbols it references:
sh-elf-gccgo -c test.go -o test.o
sh-elf-nm test.o | grep " U " # "U" = undefined (needs linking)
This shows you every external symbol your code needs.
Method 4: Disassemble and Trace
When things don’t work, disassemble:
sh-elf-objdump -d test.o | less
Look at how functions are called. What registers hold arguments? What’s expected in return registers?
The Symbol Naming Convention
gccgo uses a specific naming scheme:
| Go Concept | Symbol Name |
|---|---|
runtime.X | runtime.X (literal dot) |
main.foo | main.foo |
| Method on type T | T.MethodName |
| Interface method | Complex mangling |
Since C can’t have dots in identifiers, we use the __asm__ trick:
void runtime_printstring(String s) __asm__("runtime.printstring");
void runtime_printstring(String s) {
// Implementation
}
Part 2: The Build Order
You can’t build everything at once. There are dependencies:
┌─────────────────────────────────────────────────────────────┐
│ │
│ DEPENDENCY GRAPH │
│ │
│ ┌─────────┐ │
│ │ println │ │
│ └────┬────┘ │
│ │ needs │
│ ┌────▼────┐ │
│ │ strings │ │
│ └────┬────┘ │
│ │ needs │
│ ┌────▼────┐ │
│ │ memory │ │
│ │ alloc │ │
│ └────┬────┘ │
│ │ needs │
│ ┌────▼────┐ │
│ │ heap │ │
│ │ init │ │
│ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Milestone 1: Hello World
Goal: Print a string. No GC, no goroutines, nothing fancy.
What you need:
- Memory allocator — Even
printlnallocates internally - Print functions —
runtime.printstring,runtime.printnl,runtime.printint - String support — Go strings are
{pointer, length}structs - Entry point — Something to call
main.main
The minimal files:
runtime/
├── go-main.c # Entry point, calls main.main
├── malloc_dreamcast.c # Basic malloc wrapper
├── go-print.c # Print functions
└── runtime.h # Common definitions
Test:
package main
func main() {
println("Hello, Dreamcast!")
}
If this prints, you have a foundation.
Milestone 2: Basic Types
Goal: Slices, arrays, basic type operations.
What you need:
- makeslice — Create slices
- growslice — Append to slices
- Type descriptors — Compiler generates these, you need to understand them
- Memory operations —
memcpy,memset,memmovewrappers
New files:
runtime/
├── slice_dreamcast.c # Slice operations
├── string_dreamcast.c # String operations
└── type_descriptors.h # Type metadata structures
Test:
package main
func main() {
s := make([]int, 5)
s[0] = 42
println(s[0])
}
Milestone 3: Panic and Defer
Goal: Error handling works.
Why before GC? Because GC needs defer for cleanup. And panic is simpler than GC.
What you need:
- Defer chain — Linked list of deferred calls per goroutine
- Panic mechanism — setjmp/longjmp based
- Recover — Check if in deferred function
Test:
package main
func main() {
defer println("world")
println("hello")
}
// Should print: hello, then world
Milestone 4: Maps
Goal: Hash tables work.
The problem: Go maps have complex semantics:
- Iteration order is randomized
- Growing rehashes everything
- Keys can be any comparable type
What you need:
- Hash function — For each key type
- Bucket structure — Go uses a specific layout
- makemap, mapaccess, mapassign, mapdelete — Core operations
- Map iteration — Complex state machine
Lesson learned: Map iteration state is stored in a hiter struct. If you get this wrong, range loops break mysteriously.
Milestone 5: Garbage Collection
Goal: Automatic memory management.
Design decision: We chose semi-space copying GC because:
- No fragmentation
- Simple implementation
- Predictable pause times (though not short)
What you need:
- Root scanning — Find all pointers on stack and in globals
- Object copying — Move live objects to new space
- Pointer updating — Fix all references
- Type bitmaps — Know which words are pointers
The hard part: Knowing which stack slots are pointers. gccgo generates __gcdata bitmaps for types, but stack scanning is conservative.
Milestone 6: Goroutines
What you need:
- G struct — Goroutine state
- Stack allocation — Each goroutine needs its own stack
- Context switching — Save/restore CPU registers (assembly!)
- Scheduler — Pick which goroutine runs next
- Run queue — List of runnable goroutines
The assembly is unavoidable. You must write swapcontext in SH-4 assembly. There’s no way around it. You see, you have to do context switching in the actual registers, but C doesn’t give you access to talk to them. The compiler manages the registers behind your back.
! Save current context
mov.l r8, @-r4
mov.l r9, @-r4
! ... save all callee-saved registers ...
! Load new context
mov.l @r5+, r8
mov.l @r5+, r9
! ... restore all registers ...
rts
Milestone 7: Channels
Goal: Goroutines can communicate.
Channels require:
- Wait queues (goroutines blocked on send/receive)
- Buffered storage (ring buffer)
- Select statement (waiting on multiple channels)
The “3 days of debugging” commit touched channels. The issue was usually:
- Waking the wrong goroutine
- Corrupting state during concurrent access
- Stack misalignment after context switch
Part 3: Resources You’ll Need
Essential Reading
- gccgo source code —
gcc/go/gofrontend/andlibgo/runtime/ - Go runtime source —
$GOROOT/src/runtime/(different ABI, but same concepts) - SH-4 programming manual — For assembly and ABI
- KallistiOS documentation — For Dreamcast specifics
Tools
| Tool | Purpose |
|---|---|
sh-elf-nm | List symbols in object files |
sh-elf-objdump | Disassemble code |
sh-elf-addr2line | Convert addresses to line numbers |
dc-tool-ip | Upload and run on Dreamcast |
lxdream | Dreamcast emulator (for faster iteration) |
The Checklist Mentality
Before each phase, write down:
- What symbols must I implement?
- What’s the expected signature?
- How will I test it?
After each phase:
- Did all tests pass?
- What surprised me?
- What would I do differently?
The journey from nothing to a working Go runtime is not easy. But it is achievable. Every problem has a solution. Every bug can be found. Every undefined symbol can be implemented.
You now have the map. Go build it.
Introduction to libgodc
What Is This Book?
This book is about building a Go runtime for the Sega Dreamcast.
Wait, what?
┌─────────────────────────────────────────────────────────────┐
│ │
│ THE CRAZY PROJECT │
│ │
│ Go: │
│ • Designed for servers and cloud computing │
│ • Expects gigabytes of RAM │
│ • Has a sophisticated garbage collector │
│ • Written for modern multi-core CPUs │
│ │
│ Dreamcast: │
│ • A game console from 1998 │
│ • Has 16 MB of RAM (megabytes, not giga) │
│ • Single CPU core at 200 MHz │
│ • Was designed for arcade games │
│ │
│ These shouldn't work together. But they do. │
│ │
└─────────────────────────────────────────────────────────────┘
We call this project libgodc, a library that implements Go’s runtime for the Dreamcast. By the end of this book, you’ll understand how we built the Dreamcast Go runtime from scratch: memory allocation, garbage collection, goroutine scheduling, channels, and more.
Who Is This Book For?
You should read this book if:
- You’re curious how programming languages work “under the hood”
- You want to understand what a runtime actually does
- You enjoy systems programming and low-level details
- You think retro game consoles are cool
You’ll need to know:
- Basic Go (variables, functions, structs, goroutines)
- Some C (pointers, memory, basic syntax)
- What a compiler does (turns source code into machine code, duh!)
You don’t need to know:
- Assembly language (we’ll explain what you need)
- How to program the Dreamcast (KallistiOS handles the hard parts)
- Anything about garbage collectors (we’ll build one together)
The Machine We’re Programming
Let’s meet our hardware. The Sega Dreamcast (1998) was ahead of its time—the first 128-bit console, they said! (Marketing math, but still impressive.)
┌─────────────────────────────────────────────────────────────┐
│ │
│ THE SEGA DREAMCAST │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ CPU: Hitachi SH-4 @ 200 MHz │ │
│ │ │ │
│ │ RAM: 16 MB (yes, that's megabytes, not giga) │ │
│ │ │ │
│ │ VRAM: 8 MB (for the GPU) │ │
│ │ │ │
│ │ GPU: PowerVR2 CLX2 │ │
│ │ │ │
│ │ Sound: Yamaha AICA (has its own ARM7 + 2 MB) │ │
│ │ │ │
│ │ Storage: GD-ROM (or SD card adapter) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
For comparison, your phone probably has:
- 4-8 CPU cores at 2+ GHz
- 4-8 GB of RAM
- Virtual memory, memory protection, multiple privilege levels
The Dreamcast has:
- 1 CPU core at 200 MHz
- 16 MB of RAM
- No virtual memory, no memory protection, no privilege levels
Different world.
Why Can’t We Just Use Standard Go?
Go has an official compiler called gc. It generates code for x86, ARM, and other modern architectures.
The Dreamcast uses a SuperH SH-4 processor. Adding SH-4 support to gc would require rewriting significant portions of the compiler backend—months of work, requiring deep expertise in both Go internals and the SH-4 architecture. That’s a project for a team of compiler engineers with sleepless nights, questionable caffeine consumption, and possibly mild insanity.
Instead, we use gccgo, an alternative Go compiler built on GCC. GCC already supports SH-4 (from decades of embedded development). So gccgo can compile Go to SH-4—we just need to provide the runtime.
┌─────────────────────────────────────────────────────────────┐
│ │
│ TWO PATHS TO GO ON DREAMCAST │
│ │
│ Path A: Modify gc │
│ ───────────────────── │
│ - Write a new SH-4 backend │
│ - Write a new Dreamcast Operating System │
│ - Understand SSA, register allocation, etc. │
│ - Result: "real" Go on Dreamcast │
│ │
│ Path B: Use gccgo + write runtime (this book) │
│ ──────────────────────────────────────────── │
│ - GCC already knows SH-4 │
│ - Write runtime in C │
│ - Result: Go dialect for Dreamcast │
│ │
│ We chose Path B. It's faster and teaches more. │
│ │
└─────────────────────────────────────────────────────────────┘
The 16 Megabyte Problem
Sixteen megabytes. That’s it. Everything must fit:
┌─────────────────────────────────────────────────────────────┐
│ │
│ 16 MB = 16,777,216 bytes │
│ │
│ That's shared between: │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Your program's code (0.5 - 2 MB) │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ KallistiOS overhead (~0.5 MB) │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ Go runtime heap (??? MB) │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ Goroutine stacks (??? MB) │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ Game assets (textures, etc.) (??? MB) │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Everything fights for space. │
│ │
└─────────────────────────────────────────────────────────────┘
This is why our garbage collector choice matters so much. We use a semi-space copying collector, which needs two equally-sized spaces. libgodc allocates 2 MB per space = 4 MB total = 2 MB usable heap.
┌─────────────────────────────────────────────────────────────┐
│ │
│ Semi-space GC memory usage (libgodc default): │
│ │
│ ┌─────────────────────┬─────────────────────┐ │
│ │ FROM-SPACE │ TO-SPACE │ │
│ │ 2 MB │ 2 MB │ │
│ │ │ │ │
│ │ (active heap) │ (empty, waiting │ │
│ │ │ for next GC) │ │
│ └─────────────────────┴─────────────────────┘ │
│ │
│ Total: 4 MB for a 2 MB usable heap. That's 50% overhead! │
│ │
│ But: no fragmentation, simple, predictable. │
│ │
└─────────────────────────────────────────────────────────────┘
Design decision: We chose simplicity (semi-space GC) over memory efficiency. On a 16 MB machine, this hurts. But a more memory-efficient collector would be much more complex to implement and debug. The 2 MB usable heap is sufficient for most Dreamcast games—large assets like textures should use external allocation anyway. For games needing more RAM, compile with
-DGC_SEMISPACE_SIZE_KB=1024to shrink the heap to 1 MB usable (2 MB total).
Where Does Everything Live?
The Dreamcast has 16 MB of main RAM at addresses 0x8C000000 to 0x8CFFFFFF. Here’s how it’s organized:
0x8C000000 ──────────────────────────────────────────────
│
│ KOS kernel + drivers (~1 MB)
│
├──────────────────────────────────────────
│ .text (your compiled code)
│ .rodata (constants, type descriptors)
│ .data (initialized globals)
│ .bss (uninitialized globals)
├──────────────────────────────────────────
│
│ KOS malloc heap (everything below):
│
│ ┌─────────────────────────────────────┐
│ │ GC semi-space 0 (2 MB) │
│ ├─────────────────────────────────────┤
│ │ GC semi-space 1 (2 MB) │
│ ├─────────────────────────────────────┤
│ │ Goroutine stacks (64 KB each) │
│ ├─────────────────────────────────────┤
│ │ Textures, audio, game assets │
│ └─────────────────────────────────────┘
│
├──────────────────────────────────────────
│ Main thread stack (grows downward)
│
0x8CFFFFFF ──────────────────────────────────────────────
Total: 16 MB (0x1000000 bytes)
KOS manages the heap via malloc. When you run out of memory, malloc returns NULL and your program crashes. There’s no virtual memory, no swap file, no second chance. See our implementation friendly messages (lol):
// runtime/gc_heap.c
if (gc_heap.alloc_ptr + total_size > gc_heap.alloc_limit)
runtime_throw("out of memory");
// runtime/stack.c
void *base = memalign(8, size);
if (!base)
runtime_throw("stack_alloc: out of memory");
// runtime/chan.c
c = (hchan *)gc_alloc(totalSize, &__hchan_type);
if (!c)
runtime_throw("makechan: out of memory");
// runtime/tls_sh4.c
tls = (tls_block_t *)malloc(sizeof(tls_block_t));
if (!tls)
runtime_throw("tls_alloc: out of memory");
The SH-4 Processor
Let’s get to know the CPU that runs our code.
The Alignment Rule
Here’s something that will bite you if you forget it:
The SH-4 requires natural alignment.
┌─────────────────────────────────────────────────────────────┐
│ │
│ Type Size Must be aligned to │
│ ──── ──── ────────────────── │
│ uint8 1 byte Any address is fine │
│ uint16 2 bytes Address must be divisible by 2 │
│ uint32 4 bytes Address must be divisible by 4 │
│ uint64 8 bytes Address must be divisible by 8 │
│ │
└─────────────────────────────────────────────────────────────┘
On x86 (your laptop), unaligned access is just slow. On SH-4, it crashes the CPU.
┌─────────────────────────────────────────────────────────────┐
│ │
│ x86 (your laptop): │
│ Unaligned access? → Works, but slower │
│ │
│ SH-4 (Dreamcast): │
│ Unaligned access? → ADDRESS ERROR EXCEPTION │
│ System crashes. No recovery. │
│ │
└─────────────────────────────────────────────────────────────┘
Our allocator must always return properly aligned addresses.
The Floating Point Unit
The SH-4 has a powerful FPU with a twist:
┌─────────────────────────────────────────────────────────────┐
│ │
│ Single-precision (float32): FAST! ✓ │
│ - Hardware accelerated │
│ - Multiply-add in 1 cycle │
│ │
│ Double-precision (float64): Slow ✗ │
│ - Takes many more cycles │
│ - Avoid in performance-critical code │
│ │
└─────────────────────────────────────────────────────────────┘
Go defaults to float64. For games, consider using float32 where precision isn’t critical.
Sadly making float32 the new default for our libgodc is not possible. Unless someone, is
crazy enough to recompile gccgo and change all the consts and all the standard library
to use float32, that is a massive work, especially around the math libraries and ones
that depend on it. So, just remember to use float32 and never float64.
A better way to solve this, in the future, would be to create float32 wrappers around common
math functions.
The Cache Problem
The SH-4 has a 16 KB data cache with “write-back” behavior. When you write data, it might only go to the cache, not to main memory.
THE PROBLEM:
════════════
Your code writes to address 0x8C100000
│
▼
┌───────────────┐
│ CACHE │ ← Data goes HERE
│ (new value) │
└───────────────┘
┌───────────────┐
│ MAIN MEMORY │ ← But not HERE (yet)
│ (old value) │
└───────────────┘
│
▼
GPU reads from 0x8C100000
Gets the OLD value! 💥
We have to manually flush the cache before hardware reads from memory:
dcache_flush_range(addr, len); // Push cache → memory
On your laptop, the OS handles this. On the Dreamcast, it’s our job.
KallistiOS: The Foundation
We’re not programming bare-metal. We build on KallistiOS (KOS), the standard SDK for Dreamcast homebrew.
┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Your Go Program │ │
│ └───────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ libgodc │ │
│ │ (Go runtime: GC, scheduler, channels, etc.) │ │
│ └───────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ KallistiOS │ │
│ │ (hardware abstraction, malloc, timers) │ │
│ └───────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Dreamcast Hardware │ │
│ └───────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
KOS is a minimal embedded operating system that gets statically linked into your program. There’s no user/kernel mode separation, no process isolation, and no memory protection. Your code runs with full hardware access, alongside the KOS kernel.
The Constraints That Shape Everything
These hardware limitations drive every decision in libgodc:
Constraint 1: No Memory Protection
On your laptop, accessing invalid memory gives: Segmentation fault (core dumped)
On the Dreamcast: the program corrupts silently or crashes without explanation.
Constraint 2: Real-Time Requirements
Games need consistent frame rates. At 60 FPS, you have 16.67 milliseconds per frame:
┌─────────────────────────────────────────────────────────────┐
│ │
│ One frame = 16.67 ms │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │
│ └─────────────────────────────────────────────────────┘ │
│ Game logic Rendering GC pause │
│ ░░░░░░░░░░ ░░░░░░░░░░░░░░░ ░░░░ │
│ ▲ │
│ │ │
│ If GC takes 20ms, you miss frames! │
│ │
└─────────────────────────────────────────────────────────────┘
Constraint 3: Single Core
The SH-4 is single-core CPU. Even if we wanted parallel GC, the SH-4 can’t run threads simultaneously. That said, when GC runs, everything stops.
The Toolchain
In this chapter
- You learn why we use
gccgoinstead of the standard Go compiler - You see how Go code becomes Dreamcast machine code
- You understand the “holes” in compiled code and how we fill them
- You discover the dark arts: making C pretend to be Go
- You learn about calling conventions and type descriptors
Why gccgo?
A compiler is just a program that writes programs. Most Go developers use gc, the standard Go compiler. It’s fast, produces excellent code, and has a fantastic runtime.
But gc only speaks certain architectures:
┌─────────────────────────────────────────┐
│ │
│ gc compiler's architecture list │
│ │
│ ✓ x86-64 laptops, desktops │
│ ✓ ARM64 phones, Raspberry Pi │
│ ✓ RISC-V new trend │
│ │
│ ✗ SH-4 "never heard of this" │
│ │
└─────────────────────────────────────────┘
The Dreamcast uses a Hitachi SuperH SH-4 processor. Adding support to gc would require modifying the compiler backend—months of work, lots of caffeine, and at least three existential crises.
But here’s the thing: GCC has supported the SH-4 for over two decades.
┌─────────────────┐ ┌─────────────────┐
│ gc compiler │ │ GCC compiler │
│ │ │ │
│ Knows Go ✓ │ │ Knows Go ✗ │
│ Knows SH-4 ✗ │ │ Knows SH-4 ✓ │
└─────────────────┘ └─────────────────┘
│ │
└─────── combine? ──────────┘
│
▼
┌─────────────────┐
│ gccgo │
│ │
│ Knows Go ✓ │
│ Knows SH-4 ✓ │
└─────────────────┘
gccgo is a Go frontend for GCC. It reads Go source code, performs type checking, then hands everything to GCC’s backend. GCC handles the hard part—generating SH-4 machine code.
We get Go compilation for the Dreamcast “for free.” Our job is to provide the runtime library.
What is a Runtime?
A runtime is a library of functions that a compiled program calls during execution. It handles things the compiler can’t (or shouldn’t) generate inline: memory allocation, garbage collection, goroutine scheduling, panic handling, and more.
Why do languages use this pattern? Portability. The compiler translates your source code into machine instructions, but those instructions need to interact with the operating system or hardware. By separating “language translation” from “platform interaction,” you can:
- Reuse the compiler — gccgo already knows Go. We don’t touch it.
- Swap the runtime — We write a Dreamcast-specific runtime. The same compiler now works on a new platform.
This is how Go supports Linux, Windows, macOS, and now Dreamcast—same language, same compiler frontend, different runtimes.
Other languages use similar patterns:
- C has startup code (
crt0) andlibcfor system calls - C++ adds exception handling (
libgcc) and the standard library (libstdc++) - Rust has a minimal runtime embedded in
libstd - Java has the JVM—a full runtime with GC, JIT, and class loading
- Python has
libpython—the interpreter itself
The difference is scope. C’s runtime is small—just system call wrappers. Go’s runtime is large—it includes a garbage collector, scheduler, and channel implementation. That’s why porting Go is harder than porting C, but the principle is identical.
Code with Holes
Here’s the key insight of this entire book. When you compile Go code, the compiler doesn’t include everything.
func main() {
s := make([]int, 10)
m := make(map[string]int)
go doSomething()
}
What does make([]int, 10) actually do? It needs to allocate memory, initialize the slice header, and return it. Does the compiler generate all that code inline?
No. It generates function calls instead:
Your Go code What the compiler emits
───────────── ──────────────────────
make([]int, 10) → CALL runtime.makeslice
make(map[string]int) → CALL runtime.makemap
go doSomething() → CALL runtime.newproc
The compiled object file is full of these calls. But the implementations aren’t there:
┌─────────────────────────────────────────────────────┐
│ │
│ main.o (your compiled code) │
│ │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │HOLE │ │HOLE │ │HOLE │ │HOLE │ │HOLE │ │
│ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │
│ runtime runtime runtime runtime runtime │
│ .make .make .make .new .defer │
│ slice map chan proc proc │
│ │
└─────────────────────────────────────────────────────┘
These are unresolved symbols. The object file knows it needs to call runtime.makeslice, but doesn’t know where that function is.
Who fills in the holes? That’s us. That’s libgodc.
Filling the Holes
Our job is to provide implementations. When the linker combines your code with our library, every hole gets filled:
BEFORE LINKING:
═══════════════
┌──────────────────┐ ┌──────────────────┐
│ main.o │ │ libgodc.a │
│ │ │ │
│ HOLE: runtime. │ │ runtime. │
│ makeslice │ │ makeslice ──────┼──→ actual code!
│ │ │ │
│ HOLE: runtime. │ │ runtime. │
│ newproc │ │ newproc ────────┼──→ actual code!
└──────────────────┘ └──────────────────┘
AFTER LINKING:
══════════════
┌─────────────────────────────────────────────────────┐
│ game.elf │
│ │
│ call runtime.makeslice ───→ [makeslice code] │
│ call runtime.newproc ─────→ [newproc code] │
│ │
│ No more holes! Ready to run. │
└─────────────────────────────────────────────────────┘
The Symbol Problem
There’s a wrinkle. Go uses dots in names: runtime.makeslice.
But dots are illegal in C identifiers:
void runtime.makeslice() { } // SYNTAX ERROR!
How do we write a C function with a dot in its name?
The __asm__ Trick
GCC lets you specify the symbol name separately:
// C identifier uses underscore, but symbol has a dot
void *runtime_makeslice(void *type, int len, int cap)
__asm__("runtime.makeslice");
void *runtime_makeslice(void *type, int len, int cap) {
// implementation
}
┌────────────────────────────────────────────────────────┐
│ │
│ In C code: → In object file: │
│ │
│ runtime_makeslice() runtime.makeslice │
│ (underscore) (dot) │
│ │
│ Go calls runtime.makeslice, linker finds it, │
│ Go never knows it was written in C. │
│ │
└────────────────────────────────────────────────────────┘
Every runtime function in libgodc uses this pattern.
Symbols vs. Signatures
Two things must match between caller and callee:
1. The Symbol (the name): Get it wrong, the linker complains loudly.
2. The Signature (the shape): What arguments, what order, what return values.
The compiler has already decided how to call runtime.makeslice:
Register r4: pointer to type descriptor
Register r5: length
Register r6: capacity
Return value in r0
If our implementation expects arguments in different registers:
What compiler sends: What our code expects:
──────────────────── ──────────────────────
r4 = type pointer r4 = length ← WRONG!
r5 = length r5 = capacity ← WRONG!
The linker won’t catch this. Symbol names match, so it happily connects them. The mismatch only shows up at runtime as mysterious crashes.
┌─────────────────────────────────────────────────────┐
│ │
│ Symbol mismatch: Signature mismatch: │
│ ─────────────── ────────────────── │
│ Linker error Linker succeeds │
│ Clear message Runtime crash │
│ Easy to fix Hard to debug │
│ │
└─────────────────────────────────────────────────────┘
The Calling Convention
When a function calls another function, they need to agree on how to pass data. This is the calling convention.
SH-4 Register Usage
┌─────────────────────────────────────────────────────────────┐
│ SH-4 Register Usage │
│ │
│ r0 Return value / scratch │
│ r1 Return value (64-bit) / scratch │
│ r2-r3 Scratch │
│ ───────────────────────────────────────────── │
│ r4 1st argument │
│ r5 2nd argument │
│ r6 3rd argument │
│ r7 4th argument │
│ ───────────────────────────────────────────── │
│ r8-r13 Callee-saved (must preserve) │
│ r14 Frame pointer │
│ r15 Stack pointer │
└─────────────────────────────────────────────────────────────┘
Why does this matter? Most of the time, it doesn’t—the compiler handles it. But understanding the calling convention helps when:
- Debugging crashes: Register dumps make sense when you know r4-r7 hold arguments
- Writing
//externbindings: You need to match what C functions expect - Reading the runtime assembly: Context switching must save/restore the right registers (r8-r14 are callee-saved, so the callee must preserve them)
Multiple Return Values
Go functions can return multiple values. C can’t. gccgo handles this by returning a struct:
struct result {
int quotient;
int remainder;
};
struct result divmod(int a, int b) {
return (struct result){ a / b, a % b };
}
Small structs fit in r0-r1. When implementing runtime functions that return multiple values, we must match exactly what gccgo expects.
Reading CPU Registers
Sometimes we need to know register values directly:
// This variable IS register r15
register uintptr_t sp asm("r15");
printf("Stack pointer: 0x%08x\n", sp);
This isn’t a copy—sp is the register. We use this for:
- Stack bounds checking
- Context switching (saving/restoring goroutine state)
- Debugging (dump registers on crash)
Inline Assembly
Sometimes C can’t express what we need. Here are real examples from libgodc:
// Prefetch - hint CPU to load cache line (gc_copy.c)
#define GC_PREFETCH(addr) __asm__ volatile("pref @%0" : : "r"(addr))
// Read the stack pointer (gc_copy.c)
void *sp;
__asm__ volatile("mov r15, %0" : "=r"(sp));
// Read/write status register (scheduler.c)
__asm__ volatile("stc sr, %0" : "=r"(sr)); // read
__asm__ volatile("ldc %0, sr" : : "r"(sr)); // write
// Memory barrier - prevent compiler reordering (runtime.h)
#define CONTEXT_SWITCH_BARRIER() __asm__ volatile("" ::: "memory")
We use assembly for:
- Prefetching (hint cache to load data we’ll need soon)
- Context switching (save/restore all registers—see
runtime_sh4_minimal.S) - Reading special registers (stack pointer, status register)
- Memory barriers (ensure memory operations complete before continuing)
Don’t use it for anything you can do in C. KOS handles cache flush/invalidate via dcache_flush_range().
Type Descriptors
When you define a Go type, the compiler generates a type descriptor. Here are the key fields (the full struct has 12 fields, 36 bytes):
struct __go_type_descriptor {
uintptr_t __size; // Size of an instance
uintptr_t __ptrdata; // Bytes containing pointers
uint32_t __hash; // Hash for type comparison
uint8_t __code; // Kind (int, string, struct...)
const uint8_t *__gcdata; // GC bitmap: which words are pointers
// ... plus alignment, equality function, reflection string, etc.
};
For this Go type:
type Point struct {
X, Y int
Name *string
}
The compiler generates:
┌─────────────────────────────────────────────────────────────┐
│ Type descriptor for Point: │
│ │
│ __size: 12 bytes (int + int + pointer) │
│ __ptrdata: 12 bytes (all 3 words may contain pointers) │
│ __code: STRUCT │
│ __gcdata: bit-packed bitmap (1 bit per word) │
│ │
│ Word 0 (X): int, not a pointer → bit 0 = 0 │
│ Word 1 (Y): int, not a pointer → bit 1 = 0 │
│ Word 2 (Name): pointer → bit 2 = 1 │
│ │
│ __gcdata[0] = 0b00000100 = 0x04 │
│ │
│ GC reads: gcdata[word/8] & (1 << (word%8)) │
│ │
└─────────────────────────────────────────────────────────────┘
The garbage collector uses __gcdata to know which fields to scan. The bitmap is bit-packed: one bit per pointer-sized word. Without it, the GC would have to guess which values are pointers.
The Build Process
══════════════════════════════════════════════════════════════
THE BUILD PIPELINE
══════════════════════════════════════════════════════════════
ONCE (building libgodc):
────────────────────────
gc_runtime.c ─┐
chan.c ───────┼──→ sh-elf-gcc ──→ *.o ──→ ar ──→ libgodc.a
scheduler.c ──┤
map.c ────────┘
EVERY TIME (building your game):
────────────────────────────────
main.go ──→ sh-elf-gccgo ──→ main.o (with holes)
│
▼
main.o + libgodc.a + libkallisti.a ──→ sh-elf-ld ──→ game.elf
══════════════════════════════════════════════════════════════
The linker doesn’t care what language produced the code. It just matches symbol names.
Why C, Not Go?
libgodc is written in C (specifically, C11 with GNU extensions).
The Bootstrap Problem: To compile Go, you need a Go runtime. To get a Go runtime, you need to compile Go. Chicken, meet egg.
By writing the runtime in C, we sidestep the problem. The C compiler doesn’t need anything from Go.
Also, KallistiOS is written in C, so we can directly call its functions.
What Runs Before main()?
Your Go main() isn’t the first thing that runs. libgodcbegin.a provides the C main() (in go-main.c) that sets everything up:
Dreamcast powers on
│
▼
KallistiOS boots
│
▼
C main() [go-main.c]
│
├──→ runtime_args() Save argc/argv
├──→ runtime_init()
│ ├──→ gc_init() Set up garbage collector
│ ├──→ map_init() Initialize map subsystem
│ ├──→ sudog_pool_init() Pre-allocate channel waiters
│ ├──→ stack_pool_preallocate() Pre-allocate goroutine stacks
│ ├──→ proc_init() Set up scheduler (tls_init, g0)
│ └──→ panic_init() Set up panic/recover
│
├──→ __go_go(main_wrapper) Create goroutine for main.main
│
└──→ scheduler_run_loop() Start scheduler
│
▼
YOUR CODE RUNS HERE
Memory Management
The Problem with Memory
In C, you’re the janitor:
char *name = malloc(100);
strcpy(name, "Mario");
free(name); // Forget this? Memory leak.
// Do it twice? Crash.
It’s like putting your cup of coffee to your desk every morning and never putting back. Monday is fine, but Friday looks like a pile of empty coffee mugs all over the place.
Go says: “I’ll handle the cleaning the trash and your coffee mugs.”
player := &Player{name: "Mario"} // struct allocation (heap)
enemies := make([]Enemy, 10) // slice allocation (heap)
scores := make(map[string]int) // map allocation (heap)
// That's it. Go cleans up automatically when you're done with them.
Stack vs Heap: Where Does Memory Live?
If you’re coming from Python or JavaScript, you might never have thought about where your variables live. In those languages, everything “just works” in the sense where you create objects, use them, and the runtime cleans up. But programs actually use two different regions of RAM: the stack and the heap. Both are in main memory, but they’re managed very differently.
func calculate() int {
x := 42 // stack: lives only during this function call
y := x * 2 // stack: same, gone when function returns
return y // value is copied out, then x and y disappear
}
func createPlayer() *Player {
p := &Player{name: "Mario"} // heap: we're returning a pointer
return p // p (the pointer) disappears, but the
// Player data survives on the heap
}
The stack is memory that belongs to the current function call. When the function returns, that memory is immediately reclaimed—no cleanup needed, no garbage collector involved. But the data is gone forever.
The heap is memory that persists beyond the function that created it. When you take the address of something (&Player{...}), return a pointer, or use make() for slices/maps, Go allocates on the heap. That memory sticks around until the garbage collector determines nothing references it anymore.
There’s also the data segment where global variables live. These are allocated once when the program starts and exist until the program exits—no cleanup, no GC, they just persist for the program’s entire lifetime.
var highScore int // data segment - exists from start to end
func main() {
x := 42 // stack - gone when main() returns
p := &Player{} // heap - GC cleans up when unreferenced
highScore = 9999 // modifying global, not allocating
}
On Dreamcast, there are additional memory regions you’ll encounter:
| Region | Size | Contains |
|---|---|---|
| Code | varies | Your compiled program (read-only instructions) |
| Data/BSS | varies | Global variables |
| Stack | 64 KB per goroutine | Local variables, function calls |
| Heap | ~4 MB (2 MB usable) | GC-managed allocations |
| VRAM | 8 MB total | Textures, framebuffer (via PVR functions) |
| Sound RAM | 2 MB | Audio samples (via sound functions) |
VRAM and Sound RAM are physically separate chips—they can’t corrupt main RAM or each other. If you run out of VRAM, PvrMemMalloc() returns 0. If you don’t check and try to use that zero pointer, your program crashes. Use PvrMemAvailable() to check how much VRAM remains (the framebuffer takes some of the 8 MB, so you won’t have all of it for textures).
When your game ends (power off or reset), all memory is simply gone—the “cleanup” is turning off the console.
func example() {
// STACK - temporary, fast, automatic cleanup:
count := 10
sum := 0.0
flag := true
// HEAP - persists, needs GC to clean up:
player := &Player{} // pointer escapes? heap
enemies := make([]Enemy, 5) // slices go to heap
scores := make(map[string]int) // maps always heap
}
The compiler decides where each variable lives through escape analysis: if the data could be used after the function returns (passed around, stored somewhere, returned), it goes to the heap. Otherwise, it stays on the stack.
The garbage collector (GC) finds stuff you’re not using anymore and reclaims the memory. But here’s the catch—it takes time to run.
How Allocation Works
When you create something in Go, where does the memory come from?
We use bump allocation. Think of it like a notepad:
┌─────────────────────────────────────────────────────┐
│ Mario │ Luigi │ Peach │ │
└─────────────────────────────────────────────────────┘
↑
You are here
(next free spot)
To allocate: just write at the current spot and move the marker.
┌─────────────────────────────────────────────────────┐
│ Mario │ Luigi │ Peach │ Toad │ │
└─────────────────────────────────────────────────────┘
↑
Moved!
That’s it! Just move a pointer. Way faster than malloc.
Verifying Allocations: A Hands-On Example
Embedded developers are used to inspecting memory directly. Here’s how you can see these allocations in action:
package main
import "unsafe"
type Player struct {
X, Y float32
Score int32
}
//go:noinline
func allocOnHeap() *Player {
return &Player{X: 10, Y: 20, Score: 100}
}
func main() {
// Stack allocation
var local Player
stackAddr := uintptr(unsafe.Pointer(&local))
println("Stack allocation at:", stackAddr)
// Heap allocation
p := allocOnHeap()
heapAddr := uintptr(unsafe.Pointer(p))
println("Heap allocation at:", heapAddr)
// Multiple heap allocations - watch the bump pointer move
for i := 0; i < 5; i++ {
obj := allocOnHeap()
addr := uintptr(unsafe.Pointer(obj))
println(" Player", i, "at:", addr)
}
}
Actual output from Dreamcast hardware (from tests/test_alloc_inspect.elf):
Stack allocation:
Address (hex): 0x8c494cc4
Heap allocation:
Address (hex): 0x8c084b00
Allocating 5 Player structs consecutively:
Player 0 at: 0x8c084b50
Player 1 at: 0x8c084b68 (+ 24 bytes)
Player 2 at: 0x8c084b80 (+ 24 bytes)
Player 3 at: 0x8c084b98 (+ 24 bytes)
Player 4 at: 0x8c084bb0 (+ 24 bytes)
Global variable at: 0x8c05ecc0
→ Data segment (matches .data section start)
Notice the heap addresses increment by 24 bytes each time—that’s the 12-byte Player struct plus the 8-byte GC header, rounded up to 8-byte alignment. The bump pointer just keeps moving forward.
Using GDB to inspect:
# Start dc-tool with GDB server enabled
$ dc-tool-ip -t 192.168.x.x -g -x your_game.elf
# In another terminal, connect GDB
$ sh-elf-gdb your_game.elf
(gdb) target remote :2159
# Set breakpoint and run
(gdb) break main.main
(gdb) continue
# Examine heap memory (address from test output)
(gdb) x/32x 0x8c084b00 # Dump heap region
(gdb) info registers r15 # Stack pointer (SP)
# View GC heap structure
(gdb) p gc_heap # Print GC heap state
(gdb) p gc_heap.alloc_ptr # Current bump pointer
Memory layout from real hardware (16 MB RAM at 0x8c000000-0x8d000000):
0x8c000000 ┌─────────────────────────────────────┐
│ KOS kernel and system data │
0x8c010000 ├─────────────────────────────────────┤
│ .text (your compiled code) │ ← Binary starts here
0x8c052aa0 ├─────────────────────────────────────┤
│ .rodata (read-only data, strings) │
0x8c05ecc0 ├─────────────────────────────────────┤
│ .data (global variables) │ ← Global at 0x8c05ecc0
0x8c0622ac ├─────────────────────────────────────┤
│ Heap (KOS malloc) │
│ - GC semi-spaces │ ← Heap alloc at 0x8c084b00
│ - KOS thread stacks │ ← Stack var at 0x8c494cc4
│ - Other malloc allocations │
│ │
0x8d000000 └─────────────────────────────────────┘
Note: KOS manages thread stacks via malloc, so both heap allocations and stack memory come from the same pool. The addresses above are from running test_alloc_inspect.elf on real hardware.
But wait…! We never erase anything. Eventually we run out of pages. Yikes!
Why Two Spaces? (Semi-Space Collection)
The bump allocator has a problem: it can only allocate, never free individual objects. When the space fills up, we need a way to reclaim garbage.
Why not free objects in place? Because it creates fragmentation:
┌──────────────────────────────────────────────────────┐
│ Player │ FREE │ Enemy │ FREE │ FREE │ Bullet │ FREE │
└──────────────────────────────────────────────────────┘
↑ can't fit a 3-slot object here
You end up with “free” holes everywhere. A 3-slot object might not fit even though there’s enough total free space.
The solution: copy to a second space. Instead of freeing in place:
- Allocate a second space of equal size
- When the first space fills, scan for live objects (objects still referenced)
- Copy only live objects to the second space
- The first space is now 100% garbage—reset the bump pointer to the start
BEFORE (Space A full): AFTER (Space B active):
┌────────────────────────┐ ┌────────────────────────┐
│ Player │ xxx │ Enemy │ │ → │ Player │ Enemy │ Bullet│
│ xxx │ Bullet │ xxx │ │ │ │
└────────────────────────┘ └────────────────────────┘
(xxx = garbage) (compacted, no gaps!)
This copying collection solves two problems at once:
- Garbage is reclaimed: everything left in Space A is garbage
- Memory is compacted: no fragmentation in Space B
How Copying Works: Cheney’s Algorithm
The copying process uses an elegant algorithm invented by C.J. Cheney in 1970. It needs only two pointers and no recursion:
TO-SPACE:
┌────────────────────────────────────────────────────────┐
│ Player │ Enemy │ Bullet │ │
└────────────────────────────────────────────────────────┘
↑ ↑
SCAN ALLOC
-
Start with roots (global variables, stack references, CPU registers)
Why roots? The GC needs to know which objects are still in use. It can’t ask the running program—the program is paused. The only way to determine if an object is “live” is to check: can any code reach it? Roots are the starting points—references the program definitely has access to. If an object isn’t reachable from any root (directly or through a chain of pointers), no code can ever access it again. It’s garbage.
-
Copy each root object to to-space at the ALLOC position, then move ALLOC forward by the object’s size (this is the same bump allocation from earlier—just
alloc_ptr += size) -
Scan copied objects (starting at SCAN pointer) for pointers to other objects
“Scan” doesn’t mean checking every byte—that would be slow and error-prone. Each object has type information (the
__gcdatabitmap from its type descriptor) that tells the GC exactly which fields are pointers. The GC only checks those fields. -
If a referenced object hasn’t been copied, copy it to to-space
-
Update the pointer to point to the new location
-
Repeat until SCAN catches up with ALLOC—all live objects are now copied
The clever part: when you copy an object, you leave a forwarding pointer in the old location. If another reference points to that same object, you find the forwarding pointer and update the reference without copying again.
// Simplified from runtime/gc_copy.c
void *gc_copy_object(void *old_ptr) {
gc_header_t *header = gc_get_header(old_ptr);
// Already copied? Return the forwarding address
if (GC_HEADER_IS_FORWARDED(header))
return GC_HEADER_GET_FORWARD(header);
size_t obj_size = GC_HEADER_GET_SIZE(header);
// Copy to to-space at current alloc_ptr
gc_header_t *new_header = (gc_header_t *)gc_heap.alloc_ptr;
memcpy(new_header, header, obj_size);
gc_heap.alloc_ptr += obj_size;
void *new_ptr = gc_get_user_ptr(new_header);
// Leave forwarding pointer in old location
GC_HEADER_SET_FORWARD(header, new_ptr);
return new_ptr;
}
Why this algorithm is elegant:
- O(live objects) time—dead objects aren’t even touched
- No recursion—just two pointers chasing each other
- Single pass—scan and copy happen together
- Compaction is free—objects naturally pack together
The trade-off: 50% of heap is always reserved for the copy destination.
The 50% Memory Cost
You may have noticed the trade-off mentioned earlier: one space is always reserved for copying. That means half your heap is “unusable” at any given time.
┌─────────────────────────────────────────────────────┐
│ 4 MB total GC heap │
│ ┌──────────────────┬──────────────────┐ │
│ │ Space A │ Space B │ │
│ │ 2 MB │ 2 MB │ │
│ │ (active) │ (copy target) │ │
│ └──────────────────┴──────────────────┘ │
│ │
│ Usable at any time: 2 MB │
└─────────────────────────────────────────────────────┘
Why accept this 50% cost? Because you get:
- No fragmentation: Cheney’s algorithm compacts automatically
- O(1) allocation: just bump
alloc_ptr, no free-list search - O(live objects) collection: dead objects aren’t even touched
- Simple implementation: fewer bugs in the runtime
- Cache-friendly: live objects end up packed together
It’s a deliberate trade-off: memory for speed and simplicity. On a 16 MB system where you’re also using VRAM and Sound RAM for assets, 2 MB of usable GC heap is often sufficient.
Customizing heap size: The default is
GC_SEMISPACE_SIZE_KB=2048(2 MB per space, 4 MB total). To change it, editruntime/godc_config.hor rebuild libgodc withmake CFLAGS="-DGC_SEMISPACE_SIZE_KB=1024"for 1 MB usable, leaving more RAM for game assets.
The Freeze
Here’s the bad news. When the GC runs, your game stops.
Timeline:
────────────────────────────────────────────────────────
Game: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓████████████▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
↑ ↑
GC starts GC ends
"stop-the-world"
All Go code freezes, game logic, physics, input handling. No goroutines run during collection. (Music keeps playing though—the AICA sound processor runs independently of the SH-4 CPU.)
How long does this take? Let’s find out with real numbers.
Real Benchmark Results
Benchmarks from actual Dreamcast hardware (from tests/bench_architecture.elf), verified December 2025:
┌─────────────────────────────────────────────────────┐
│ SCENARIO GC PAUSE │
├─────────────────────────────────────────────────────┤
│ Large objects (≥128 KB) ~73 μs (bypass GC) │
│ 64 KB live data ~2.2 ms │
│ 32 KB live data ~6.2 ms │
└─────────────────────────────────────────────────────┘
GC pause scales with the number of objects, not just total size. Many small objects (32 KB scenario) require more traversal and copying than fewer large objects.
Key insight: Allocations ≥64 KB bypass the GC heap entirely (go straight to malloc), which is why the “large objects” scenario shows only ~73 μs—that’s just the baseline GC setup cost with nothing to copy.
See the Glossary for a complete reference of all benchmark numbers.
What This Means for Games
Let’s do the math with real data (assuming ~128KB live data = ~6ms pause):
┌─────────────────────────────────────────────────────┐
│ TARGET FPS FRAME BUDGET GC PAUSE (~6ms) │
├─────────────────────────────────────────────────────┤
│ 60 FPS 16.7 ms ~1/3 frame stutter │
│ 30 FPS 33.3 ms barely noticeable │
│ 20 FPS 50 ms unnoticeable │
└─────────────────────────────────────────────────────┘
At 60 FPS, a 6ms GC pause is noticeable but brief. Keep live data small, and pauses stay short.
Big Objects Get Special Treatment
Here’s a surprise: big allocations skip the GC entirely!
small := make([]byte, 1000) // → GC heap
big := make([]byte, 100*1024) // → malloc (bypasses GC!)
The threshold is 64 KB:
┌─────────────────────────────────────────────────────┐
│ SIZE WHERE IT GOES FREED BY │
├─────────────────────────────────────────────────────┤
│ < 64 KB GC heap GC (automatic) │
│ ≥ 64 KB malloc NEVER! (manual) │
└─────────────────────────────────────────────────────┘
Wait, never? That’s right. Big objects are never automatically freed.
Why? Copying a 256 KB texture during GC would be too slow. So we skip it entirely. But that means you’re responsible for freeing it.
⚠️ WARNING ⚠️
Large objects (≥64 KB) are NEVER
automatically freed by the GC!
This is a memory leak unless you
call freeExternal() manually (see next section).
When Is This OK?
Fine: Loading a texture at game start. It lives forever anyway.
Problem: Loading new textures every level without freeing old ones.
Freeing Big Objects
Here’s how to clean up big allocations:
import "unsafe"
//extern _runtime.FreeExternal
func freeExternal(ptr unsafe.Pointer)
// Load a big texture
texture := make([]byte, 256*1024) // 256KB, bypasses GC
// Later, when done with it:
freeExternal(unsafe.Pointer(&texture[0]))
texture = nil // Don't use it anymore!
The best time to do this? Level transitions.
func LoadLevel(num int) {
// Free old level's big stuff
if oldTexture != nil {
freeExternal(unsafe.Pointer(&oldTexture[0]))
oldTexture = nil
}
// Load new level
oldTexture = loadTexture(num)
// Clean up small stuff too
runtime.GC()
}
EXERCISE
3.3 You load a 128 KB texture each level. After 10 levels without calling freeExternal(), how much memory have you leaked?
Making GC Hurt Less
Techniques to reduce GC impact, validated by real benchmarks from tests/bench_gc_techniques.elf.
Technique 1: Pre-allocate Slices
Benchmark result: 78% faster!
Real numbers from Dreamcast:
- Growing slice: 72,027 ns/iteration
- Pre-allocated: 40,450 ns/iteration
// SLOW: Slice grows, triggers multiple allocations
var items []int
for i := 0; i < 100; i++ {
items = append(items, i)
}
Why is this slow? A slice in Go is three things: a pointer to data, a length, and a capacity. When you append beyond capacity, Go must:
- Allocate a new, larger array (typically 2x the size)
- Copy all existing elements to the new array
- Abandon the old array (becomes garbage for GC to collect)
Here’s what happens in memory when appending 5 items to an empty slice:
append #1: Allocate [_], write item → 1 alloc, 0 copies
append #2: Full! Allocate [_,_], copy 1 → 2 allocs, 1 copy
append #3: Full! Allocate [_,_,_,_], copy 2 → 3 allocs, 3 copies total
append #4: Space available, just write → 3 allocs, 3 copies total
append #5: Full! Allocate [_,_,_,_,_,_,_,_], copy 4 → 4 allocs, 7 copies total
For 100 items, this triggers ~7 reallocations and copies ~200 elements total. Each abandoned array is garbage that fills the heap faster.
Memory timeline (growing slice):
┌─────────────────────────────────────────────────────┐
│ [1] ← alloc #1 (abandoned) │
│ [1,2] ← alloc #2 (abandoned) │
│ [1,2,3,_] ← alloc #3 (abandoned) │
│ [1,2,3,4,5,_,_,_] ← alloc #4 (abandoned) │
│ [1,2,3,4,5,6,7,8,9,...] ← alloc #5 (current) │
│ │
│ GC must eventually clean up allocs #1-#4! │
└─────────────────────────────────────────────────────┘
The fix: If you know (or can estimate) how many items you’ll need, pre-allocate:
// FAST: Pre-allocate with known capacity
items := make([]int, 0, 100) // length=0, capacity=100
for i := 0; i < 100; i++ {
items = append(items, i)
}
Memory timeline (pre-allocated):
┌─────────────────────────────────────────────────────┐
│ [_,_,_,_,_,...100 slots...] ← single allocation │
│ [1,_,_,_,_,...] → [1,2,_,_,...] → [1,2,3,_,...] │
│ │
│ No copying. No garbage. Just fill in the blanks. │
└─────────────────────────────────────────────────────┘
No growing. No copying. No garbage. 78% faster.
When to use: Loading enemy spawns from a level file? You know the count. Parsing a protocol with a length header? Pre-allocate. Even a rough estimate (round up to next power of 2) beats growing from zero.
Technique 2: Object Pools
Important: Pools are NOT faster for allocation!
Real numbers from Dreamcast:
- new() allocation: 201 ns/object
- Pool get/return: 1,450 ns/object (7x slower!)
This is counter-intuitive if you’re coming from desktop Go or other languages. Let’s understand why.
Why is new() so fast? Our bump allocator is essentially one operation:
new(Bullet):
┌───────────────────────────────────────────────────────┐
│ alloc_ptr → [████████ used █████|▓▓▓▓ free ▓▓▓▓▓] │
│ ↑ │
│ alloc_ptr += sizeof(Bullet)│
│ │
│ Total: 1 pointer increment. Done. │
└───────────────────────────────────────────────────────┘
That’s it. No free lists to search. No size classes. No locking. Just bump the pointer forward. This is why 201 ns is achievable—it’s maybe 40-50 CPU cycles.
Why are pools slower? Pool operations involve slice manipulation:
GetFromPool():
┌─────────────────────────────────────────────────────┐
│ 1. Check if len(pool) > 0 ← bounds check │
│ 2. Read pool[len-1] ← memory access │
│ 3. pool = pool[:len-1] ← slice header write │
│ 4. Return pointer ← done │
│ │
│ ReturnToPool(): │
│ 1. Reset object fields ← memory writes │
│ 2. pool = append(pool, obj) ← may grow slice! │
│ │
│ Total: ~7x more work than bump allocation │
└─────────────────────────────────────────────────────┘
So why use pools at all? The trade-off isn’t about allocation speed. It’s about when you pay the cost:
WITHOUT POOL (100 bullets/frame):
─────────────────────────────────────────────────────
Frame 1: new new new new... (100x) │ 20 μs │ smooth
Frame 2: new new new new... (100x) │ 20 μs │ smooth
Frame 3: new new new new... (100x) │ 20 μs │ smooth
...
Frame 50: GC TRIGGERED! │ 6 ms │ ← STUTTER!
─────────────────────────────────────────────────────
└─ 60 FPS target = 16.6 ms
6 ms pause = 1/3 frame drop
WITH POOL (100 bullets/frame):
─────────────────────────────────────────────────────
Frame 1: get get get... return... │ 145 μs │ smooth
Frame 2: get get get... return... │ 145 μs │ smooth
Frame 3: get get get... return... │ 145 μs │ smooth
...
Frame 50: (no GC needed) │ 145 μs │ still smooth!
─────────────────────────────────────────────────────
You’re trading ~125 μs per frame for no GC pauses. For a bullet hell game, that’s worth it.
When to use pools:
- High-frequency create/destroy (bullets, particles, audio events)
- Objects with predictable lifetimes (spawned and despawned together)
- When you need consistent frame times (no surprise stutters)
When NOT to use pools:
- Objects created once and kept (player, level geometry)
- Low churn rate (a few allocations per second)
- Prototype/debugging (just use
new(), it’s simpler)
Simple pool implementation:
var pool []*Bullet
func GetBullet() *Bullet {
if len(pool) > 0 {
b := pool[len(pool)-1]
pool = pool[:len(pool)-1]
return b
}
return new(Bullet) // Pool empty? Allocate fresh
}
func ReturnBullet(b *Bullet) {
b.X, b.Y, b.Active = 0, 0, false // Reset state!
pool = append(pool, b)
}
Pro tip: Pre-populate the pool at game start to avoid any new() calls during gameplay:
func InitBulletPool(size int) {
pool = make([]*Bullet, size)
for i := range pool {
pool[i] = new(Bullet)
}
}
Now GetBullet() never allocates during gameplay—predictable performance every frame.
Technique 3: Trigger GC at Safe Times
Benchmark: Manual GC takes ~35 μs with minimal live data
The problem with automatic GC is unpredictability. You don’t control when it runs. It just happens when the heap fills up. That might be during a boss fight.
GC pause times from real benchmarks (from bench_gc_pause.elf):
| Live Data | GC Pause | Impact at 60 FPS |
|---|---|---|
| Minimal | ~100 μs | Unnoticeable |
| 32 KB | ~2 ms | Minor stutter |
| 128 KB | ~6 ms | 1/3 frame drop |
The key insight: GC pause scales with live data, not garbage. If you trigger GC when live data is minimal (between levels, during menus), the pause is tiny.
Uncontrolled vs Controlled GC:
UNCONTROLLED (GC surprises you):
─────────────────────────────────────────────────────────────
│ Gameplay ││ Gameplay ││ Gameplay ││ GC! ││ Gameplay │
│ smooth ││ smooth ││ smooth ││6 ms!││ smooth │
─────────────────────────────────────────────────────────────
↑
Player notices!
"Why did it stutter
when I jumped?"
CONTROLLED (you choose when):
─────────────────────────────────────────────────────────────
│ Gameplay ││ Menu Opens ││ Gameplay ││ Level End ││ Next │
│ smooth ││ GC (35 μs) ││ smooth ││ GC (35 μs)││Level │
─────────────────────────────────────────────────────────────
↑ ↑
Player is reading Victory animation
menu anyway playing anyway
How to trigger GC manually:
//go:linkname forceGC runtime.GC
func forceGC()
Best times to trigger GC (player won’t notice):
func OnDialogueStart() {
forceGC() // Text appearing letter-by-letter anyway
}
func OnMenuOpen() {
forceGC() // Player is reading options
}
func OnLevelComplete() {
forceGC() // Victory fanfare playing, score tallying
}
func OnLoadingScreen() {
forceGC() // Already showing "Loading..."
}
func OnRoomTransition() {
forceGC() // Screen is fading to black
}
func OnCutsceneStart() {
forceGC() // Video/animation taking over
}
Important caveats:
-
Don’t trigger too often. GC still takes time. Once per scene transition is reasonable. Once per frame defeats the purpose.
-
This doesn’t reduce garbage. You’re just choosing when to pay the cost. Combine with pre-allocation and pools to reduce how much garbage you create.
-
Live data still matters. If you have 128 KB of permanent game state, even manual GC takes ~6 ms. Keep live data lean.
Good: Trigger GC → level enemies/items are garbage → fast GC
Bad: Trigger GC → 10,000 persistent objects → slow GC anyway
Technique 4: Reuse Slices
Benchmark: 5% faster (13,200 ns → 12,500 ns)
Small gain per-call, but the real win is less garbage over time. Reset with [:0] instead of allocating new:
// BAD: New allocation every frame
func ProcessFrame() {
items := make([]int, 0, 100) // ← garbage next frame
// ...
}
// GOOD: Reuse backing array
var items = make([]int, 0, 100) // Allocate once
func ProcessFrame() {
items = items[:0] // Reset length, keep capacity
// ...
}
The [:0] trick keeps the backing array. Over 1000 frames: 1 allocation instead of 1000.
Bonus pattern—shift without allocating:
// Creates new slice header:
queue = append(queue[1:], newItem)
// Reuses existing array:
copy(queue, queue[1:])
queue[len(queue)-1] = newItem
Technique 5: Compact In-Place
When entities die, don’t allocate a filtered slice. Compact the existing one:
// BAD: Allocates new slice
alive := make([]*Enemy, 0)
for _, e := range enemies {
if e.Active {
alive = append(alive, e) // ← garbage
}
}
enemies = alive
// GOOD: Compact in place
n := 0
for _, e := range enemies {
if e.Active {
enemies[n] = e
n++
}
}
enemies = enemies[:n] // Shrink, no allocation
Visual:
Before: [A, _, B, _, _, C] (3 active, 3 dead)
↓ compact
After: [A, B, C] (same backing array, shorter length)
Classic game loop pattern: every frame, compact dead bullets/particles/enemies without touching the allocator.
Goroutines
The Trade-off
Let me set expectations: goroutines on Dreamcast work, but differently than on modern hardware.
You get zero parallelism (single CPU), but you get everything else: clean concurrency primitives, channels, and code that feels like Go.
Here’s the thing. Goroutines shine when you have multiple CPU cores:
Modern PC (8 cores):
────────────────────────────────────────────────────────────
Core 1: [──────goroutine A──────]
Core 2: [──────goroutine B──────]
Core 3: [──────goroutine C──────]
Core 4: [──────goroutine D──────]
...
↑
All running SIMULTANEOUSLY
4x faster than running them one-by-one!
But Dreamcast?
Dreamcast (1 core):
────────────────────────────────────────────────────────────
CPU: [───A───][───B───][───A───][───C───][───B───]...
↑
Only ONE runs at a time
ZERO parallelism benefit
So why did libgodc implements them?
Why Bother?
Because Go without goroutines isn’t Go.
Imagine porting Python to a machine without lists. Or JavaScript without callbacks. You could do it, but would it feel like the same language?
I wanted Go on Dreamcast to feel like Go. You can write:
go processEnemies()
go playBackgroundMusic()
go handleInput()
It works. It’s correct. The code is cleaner. It’s just not faster than calling them directly:
processEnemies()
playBackgroundMusic()
handleInput()
There’s overhead—but less than you might expect. Let’s see the numbers.
What Happens Under the Hood
When you create a goroutine, here’s what actually happens:
┌─────────────────────────────────────────────────────────────┐
│ go doSomething() │
│ ──────────────── │
│ │
│ 1. Allocate 64 KB stack (from pool or malloc) │
│ 2. Initialize G struct (~150 bytes) │
│ 3. Save 16 CPU registers to context │
│ 4. Set up context (sp, pc, pr) │
│ 5. Add to run queue │
│ 6. Later: context switch to run (~6.6 μs) │
│ ───────────────────────────────────────────────────── │
│ Total spawn + first run: ~32 μs │
│ │
│ That's ~6,400 CPU cycles per goroutine spawn! │
└─────────────────────────────────────────────────────────────┘
What do you get for this overhead? On a multi-core system: parallelism. On Dreamcast: proper Go semantics and working concurrency primitives. That’s actually worth something!
The Numbers
I ran benchmarks on real Dreamcast hardware (from bench_architecture.elf):
┌─────────────────────────────────────────────────────────────┐
│ OPERATION TIME │
├─────────────────────────────────────────────────────────────┤
│ runtime.Gosched() 120 ns ← very cheap! │
│ Buffered channel op ~1.5 μs │
│ Context switch ~6.6 μs │
│ Channel round-trip ~13 μs │
│ Goroutine spawn+run ~34 μs │
└─────────────────────────────────────────────────────────────┘
At 200 MHz, you get about 200 million cycles per second. At 60 FPS you have 3.3 million cycles per frame. A 34 μs goroutine spawn is ~6,800 cycles—that’s only 0.2% of your frame budget. You can afford a few goroutines per frame, just don’t spawn hundreds!
See the Glossary for a complete reference of all benchmark numbers.
How It Works
The implementation is pretty elegant for a 200 MHz machine. Let’s see how we create the illusion of concurrency.
The G Struct
Every goroutine is a G structure (see runtime/goroutine.h):
┌─────────────────────────────────────────────────────────────┐
│ Goroutine (G) │
│ │
│ _panic: nil (current panic - offset 0) │
│ _defer: nil (deferred functions - offset 4) │
│ atomicstatus: Grunning (or Gwaiting, Grunnable, etc.) │
│ schedlink: next G (run queue linkage) │
│ stack_lo: 0x8c100000 (bottom of stack) │
│ stack_hi: 0x8c110000 (top of stack, 64 KB above) │
│ context: saved CPU registers (64 bytes) │
│ ├── r8-r14 (callee-saved GPRs) │
│ ├── sp, pc, pr (special) │
│ └── fr12-fr15, fpscr, fpul (FPU) │
│ goid: 42 (unique ID - 8 bytes) │
│ waiting: sudog* (channel wait queue entry) │
│ checkpoint: ptr (for panic/recover) │
│ │
└─────────────────────────────────────────────────────────────┘
The key is context, aka the saved CPU registers. This lets us pause mid-function and resume later.
The Run Queue
Runnable goroutines wait in line:
head tail
↓ ↓
┌────┐ ┌────┐ ┌────┐ ┌────┐
│ G3 │──▶│ G7 │──▶│ G2 │──▶│ G9 │──▶ NULL
└────┘ └────┘ └────┘ └────┘
↑
"I'm next!"
The scheduler is simple:
while (true) {
G *gp = runq_get(); // Get next goroutine
if (gp) {
switch_to(gp); // Run it
}
// When it yields, we come back here
}
Context Switching
This is where the magic happens. We’re running goroutine A, and we need to switch to B:
STEP 1: Save A's registers to A's context
────────────────────────────────────────────────────────
CPU A's Context
┌─────────┐ ┌─────────┐
│ r8 = 42 │ ────────────────▶ │ r8 = 42 │
│ r9 = 17 │ │ r9 = 17 │
│ sp = X │ │ sp = X │
│ pc = Y │ │ pc = Y │
└─────────┘ └─────────┘
STEP 2: Load B's registers from B's context
────────────────────────────────────────────────────────
B's Context CPU
┌─────────┐ ┌─────────┐
│ r8 = 99 │ ────────────────▶ │ r8 = 99 │
│ r9 = 55 │ │ r9 = 55 │
│ sp = P │ │ sp = P │
│ pc = Q │ │ pc = Q │
└─────────┘ └─────────┘
STEP 3: Return (now running B!)
────────────────────────────────────────────────────────
CPU continues from B's saved PC with B's saved registers.
To B, it's like it never stopped running!
On SH-4, we save/restore 16 registers (64 bytes). The full context switch with FPU takes ~88 cycles. With lazy FPU optimization (skipping FPU for integer-only goroutines), it drops to ~38 cycles. At 200 MHz, that’s under 0.5 microseconds—the total yield path including scheduler overhead is ~6.6 μs as shown in the benchmarks.
Cooperative Scheduling: The Gotcha
Our scheduler is cooperative, not preemptive. This is different from official Go!
Preemptive (official Go since 1.14): The runtime can forcibly pause a goroutine at any time using timer interrupts or signals. Even an infinite loop gets interrupted so other goroutines can run.
Cooperative (libgodc): Goroutines must volunteer to give up the CPU. The runtime never forces a switch. If a goroutine doesn’t yield, nothing else runs.
Why the difference? Preemptive scheduling requires:
- Signal handlers or timer interrupts to interrupt running code
- Complex stack inspection to find safe preemption points
- More saved state per context switch
On Dreamcast, we keep it simple. The cost is that you must be careful:
// This freezes your Dreamcast (but works fine in official Go!):
func badGoroutine() {
for {
x++ // Infinite loop, never yields
}
}
Where Goroutines Yield
┌─────────────────────────────────────────────────────────────┐
│ YIELDS (lets others run) DOESN'T YIELD │
├─────────────────────────────────────────────────────────────┤
│ ✓ Channel send: ch <- x ✗ Math: x + y * z │
│ ✓ Channel receive: <-ch ✗ Memory: array[i] │
│ ✓ time.Sleep() ✗ Loops: for i := ... │
│ ✓ runtime.Gosched() │
│ ✓ select {} │
└─────────────────────────────────────────────────────────────┘
The Fix for Long Computations
// Bad: No yields for 10 million iterations
for i := 0; i < 10000000; i++ {
result += compute(i)
}
// Good: Yield periodically
for i := 0; i < 10000000; i++ {
result += compute(i)
if i % 10000 == 0 {
runtime.Gosched() // Let others run
}
}
Note: if you have a single long computation with no natural yield points, a direct function call is simpler. Goroutines shine when you have multiple things that can interleave.
When Goroutines Shine
Goroutines work well for several patterns. Here’s real benchmark data from bench_goroutine_usecase.elf:
┌─────────────────────────────────────────────────────────────┐
│ USE CASE OVERHEAD VERDICT │
├─────────────────────────────────────────────────────────────┤
│ Multiple independent tasks 10-38% ✓ Acceptable │
│ Producer-consumer pattern ~163% ⚠ Use carefully │
│ Channel ping-pong ~13 μs/op Know the cost │
└─────────────────────────────────────────────────────────────┘
The key insight: independent tasks (each goroutine does its own work, minimal channel communication) have reasonable overhead (typically ~25%, varies with scheduling). Heavy channel use (producer-consumer with many sends) costs ~163%.
Porting Existing Go Code
If you’re porting Go code that uses goroutines, it works without modification:
// This Go code just works:
func fetch(urls []string) []Result {
ch := make(chan Result, len(urls))
for _, url := range urls {
go func(u string) {
ch <- download(u)
}(url)
}
// ... collect results
}
Patterns to Avoid
Some patterns don’t make sense on a single-core system:
Don’t: Spawn Per-Item
// Inefficient: 1000 spawns = 32 ms overhead
for i := 0; i < 1000; i++ {
go process(items[i])
}
// Better: Process directly, or use one goroutine
for i := 0; i < 1000; i++ {
process(items[i])
}
Don’t: Force Sequential With Channels
// Overcomplicated: These are sequential anyway
go step1()
<-done1
go step2()
<-done2
// Simpler:
step1()
step2()
Be Careful: Heavy Channel Traffic
// Each channel op is ~13 μs
// High-volume producer-consumer shows ~163% overhead
for item := range items {
workChan <- item
}
For high-throughput paths, batch items or use direct calls.
Panic and Recover
Two Kinds of Errors
Most errors in Go are… boring. And that’s good! You handle them like this:
file, err := openFile("game.sav")
if err != nil {
// No saved game? No problem.
// Start a new game instead.
}
The function tells you something went wrong, and you decide what to do. Maybe you retry. Maybe you use a default. Maybe you tell the user. It’s your choice.
But some errors are different. They’re programmer mistakes:
enemies := []Enemy{orc, goblin, troll}
enemy := enemies[99] // WAIT. There's only 3 enemies!
This isn’t “the file doesn’t exist.” This is “the code is broken.” There’s no sensible way to continue.
This is when Go panics.
What Happens When You Panic
Here’s the sequence, step by step:
Normal Execution
↓
┌───────────────────────────────┐
│ enemies := []Enemy{...} │
│ enemy := enemies[99] │ ← PANIC!
│ moveEnemy(enemy) │ ← never runs
└───────────────────────────────┘
↓
EXECUTION STOPS
↓
┌───────────────────────────────┐
│ Run all deferred functions │
│ (in reverse order!) │
└───────────────────────────────┘
↓
Did any defer call recover()?
/ \
YES NO
↓ ↓
Program continues Program dies
The key insight: deferred functions always run, even during a panic. This is Go’s cleanup guarantee. Well… there are some really really bad cases (e.g. panic before runtime init or too many nested panics) where this statement is false.
Defer: The Cleanup Crew
Before we talk more about panic, let’s understand defer. It’s simple but powerful.
func processEnemy(e *Enemy) {
file := openLog("combat.log")
defer closeLog(file) // "Remember to do this when I leave!"
damage := calculateDamage(e)
applyDamage(e, damage)
// closeLog runs here, automatically
}
The defer keyword says: “Don’t run this now. Run it when the function exits.”
No matter how you exit—return, panic, whatever—the deferred function runs.
Multiple Defers: LIFO
If you have multiple defers, they run in reverse order. Last in, first out. Like a stack of plates:
func setup() {
defer println("First defer") // Runs 3rd
defer println("Second defer") // Runs 2nd
defer println("Third defer") // Runs 1st
println("Normal code")
}
// Output:
// Normal code
// Third defer
// Second defer
// First defer
Why reverse order? Think about it: if you opened file A, then file B, you want to close B before A. The last thing you set up is the first thing you tear down.
Visualizing the Defer Chain
Each goroutine maintains a linked list of deferred functions:
G.defer → [cleanup3] → [cleanup2] → [cleanup1]
newest oldest
runs runs
first last
When the function returns (or panics):
- Pop cleanup3, run it
- Pop cleanup2, run it
- Pop cleanup1, run it
- Done!
Recover: Catching the Fall
Here’s the safety net. recover() catches a panic mid-flight:
func safeGameLoop() {
if runtime_checkpoint() != 0 {
// We land here after recovering from a panic
// libgodc needs this, if you are going to use "recover" mechanisms
println("Recovered! Returning to main menu...")
return
}
defer func() {
if r := recover(); r != nil {
println("Caught panic:", r)
}
}()
runGame() // If this panics, we catch it!
}
func main() {
safeGameLoop()
println("Program continues!") // This runs even after panic!
}
Note: libgodc requires
runtime_checkpoint()for recover to work properly. Without it, even a successful recover() will terminate the program. Standard Go handles this automatically via DWARF unwinding, but we use setjmp/longjmp instead (explained later in this chapter).
Let’s trace what happens:
1. safeGameLoop() starts
2. runtime_checkpoint() saves recovery point, returns 0
3. defer registers our recovery function
4. runGame() starts
5. ... something bad happens ...
6. PANIC!
7. Deferred function runs
8. recover() catches the panic, marks it recovered
9. longjmp back to checkpoint, runtime_checkpoint() returns 1
10. "Recovered!" prints, function returns normally
11. "Program continues!" prints
The panic was caught. The program lives.
The Golden Rule
Here’s the catch: recover only works inside a deferred function.
// THIS WORKS ✓
defer func() {
recover() // Called directly in defer
}()
// THIS DOESN'T WORK ✗
recover() // Not in a defer—does nothing!
Why? Because recover needs to intercept the panic during the cleanup phase. If you’re not in a defer, you’re not in cleanup mode.
libgodc note: Standard Go is even stricter—recover must be called directly in the defer, not in a helper function. We relaxed this rule because it’s complex to implement and the behavior difference is benign for games. More panics get caught, which is fine.
How We Implement It
Standard Go uses something called DWARF unwinding. It’s sophisticated: the compiler generates detailed metadata about every function’s stack layout, and a runtime library uses this to carefully walk back up the stack.
That’s a lot of complexity. We don’t have DWARF support on Dreamcast, yet (?).
Instead, we use an old C trick: setjmp/longjmp.
The Teleportation Trick
Imagine setjmp as dropping a bookmark:
jmp_buf bookmark;
if (setjmp(bookmark) == 0) {
// First time through: setjmp returns 0
printf("Starting...\n");
doRiskyThing();
printf("Made it!\n");
} else {
// After longjmp: setjmp returns 1
printf("Something went wrong!\n");
}
And longjmp teleports you back to that bookmark:
void doRiskyThing() {
// ...
if (disaster) {
longjmp(bookmark, 1); // TELEPORT!
}
// ...
}
When longjmp is called, execution jumps back to setjmp, which now returns 1 instead of 0. All the function calls in between? Gone. Skipped. Like they never happened.
The Recovery Path
┌─────────────────────────────────────────────────────────────┐
│ PANIC WITH CHECKPOINT │
│ │
│ func risky() { │
│ if runtime_checkpoint() != 0 { │
│ return // Recovered! Continue here. │
│ } │
│ defer func() { │
│ recover() │
│ }() │
│ panic("oops") // longjmp to checkpoint │
│ } │
│ │
│ → Clean, predictable │
│ → Required for recover() to work in libgodc │
└─────────────────────────────────────────────────────────────┘
Important: Without
runtime_checkpoint(), callingrecover()will still mark the panic as recovered, but the program will terminate with “FATAL: recover without checkpoint”. The checkpoint is required for proper recovery in libgodc.
When Nobody Catches the Panic
If no recover catches the panic, the program dies. On Dreamcast, you’ll see:
panic: index out of range [99] with length 3
goroutine 1 [running]:
0x8c010234
0x8c010456
0x8c010678
Memory: arena=4194304 used=1258291 free=2936013
The console halts. The user has to manually reset. This is intentional. A crash is better than continuing with corrupted state and zombies.
When Should You Panic?
Here’s the decision tree:
Is this a programmer mistake?
│
├── YES → Maybe panic is okay
│ ├── nil pointer dereference
│ ├── index out of bounds
│ └── calling method on nil
│
└── NO → DON'T PANIC. Return an error.
├── File not found
├── Network timeout
├── Invalid user input
└── Resource unavailable
When Recover Makes Sense
Use recover at boundaries—places where you want to contain failures. In libgodc, remember to use runtime_checkpoint():
func handleEventSafely(event Event) {
if runtime_checkpoint() != 0 {
println("Event handler crashed, continuing...")
return
}
defer func() {
if r := recover(); r != nil {
println("Caught:", r)
}
}()
handleEvent(event) // If this panics, we catch it
}
One bad event handler shouldn’t kill the entire game.
For general Go error handling best practices (when to panic vs return errors), see Effective Go.
Data Structures
Part 1: Strings
The Million-Dollar Question
How long is this string?
"Hello, Dreamcast!"
In C, you have to count:
char *msg = "Hello, Dreamcast!";
int len = 0;
while (msg[len] != '\0') { // Keep going until null byte
len++;
}
// H e l l o , D r e a m c a s t ! \0
// 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
// len is now 17... but we checked 18 characters!
C strings end with a special “null byte” (\0). To find the length, you walk through every character until you hit it. For a 10,000-character string, that’s 10,000 checks.
Go strings are smarter. They remember their length:
┌────────────────┐
│ str: ─────────────────▶ h │ e │ l │ l │ o │
│ len: 5 │
└────────────────┘
In libgodc, this is an 8-byte structure (on 32-bit Dreamcast):
// From runtime/runtime.h, see GoString C struct
typedef struct {
const uint8_t *str; // 4 bytes: pointer to character data
intptr_t len; // 4 bytes: length in bytes
} GoString;
Unlike C strings (null-terminated), Go strings store their length explicitly. This means:
- O(1) length lookup just read the
lenfield - Can contain null bytes no special terminator
- Bounds checked we know exactly where the string ends
String Allocation
Strings are immutable. Every concatenation allocates new memory:
s := "foo" + "bar" // Allocates 6 bytes, copies both strings
Repeated concatenation in a loop is O(n²), where each iteration copies all previous data. This is a common Go performance pitfall; see Effective Go for solutions.
The tmpBuf Optimization
Here’s a secret: libgodc cheats for short strings.
When you concatenate strings that total ≤32 bytes, we use a stack buffer instead of allocating from the heap:
"a" + "b" = "ab"
Stack (fast): ┌────────────────────────────────┐
│ a │ b │ │ │ ... │ │ │ │ 32 bytes
└────────────────────────────────┘
No GC allocation needed!
This happens automatically. You don’t have to do anything—the compiler passes a stack buffer to the runtime, and we use it when we can.
Part 2: Slices
The Three-Part Header
A slice is not just a pointer. It’s a header (that means struct) with three fields:
┌─────────────────────────────────────────────────────────────┐
│ │
│ Slice: []int with values [10, 20, 30] │
│ │
│ ┌────────────────┐ ┌─────┬─────┬─────┬─────┬─────┐ │
│ │ array: ───────────────▶│ 10 │ 20 │ 30 │ ? │ ? │ │
│ │ len: 3 │ └─────┴─────┴─────┴─────┴─────┘ │
│ │ cap: 5 │ ▲ ▲ │
│ └────────────────┘ length capacity │
│ │
└─────────────────────────────────────────────────────────────┘
- array :: Pointer to the underlying data
- len :: How many elements are currently in use
- cap:: How many elements could fit before reallocation
Think of it like a notebook. You have 100 pages (capacity), but you’ve only written on 30 (length).
The Magic of Slicing
Here’s the trick that makes Go slices amazing. When you “slice” a slice, no data is copied:
a := []int{10, 20, 30, 40, 50}
b := a[1:4] // b is [20, 30, 40]
What actually happens:
Underlying array:
┌─────┬─────┬─────┬─────┬─────┐
│ 10 │ 20 │ 30 │ 40 │ 50 │
└─────┴─────┴─────┴─────┴─────┘
▲ ▲
│ │
│ └── b.array points here
│ b.len = 3
│ b.cap = 4
│
└── a.array points here
a.len = 5
a.cap = 5
Both a and b point to the same memory. Slicing is O(1) — just create a new 12-byte header.
The Sharing Trap
But wait. If they share memory…
a := []int{10, 20, 30, 40, 50}
b := a[1:4]
b[0] = 999 // What happens to a?
After b[0] = 999:
┌─────┬─────┬─────┬─────┬─────┐
│ 10 │ 999 │ 30 │ 40 │ 50 │
└─────┴─────┴─────┴─────┴─────┘
▲ ▲
│ │
a b
a is now [10, 999, 30, 40, 50]!
Both slices see the change! This is usually a bug waiting to happen.
If you need independent data, use copy:
b := make([]int, 3)
copy(b, a[1:4]) // b has its own data now
How libgodc Implements copy
When you write copy(dst, src), what actually happens?
Step 1: Figure out how many elements to copy
┌─────────────────────────────────────────────────────────────┐
│ │
│ dst has room for 3 src has 5 elements │
│ ┌───┬───┬───┐ ┌───┬───┬───┬───┬───┐ │
│ │ │ │ │ │ A │ B │ C │ D │ E │ │
│ └───┴───┴───┘ └───┴───┴───┴───┴───┘ │
│ │
│ Copy min(3, 5) = 3 elements │
│ │
└─────────────────────────────────────────────────────────────┘
Step 2: Calculate byte size
┌─────────────────────────────────────────────────────────────┐
│ │
│ 3 elements × 4 bytes each (int) = 12 bytes │
│ │
└─────────────────────────────────────────────────────────────┘
Step 3: copy the bytes safely (aka memmove in C)
┌─────────────────────────────────────────────────────────────┐
│ │
│ src: ████████████░░░░░░░░ (copy first 12 bytes) │
│ │ │
│ ▼ │
│ dst: ████████████ │
│ │
└─────────────────────────────────────────────────────────────┘
Step 4: Return 3 (number of elements copied)
Why memmove instead of memcpy? Because slices can overlap:
s := []int{1, 2, 3, 4, 5}
copy(s[1:], s[:4]) // Shift elements right — overlapping!
memmove handles this safely. memcpy would corrupt the data.
Growing Slices: The append Dance
What happens when you append beyond capacity?
s := make([]int, 3, 4) // len=3, cap=4
s = append(s, 10) // len=4, cap=4 — fits!
s = append(s, 20) // len=5, cap=??? — doesn't fit!
libgodc allocates a new, bigger array:
Before:
┌─────┬─────┬─────┬─────┐
│ 0 │ 0 │ 0 │ 10 │ cap=4, FULL
└─────┴─────┴─────┴─────┘
After append(s, 20):
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ 0 │ 0 │ 0 │ 10 │ 20 │ │ │ │ cap=8, NEW ARRAY
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
Old array becomes garbage (GC will clean it up).
libgodc’s Growth Strategy
Standard Go doubles capacity for small slices and grows by 25% for large ones. But Dreamcast only has 16MB RAM, so libgodc is more conservative by design:
┌─────────────────────────────────────────────────────────────┐
│ libgodc growth algorithm (runtime_growslice) │
│ │
│ if capacity < 64: │
│ new_cap = capacity × 2 ← Double (same as std Go) │
│ else: │
│ new_cap = capacity × 1.125 ← Only 12.5% growth! │
│ │
└─────────────────────────────────────────────────────────────┘
| Slice size | Standard Go | libgodc |
|---|---|---|
| Small (< 64 elements) | Double | Double |
| Large (≥ 64 elements) | +25% | +12.5% |
Why the difference? On a 16MB system, aggressive doubling wastes precious memory. A 10,000-element slice growing by 25% allocates 2,500 extra slots. At 12.5%, that’s only 1,250, so half the waste.
Pro tip: If you know how big you’ll need, then pre-allocate!
// Bad: many reallocations
enemies := []Enemy{}
for i := 0; i < 100; i++ {
enemies = append(enemies, loadEnemy(i))
}
// Good: one allocation
enemies := make([]Enemy, 0, 100)
for i := 0; i < 100; i++ {
enemies = append(enemies, loadEnemy(i))
}
Part 3: Maps
The Problem: Finding Things Fast
Suppose you’re building an item shop for your game. You have a price list:
type Item struct {
Name string
Price int
}
items := []Item{
{"Potion", 50},
{"Sword", 300},
{"Shield", 250},
{"Bow", 200},
// ... 100 more items
}
A customer asks: “How much is the Bow?”
You have to search through every item:
for _, item := range items {
if item.Name == "Bow" {
return item.Price
}
}
If the item list has 100 items, you might check up to 100 items. That’s O(n) time.
Now imagine you have a friend named Maggie who has memorized every item and its price. You ask “How much is the Bow?” and she instantly says “200 gold!”
Maggie gives you the answer in O(1) time — constant time. It doesn’t matter if there are 10 items or 10,000. She just knows.
How do you get a “Maggie”?
You use a hash table. In Go, that’s a map.
Building Your Own Maggie
A hash table combines two things:
- A hash function that turns keys into numbers
- An array to store the values
Let’s build one step by step. Start with an empty array of 5 slots:
┌───────┬───────┬───────┬───────┬───────┐
│ 0 │ 1 │ 2 │ 3 │ 4 │
├───────┼───────┼───────┼───────┼───────┤
│ │ │ │ │ │
└───────┴───────┴───────┴───────┴───────┘
Now we need a hash function. A hash function takes a string and returns a number. Here’s the important part:
- It must be consistent: “Potion” always returns the same number.
- It should spread things out: different strings should (usually) give different numbers.
Let’s add the price of a Potion. We feed “Potion” into the hash function:
hash("Potion") → 7392
7392 % 5 = 2 ← slot 2!
We store the price (50) at index 2:
┌───────┬───────┬───────┬───────┬───────┐
│ 0 │ 1 │ 2 │ 3 │ 4 │
├───────┼───────┼───────┼───────┼───────┤
│ │ │ 50 │ │ │
│ │ │Potion │ │ │
└───────┴───────┴───────┴───────┴───────┘
Now add the Sword (300 gold):
hash("Sword") → 4281
4281 % 5 = 1 ← slot 1!
┌───────┬───────┬───────┬───────┬───────┐
│ 0 │ 1 │ 2 │ 3 │ 4 │
├───────┼───────┼───────┼───────┼───────┤
│ │ 300 │ 50 │ │ │
│ │ Sword │Potion │ │ │
└───────┴───────┴───────┴───────┴───────┘
Add the Shield and Bow:
hash("Shield") % 5 = 0
hash("Bow") % 5 = 4
┌───────┬───────┬───────┬───────┬───────┐
│ 0 │ 1 │ 2 │ 3 │ 4 │
├───────┼───────┼───────┼───────┼───────┤
│ 250 │ 300 │ 50 │ │ 200 │
│Shield │ Sword │Potion │ │ Bow │
└───────┴───────┴───────┴───────┴───────┘
Now when someone asks “How much is the Bow?”:
hash("Bow") % 5 = 4- Look at slot 4
- It’s 200 gold!
No searching! The hash function tells you exactly where to look. This is O(1) — constant time.
You just built a “Maggie”!
Collisions: When Two Keys Want the Same Slot
Here’s a problem. What if two items hash to the same slot?
hash("Potion") % 5 = 2
hash("Scroll") % 5 = 2 ← Same slot!
Oh no! Potions are already in slot 2. If we put Scrolls there, we’ll overwrite Potions!
This is called a collision. There are different ways to handle it. Go uses a simple approach: store both items in the same slot using a small list.
┌───────┬───────┬────────────────────┬───────┬───────┐
│ 0 │ 1 │ 2 │ 3 │ 4 │
├───────┼───────┼────────────────────┼───────┼───────┤
│ 250 │ 300 │ Potion→50 │ │ 200 │
│Shield │ Sword │ Scroll→75 │ │ Bow │
└───────┴───────┴────────────────────┴───────┴───────┘
Now when you look up “Scroll”:
hash("Scroll") % 5 = 2- Look at slot 2
- Check if “Potion” matches — no
- Check if “Scroll” matches — yes! Return 75.
It takes a tiny bit longer, but it works.
The Worst Case: Everyone in One Slot
What if you’re really unlucky and every item hashes to the same slot?
┌───────┬───────┬──────────────────────────┬───────┬───────┐
│ 0 │ 1 │ 2 │ 3 │ 4 │
├───────┼───────┼──────────────────────────┼───────┼───────┤
│ │ │ Potion→50 │ │ │
│ │ │ Sword→300 │ │ │
│ │ │ Shield→250 │ │ │
│ │ │ Bow→200 │ │ │
│ │ │ Scroll→75 │ │ │
└───────┴───────┴──────────────────────────┴───────┴───────┘
Now looking up “Scroll” requires checking 5 items. That’s just as slow as a regular list!
This is the worst case: O(n) instead of O(1).
Two things prevent this:
- Good hash functions spread keys evenly
- Resizing — when the table gets too full, Go makes it bigger
The Tophash Optimization
Each bucket stores a “tophash” — the top 8 bits of the hash — for quick rejection:
Bucket 2:
┌─────────────────────────────────────────────────┐
│ tophash: [a3] [7f] [ ] [ ] [ ] [ ] [ ] [ ]│
│ keys: [Potion] [Scroll] [ ] [ ] [ ] [ ] │
│ values: [ 50 ] [ 75 ] [ ] [ ] [ ] [ ] │
└─────────────────────────────────────────────────┘
When looking up “Sword” (tophash = 0xb2):
- Check if 0xb2 == 0xa3? No. Skip.
- Check if 0xb2 == 0x7f? No. Skip.
- Not found!
We didn’t even compare the full strings. The tophash check is super fast.
Performance Comparison
┌─────────────────────────────────────────────────────────────┐
│ Hash Table vs Array: Searching 100 elements │
│ │
│ Array (linear search): │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Average: check 50 elements │ │
│ │ Worst: check 100 elements │ │
│ │ Time: O(n) │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ Hash Table (map): │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Average: check 1 element │ │
│ │ Worst: check all elements (very rare!) │ │
│ │ Time: O(1) average │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ With 1,000,000 elements: │
│ • Array: up to 1,000,000 checks │
│ • Map: still just ~1 check! │
└─────────────────────────────────────────────────────────────┘
How libgodc Implements Maps
libgodc’s map implementation is tuned for the Dreamcast’s SH-4 CPU and 16MB memory limit.
The GoMap header (28 bytes):
┌─────────────────────────────────────────────────────────────┐
│ GoMap Structure │
│ │
│ ┌──────────────┬──────────────────────────────────────┐ │
│ │ count │ Number of entries │ │
│ │ flags + B │ State flags + log2(bucket count) │ │
│ │ hash0 │ Random seed (different per map!) │ │
│ │ buckets ─────────▶ Current bucket array │ │
│ │ oldbuckets ──────▶ Old buckets (during resize) │ │
│ │ nevacuate │ Resize progress counter │ │
│ └──────────────┴──────────────────────────────────────┘ │
│ │
│ Total: 28 bytes (compact for Dreamcast's limited RAM) │
│ │
└─────────────────────────────────────────────────────────────┘
SH-4 optimized hashing:
The hash function uses wyhash, a fast 32-bit algorithm that takes advantage of SH-4’s dmuls.l instruction (32×32→64 multiply):
┌─────────────────────────────────────────────────────────────┐
│ Hash("Potion", seed=0x12345678) │
│ │
│ Step 1: Mix 4 bytes at a time │
│ wymix32(h ^ "Poti", 0x9E3779B9) │
│ │
│ Step 2: Handle remaining bytes │
│ wymix32(h ^ "on\0\0", 0x85EBCA6B) │
│ │
│ Step 3: Final mix with length │
│ wymix32(h, 6) → 0x7A3B2C1D │
│ │
└─────────────────────────────────────────────────────────────┘
Dreamcast-specific limits:
| Setting | libgodc | Standard Go |
|---|---|---|
| Max bucket shift | 15 (32K buckets) | ~24 (16M buckets) |
| Hash seed source | Dreamcast timer | OS random |
| Prefetch hint | SH-4 pref @Rn | Platform-specific |
Lazy allocation for small maps:
items := make(map[string]int) // No buckets yet!
items["key"] = 1 // NOW buckets are allocated
This saves memory when you create maps that might stay empty.
The Nil Map Trap
This is the #1 map bug for Go beginners:
var inventory map[string]int // nil map
// Reading: works! Returns zero value.
count := inventory["sword"] // count is 0
// Writing: PANIC!
inventory["sword"] = 1 // "assignment to entry in nil map"
A nil map is like a locked filing cabinet. You can look through the glass (read), but you can’t put anything in (write).
Always initialize:
inventory := make(map[string]int)
// or
inventory := map[string]int{}
Map Iteration is Random
scores := map[string]int{
"Mario": 100,
"Luigi": 85,
"Peach": 95,
}
for name, score := range scores {
println(name, score)
}
Run this twice. You might get:
Run 1: Run 2:
Luigi 85 Peach 95
Peach 95 Mario 100
Mario 100 Luigi 85
This is intentional. Go randomizes iteration order to prevent you from depending on it. If you need sorted keys, sort them yourself.
Choosing the Right Tool
┌─────────────────────────────────────────────────────────────┐
│ DECISION TREE: What Data Structure Should I Use? │
│ │
│ Need to look up by name/key? │
│ │ │
│ ├── YES → Use a map (O(1) lookup!) │
│ │ │
│ └── NO → Is the data ordered/sequential? │
│ │ │
│ ├── YES → Use a slice │
│ │ │
│ └── NO → Still probably use a slice │
│ (maps have memory overhead) │
│ │
│ Is it text? → Use a string (immutable) │
│ Need to build text? → Use []byte, convert at the end │
└─────────────────────────────────────────────────────────────┘
Summary Table
| Operation | String | Slice | Map |
|---|---|---|---|
| Get length | O(1) | O(1) | O(1) |
| Access by index | O(1) | O(1) | — |
| Access by key | — | — | O(1) avg |
| Append | N/A | O(1)* | O(1) avg |
| Concatenate | O(n) | O(n) | — |
* Amortized — occasional reallocations
Memory Overhead
String header: 8 bytes (pointer + length)
Slice header: 12 bytes (pointer + length + capacity)
Map header: 28 bytes (+ bucket overhead per entry)
Maps have the most overhead. For small, dense integer keys (0 to N), a slice is often better:
// If enemy IDs are 0-999, use a slice!
enemies := make([]*Enemy, 1000)
enemies[42] = &orc // O(1), less memory than map
Real Benchmark Results
We ran these benchmarks on actual Dreamcast hardware. The numbers don’t lie!
Map vs Slice: The “Maggie” Effect
Looking up an item by ID, searching near the end of the collection:
| Elements | Slice (linear search) | Map lookup | Map is… |
|---|---|---|---|
| 100 | 17 μs | 1.3 μs | 13× faster |
| 500 | 92 μs | 0.9 μs | 97× faster |
| 1,000 | 187 μs | 0.9 μs | 203× faster |
| 2,000 | 443 μs | 1.2 μs | 376× faster |
Notice how slice time grows linearly (O(n)) while map time stays constant (O(1)). With 2,000 enemies, map lookup is 376× faster!
String Concatenation: The Hidden Cost
Building a string character by character:
| Characters | s += "x" in loop | append to []byte | Speedup |
|---|---|---|---|
| 50 | 122 μs | 23 μs | 5× faster |
| 200 | 665 μs | 69 μs | 9× faster |
| 500 | 2,725 μs | 161 μs | 16× faster |
| 1,000 | 8,973 μs | 314 μs | 28× faster |
The loop method is O(n²) — time explodes as strings get longer. For 1,000 characters, pre-allocation is 28× faster!
Slice Pre-allocation: One Allocation vs Many
Appending items to a slice:
| Items | Growing []int{} | Pre-alloc make(0,n) | Time saved |
|---|---|---|---|
| 50 | 35 μs | 24 μs | 32% faster |
| 100 | 76 μs | 41 μs | 46% faster |
| 200 | 178 μs | 76 μs | 57% faster |
Pre-allocation eliminates the repeated reallocations as the slice grows.
The right data structure is like having the right superpower. A map turns an O(n) search into O(1). That’s not just faster… it’s magic.
Channels
This chapter explains how libgodc implements Go channels for the Dreamcast. The implementation differs significantly from the standard Go runtime due to our M:1 cooperative scheduling model.
The hchan Structure
Every channel is an hchan structure allocated on the GC heap:
typedef struct hchan {
uint32_t qcount; // Items currently in buffer
uint32_t dataqsiz; // Buffer capacity (0 = unbuffered)
void *buf; // Ring buffer (follows hchan in memory)
uint16_t elemsize; // Size of each element
uint8_t closed; // Channel closed flag
uint8_t buf_mask_valid; // Power-of-2 optimization flag
struct __go_type_descriptor *elemtype;
uint32_t sendx; // Send index into ring buffer
uint32_t recvx; // Receive index into ring buffer
waitq recvq; // Goroutines waiting to receive
waitq sendq; // Goroutines waiting to send
uint8_t locked; // Simple lock (no contention in M:1)
} hchan;
When you write make(chan int, 3), libgodc allocates a single block containing both the hchan header and the buffer:
┌─────────────────────────────────────────────────────────────┐
│ │
│ Memory Layout for make(chan int, 3) │
│ │
│ ┌─────────────────────┬─────────────────────────────────┐ │
│ │ hchan (48B) │ buffer (3 × 4B = 12B) │ │
│ ├─────────────────────┼───────┬───────┬───────┬─────────┤ │
│ │ qcount, dataqsiz, │ [0] │ [1] │ [2] │ │ │
│ │ sendx, recvx, │ int │ int │ int │ │ │
│ │ waitqueues, ... │ │ │ │ │ │
│ └─────────────────────┴───────┴───────┴───────┴─────────┘ │
│ │
│ Total allocation: sizeof(hchan) + (cap × elemsize) │
│ │
└─────────────────────────────────────────────────────────────┘
Ring Buffer Indexing
The buffer is a circular queue. To find where to read/write:
static inline void *chanbuf(hchan *c, uint32_t i) {
uint32_t index = chan_index(c, i);
return (void *)((uintptr_t)c->buf + (uintptr_t)index * c->elemsize);
}
For power-of-2 capacities, we use bitwise AND instead of modulo:
static inline uint32_t chan_index(hchan *c, uint32_t i) {
if (c->buf_mask_valid)
return i & (c->dataqsiz - 1); // Fast: i & 3 for cap=4
return i % c->dataqsiz; // Slow: division
}
Tip: Use power-of-2 buffer sizes (2, 4, 8, 16…) for faster indexing.
The Send Algorithm
When you write ch <- value, this is chansend():
┌─────────────────────────────────────────────────────────────┐
│ chansend(c, elem, block) │
│ │
│ 1. nil channel? │
│ └── block=true: gopark forever (deadlock) │
│ └── block=false: return false │
│ │
│ 2. Channel closed? │
│ └── runtime_throw("send on closed channel") │
│ │
│ 3. Receiver waiting in recvq? │
│ └── YES: Copy data DIRECTLY to receiver's elem │
│ Wake receiver with goready() │
│ Return true │
│ │
│ 4. Buffer has space? (qcount < dataqsiz) │
│ └── YES: Copy to buf[sendx], increment sendx │
│ Return true │
│ │
│ 5. Non-blocking? (block=false) │
│ └── Return false │
│ │
│ 6. Must block: │
│ └── Create sudog, enqueue in sendq │
│ gopark() - yield to scheduler │
│ When woken: return success flag │
└─────────────────────────────────────────────────────────────┘
The key insight: direct transfer. If a receiver is already waiting, we copy data straight to their memory location, bypassing the buffer entirely. This is why unbuffered channels involve no buffer at all.
The Receive Algorithm
When you write value := <-ch, this is chanrecv():
┌─────────────────────────────────────────────────────────────┐
│ chanrecv(c, elem, block) │
│ │
│ 1. nil channel? │
│ └── block=true: gopark forever │
│ └── block=false: return false │
│ │
│ 2. Closed AND empty? │
│ └── Zero out elem, return (true, received=false) │
│ │
│ 3. Sender waiting in sendq? │
│ └── Unbuffered: Copy directly from sender's elem │
│ └── Buffered: Take from buffer, move sender's data in │
│ Wake sender with goready() │
│ Return (true, received=true) │
│ │
│ 4. Buffer has data? (qcount > 0) │
│ └── Copy from buf[recvx], zero slot, decrement qcount │
│ Return (true, received=true) │
│ │
│ 5. Non-blocking? │
│ └── Return false │
│ │
│ 6. Must block: │
│ └── Create sudog, enqueue in recvq │
│ gopark() │
│ When woken: return success │
└─────────────────────────────────────────────────────────────┘
The Buffered Receive with Waiting Sender
This case is subtle. When the buffer is full and a sender is waiting:
if (c->dataqsiz > 0) { // Buffered channel
// 1. Take oldest item from buffer for receiver
src = chanbuf(c, c->recvx);
chan_copy(c, elem, src);
// 2. Put sender's NEW item into the freed slot
chan_copy(c, src, sg->elem);
// 3. Advance indices (sendx follows recvx)
c->recvx = chan_index(c, c->recvx + 1);
c->sendx = c->recvx;
}
This maintains FIFO order: the receiver gets the oldest buffered value, not the sender’s new value.
Wait Queues and Sudogs
When a goroutine blocks on a channel, it creates a sudog (sender/receiver descriptor):
typedef struct sudog {
G *g; // The blocked goroutine
struct sudog *next; // Next in wait queue
struct sudog *prev; // Previous in wait queue
void *elem; // Pointer to data being sent/received
uint64_t ticket; // Used by select for case index
bool isSelect; // Part of a select statement?
bool success; // Did operation succeed?
struct sudog *waitlink; // For select: links all sudogs
struct sudog *releasetime; // Unused (Go runtime compat)
struct hchan *c; // Channel we're waiting on
} sudog;
The Sudog Pool
Creating sudogs during gameplay would trigger malloc(). libgodc pre-allocates a pool at startup:
void sudog_pool_init(void) {
for (int i = 0; i < 16; i++) {
sudog *s = (sudog *)malloc(sizeof(sudog));
s->next = global_pool;
global_pool = s;
}
}
acquireSudog() pulls from the pool; releaseSudog() returns to it. If the pool is exhausted, we fall back to malloc().
Wait Queues
Each channel has two wait queues (doubly-linked lists):
typedef struct waitq {
struct sudog *first;
struct sudog *last;
} waitq;
Operations:
waitq_enqueue()- add blocked goroutine to endwaitq_dequeue()- remove and return first goroutinewaitq_remove()- remove specific sudog (for select cancellation)
Blocking and Waking: gopark/goready
This is where libgodc’s M:1 model shines.
gopark() - Block Current Goroutine
void gopark(bool (*unlockf)(void *), void *lock, WaitReason reason) {
G *gp = getg();
if (!gp || gp == g0)
runtime_throw("gopark on g0 or nil");
gp->atomicstatus = Gwaiting;
gp->waitreason = reason;
// Call unlock function - if it returns false, abort parking
if (unlockf && !unlockf(lock)) {
gp->atomicstatus = Grunnable;
runq_put(gp);
return;
}
// Context switch to scheduler
__go_swapcontext(&gp->context, &sched_context);
}
The goroutine saves its context and swaps to the scheduler. The unlockf callback releases the channel lock atomically with parking - if it returns false, we abort and re-enqueue instead.
goready() - Wake a Goroutine
void goready(G *gp) {
if (!gp) return;
// Don't wake dead/already-runnable/running goroutines
Gstatus status = gp->atomicstatus;
if (status == Gdead || status == Grunnable || status == Grunning)
return;
gp->atomicstatus = Grunnable;
gp->waitreason = waitReasonZero;
runq_put(gp);
}
The woken goroutine becomes runnable and will be scheduled on the next schedule() call.
Why M:1 Simplifies Things
In standard Go, channels need atomic operations and memory barriers because multiple OS threads access them. libgodc runs all goroutines on one KOS thread:
- No atomics needed for
lockedflag (simple bool) - No memory barriers
- No contention on wait queues
- Context switches are explicit (cooperative)
The chan_lock()/chan_unlock() functions just set a flag:
void chan_lock(hchan *c) {
if (!c)
runtime_throw("chan: nil channel");
if (c->locked)
runtime_throw("chan: recursive lock");
c->locked = 1;
}
void chan_unlock(hchan *c) {
if (c) c->locked = 0;
}
This is safe because we never preempt a goroutine in the middle of a channel operation.
Select Implementation
Select is the most complex part. Here’s how selectgo() works:
Phase 1: Setup
SelectGoResult selectgo(scase *cas0, uint16_t *order0,
int nsends, int nrecvs, bool block) {
int ncases = nsends + nrecvs;
// order0 provides space for two arrays:
uint16_t *pollorder = order0; // Random order to check cases
uint16_t *lockorder = order0 + ncases; // Order to lock channels
Phase 2: Randomize Poll Order (Fairness)
// Fisher-Yates shuffle
for (int i = ncases - 1; i > 0; i--) {
int j = fastrand() % (i + 1);
uint16_t tmp = pollorder[i];
pollorder[i] = pollorder[j];
pollorder[j] = tmp;
}
Why random? If we always checked cases in order, the first case would always win when multiple are ready. Randomization ensures fairness.
Phase 3: Lock Channels (Deadlock Prevention)
// Sort by channel address using heap sort
heapsort_lockorder(cas0, lockorder, ncases);
// Lock in address order
sellock(cas0, lockorder, ncases);
If goroutine A does select { case <-ch1: case <-ch2: } and goroutine B does select { case <-ch2: case <-ch1: }, they could deadlock if they lock in different orders. Sorting by address ensures everyone locks in the same global order.
Phase 4: Check for Ready Cases
for (int i = 0; i < ncases; i++) {
int casi = pollorder[i]; // Check in random order
scase *cas = &cas0[casi];
hchan *c = cas->c;
if (c == NULL)
continue;
if (casi < nsends) {
// Send: closed channel will panic - select it
if (c->closed) {
selected = casi;
break;
}
// Check for waiting receiver or buffer space
if (!waitq_empty(&c->recvq) || c->qcount < c->dataqsiz) {
selected = casi;
break;
}
} else {
// Receive: check for waiting sender, buffer data, or closed
if (!waitq_empty(&c->sendq) || c->qcount > 0 || c->closed) {
selected = casi;
break;
}
}
}
If any case is ready, execute it immediately and return.
Phase 5: Block on All Channels
If nothing is ready and block=true, we enqueue on ALL channels:
sudog *sglist = NULL;
for (int i = 0; i < ncases; i++) {
int casi = pollorder[i];
scase *cas = &cas0[casi];
hchan *c = cas->c;
if (c == NULL)
continue;
sudog *sg = acquireSudog();
sg->g = gp;
sg->c = c;
sg->elem = cas->elem;
sg->isSelect = true;
sg->success = false;
sg->ticket = casi; // Remember which case this is
// Link for later cleanup
sg->waitlink = sglist;
sglist = sg;
if (casi < nsends)
waitq_enqueue(&c->sendq, sg);
else
waitq_enqueue(&c->recvq, sg);
}
gp->waiting = sglist;
gopark(selparkcommit, &unlock_arg, waitReasonSelect);
Phase 6: Woken - Find Winner
When woken, one sudog has success=true. Find it and dequeue from all other channels:
// Pass 3: Find winner and dequeue losers
for (sudog *sg = sglist; sg != NULL; sg = sgnext) {
sgnext = sg->waitlink; // Save before we might release
int casi = (int)sg->ticket;
if (sg->success) {
selected = casi;
if (casi >= nsends)
recvOK = true; // Received actual data
} else {
// Remove from wait queue (we won't use this case)
if (casi < nsends)
waitq_remove(&sg->c->sendq, sg);
else
waitq_remove(&sg->c->recvq, sg);
}
}
// Release all sudogs in separate pass
for (sudog *sg = sglist; sg != NULL; sg = sgnext) {
sgnext = sg->waitlink;
releaseSudog(sg);
}
The Default Case
When block=false and nothing is ready, selectgo() returns selected=-1:
if (!block) {
selunlock(cas0, lockorder, ncases);
go_yield(); // Give other goroutines a chance
return (SelectGoResult){-1, false};
}
The go_yield() prevents tight polling loops from starving other goroutines.
Closing Channels
closechan() marks the channel closed and wakes ALL waiting goroutines:
void closechan(hchan *c) {
G *wake_list = NULL;
G *wake_tail = NULL;
chan_lock(c);
if (c->closed) {
chan_unlock(c);
runtime_throw("close of closed channel");
}
c->closed = 1;
// Collect all receivers (they'll get zero values)
while ((sg = waitq_dequeue(&c->recvq)) != NULL) {
sg->success = false; // Indicates closed, not real data
gp = sg->g;
if (!gp || gp->atomicstatus == Gdead)
continue;
if (sg->elem && c->elemsize > 0)
memset(sg->elem, 0, c->elemsize);
// Add gp to wake_list via schedlink...
}
// Collect all senders (they'll panic when they wake)
while ((sg = waitq_dequeue(&c->sendq)) != NULL) {
sg->success = false;
gp = sg->g;
if (!gp || gp->atomicstatus == Gdead)
continue;
// Add gp to wake_list via schedlink...
}
chan_unlock(c);
// Wake everyone outside the lock
while (wake_list) {
gp = wake_list;
wake_list = gp->schedlink;
goready(gp);
}
}
Senders check success when they wake and throw “send on closed channel” if false.
Performance
For benchmark numbers, see the Performance section in Design. You can run the benchmarks yourself with tests/bench_architecture.elf on hardware.
Why Unbuffered is Slower
Unbuffered channels always require a context switch:
Sender Receiver
────── ────────
ch <- 42
│
└── gopark() ─────────────────► scheduler picks receiver
│
x := <-ch
│
◄── goready() ────────────────── wakes sender
│
continues
Buffered channels avoid this when buffer has space/data.
Optimization Tips
- Use buffered channels for producer/consumer patterns
- Power-of-2 buffer sizes for faster indexing (uses bitwise AND instead of modulo)
- Batch data - send structs with multiple values instead of multiple sends
- select with default for non-blocking checks in game loops
- Pre-warm channels - send/receive once during init to allocate sudogs
Limitations
libgodc channels have some constraints:
| Limit | Value | Reason |
|---|---|---|
| Max buffer size | 65536 elements | Sanity check in makechan() |
| Max element size | 65536 bytes | 16-bit elemsize field in hchan |
| Sudog pool | 16 pre-allocated, 128 max | Defined in godc_config.h |
For game code, these limits are rarely hit. If you need larger queues, consider using slices with your own synchronization.
System Integration
The Layer Cake
Imagine your game as an office building. You’re on the top floor, writing Go code. But when you need something done — read a file, play a sound, draw a sprite. Well, obviously there is no such thing as “the cloud”. Someone else does the actual work.
┌─────────────────────────────────────────────────────────────┐
│ │
│ Floor 4: Your Go Program │
│ "I want to play a sound!" │
│ ↓ │
│ Floor 3: libgodc (Go runtime) │
│ "Let me translate that..." │
│ ↓ │
│ Floor 2: KallistiOS │
│ "I know how to talk to hardware." │
│ ↓ │
│ Floor 1: Dreamcast Hardware │
│ *beep boop* │
│ │
└─────────────────────────────────────────────────────────────┘
Each floor speaks a different language. libgodc translates Go into something KallistiOS understands. KallistiOS translates that into hardware register writes.
You don’t need to know all the details, but understanding the stack helps you debug problems.
Part 1: Timers and Sleep
How Does Sleep Work?
When you write:
time.Sleep(100 * time.Millisecond)
What actually happens? Let’s trace it:
┌─────────────────────────────────────────────────────────────┐
│ WHAT HAPPENS WHEN YOU SLEEP │
│ │
│ Step 1: "I want to sleep for 100ms" │
│ ↓ │
│ Step 2: Calculate wake time: now + 100ms = 4:00:00.100 │
│ ↓ │
│ Step 3: Add timer to the timer heap │
│ ┌─────────────────────────────┐ │
│ │ wake_time: 4:00:00.100 │ │
│ │ goroutine: G7 │ │
│ └─────────────────────────────┘ │
│ ↓ │
│ Step 4: Park the goroutine (it's now sleeping) │
│ ↓ │
│ Step 5: Scheduler runs OTHER goroutines │
│ ...100ms pass... │
│ ↓ │
│ Step 6: Scheduler checks timer heap │
│ "Hey, it's 4:00:00.100! Wake G7!" │
│ ↓ │
│ Step 7: G7 wakes up, continues executing │
│ │
└─────────────────────────────────────────────────────────────┘
Key insight: Your goroutine isn’t actually sleeping on a couch somewhere. It’s parked in a queue, and the scheduler knows when to wake it.
Where Does Time Come From?
The SH-4 CPU has hardware timers. KallistiOS reads them:
//extern timer_us_gettime64
func TimerUsGettime64() uint64
This returns microseconds since boot. Accurate to about 1 μs. Fast to read.
In your Go code, you can use this for precise timing:
//extern timer_us_gettime64
func timerUsGettime64() uint64
func measureSomething() {
start := timerUsGettime64()
doExpensiveWork()
elapsed := timerUsGettime64() - start
println("Took", elapsed, "microseconds")
}
The Timer Heap
Multiple goroutines can sleep at once. Go keeps them in a heap (priority queue) sorted by wake time:
Timer Heap:
┌───────────────────────────────────────────────────────────┐
│ │
│ [G3: wake at 100ms] ← Earliest, checked first │
│ /\ │
│ / \ │
│ [G7: 200ms] [G2: 150ms] │
│ / │
│ [G5: 500ms] │
│ │
└───────────────────────────────────────────────────────────┘
The scheduler only needs to check the top of the heap. If the earliest timer hasn’t fired, none of them have.
Part 2: File I/O (The Danger Zone)
The Problem
You want to load a texture:
data := loadFile("/cd/textures/enemy.pvr")
Seems innocent, right? Here’s what actually happens:
┌─────────────────────────────────────────────────────────────┐
│ GD-ROM READ: THE SILENT KILLER │
│ │
│ Time: 0ms → loadFile() called │
│ Time: 0ms → KOS asks GD-ROM to seek │
│ Time: 50ms → Drive head moves (mechanical!) │
│ Time: 100ms → Data starts streaming │
│ Time: 150ms → Still reading... │
│ Time: 200ms → loadFile() returns │
│ │
│ DURING THOSE 200ms: │
│ • No other goroutines run │
│ • Game loop frozen │
│ • Audio buffer might run dry → glitch! │
│ • Player sees: lag, stutter, freeze │
│ │
│ At 60 FPS, you have 16.6ms per frame. │
│ A 200ms file read = 12 FROZEN FRAMES! │
│ │
└─────────────────────────────────────────────────────────────┘
Why does this happen? KOS file operations are synchronous. The CPU sits in a loop waiting for the CD drive. No scheduler runs. Nothing else happens.
The Solutions
Solution 1: Loading Screens
Load everything at startup or level transitions:
func main() {
showLoadingScreen()
// All the slow stuff happens here
textures = loadAllTextures()
sounds = loadAllSounds()
levelData = loadLevel(1)
hideLoadingScreen()
// Now game loop is safe
for {
gameLoop()
}
}
Solution 2: Streaming in Chunks
If you must load during gameplay, do it in small pieces:
func streamTexture(path string) {
file := openFile(path)
defer closeFile(file)
for !file.EOF() {
chunk := file.Read(4096) // Read 4KB
processChunk(chunk)
runtime.Gosched() // Let other goroutines run!
}
}
Solution 3: Pre-load into RAM
The Dreamcast has 16 MB of RAM. Use it!
// At startup, load everything you might need
var textureCache = make(map[string][]byte)
func preloadTexture(name string) {
textureCache[name] = loadFile("/cd/textures/" + name)
}
// During gameplay, instant access
func getTexture(name string) []byte {
return textureCache[name] // Already in RAM!
}
Part 3: Calling C Functions
The //extern Magic
Go code can call C functions directly:
//extern pvr_wait_ready
func PvrWaitReady() int32
//extern maple_enum_dev
func mapleEnumDev(port, unit int32) uintptr
func main() {
PvrWaitReady() // Calls the C function!
}
No CGo. No runtime overhead. Just a direct function call.
The Danger
Here’s the catch: C functions run on your goroutine’s stack. Goroutines have fixed stacks (64 KB by default). If the C function is stack-hungry:
┌─────────────────────────────────────────────────────────────┐
│ STACK OVERFLOW SCENARIO │
│ │
│ Goroutine stack: 64 KB │
│ │
│ ┌────────────────────┐ ← Stack top │
│ │ Your Go function │ 1 KB used │
│ ├────────────────────┤ │
│ │ C function called │ │
│ │ local arrays... │ 6 KB used │
│ │ more locals... │ │
│ ├────────────────────┤ │
│ │ C calls another C │ │
│ │ BOOM! │ OVERFLOW! │
│ └────────────────────┘ ← Stack bottom (guard page) │
│ │
│ Result: Memory corruption, crash, mysterious bugs │
│ │
└─────────────────────────────────────────────────────────────┘
Part 4: Debugging Without Fancy Tools
The Detective’s Toolkit
Tool 1: Print Statements
The oldest debugging technique is still the best:
func suspiciousFunction(x int) {
println(">>> suspiciousFunction start, x =", x)
result := doSomething(x)
println(" after doSomething, result =", result)
processResult(result)
println("<<< suspiciousFunction end")
}
Tool 2: Binary Search Debugging
Program crashes somewhere. Where?
1. Add print at function start and end
2. If it prints START but not END, crash is inside
3. Add print in the middle
4. Repeat until you find the exact line
Tool 3: The Assumptions Checklist
When something “can’t possibly be wrong,” check it:
func processEnemy(e *Enemy) {
// CHECK YOUR ASSUMPTIONS
if e == nil {
println("BUG: e is nil!")
return
}
if e.Health < 0 {
println("BUG: negative health:", e.Health)
}
if e.X < 0 || e.X > 640 {
println("BUG: X out of bounds:", e.X)
}
// Now do the actual work
// ...
}
Reading Crash Information
When your game crashes, you might see:
panic: index out of range [99] with length 3
Registers:
PC=8c015678 PR=8c015432
Stack trace:
0x8c015678
0x8c015432
0x8c014000
What does this mean?
- PC (Program Counter) — Where the crash happened
- PR (Procedure Register) — Who called us (return address)
- Stack trace — Chain of function calls
Finding the Function Name
You have an address: 0x8c015678. Where is it?
Method 1: addr2line
sh-elf-addr2line -e game.elf 0x8c015678
# Output: /path/to/main.go:42
This tells you the exact line number!
Method 2: Symbol Table
sh-elf-nm game.elf | sort > symbols.txt
# Then search for addresses near 0x8c015678
Method 3: With Function Names
sh-elf-addr2line -f -C -i -e game.elf 0x8c015678
# Output: functionName
# main.go:42
Common Bugs and Fixes
| Symptom | Likely Cause | Fix |
|---|---|---|
| Hangs, no output | Infinite loop without yield | Add runtime.Gosched() in loops |
| Garbage on screen | Memory corruption | Check array bounds |
| Random crashes | Stack overflow | Check deep recursion, big C calls |
| GC panic | Too much live data | Reduce heap usage, trigger GC earlier |
| Works in emu, fails on hw | Timing differences | Test on real hardware earlier! |
Troubleshooting Flowchart
Use this decision tree when things go wrong:
┌──────────────────────────────────────────────────────────────┐
│ TROUBLESHOOTING FLOWCHART │
│ │
│ What's happening? │
│ │ │
│ ├─► CRASH (program terminates) │
│ │ │ │
│ │ ├─► Panic message visible? │
│ │ │ │ │
│ │ │ ├─► YES: Read the message! │
│ │ │ │ • "index out of range" │
│ │ │ │ → Check slice bounds │
│ │ │ │ • "nil pointer" │
│ │ │ │ → Check for nil before use │
│ │ │ │ • "out of memory" │
│ │ │ │ → Reduce allocations │
│ │ │ │ │
│ │ │ └─► NO: Stack overflow likely │
│ │ │ → Reduce local variables │
│ │ │ → Convert recursion to loop │
│ │ │ │
│ ├─► FREEZE (no crash, no progress) │
│ │ │ │
│ │ ├─► Any goroutines running? │
│ │ │ │ │
│ │ │ ├─► Only one: Infinite loop │
│ │ │ │ → Add runtime.Gosched() │
│ │ │ │ │
│ │ │ └─► Multiple: Deadlock │
│ │ │ → Check channel usage │
│ │ │ → Ensure sends have receivers│
│ │ │ │
│ ├─► STUTTER (periodic lag) │
│ │ │ │
│ │ └─► GC pauses likely │
│ │ → Reduce live heap size │
│ │ → Trigger GC during loading │
│ │ → Use object pools │
│ │ │
│ └─► WRONG OUTPUT (runs but incorrect) │
│ │ │
│ └─► Add println() everywhere │
│ → Check variable values │
│ → Verify assumptions │
│ │
└──────────────────────────────────────────────────────────────┘
The 5-Step Debug Process
┌─────────────────────────────────────────────────────────────┐
│ THE DEBUGGING ALGORITHM │
│ │
│ 1. REPRODUCE │
│ Can you make it happen consistently? │
│ If not, add logging until you can. │
│ │
│ 2. NARROW DOWN │
│ Binary search with prints. │
│ "Does it crash before this line or after?" │
│ │
│ 3. CHECK ASSUMPTIONS │
│ Print everything. That variable you're SURE is │
│ correct? Print it anyway. │
│ │
│ 4. SIMPLIFY │
│ Create the smallest program that shows the bug. │
│ Often, you'll find the bug while simplifying. │
│ │
│ 5. TAKE A BREAK │
│ Seriously. Walk away. Fresh eyes find bugs faster │
│ than tired eyes. │
│ │
└─────────────────────────────────────────────────────────────┘
Part 5: Testing on a Game Console
The Test Structure
Our tests are simple: standalone executables that print PASS or FAIL.
tests/
├── test_types.go → test_types.elf (maps, interfaces, structs)
├── test_goroutines.go → test_goroutines.elf (goroutines, channels)
├── test_memory.go → test_memory.elf (allocation, GC)
└── test_control.go → test_control.elf (defer, panic, recover)
No fancy test framework. No JUnit. Just:
- Do something
- Check if it worked
- Print the result
A Minimal Test
package main
func TestMaps() {
println("maps:")
passed := 0
total := 0
total++
m := make(map[string]int)
m["score"] = 100
if m["score"] == 100 {
passed++
println(" PASS: read after write")
} else {
println(" FAIL: read after write")
}
total++
if m["missing"] == 0 {
passed++
println(" PASS: missing key returns zero")
} else {
println(" FAIL: missing key returns zero")
}
total++
delete(m, "score")
_, ok := m["score"]
if !ok {
passed++
println(" PASS: delete removes key")
} else {
println(" FAIL: delete removes key")
}
println(" ", passed, "/", total)
}
func main() {
TestMaps()
}
Running Tests
# Build the test
make test_types
# Run on Dreamcast
dc-tool-ip -t 192.168.2.205 -x test_types.elf
# Output:
# maps:
# PASS: read after write
# PASS: missing key returns zero
# PASS: delete removes key
# 3 / 3
Emulator vs Hardware
| Aspect | Emulator | Real Hardware |
|---|---|---|
| Speed | Fast iteration | Slower uploads |
| Debugging | Can use host tools | printf only |
| Accuracy | Close but not exact | The truth |
| Timing | May differ | Definitive |
The Strategy:
┌─────────────────────────────────────────────────────────────┐
│ DEVELOPMENT WORKFLOW │
│ │
│ 80% of time: Emulator │
│ ├── Fast compile-run cycle │
│ ├── Quick iteration │
│ └── Good for logic bugs │
│ │
│ 20% of time: Real Hardware │
│ ├── Catches timing issues │
│ ├── Finds memory/stack problems │
│ └── Final validation before release │
│ │
│ RULE: Never release without testing on real hardware! │
│ │
└─────────────────────────────────────────────────────────────┘
The Dreamcast is a 25-year-old console with 16 MB of RAM, no debugger, and a CD-ROM that takes 200ms to seek. And yet, people made incredible games for it. You can too. You just need patience, println, and the knowledge in this chapter.
Performance
Part 1: The Cache — Your Best Friend
The Numbers That Matter
┌─────────────────────────────────────────────────────────────┐
│ SH-4 MEMORY HIERARCHY │
│ │
│ Registers: 0 cycles (instant) │
│ L1 Cache: 1-2 cycles (~10 ns) │
│ Main RAM: 10-20 cycles (~100 ns) │
│ CD-ROM: millions of cycles (200+ ms) │
│ │
│ Cache miss = 10-20× SLOWER than cache hit! │
│ │
└─────────────────────────────────────────────────────────────┘
Cache Lines: The Free Lunch
When you read one byte from RAM, the CPU doesn’t fetch just that byte. It fetches a whole cache line — 32 bytes on SH-4.
You ask for array[0]:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ ← All 32 bytes loaded!
└────┴────┴────┴────┴────┴────┴────┴────┘
▲
You wanted this one
Next 7 accesses are FREE! They're already in cache.
Sequential Access: The Fast Path
// FAST: Sequential access — 125 elements
sum := 0
for i := 0; i < 125; i++ {
sum += array[i]
}
What happens:
Access array[0] → Cache miss, load 32 bytes
Access array[1] → Cache HIT (free!)
Access array[2] → Cache HIT (free!)
...
Access array[7] → Cache HIT (free!)
Access array[8] → Cache miss, load next 32 bytes
...
Total cache misses: 125 / 8 = ~16
Strided Access: The Slow Path
// SLOW: Strided access (every 8th element) — also 125 elements
sum := 0
for i := 0; i < 1000; i += 8 {
sum += array[i]
}
What happens:
Access array[0] → Cache miss
Access array[8] → Cache miss (different cache line!)
Access array[16] → Cache miss
Access array[24] → Cache miss
...
Access array[992] → Cache miss
Total cache misses: 125 (EVERY access misses!)
Same number of additions (125), but strided is ~8× slower because every access misses the cache.
The Practical Lesson
┌─────────────────────────────────────────────────────────────┐
│ CACHE-FRIENDLY PATTERNS │
│ │
│ ✓ Process arrays left-to-right │
│ ✓ Keep related data together (struct of arrays) │
│ ✓ Avoid pointer-chasing (linked lists are slow!) │
│ ✓ Small, tight loops │
│ │
│ ✗ Random access patterns │
│ ✗ Large structs with rarely-used fields │
│ ✗ Jumping around memory │
│ │
└─────────────────────────────────────────────────────────────┘
Part 2: The Float64 Trap
The Shocking Truth
Go defaults to float64 for floating-point numbers:
x := 3.14 // This is float64!
On a modern PC, float64 and float32 are about the same speed. On SH-4?
┌─────────────────────────────────────────────────────────────┐
│ FLOAT PERFORMANCE ON SH-4 │
│ │
│ float32: Hardware accelerated, FAST │
│ One instruction, one cycle │
│ │
│ float64: Software emulation, SLOW │
│ Multiple instructions, 10-20× slower! │
│ │
│ A physics simulation using float64 could run │
│ at 6 FPS instead of 60 FPS. That's the difference. │
│ │
└─────────────────────────────────────────────────────────────┘
The Fix
Be explicit about float32:
// SLOW
x := 3.14 // float64 by default!
y := x * 2.0 // float64 math
// FAST
var x float32 = 3.14 // Explicit float32
y := x * 2.0 // float32 math
For game physics, positions, velocities — always use float32.
Part 3: What We Deliberately Left Out
“Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.” — Antoine de Saint-Exupéry
libgodc is not a complete Go implementation. That’s intentional. Here’s what we cut and why:
Omission 1: Full Reflection
Standard Go: Every type carries metadata — field names, method signatures, struct tags. This enables reflect and fancy JSON marshaling.
Cost: Binary size can double.
libgodc: Basic reflection only. Enough for println to work.
What you lose:
reflect.MakeFunc(...) // NOT SUPPORTED
json.Marshal(myStruct) // NOT SUPPORTED (would need full reflection)
What you do instead: Write explicit serialization. Use code generators.
Omission 2: Finalizers
Standard Go:
runtime.SetFinalizer(obj, func(o *MyType) {
o.cleanup() // Runs when GC collects obj
})
The problem: Finalizers are a nightmare for GC:
- Objects can be resurrected
- Run order is undefined
- Timing is unpredictable
- Complicate the GC significantly
libgodc: No finalizers.
What you do instead: Use defer for cleanup:
func process() {
resource := acquire()
defer resource.Release() // Always runs!
// ... use resource ...
}
Omission 3: Preemptive Scheduling
Standard Go: The runtime can interrupt a goroutine at almost any point.
libgodc: Goroutines must yield voluntarily.
// THIS FREEZES THE SYSTEM
for {
// Infinite loop, never yields
// No other goroutine will EVER run
}
// THIS IS FINE
for {
doWork()
runtime.Gosched() // "Let others run"
}
Why we did this: Preemption requires safe points, stack inspection, and signal handling. Complex for little benefit on single-CPU.
Omission 4: Concurrent GC
Standard Go:
Your code: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
GC: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
Both run in parallel!
Pause: < 1ms
libgodc:
Your code: ░░░░░░░░░░████████████░░░░░░░░
GC: ▓▓▓▓▓▓▓▓▓▓▓▓
EVERYTHING STOPS during GC
Pause: 5-20ms
Why we did this: Concurrent GC requires write barriers, atomic operations, and careful synchronization. Stop-the-world is simpler and predictable.
What you do: Keep live data small. Trigger GC between frames or during loading.
The Trade-off Table
| Feature | What We Chose | Why |
|---|---|---|
| GC | Semi-space, stop-the-world | Simple, no fragmentation |
| Scheduling | Cooperative, M:1 | No locks, predictable |
| Panic/Recover | setjmp/longjmp | No DWARF unwinding |
| Reflection | Minimal | Binary size |
| Preemption | None | Simplicity |
| C interop | Direct linking | No CGo complexity |
Our philosophy: Predictability over throughput. Simplicity over features.
Part 4: When to Optimize
The Golden Question
Before optimizing anything, ask:
“Have I measured this?”
If the answer is no, stop. You’re guessing. And programmers are notoriously bad at guessing where time is spent.
The 90/10 Rule
┌─────────────────────────────────────────────────────────────┐
│ │
│ 90% of execution time is spent in 10% of the code │
│ │
│ That means: │
│ • 90% of your code DOESN'T MATTER for performance │
│ • Optimizing the wrong code = wasted effort │
│ • Always measure first! │
│ │
└─────────────────────────────────────────────────────────────┘
DO Optimize
- Code that runs every frame (game loop, rendering)
- Hot loops with thousands of iterations
- Code that measurements show is slow
DON’T Optimize
- Code that runs once (startup, level load)
- Code that runs rarely (menu navigation)
- Code you haven’t measured
- At the cost of readability
How to Measure
//extern timer_us_gettime64
func timerUsGettime64() uint64
func measureGameLoop() {
start := timerUsGettime64()
updatePhysics()
physicsTime := timerUsGettime64() - start
renderStart := timerUsGettime64()
renderFrame()
renderTime := timerUsGettime64() - renderStart
println("Physics:", physicsTime, "us")
println("Render:", renderTime, "us")
}
Now you know where time actually goes!
Part 5: The Debug Build System
Production vs Debug
By default, libgodc is silent. Zero debug output, zero overhead.
# Production build (default)
make && make install
# Debug build - enables debug output and assertions
make DEBUG=3 && make install
The Performance Tax of Debug Output
┌─────────────────────────────────────────────────────────────┐
│ OPERATION Production DEBUG=3 │
│ │
│ Goroutine spawn 50 μs 188,000 μs (188 ms!) │
│ Channel send 19 μs ~50,000 μs │
│ GC pause 21 ms ~500 ms │
│ │
│ Debug output is EXTREMELY EXPENSIVE! │
│ Never benchmark with DEBUG enabled. │
│ │
└─────────────────────────────────────────────────────────────┘
Debug Macros
Instead of raw printf, use these macros:
| Macro | Use For | Example |
|---|---|---|
LIBGODC_TRACE() | General tracing | Scheduler events |
LIBGODC_WARNING() | Non-fatal issues | Large allocations |
LIBGODC_ERROR() | Recoverable errors | Failed operations |
LIBGODC_CRITICAL() | Fatal errors | Logged to crash dump |
GC_TRACE() | GC-specific | Collection details |
In production (DEBUG=0): All macros compile to nothing. Zero cost.
In debug (DEBUG=3): Output includes labels:
[godc:main] Scheduling G 42 (status=1)
[godc:main] WARNING: Large allocation 256 KB
[GC] #3: 1024->512 (50% survived) in 21045 us
Using Debug Macros
In C runtime code:
#include "runtime.h"
void my_function(void) {
LIBGODC_TRACE("Entering my_function");
if (error_condition) {
LIBGODC_WARNING("Something unexpected: %d", value);
}
LIBGODC_TRACE("my_function complete");
}
In Go code, use println:
const DEBUG = false // Set to true when debugging
func debugPrint(msg string) {
if DEBUG {
println(msg)
}
}
Debug Functions Available
When investigating issues, you can call these:
gc_dump_stats(); // Print GC statistics
gc_verify_heap(); // Check heap integrity
gc_print_object(ptr); // Print object details
gc_dump_heap(10); // Dump first 10 heap objects
Real Benchmark Results
We ran these benchmarks on actual Dreamcast hardware. These numbers should guide your optimization decisions.
PVRMark: Go vs Native C
We ran the KOS pvrmark benchmark (flat-shaded triangles, no textures) on real Dreamcast hardware to measure Go runtime overhead:
| Metric | C Native | Go (default) | Go (GODC_FAST) |
|---|---|---|---|
| Peak polys/frame | 17,533 | 13,833 | 14,333 |
| Peak pps | ~1,054,097 | ~831,714 | ~860,532 |
| vs C performance | 100% | 79% | 82% |
| Binary size | 314 KB | 614 KB | 614 KB |
┌─────────────────────────────────────────────────────────────┐
│ POLYGON THROUGHPUT (polys/frame @ 60fps) │
│ │
│ C Native: ████████████████████████████████████ 17,533│
│ Go Optimized: ████████████████████████████ 14,333 │
│ Go Default: ██████████████████████████ 13,833 │
│ │
│ GODC_FAST=1 adds +500 polys/frame (+3.6%) │
│ Go achieves 82% of C polygon throughput │
└─────────────────────────────────────────────────────────────┘
Analysis:
- The 18% overhead comes from bounds checking, slice header overhead, and gccgo code generation differences (not FFI —
//externcompiles to directjsrcalls) GODC_FAST=1improves performance by ~3.6% via aggressive optimization- For real games with textures, lighting, and game logic, this difference is negligible
- 14,333 flat-shaded triangles at 60fps is plenty for actual gameplay
What the extra 300KB binary size buys you:
- Garbage collection
- Goroutines and channels
- Defer/panic/recover
- Type safety and bounds checking
- Full Go standard library support
Compiler Optimization Flags
The godc build command uses these SH-4 specific optimizations:
| Flag | Effect | Default |
|---|---|---|
-O2 | Standard optimization | ✓ |
-m4-single | Single-precision FPU mode | ✓ |
-mfsrra | Hardware reciprocal sqrt (10× faster) | ✓ |
-mfsca | Hardware sin/cos (10× faster) | ✓ |
-O3 | Aggressive optimization | GODC_FAST only |
-ffast-math | Fast FP (breaks IEEE) | GODC_FAST only |
-funroll-loops | Loop unrolling | GODC_FAST only |
To enable aggressive optimizations:
GODC_FAST=1 godc build
Warning: -ffast-math breaks IEEE floating point compliance. NaN and infinity handling may not work correctly. Use only for games where FP precision isn’t critical.
Conclusion
What We Built
We started with a simple question: Can Go run on a 1998 game console?
The answer is yes. Not perfectly, not completely, but yes.
┌─────────────────────────────────────────────────────────────┐
│ │
│ libgodc: A Go Runtime for the Sega Dreamcast │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ ✓ Memory allocation with bump allocator │ │
│ │ ✓ Garbage collection (semi-space copying) │ │
│ │ ✓ Goroutines (cooperative M:1 scheduling) │ │
│ │ ✓ Channels (buffered and unbuffered) │ │
│ │ ✓ Select statement │ │
│ │ ✓ Defer, panic, and recover │ │
│ │ ✓ Maps, slices, strings, interfaces │ │
│ │ ✓ Direct C interop via //extern │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ All running on 16MB RAM and a 200MHz CPU. │
│ │
└─────────────────────────────────────────────────────────────┘
The Trade-offs We Made
Every design decision was a trade-off. Here’s what we chose and why:
| Decision | What We Gave Up | What We Gained |
|---|---|---|
| Semi-space GC | 50% of heap unusable | No fragmentation, simple code |
| Cooperative scheduling | Preemption | No locks, predictable timing |
| Fixed 64KB stacks | Stack growth | Simplicity, no stack probes |
| M:1 model | Parallelism | No thread synchronization |
| setjmp/longjmp panic | DWARF unwinding | Works without debug info |
| No finalizers | Destructor patterns | Simpler GC, predictable cleanup |
These aren’t the “right” choices for every platform. They’re the our choices for this platform.
What We Didn’t Build
libgodc is not a complete Go implementation. We deliberately left out:
- Race detector — No parallelism means no data races
- CPU/memory profiling — Use
printlnand timers - Debugger support — Not available go debugger
- Full reflection — Binary size matters
- Preemptive scheduling — Complexity for no benefit
- Concurrent GC — Single core, stop-the-world is fine
Lessons for Runtime Implementers
If you’re building a runtime for another constrained platform, here’s what we learned:
- Don’t plan everything upfront. Get
println("Hello")working first. The linker errors will guide you to the next step. - When documentation fails, read the code. gccgo’s
libgo/runtime/directory answered questions no documentation could. - Our first GC was embarrassingly slow. It didn’t matter. Once it worked, we could measure and optimize. Premature optimization would have wasted months.
- Emulators lie. Timing is different. Memory layout is different. Test on hardware as soon as you can run anything.
- Fighting the hardware is futile. The SH-4 has 16MB RAM and a 200MHz CPU. Accept it. Design for it. Work with it.
The Bigger Picture
Coding this project, helped me understand better what Go actually does.
When you write go func() {}, something has to:
- Allocate a stack
- Save the entry point
- Add it to a run queue
- Eventually switch contexts to run it
When you write x := make([]int, 10), something has to:
- Calculate the size
- Find free memory
- Initialize the slice header
- Eventually clean up when it’s garbage
That “something” is the runtime. Every high-level language has one. Understanding how it works makes you a better programmer in any language.
What’s Next?
libgodc is open source. You can:
- Use it — Build games for the Dreamcast in Go
- Extend it — Add features you need
- Learn from it — Apply these patterns to other platforms
- Contribute — Fix bugs, improve performance, write examples
The Dreamcast community is small but passionate. Join us at:
Final Words
The Sega Dreamcast was released on November 27, 1998, in Japan. It was discontinued on March 31, 2001—a commercial failure that outlived its corporate support by decades.
Twenty-five years later, people are still writing code for it. Still pushing its limits. Still finding joy in its constraints.
That’s the magic of retro computing. It’s not about nostalgia. It’s about craft. Modern development gives us infinite resources and infinite complexity. Old hardware gives us finite resources and forces elegant solutions.
libgodc exists because someone asked: “Can Go run on a Dreamcast?”
The answer is yes. And now you know how.
Thank you for reading, Panos
libgodc Design
libgodc is a Go runtime for the Sega Dreamcast. This document explains how it works under the hood.
The Problem
The Dreamcast is a fixed platform: 200MHz SH-4, 16MB RAM, no MMU, no swap. The standard Go runtime assumes infinite memory, preemptive scheduling, operating system threads, and virtual memory. None of these exist here.
libgodc replaces the Go runtime with one designed for this environment.
Architecture
┌────────────────────────────────────────────────────────────────┐
│ Your Go Code │
│ compiles with sh-elf-gccgo │
│ produces .o files with Go runtime calls │
├────────────────────────────────────────────────────────────────┤
│ libgodc (this library) │
│ implements Go runtime functions │
│ memory allocation, goroutines, channels, GC │
├────────────────────────────────────────────────────────────────┤
│ KallistiOS (KOS) │
│ baremetal OS for Dreamcast │
│ provides malloc, threads, drivers │
├────────────────────────────────────────────────────────────────┤
│ Dreamcast Hardware │
│ SH4 CPU, PowerVR2 GPU, AICA sound │
│ 16MB main RAM, 8MB VRAM │
└────────────────────────────────────────────────────────────────┘
We don’t need the full Go runtime. We need enough to run games. Games have different requirements than servers—short sessions, realtime deadlines, no network services. This simplifies everything.
Memory Model
The Budget
16MB total RAM:
KOS kernel + drivers: ~1MB
Your program text/data: ~13MB
GC heap (two semispaces): 4MB (2MB active at any time)
Goroutine stacks: ~640KB (10 goroutines × 64KB)
Channel buffers: Variable
Available for KOS malloc: ~6-9MB (textures, audio, meshes)
The number are from the source code config:
GC heap: GC_SEMISPACE_SIZE_KB in godc_config.h (default 2048 = 2MB × 2)
Stack size: GOROUTINE_STACK_SIZE in godc_config.h (default 64KB)
Run bench_architecture.elf to verify: prints actual config values
The 16MB limit is absolute. There is no virtual memory, no swap, no second chance. Every byte matters.
Allocation Strategy
libgodc uses three allocation paths:
1. GC Heap (for Go objects)
Small, frequentlyallocated objects go here. The semispace collector manages
them automatically. Implementation: gc_heap.c, gc_copy.c.
Implementation of the allocation in simple pseudocode:
// Bump allocator: O(1) allocation (simplified)
void *gc_alloc(size_t size, type_descriptor *type) {
size = ALIGN(size + HEADER_SIZE, 8);
if (alloc_ptr + size > alloc_limit) {
gc_collect(); // Cheney's algorithm
}
void *obj = alloc_ptr;
alloc_ptr += size;
return obj;
}
```go
This is simplified. The real code in `gc_heap.c` also handles large objects
(>64KB bypass the GC heap and go straight to malloc), alignment edge cases,
and gc_percent threshold checks. But the core is exactly this: bump a pointer.
The bump allocator is the fastest possible allocation strategy. Deallocation
happens during collectionlive objects are copied, dead objects are forgotten.
Usage example:
```go
// Go: allocate freely, GC handles cleanup
func spawnEnemy() *Enemy {
return &Enemy{bullets: make([]Bullet, 100)}
}
// No kill function needed when nothing references it, it's collected
2. KOS Heap (for large objects)
Objects larger than 64KB bypass the GC entirely. This is correct for game assetstextures, audio buffers, and mesh data are typically loaded once and never freed during gameplay.
// This goes to KOS malloc, not GC:
texture := make([]byte, 256*256*2) // 128KB texture
```c
Large objects use `malloc()` internally and are not tracked by the GC.
To free them, use `runtime.FreeExternal`:
```go
//go:linkname freeExternal runtime.FreeExternal
func freeExternal(ptr unsafe.Pointer)
// Allocate large texture
texture := make([]byte, 256*256*2) // 128KB, bypasses GC
// When done with it:
freeExternal(unsafe.Pointer(&texture[0]))
texture = nil // Don't use after freeing!
See gc_external_free in gc_heap.c. Run test_free_external.elf to verify.
Typical pattern swap textures between levels:
// Load level 1
bgTexture := make([]byte, 512*512*2) // 512KB
// ... play level 1 ...
// Unload before level 2
freeExternal(unsafe.Pointer(&bgTexture[0]))
bgTexture = make([]byte, 512*512*2) // reuses memory
// or you could use a helper function, like that:
func freeSlice(s []byte) {
if len(s) > 0 {
freeExternal(unsafe.Pointer(&s[0]))
}
}
// Then just:
freeSlice(bgTexture)
3. Stack (for goroutine execution)
Each goroutine gets a fixed 64KB stack. No stack growth, no splitstack. This is simpler and faster than growable stacks, but requires discipline.
Stack frames are freed automatically when functions return. Use the stack for temporary buffers:
func processAudio() {
buffer := [4096]int16{} // 8KB on stack, automatically freed
// ...
}
Object Header
Every GC object has an 8byte header. The GC needs to know each object’s
size (to copy it) and whether it contains pointers (to scan them). Storing
this inline costs 8 bytes per object but makes lookup instant (ptr 8).
┌──────────────────────────────────────────────────────────┐
│ Bits 31: Forwarded (1 = copied during GC) │
│ Bits 30: NoScan (1 = no pointers) │
│ Bits 29-24: Type tag (6 bits, Go type kind) │
│ Bits 23-0: Size (24 bits, max 16MB) │
├──────────────────────────────────────────────────────────┤
│ Type pointer (32 bits, full type descriptor) │
└──────────────────────────────────────────────────────────┘
Putting numbers on the paper, a [4]byte array actually uses not 4 but 12
bytes (4 data + 8 header). This is why many small allocations hurt more than
fewer large ones.
The NoScan bit is critical for performance. Objects containing only integers, floats, or other nonpointer types skip GC scanning entirelythe collector just copies them without inspecting their contents.
The practical takeaway: prefer value types over pointer types when possible.
// Faster GC (NoScan), just a copy:
type Vertex struct { X, Y, Z float32 }
mesh := make([]Vertex, 1000)
// Slower GC (must scan), has pointers:
mesh := make([]*Vertex, 1000)
Garbage Collection
Algorithm: Cheney’s SemiSpace Collector
The heap is divided into two semispaces of equal size. Only one is active at any time. When the active space fills up:
- Stop all goroutines (stoptheworld)
- Copy all live objects to the other space
- Update all pointers to point to new locations
- Switch active space
- Resume execution
// Two semispaces
gc_heap.space[0] = memalign(32, GC_SEMISPACE_SIZE);
gc_heap.space[1] = memalign(32, GC_SEMISPACE_SIZE);
// Collection switches active space
int old_space = gc_heap.active_space;
int new_space = 1 old_space;
gc_heap.active_space = new_space;
// Copy to new space, scan roots, update pointers
gc_scan_roots();
// ... Cheney's forwarding loop ...
This algorithm is simple, has no fragmentation, and handles cycles naturally. The cost is that only half the heap is usable at any time.
Collection Trigger
GC runs when:
Active space exceeds threshold (default: 75% when gc_percent=100)
Allocation would exceed remaining space
Explicit GC call
The threshold is controlled by gc_percent:
gc_percent = 100(default): threshold = 75% of heap spacegc_percent = 50: threshold = 50% of heap spacegc_percent = -1: disable automatic GC (only explicitruntime.GC()triggers collection)
To control GC from Go:
//go:linkname setGCPercent debug.SetGCPercent
func setGCPercent(percent int32) int32
//go:linkname gc runtime.GC
func gc()
func init() {
setGCPercent(50) // Trigger at 50% instead of 75%
setGCPercent(-1) // Disable automatic GC entirely
gc() // Force collection now
}
Run test_gc_percent.elf to verify this works.
Pause Times
GC pause time depends on live object count and layout. Run
tests/bench_architecture.elf on hardware to measure actual pauses.
For 60fps (16.6ms frames), disable automatic GC during gameplay:
import _ "unsafe"
//go:linkname setGCPercent debug.SetGCPercent
func setGCPercent(percent int32) int32
//go:linkname forceGC runtime.GC
func forceGC()
func main() {
setGCPercent(-1) // Disable automatic GC
// ... game runs with no GC pauses ...
// GC during loading screens only:
showLoadingScreen()
forceGC()
startGameplay()
}
Root Scanning
The GC finds live objects by tracing from roots:
static void gc_scan_roots(void)
{
// Scan explicit roots (gc_add_root)
for (int i = 0; i < gc_root_table.count; i++) { ... }
// Scan compilerregistered roots (registerGCRoots)
gc_scan_compiler_roots();
// Scan current stack
gc_scan_stack();
// Scan all goroutine stacks
gc_scan_all_goroutine_stacks();
}
-
Global variables Registered by gccgogenerated code via
registerGCRoots(). Each package contributes a root list. -
Goroutine stacks Scanned conservatively. Every aligned pointersized value that points into the heap is treated as a potential pointer.
-
Explicit roots Optional. If you write C code that holds pointers to Go objects, call
gc_add_root(&ptr)so the GC doesn’t collect them.
DMA Hazard
The GC moves objects. Any pointer held by hardware (PVR DMA, AICA) will become stale after collection. Safe patterns:
// DANGEROUS GC might move buffer during DMA:
data := make([]byte, 4096) // Small, in GC heap
startDMA(data) // Hardware holds pointer
runtime.Gosched() // GC might run here!
// SAFE Large allocations bypass GC:
data := make([]byte, 100*1024) // >64KB, uses malloc
startDMA(data) // Won't move
// SAFE VRAM for textures:
tex := kos.PvrMemMalloc(size) // Allocates in VRAM
Scheduler
M:1 Cooperative Model
All goroutines run on a single KOS thread. One goroutine executes at a time. Context switches happen only at explicit yield points:
Channel operations (send, receive, select)
runtime.Gosched()
time.Sleep() and timer waits
Blocking I/O
A goroutine in a tight CPU loop will monopolize the processor. There is no preemption.
Why M:1?
The Dreamcast has one CPU core. Preemptive scheduling adds complexity and overhead for no parallelism benefit. Cooperative scheduling is simpler, faster, and sufficient for games.
Run Queue Structure
The scheduler maintains a simple FIFO run queue. Goroutines are added to the tail and removed from the head. This is simpler than prioritybased scheduling and sufficient for game workloads where you control when each goroutine yields.
// Goroutines execute in the order they become runnable
runq_put(gp); // Add to tail
gp = runq_get(); // Remove from head
For realtime requirements, structure your code so timesensitive work runs on the main goroutine or yields frequently.
Context Switching
Each goroutine saves 64 bytes of CPU state when it yields:
typedef struct sh4_context {
uint32_t r8, r9, r10, r11, r12, r13, r14; // Calleesaved
uint32_t sp, pr, pc; // Special registers
uint32_t fr12, fr13, fr14, fr15; // FPU calleesaved
uint32_t fpscr, fpul; // FPU control
} sh4_context_t;
Context switch is implemented in runtime_sh4_minimal.S (simplified for brevity):
__go_swapcontext:
! Save current context
mov.l r8, @r4 ! r4 = old_ctx
mov.l r9, @(4, r4)
...
! Restore new context
mov.l @r5, r8 ! r5 = new_ctx
mov.l @(4, r5), r9
...
rts
FPU Context
Every context switch saves floatingpoint registers, even if your goroutine only uses integers. This costs ~50 extra cycles per switch.
// Both goroutines pay FPU overhead, even though neither uses floats
go audioDecoder() // Integer PCM math
go networkHandler() // Packet parsing
This is a tradeoff: always saving FPU is slower but correct. A goroutine that unexpectedly uses a float won’t corrupt another’s FPU state.
Goroutine Structure
typedef struct G {
// ABICRITICAL: gccgo expects these at specific offsets
PanicRecord *_panic; // Offset 0: innermost panic
GccgoDefer *_defer; // Offset 4: innermost defer
// Scheduling
Gstatus atomicstatus;
G *schedlink;
void *param;
// Stack
void *stack_lo;
void *stack_hi;
stack_segment_t *stack;
void *stack_guard;
tls_block_t *tls;
// CPU context (64 bytes)
sh4_context_t context;
// Metadata
int64_t goid;
WaitReason waitreason;
int32_t allgs_index;
uint32_t death_generation;
G *dead_link;
uint8_t gflags2;
// Channel wait
sudog *waiting;
// Defer/panic
Checkpoint *checkpoint;
int defer_depth;
// Entry point
uintptr_t startpc;
G *freeLink;
} G;
See goroutine.h for the authoritative definition.
Goroutine Lifecycle
- Creation
__go_go()allocates G struct, stack, and TLS block - Runnable Added to run queue
- Running Scheduler switches context to it
- Waiting Parked on channel, timer, or I/O
- Dead Function returned, queued for cleanup
Dead goroutines are reclaimed after a grace period (epochbased reclamation) to ensure no dangling sudog references from channel wait queues.
Channels
Channels are the primary synchronization primitive. Implementation follows the Go runtime closely.
Structure
typedef struct hchan {
uint32_t qcount; // Current element count
uint32_t dataqsiz; // Buffer size (0 = unbuffered)
void *buf; // Circular buffer
uint16_t elemsize; // Element size
uint8_t closed; // Channel closed flag
uint8_t buf_mask_valid; // Optimization: can use & instead of %
struct __go_type_descriptor *elemtype;
uint32_t sendx, recvx; // Buffer indices
waitq recvq, sendq; // Wait queues (sudog linked lists)
uint8_t locked; // Simple lock flag
} hchan;
Unbuffered Channels
Send blocks until a receiver arrives. Receive blocks until a sender arrives. When both are ready, data transfers directlyno buffering.
This is the fundamental synchronization primitive: rendezvous.
Buffered Channels
Send blocks only when buffer is full. Receive blocks only when buffer is empty. The buffer is a simple circular array.
Select
Select uses randomized ordering to prevent starvation:
select {
case x := <ch1: // These are checked in random order
case ch2 < y:
case <time.After(timeout):
}
Implementation: shuffle cases, check each for readiness, park on all if none ready.
Defer, Panic, Recover
Defer
Defer uses a linked list per goroutine. Each defer statement pushes a
record; function exit pops and executes them in LIFO order.
typedef struct GccgoDefer {
struct GccgoDefer *link; // Next entry in defer stack
bool *frame; // Pointer to caller's frame bool
PanicRecord *panicStack; // Panic stack when deferred
PanicRecord *_panic; // Panic that caused defer to run
uintptr_t pfn; // Function pointer to call
void *arg; // Argument to pass to function
uintptr_t retaddr; // Return address for recover matching
bool makefunccanrecover; // MakeFunc recover permission
bool heap; // Whether heap allocated
} GccgoDefer; // 32 bytes total
```go
### Panic and Recover
Userinitiated panic (`panic()`) is recoverable via `recover()` in a deferred
function. Implementation uses `setjmp`/`longjmp` with checkpoints.
Runtime panics (nil dereference, bounds check, divide by zero) are not
recoverablethey crash immediately with a diagnostic.
Why? Recovering from a bounds check failure would leave the program in an
undefined state. It's better to crash clearly than corrupt silently.
## Type System
### Type Descriptors
gccgo generates type descriptors for every Go type. libgodc uses these for:
GC pointer scanning (which fields contain pointers?)
Interface method dispatch (which methods does this type implement?)
Reflection (what is this type's name and structure?)
```c
typedef struct __go_type_descriptor {
uint8_t __code; // Kind (bool, int, slice, etc.)
uint8_t __align, __field_align;
uintptr_t __size;
uint32_t __hash;
uintptr_t __ptrdata; // Bytes containing pointers
const void *__gcdata; // Pointer bitmap
// ...
} __go_type_descriptor;
Interface Tables
Interface dispatch uses precomputed method tables. When you write:
var w io.Writer = os.Stdout
w.Write(data)
The compiler generates an itab linking *os.File to io.Writer, containing
function pointers for all interface methods.
SH4 Specifics
Register Allocation
r0r7: Callersaved (arguments, scratch)
r8r14: Calleesaved (preserved across calls)
r15: Stack pointer
pr: Procedure return (return address)
GBR: Reserved for KOS _Thread_local
We do not use GBR for goroutine TLS. Instead, we use a global current_g
pointer. This avoids conflicts with KOS and simplifies context switching.
FPU Mode
libgodc uses singleprecision mode (m4single). The SH4 FPU is fast in
singleprecision but slow in doubleprecision. All float64 operations
generate software emulation callsavoid them in hot paths.
Cache Considerations
The SH4 has 32byte cache lines. Context switching saves/restores 64 bytes of CPU state (2 cache lines).
DMA operations require explicit cache management. The GC handles this for its semispace flip, but user code doing DMA must use KOS cache functions:
#include <arch/cache.h>
dcache_flush_range((uintptr_t)ptr, size); // Flush before DMA write
dcache_inval_range((uintptr_t)ptr, size); // Invalidate after DMA read
File Organization
runtime/
├── gc_heap.c # Heap initialization, allocation
├── gc_copy.c # Cheney's copying collector
├── gc_runtime.c # Go runtime interface (newobject, etc.)
├── scheduler.c # Run queue, schedule(), goready()
├── proc.c # Goroutine creation, lifecycle
├── chan.c # Channel implementation
├── select.c # Select statement
├── sudog.c # Wait queue entries
├── defer_dreamcast.c # Defer/panic/recover
├── timer.c # Time.Sleep, timers
├── tls_sh4.c # TLS management
├── runtime_sh4_minimal.S # Context switching assembly
├── interface_dreamcast.c # Interface dispatch
├── map_dreamcast.c # Map implementation
├── goroutine.h # Core data structures
├── gen-offsets.c # Generates struct offset definitions
└── asm-offsets.h # Auto-generated struct offsets for assembly
AssemblyC ABI Synchronization
The Problem
Context switching is implemented in assembly (runtime_sh4_minimal.S). The assembly
code accesses G struct fields by hardcoded byte offsets:
mov.l @(32, r4), r0 ! Load G>context at offset 32
If someone changes the G struct in C (adds/removes/reorders fields), the assembly breaks silentlyit reads garbage from wrong offsets. This is a classic embedded systems bug: C struct layout changes invisibly break handwritten assembly.
The Solution
We use a threelayer defense:
1. Generated Header (asm-offsets.h)
gen-offsets.c uses offsetof() to emit the actual struct offsets:
// genoffsets.c
OFFSET(G_CONTEXT, G, context); // Emits: #define G_CONTEXT 32
The Makefile compiles this to assembly, extracts the #define lines, and writes
asm-offsets.h. This header is committed to git.
2. Build-Time Verification (make check-offsets)
Before release, run:
make check-offsets
This regenerates the offsets from the current struct and diffs against the committed header. If they don’t match, the build fails with a clear error.
3. Runtime Verification (scheduler.c)
At startup, the scheduler verifies critical offsets:
if (offsetof(G, context) != G_CONTEXT) {
runtime_throw("G struct layout mismatch - update asm-offsets.h");
}
If somehow a mismatched binary runs, it crashes immediately with a diagnostic instead of silently corrupting goroutine state.
Workflow for Changing G Struct
- Modify
runtime/goroutine.h(the authoritative definition) - Update
runtime/gen-offsets.cto match - Run
make check-offsets— it will fail if out of sync - Run
make runtime/asm-offsets.hto regenerate - Update
runtime/runtime_sh4_minimal.SifG_CONTEXTchanged - Run
make check-offsetsagain — should pass now - Commit all changed files together
Why This Matters
In games, struct layout bugs cause symptoms like:
Goroutines resume with corrupted registers Context switches overwrite random memory FPU state leaks between goroutines Panics with nonsensical stack traces
These are nearly impossible to debug. The offset verification catches them at build time (or worst case, at startup) instead of during the final boss fight.
Performance
Measured on real Dreamcast hardware (SH4 @ 200MHz), verified December 2025:
| Operation | Time | Notes | | | | | | Gosched yield | 120 ns | Minimal scheduler roundtrip | | Direct call | 140 ns | Baseline comparison | | Buffered channel op | ~1.5 μs | Send to ready receiver | | Context switch | ~6.6 μs | Full goroutine switch | | Unbuffered channel | ~13 μs | Send + receive roundtrip | | Goroutine spawn | ~34 μs | Create + schedule + run | | GC pause (bypass) | ~73 μs | Objects ≥64KB bypass GC | | GC pause (64KB live) | ~2.2 ms | Medium live set | | GC pause (32KB live) | ~6.2 ms | Many small objects |
Run tests/bench_architecture.elf to measure on your hardware.
Note: For a complete reference of performance numbers, see the Glossary.
Design Decisions
Why gccgo instead of gc?
The standard Go compiler (gc) generates code for a completely different runtime. gccgo uses GCC’s backend, which already supports SH4 targets. We replace libgo with libgodc; the compiler doesn’t need modification.
Why semispace instead of marksweep?
Semispace has no fragmentation. In a 16MB system, fragmentation would eventually make large allocations impossible even with free memory. The 50% space overhead is acceptable for games.
Why cooperative instead of preemptive?
Preemptive scheduling requires timer interrupts, signal handling, and safepoint insertion. All of this complexity gains nothing on a singlecore CPU. Cooperative scheduling is simpler, faster, and sufficient.
Why fixed stacks instead of growable?
Growable stacks require compiler support (stack probes) and runtime support (morestack). Fixed stacks work with any compiler flags and simplify the runtime. 64KB is enough for typical game code.
References
Cheney, C.J. “A Nonrecursive List Compacting Algorithm.” CACM, 1970. Jones & Lins. “Garbage Collection.” Wiley, 1996. The Go Programming Language Specification. KallistiOS Documentation. SH4 Software Manual, Renesas.
Effective Dreamcast Go
A practical guide to writing efficient Go code for the Sega Dreamcast.
These patterns come from real debugging sessions with the libgodc runtime. Follow them to write games that run smooth at 60fps on the Dreamcast’s 200MHz SH-4 processor with 16MB RAM.
Memory Model
| Resource | Limit | Notes |
|---|---|---|
| Total RAM | 16 MB | Shared with VRAM, sound, OS |
| GC Heap | 2 MB × 2 | Semispace collector, 4MB total |
| Goroutine Stack | 64 KB | Fixed size, cannot grow |
| Large Object Threshold | 64 KB | Objects larger bypass GC |
1. Pre-allocate During Loading
The garbage collector can pause your game for several milliseconds. Allocate everything during load screens, not gameplay.
Bad: Allocating during gameplay
func UpdateParticles() {
for i := 0; i < 100; i++ {
p := new(Particle) // GC pause risk every frame!
particles = append(particles, p)
}
}
Good: Object pooling
// Pre-allocated pool
var particlePool [1000]Particle
var activeCount int
func Init() {
activeCount = 0
}
func SpawnParticle() *Particle {
if activeCount >= len(particlePool) {
return nil // Pool exhausted
}
p := &particlePool[activeCount]
activeCount++
*p = Particle{} // Reset to zero
return p
}
func DespawnParticle(index int) {
// Swap with last active
activeCount--
particlePool[index] = particlePool[activeCount]
}
2. Respect the 64KB Stack Limit
Each goroutine has a fixed 64KB stack. Unlike desktop Go, stacks cannot grow. Deep recursion or large local variables will crash your game.
Bad: Large local arrays
func ProcessFrame() {
var buffer [16384]float32 // 64KB on stack - CRASH!
// ...
}
Good: Use globals or heap for large data
var frameBuffer [8192]float32 // Global, not on stack
func ProcessFrame() {
// Use frameBuffer safely
for i := range frameBuffer {
frameBuffer[i] = 0
}
}
Bad: Deep recursion
func TraverseTree(node *Node) {
if node == nil { return }
TraverseTree(node.left) // Stack grows each call
TraverseTree(node.right) // Can overflow on deep trees
}
Good: Iterative with explicit stack
func TraverseTree(root *Node) {
stack := make([]*Node, 0, 64) // Heap-allocated
stack = append(stack, root)
for len(stack) > 0 {
node := stack[len(stack)-1]
stack = stack[:len(stack)-1]
if node == nil { continue }
// Process node...
stack = append(stack, node.left, node.right)
}
}
3. Reuse Slices
Creating new slices allocates memory. Reuse existing slices by resetting their length.
Bad: New slice every frame
func GetVisibleEnemies() []Enemy {
result := make([]Enemy, 0) // Allocation every call!
for _, e := range allEnemies {
if e.visible {
result = append(result, e)
}
}
return result
}
Good: Reuse with length reset
var visibleEnemies []Enemy
func Init() {
visibleEnemies = make([]Enemy, 0, 100) // Once during init
}
func GetVisibleEnemies() []Enemy {
visibleEnemies = visibleEnemies[:0] // Reset length, keep capacity
for _, e := range allEnemies {
if e.visible {
visibleEnemies = append(visibleEnemies, e)
}
}
return visibleEnemies
}
4. Minimize Goroutines
Each goroutine consumes 64KB of stack space. 100 goroutines = 6.4MB RAM—40% of total Dreamcast memory!
Bad: Goroutine per entity
for _, enemy := range enemies {
go enemy.Think() // 100 enemies = 6.4MB just for stacks!
}
Good: Process on main goroutine
func UpdateAllEnemies() {
for i := range enemies {
enemies[i].Think() // Sequential, predictable
}
}
Acceptable: Few dedicated goroutines
func main() {
go audioMixer() // One for audio streaming
go networkHandler() // One for network (if needed)
// Main loop handles game logic
for {
Update()
Render()
}
}
5. Use Value Types for Small Structs
Small structs passed by value stay on the stack. Pointers may escape to the heap.
Good: Pass small structs by value
type Vec3 struct {
X, Y, Z float32 // 12 bytes
}
func Add(a, b Vec3) Vec3 {
return Vec3{a.X + b.X, a.Y + b.Y, a.Z + b.Z}
}
// Usage - no heap allocation
pos := Add(velocity, acceleration)
Bad: Unnecessary pointer for small struct
func Add(a, b *Vec3) *Vec3 {
return &Vec3{a.X + b.X, a.Y + b.Y, a.Z + b.Z} // Escapes to heap!
}
Structs under ~64 bytes are fine to pass by value.
6. Avoid String Operations During Gameplay
Strings are immutable. Concatenation creates new strings (garbage).
Bad: String building in loop
var log string
for i := 0; i < 100; i++ {
log = log + "entry" // New allocation each iteration!
}
Bad: Formatted strings every frame
func DrawHUD() {
scoreText := fmt.Sprintf("Score: %d", score) // Allocates!
DrawText(scoreText)
}
```c
### Good: Pre-render or avoid strings
```go
// For HUD: use digit sprites
func DrawScore(score int) {
x := 100
for score > 0 {
digit := score % 10
DrawSprite(digitSprites[digit], x, 10)
x -= 16
score /= 10
}
}
// For debug: print directly (still allocates, but debug only)
println("Debug:", value)
7. Large Assets Bypass GC
Allocations over 64KB use malloc directly and are not garbage collected.
// This 128KB texture is NOT managed by GC
texture := make([]byte, 256*256*2)
// It will live forever (or until program exit)
// This is usually fine - load assets once, keep forever
```go
Implications:
- Large slices don't pressure the GC
- They also don't get freed automatically
- Perfect for textures, sounds, level data
## 8. Escape Analysis Awareness
The Go compiler decides whether variables go on stack (fast) or heap (needs GC). Variables "escape" to heap when:
- Returned from a function
- Stored in a slice, map, or struct field
- Passed to a goroutine
- Address taken and stored somewhere
### Stack allocated (good):
```go
func Calculate() int {
x := 42 // Stays on stack
y := x * 2 // Stays on stack
return y // Value returned, not pointer
}
Heap allocated (be aware):
func MakeEnemy() *Enemy {
e := Enemy{} // Must escape - we return pointer
return &e // Heap allocation here
}
Force stack when possible:
// Instead of returning pointer...
func MakeEnemy() *Enemy {
return &Enemy{HP: 100} // Heap
}
// Return value and let caller decide:
func NewEnemy() Enemy {
return Enemy{HP: 100} // Caller's stack or their choice
}
9. Map Usage Patterns
Maps allocate internally. Pre-size them and avoid creating during gameplay.
Bad: Maps created during gameplay
func SpawnWave() {
enemyTypes := make(map[string]int) // Allocation!
enemyTypes["goblin"] = 10
// ...
}
Good: Pre-allocated maps
var enemyTypes map[string]int
func Init() {
enemyTypes = make(map[string]int, 10) // Pre-size at init
}
func SpawnWave() {
// Clear and reuse
for k := range enemyTypes {
delete(enemyTypes, k)
}
enemyTypes["goblin"] = 10
}
10. The Game Loop Pattern
A typical Dreamcast game structure:
package main
// === PRE-ALLOCATED RESOURCES ===
var (
enemies [100]Enemy
particles [500]Particle
projectiles [200]Projectile
activeEnemies []*Enemy
activeParticles []*Particle
activeProjectiles []*Projectile
)
func Init() {
// Pre-allocate slice capacity
activeEnemies = make([]*Enemy, 0, 100)
activeParticles = make([]*Particle, 0, 500)
activeProjectiles = make([]*Projectile, 0, 200)
// Load assets (large allocations OK here)
LoadTextures()
LoadSounds()
LoadLevel()
}
func Update() {
// Reset working slices
activeEnemies = activeEnemies[:0]
// Process game logic (no allocations!)
for i := range enemies {
if enemies[i].active {
enemies[i].Update()
activeEnemies = append(activeEnemies, &enemies[i])
}
}
}
func Render() {
// Draw using pre-allocated data
for _, e := range activeEnemies {
e.Draw()
}
}
func main() {
Init()
for !shouldExit {
Input()
Update()
Render()
// VSync handled by PVR
}
}
Quick Reference Card
DO
var pool [N]Object // Pre-allocated pools
slice = slice[:0] // Reset slice, keep capacity
for i := range arr { } // Index iteration
small := Vec3{1, 2, 3} // Value types
make([]T, 0, capacity) // Pre-sized slices (at init)
val, ok := m[key] // Safe map access
select { default: } // Yield in loops
runtime_checkpoint() // For panic recovery
AVOID (during gameplay)
make([]T, n) // New slices
append(s, x) // When at capacity
new(T) // For small types
go func() {}() // Excessive goroutines
string + string // String concatenation
fmt.Sprintf() // Formatted strings
recover() // Use runtime_checkpoint instead
for { busyWork() } // Loops without yielding
11. Panic/Recover Limitation
Standard Go’s recover() does not work on Dreamcast due to ABI differences. Use the runtime_checkpoint() pattern instead:
Bad: Standard recover (won’t work)
func SafeCall() {
defer func() {
if r := recover(); r != nil { // NEVER catches panics!
println("recovered")
}
}()
panic("oops")
}
Good: Use runtime_checkpoint
import _ "unsafe"
//go:linkname runtime_checkpoint runtime.runtime_checkpoint
func runtime_checkpoint() int
func SafeCall() (recovered bool) {
defer func() {
if runtime_checkpoint() != 0 {
recovered = true
return
}
// Normal cleanup here
}()
panic("oops")
return false
}
```go
Most game code shouldn't need recover. Design to avoid panics:
- Check bounds before indexing
- Validate inputs at entry points
- Use `ok` form for map access: `val, ok := m[key]`
## 12. Cooperative Scheduling
The Dreamcast scheduler is **cooperative**, not preemptive. Goroutines run until they yield.
### Goroutines yield when they:
- Send/receive on channels
- Call `select` (including with `default`)
- Call explicit yield functions
- Block on I/O
### Bad: Infinite loop without yielding
```go
go func() {
for {
doWork() // Never yields - blocks all other goroutines!
}
}()
Good: Yield periodically
go func() {
for {
doWork()
select {
case <-done:
return
default:
// Yields to scheduler, then continues
}
}
}()
Better: Use channels for work
go func() {
for item := range workQueue { // Yields while waiting
process(item)
}
}()
Timing is not guaranteed
Because of cooperative scheduling:
- Don’t rely on precise goroutine ordering
- Deadlines are “best effort”, not hard guarantees
- For real-time needs, keep critical work on main goroutine
13. Select with Default
select with default is an efficient polling pattern that yields correctly:
func pollChannels() {
for {
select {
case msg := <-inputChan:
handleInput(msg)
case result := <-resultChan:
handleResult(result)
default:
// No message ready - yields to other goroutines
// then returns immediately
}
// Can do other work here
processFrame()
}
}
This pattern works well for:
- Non-blocking channel checks
- Game loops that need to poll multiple sources
- Background workers that shouldn’t block the main loop
Platform Constraints
Goroutine Leak
Dead goroutines retain ~160 bytes each (G struct only). The stack memory and TLS are properly reclaimed, and the G struct is kept in a free list for reuse by future goroutines. When you spawn a new goroutine, it reuses a G from the free list if available.
If you spawn 10,000 goroutines that all exit without spawning new ones, you’ll
have ~1.6MB in the free list. This memory is reused when you spawn new
goroutines. Monitor goroutine count with runtime.NumGoroutine().
Unrecoverable Runtime Panics
User panic() is recoverable. Runtime panics are not:
- Nil pointer dereference
- Array/slice bounds check
- Integer divide by zero
- Stack overflow
These crash immediately. A bounds check failure means program invariants are violated—continuing would corrupt data.
32-bit Pointers
All pointers are 4 bytes. Code assuming 64-bit pointers will break.
unsafe.Sizeof(uintptr(0)) returns 4, not 8.
Single-Precision FPU
The SH-4 FPU operates in single precision. Double precision is software
emulated—extremely slow. Avoid float64 in hot paths.
Cache Coherency
DMA operations require explicit cache management. Use KOS cache functions
from C or via //extern:
#include <arch/cache.h>
dcache_flush_range((uintptr_t)ptr, size); // Before DMA write (CPU -> HW)
dcache_inval_range((uintptr_t)ptr, size); // After DMA read (HW -> CPU)
Not Implemented
- Race detector
- CPU/memory profiling
- Debugger support (delve, gdb)
- Plugin package
- cgo (use
//externfor C functions) - Signals (
os.Signal,signal.Notify) - Networking (requires Broadband Adapter)
Limited Implementation
- reflect: Basic type inspection only, no
reflect.MakeFunc - unsafe: Works, but remember 4-byte pointers
- sync: Mutexes work, but with M:1 scheduling no other goroutine runs while you hold a lock—deadlock is impossible but starvation is easy
Compatibility
- gccgo only (not the standard gc compiler)
- KallistiOS required
- SH-4 architecture only
Debugging Tips
Available tools:
- Serial output via
println()(routed to dc-tool) LIBGODC_ERROR/LIBGODC_CRITICALmacros (defined in runtime.h)- GC statistics via the C function
gc_stats(&used, &total, &collections) runtime.NumGoroutine()to count active goroutines- KOS debug console (
dbglog())
Not available: stack traces, core dumps, breakpoints, variable inspection, heap profiling. When something goes wrong, you have println() and your brain.
If your game stutters:
- Check GC pauses: Add timing around
forceGC()calls to measure - Count allocations: Use pools and count
activeCount - Monitor goroutines: Keep count of active goroutines
- Profile stack usage: Deep call chains near 64KB will crash
If your game freezes (but doesn’t crash):
- Goroutine not yielding: A goroutine in a tight loop starves others
- Deadlock: Two goroutines waiting on each other’s channels
- Main blocked: Main goroutine waiting on a channel nobody sends to
If your game crashes:
- Stack overflow: Reduce recursion, shrink local arrays
- Nil pointer: Check slice bounds, map existence
- GC corruption: Ensure pointers are valid (not into freed memory)
- Panic without checkpoint: Use
runtime_checkpoint()for recovery
Further Reading
docs/DESIGN.md- Runtime architecturedocs/KOS_WRAPPERS.md- Hardware accessexamples/- Working game examples
Console development is the art of saying ‘no’ to malloc.
KOS API Bindings
KOS is written in C. Your game is written in Go. gccgo’s //extern directive
lets you call C functions directly with no wrapper overhead.
┌─────────────────────────────────────────────────────┐
│ Go Code │
│ kos.PvrInitDefaults() │
│ │ │
│ ▼ │
│ //extern pvr_init_defaults │
│ func PvrInitDefaults() int32 │
│ │ │
│ ▼ │
│ pvr_init_defaults() in libkallisti.a │
│ │ │
│ ▼ │
│ Dreamcast Hardware │
└─────────────────────────────────────────────────────┘
Basic Syntax
Function with No Arguments
//go:build gccgo
package kos
//extern pvr_scene_begin
func PvrSceneBegin()
```go
The `//extern` comment must immediately precede the function declaration,
with no blank lines between them. The function has no body—gccgo generates
the call directly.
### Function with Arguments
```go
//extern pvr_list_begin
func PvrListBegin(list uint32) int32
//extern pvr_poly_compile
func pvrPolyCompile(header uintptr, context uintptr)
Arguments are passed according to the SH-4 ABI: first four in registers (r4-r7), remainder on the stack.
Function with Return Value
//extern pvr_mem_available
func PvrMemAvailable() uint32
//extern timer_us_gettime64
func TimerUsGettime64() uint64
Return values come back in r0 (32-bit) or r0:r1 (64-bit).
Type Mappings
The SH-4 is a 32-bit architecture with 4-byte alignment.
| C Type | Go Type | Size | Notes |
|---|---|---|---|
void | (no return) | - | |
int | int32 | 4 | SH-4 int is 32-bit |
unsigned int | uint32 | 4 | |
int8_t | int8 | 1 | |
uint8_t | uint8 | 1 | |
int16_t | int16 | 2 | |
uint16_t | uint16 | 2 | |
int32_t | int32 | 4 | |
uint32_t | uint32 | 4 | |
int64_t | int64 | 8 | |
uint64_t | uint64 | 8 | |
float | float32 | 4 | |
double | float64 | 8 | Software emulated—slow |
void* | unsafe.Pointer | 4 | |
char* | *byte | 4 | Or unsafe.Pointer |
size_t | uint32 | 4 | uintptr also works |
struct foo* | *Foo | 4 | Define matching Go struct |
Pointer Size
All pointers are 4 bytes. Code that assumes 64-bit pointers will break.
unsafe.Sizeof(uintptr(0)) is 4, not 8.
Struct Mappings
When a KOS function takes a pointer to a struct, you have two options:
Option 1: unsafe.Pointer (Quick and Dirty)
//extern pvr_vertex_submit
func pvrVertexSubmit(data unsafe.Pointer, size int32)
// Usage:
func SubmitVertex(v *PvrVertex) {
pvrVertexSubmit(unsafe.Pointer(v), int32(unsafe.Sizeof(*v)))
}
```go
Works but provides no type safety. Fine for prototyping.
### Option 2: Matching Go Struct (Correct)
Define a Go struct with identical layout to the C struct:
```c
// From dc/pvr.h
typedef struct {
uint32_t flags;
float x, y, z;
float u, v;
uint32_t argb;
uint32_t oargb;
} pvr_vertex_t;
// In Go
type PvrVertex struct {
Flags uint32
X, Y, Z float32
U, V float32
ARGB uint32
OARGB uint32
}
//extern pvr_prim
func pvrPrim(data unsafe.Pointer, size int32)
// PvrPrimVertex submits a vertex to the TA
func PvrPrimVertex(v *PvrVertex) {
pvrPrim(unsafe.Pointer(v), 32) // 32 bytes
}
Verify the struct size matches:
func init() {
if unsafe.Sizeof(PvrVertex{}) != 32 {
panic("PvrVertex size mismatch")
}
}
Alignment Matters
C structs may have padding for alignment. Go structs follow Go’s alignment rules, which may differ. Always verify sizes match.
// C struct with padding:
// struct { char a; int b; } // 8 bytes (3 bytes padding after a)
// Go equivalent:
type Example struct {
A byte
_ [3]byte // Explicit padding
B int32
}
```go
## Stub Files for Host Compilation
Go files using `//extern` only compile with gccgo. For IDE support and
host-side testing, create stub files:
### pvr.go (Dreamcast build)
```go
//go:build gccgo
package kos
//extern pvr_init_defaults
func PvrInitDefaults() int32
//extern pvr_scene_begin
func PvrSceneBegin()
```go
### pvr_stub.go (Host build)
```go
//go:build !gccgo
package kos
func PvrInitDefaults() int32 { panic("kos: not on Dreamcast") }
func PvrSceneBegin() { panic("kos: not on Dreamcast") }
```go
The build tag ensures the right file is used:
- `gccgo` tag: compiles with sh-elf-gccgo (Dreamcast)
- `!gccgo` tag: compiles with standard go (host)
## Common Patterns
### Wrapper for Type Safety
Expose a safe public API, hide the unsafe internals:
```go
// Private: direct C binding
//extern maple_dev_status
func mapleDevStatus(dev uintptr) uintptr
// Public: type-safe wrapper with method syntax
func (d *MapleDevice) ContState() *ContState {
if d == nil {
return nil
}
ptr := mapleDevStatus(uintptr(unsafe.Pointer(d)))
if ptr == 0 {
return nil
}
return (*ContState)(unsafe.Pointer(ptr))
}
Slice to C Array
C functions expect a pointer and length. Go slices have both:
//extern pvr_txr_load
func pvrTxrLoad(src unsafe.Pointer, dst unsafe.Pointer, count uint32)
func PvrTxrLoad(src []byte, dst unsafe.Pointer) {
if len(src) == 0 {
return
}
pvrTxrLoad(unsafe.Pointer(&src[0]), dst, uint32(len(src)))
}
Always check for empty slices—&src[0] panics on an empty slice.
String to C String
Go strings are not null-terminated. C functions expect null-terminated strings.
import "unsafe"
// Convert Go string to C string (allocates)
func cstring(s string) *byte {
b := make([]byte, len(s)+1)
copy(b, s)
b[len(s)] = 0
return &b[0]
}
// Usage:
//extern fs_open
func fsOpen(path *byte, mode int32) int32
func Open(path string) int32 {
return fsOpen(cstring(path), O_RDONLY)
}
```c
For hot paths, avoid allocation by using fixed buffers:
```go
var pathBuf [256]byte
func OpenFast(path string) int32 {
if len(path) >= 255 {
panic("path too long")
}
copy(pathBuf[:], path)
pathBuf[len(path)] = 0
return fsOpen(&pathBuf[0], O_RDONLY)
}
Callback Functions
Some KOS functions take callbacks. This requires careful handling:
//extern pvr_set_bg_color
func PvrSetBgColor(r, g, b float32)
// For callbacks, you often need to use //export to make a Go function
// callable from C. However, this is complex with gccgo.
// Prefer polling over callbacks when possible.
Callbacks from C to Go are tricky because:
- The callback runs on whatever stack C chooses
- The Go scheduler may not be in a consistent state
- The GC may be running
Poll instead of using callbacks when you can.
Caveats
Stack Usage
KOS functions run on the calling goroutine’s stack. Deep C call chains can overflow the 64KB stack:
// DANGEROUS: Unknown stack depth
func LoadLevel(path string) {
// fs_open -> iso9660_read -> g2_read -> ...
// How deep does this go?
}
Solutions:
- Call from the main goroutine (larger stack)
- Limit recursion depth in your code
- Move heavy I/O to loading screens
Blocking Calls
Some KOS functions block (file I/O, CD reads). During blocking:
- No other goroutines run (M:1 scheduler is blocked)
- Timers don’t fire
- The game freezes
// BAD: Blocks entire game for 200ms+
data := loadFile("/cd/level.dat")
// BETTER: Do during loading screen
showLoadingScreen()
data := loadFile("/cd/level.dat")
hideLoadingScreen()
// BEST: Stream data over multiple frames
go streamFile("/cd/level.dat", dataChan)
GBR Register
libgodc uses a global pointer for goroutine TLS, leaving GBR for KOS.
This means KOS _Thread_local variables work correctly.
If you’re writing assembly or using inline asm, don’t touch GBR—it’s reserved for KOS.
Building the kos Package
The kos/ directory contains the official bindings. To rebuild:
cd kos/
make clean
make
make install # Copies to $KOS_BASE/lib/
This produces:
kos.gox— Export data for the Go compilerlibkos.a— Compiled bindings for the linker
Adding New Bindings
Step 1: Find the C Declaration
grep -r "pvr_mem_reset" $KOS_BASE/include/
# Found in dc/pvr.h:
# void pvr_mem_reset(void);
Step 2: Write the Go Binding
//extern pvr_mem_reset
func PvrMemReset()
For functions with complex signatures, check the header carefully:
// From dc/pvr.h
int pvr_prim(void *data, size_t size);
//extern pvr_prim
func pvrPrim(data unsafe.Pointer, size uint32) int32
Step 3: Add Type-Safe Wrapper (Optional)
// For polygon headers (using helper function)
func PvrPrim(hdr *PvrPolyHdr) int32 {
return goPvrPrimHdr(unsafe.Pointer(hdr))
}
// For vertices (using helper function)
func PvrPrimVertex(v *PvrVertex) int32 {
return goPvrPrimVertex(unsafe.Pointer(v))
}
Note: For performance-critical paths like vertex submission, libgodc uses
specialized C helper functions (__go_pvr_prim_hdr, __go_pvr_prim_vertex)
that handle store queue operations efficiently.
Step 4: Add Stub
func PvrMemReset() {
panic("kos: not on Dreamcast")
}
Step 5: Rebuild
make clean && make && make install
Reference: KOS Subsystems
| Subsystem | Header | Prefix | Description |
|---|---|---|---|
| PVR | dc/pvr.h | pvr_ | PowerVR graphics |
| Maple | dc/maple.h | maple_ | Controllers, VMU, etc. |
| Sound | dc/sound/ | snd_ | AICA sound chip |
| Streaming | dc/snd_stream.h | snd_stream_ | Audio streaming |
| Filesystem | kos/fs.h | fs_ | File operations |
| Timer | arch/timer.h | timer_ | High-resolution timing |
| Video | dc/video.h | vid_ | Video modes |
| G2 Bus | dc/g2bus.h | g2_ | Bus transfers |
| CDROM | dc/cdrom.h | cdrom_ | CD access |
| VMU | dc/vmu_*.h | vmu_ | Visual Memory Unit |
| BFont | dc/biosfont.h | bfont_ | BIOS font rendering |
Example: Complete PVR Bindings
pvr.go
//go:build gccgo
package kos
import "unsafe"
// PvrPtr is a pointer to PVR video memory (VRAM)
type PvrPtr uintptr
// PVR list types
const (
PVR_LIST_OP_POLY uint32 = 0 // Opaque polygons
PVR_LIST_OP_MOD uint32 = 1 // Opaque modifiers
PVR_LIST_TR_POLY uint32 = 2 // Translucent polygons
PVR_LIST_TR_MOD uint32 = 3 // Translucent modifiers
PVR_LIST_PT_POLY uint32 = 4 // Punch-through polygons
)
// Initialization
//extern pvr_init_defaults
func PvrInitDefaults() int32
// Scene management
//extern pvr_scene_begin
func PvrSceneBegin()
//extern pvr_scene_finish
func PvrSceneFinish() int32
//extern pvr_wait_ready
func PvrWaitReady() int32
// List management
//extern pvr_list_begin
func PvrListBegin(list uint32) int32
//extern pvr_list_finish
func PvrListFinish() int32
// Primitive submission via helper functions
//extern __go_pvr_prim_hdr
func goPvrPrimHdr(data unsafe.Pointer) int32
//extern __go_pvr_prim_vertex
func goPvrPrimVertex(data unsafe.Pointer) int32
type PvrVertex struct {
Flags uint32
X, Y, Z float32
U, V float32
ARGB uint32
OARGB uint32
}
// PvrPrim submits a polygon header
func PvrPrim(hdr *PvrPolyHdr) int32 {
return goPvrPrimHdr(unsafe.Pointer(hdr))
}
// PvrPrimVertex submits a vertex
func PvrPrimVertex(v *PvrVertex) int32 {
return goPvrPrimVertex(unsafe.Pointer(v))
}
// Memory management
//extern pvr_mem_malloc
func PvrMemMalloc(size uint32) PvrPtr
//extern pvr_mem_free
func PvrMemFree(ptr PvrPtr)
//extern pvr_mem_available
func PvrMemAvailable() uint32
pvr_stub.go
//go:build !gccgo
package kos
type PvrPtr uintptr
const (
PVR_LIST_OP_POLY uint32 = 0
PVR_LIST_OP_MOD uint32 = 1
PVR_LIST_TR_POLY uint32 = 2
PVR_LIST_TR_MOD uint32 = 3
PVR_LIST_PT_POLY uint32 = 4
)
type PvrVertex struct {
Flags uint32
X, Y, Z float32
U, V float32
ARGB uint32
OARGB uint32
}
func PvrInitDefaults() int32 { panic("kos: not on Dreamcast") }
func PvrSceneBegin() { panic("kos: not on Dreamcast") }
func PvrSceneFinish() int32 { panic("kos: not on Dreamcast") }
func PvrWaitReady() int32 { panic("kos: not on Dreamcast") }
func PvrListBegin(list uint32) int32 { panic("kos: not on Dreamcast") }
func PvrListFinish() int32 { panic("kos: not on Dreamcast") }
func PvrPrim(hdr *PvrPolyHdr) int32 { panic("kos: not on Dreamcast") }
func PvrPrimVertex(v *PvrVertex) int32 { panic("kos: not on Dreamcast") }
func PvrMemMalloc(size uint32) PvrPtr { panic("kos: not on Dreamcast") }
func PvrMemFree(ptr PvrPtr) { panic("kos: not on Dreamcast") }
func PvrMemAvailable() uint32 { panic("kos: not on Dreamcast") }
Usage in Games
package main
import "kos"
func main() {
kos.PvrInitDefaults()
for {
kos.PvrWaitReady()
kos.PvrSceneBegin()
kos.PvrListBegin(kos.PVR_LIST_OP_POLY)
drawOpaqueGeometry()
kos.PvrListFinish()
kos.PvrListBegin(kos.PVR_LIST_TR_POLY)
drawTranslucentGeometry()
kos.PvrListFinish()
kos.PvrSceneFinish()
}
}
func drawOpaqueGeometry() {
// First submit a polygon header
var hdr kos.PvrPolyHdr
var ctx kos.PvrPolyCxt
kos.PvrPolyCxtCol(&ctx, kos.PVR_LIST_OP_POLY)
kos.PvrPolyCompile(&hdr, &ctx)
kos.PvrPrim(&hdr)
// Then submit vertices
v := kos.PvrVertex{
Flags: kos.PVR_CMD_VERTEX_EOL, // End of strip
X: 320, Y: 240, Z: 1,
ARGB: 0xffffffff,
}
kos.PvrPrimVertex(&v)
}
Further Reading
- Design — Runtime architecture
- KOS Documentation — Full API reference
- PVR Tutorial — Graphics programming
examples/— Working code samples
Limitations
This document describes the known limitations of libgodc. Understanding these is essential for writing reliable Dreamcast Go programs.
Memory
16MB Total
The Dreamcast has 16MB of RAM. No virtual memory, no swap, no second chance.
Budget your memory:
- KOS + drivers: ~1MB
- Your code: ~1-3MB
- GC heap: 2MB active (4MB total, two semi-spaces)
- Goroutine stacks: 64KB each
- Everything else: KOS malloc
When you run out, you crash.
Goroutine Memory Overhead
Dead goroutines retain approximately 160 bytes each (G struct only). The stack memory and TLS are properly reclaimed, and the G struct is kept in a free list for reuse by future goroutines.
Why the free list? Reusing G structs avoids repeated malloc/free overhead. When you spawn a new goroutine, it reuses a G from the free list if available.
Impact: If you spawn 10,000 goroutines that all exit without spawning new ones, you’ll have ~1.6MB in the free list. This memory is reused when you spawn new goroutines. For a typical game session, this is rarely a problem if you design with long-lived goroutines.
Workaround: Prefer long-lived goroutines or let the free list grow to a stable size. If you spawn and exit many goroutines, the G structs accumulate in the free list but are reused:
// GOOD: Fixed set of long-lived goroutines
go audioHandler() // Lives for entire game
go inputPoller() // Lives for entire game
go gameLoop() // Lives for entire game
// OK: Spawning goroutines per-event (G structs are reused)
for event := range events {
go handleEvent(event) // ~160B stays in free list for reuse
}
GC Pause Times
The garbage collector stops the world during collection. Pause times depend on live heap size:
| Live Heap | Pause |
|---|---|
| 100KB | 1-2ms |
| 500KB | 5-10ms |
| 1MB | 10-20ms |
At 60fps, you have 16.6ms per frame. A 10ms GC pause causes visible stutter.
Workarounds:
- Keep the live heap small (<500KB)
- Disable automatic GC for action sequences:
debug.SetGCPercent(-1) // Disable automatic GC runtime.GC() // Manual GC during loading screens - Use KOS malloc for large, long-lived data (textures, audio, levels)
Fixed 64KB Stacks
Goroutine stacks do not grow. Each goroutine gets exactly 64KB.
This limits recursion depth:
| Frame Size | Safe Depth |
|---|---|
| 50 bytes | ~300 |
| 100 bytes | ~150 |
| 250 bytes | ~60 |
| 500 bytes | ~30 |
Workarounds:
- Convert recursion to iteration
- Use smaller local variables
- Pass large data by pointer, not by value
- Avoid deep call chains
// BAD: Large local arrays
func processLevel(depth int) {
var buffer [4096]byte // 4KB per stack frame!
// ... recursive call
}
// GOOD: Heap allocation for large buffers
func processLevel(depth int) {
buffer := make([]byte, 4096) // GC heap
// ... recursive call
}
Scheduling
No Parallelism (M:1)
All goroutines run on a single thread. The go keyword provides concurrency
(interleaved execution), not parallelism (simultaneous execution).
There is no benefit from GOMAXPROCS—the Dreamcast has one CPU core.
No Preemption
Goroutines yield only at explicit points:
- Channel operations
runtime.Gosched()time.Sleep()- Timer operations
A goroutine in a tight loop blocks all other goroutines:
// BAD: Blocks entire system
for {
calculateNextFrame() // Never yields!
}
// GOOD: Explicit yield
for {
calculateNextFrame()
runtime.Gosched() // Let others run
}
Channel Lock Contention
Under high contention, channel locks use spin-yield loops. Many goroutines racing for the same channel wastes CPU.
Workaround: Use buffered channels to reduce contention:
// Unbuffered: every send/receive contends
events := make(chan Event)
// Buffered: reduced contention
events := make(chan Event, 16)
Language Features
Not Implemented
- Race detector
- CPU/memory profiling
- Debugger support (delve, gdb)
- Plugin package
- cgo (use KOS C functions directly via
//extern)
Limited Implementation
- reflect: Basic type inspection only. No
reflect.MakeFunc. - unsafe: Works, but remember pointers are 4 bytes.
- sync: Mutexes work, but see M:1 scheduling caveat—no goroutine runs while you hold a lock, so deadlock is impossible but starvation is easy.
Unrecoverable Runtime Panics
User panic() is recoverable via recover(). Runtime panics are not:
- Nil pointer dereference
- Array/slice bounds check
- Integer divide by zero
- Stack overflow
These crash immediately. There is no recovery.
Why? A bounds check failure means your program’s invariants are violated. Continuing would corrupt data. It’s better to crash cleanly.
Platform Constraints
32-bit Pointers
All pointers are 4 bytes. Code assuming 64-bit pointers will break:
// BAD: Assumes 64-bit
type Header struct {
flags uint32
ptr uintptr // 4 bytes on Dreamcast, not 8!
size uint32
}
Single-Precision FPU
The SH-4 FPU operates in single precision (-m4-single). Double precision
operations are emulated in software—extremely slow.
// FAST: Single precision
var x float32 = 3.14
// SLOW: Software emulation
var y float64 = 3.14159265358979
Avoid float64 in hot paths. The compiler flag -m4-single makes all FPU
operations single precision, but libraries may still use doubles.
Cache Coherency
The SH-4 has separate instruction and data caches. DMA operations require explicit cache management using KOS functions:
// Before DMA write (CPU -> hardware):
dcache_flush_range((uintptr_t)ptr, size); // Flush data cache
// After DMA read (hardware -> CPU):
dcache_inval_range((uintptr_t)ptr, size); // Invalidate data cache
The GC handles cache management for semi-space flips via incremental invalidation, but your DMA code must handle it explicitly using KOS cache functions.
No Signals
There are no Unix signals. os.Signal, signal.Notify, etc. don’t work.
Use KOS’s interrupt handlers or polling instead.
No Networking (by default)
Networking requires a Broadband Adapter (BBA) or modem. Most Dreamcast units don’t have one. Design your game to work offline.
Debugging
Available
- Serial output via
println()(routed to dc-tool) LIBGODC_ERROR/LIBGODC_CRITICALmacros (defined in runtime.h)- GC statistics via the C function
gc_stats(&used, &total, &collections) runtime.NumGoroutine()to count active goroutines- KOS debug console (
dbglog())
Not Available
- Stack traces on panic (limited)
- Core dumps
- Breakpoints
- Variable inspection
- Heap profiling
When something goes wrong, you have println() and your brain. Use them.
Compatibility
gccgo Only
This runtime is for gccgo (GCC’s Go frontend), not the standard gc compiler.
Code compiled with go build will not work. Use sh-elf-gccgo.
KallistiOS Required
libgodc requires KallistiOS. It won’t work with other Dreamcast development libraries.
SH-4 Architecture Only
This code is specifically for the Hitachi SH-4 CPU. It won’t run on other architectures.
Summary
| Limitation | Impact | Workaround |
|---|---|---|
| G struct pooling | ~160B per dead goroutine | Long-lived goroutines |
| GC pauses | 1-20ms depending on heap | Small heap, manual GC timing |
| M:1 scheduling | No parallelism | Explicit yields |
| Fixed stacks | Limited recursion | Iteration, smaller frames |
| No preemption | Tight loops block all | runtime.Gosched() |
| Runtime panics | Unrecoverable | Defensive coding |
| 16MB RAM | Memory pressure | Monitor usage, plan carefully |
For typical Dreamcast games—15-60 minute sessions with a fixed goroutine architecture—these limitations are manageable. Design with constraints in mind from the start, and you’ll have a runtime that’s simple, fast, and reliable.
Glossary
Quick reference for terms used throughout this documentation.
Runtime Terms
Bump Allocator
An allocation strategy where memory is allocated by simply incrementing a pointer. O(1) allocation, but cannot free individual objects. libgodc uses this for the GC heap.
Cheney’s Algorithm
A garbage collection algorithm that copies live objects from one semispace to another using two pointers (scan and alloc). Named after C.J. Cheney who invented it in 1970.
Context Switch
Saving one goroutine’s CPU registers and loading another’s, allowing multiple goroutines to share a single CPU. On SH4, this involves saving 64 bytes of state.
Cooperative Scheduling
A scheduling model where goroutines must voluntarily yield control. Contrast with preemptive scheduling where the runtime can interrupt goroutines at any time.
Forwarding Pointer
During garbage collection, a pointer left in an object’s old location that points to its new location. Prevents copying the same object twice.
G (Goroutine Struct)
The data structure representing a goroutine. Contains stack bounds, saved CPU context, defer chain, panic state, and scheduling information.
GC Heap
The memory region managed by the garbage collector. In libgodc, this is 4MB total (two 2MB semispaces), with 2MB usable at any time.
hchan
The internal structure representing a Go channel. Contains the buffer, send/receive indices, and wait queues.
M:1 Model
A threading model where many goroutines (M) run on one OS thread (1). All goroutines share a single CPU, providing concurrency but not parallelism.
Root
A starting point for garbage collection tracing. Roots include global variables, stack variables, and CPU registers that contain pointers.
Run Queue
A list of goroutines that are ready to execute. The scheduler picks goroutines from this queue.
SemiSpace Collector
A garbage collector that divides memory into two equal halves. Objects are allocated in one half; during collection, live objects are copied to the other half.
Stop the World
A GC phase where all program execution pauses while the collector runs. libgodc uses stoptheworld collection exclusively.
Sudog
“Sender/receiver descriptor” a structure representing a goroutine waiting on a channel operation. Contains pointers to the goroutine, the channel, and the data being transferred.
TLS (ThreadLocal Storage)
Pergoroutine storage. In libgodc, each goroutine has its own TLS block containing runtime state.
Type Descriptor
Compilergenerated metadata about a Go type, including size, alignment, hash, and a bitmap indicating which fields contain pointers.
Hardware Terms
AICA
The Dreamcast’s sound processor. An ARM7based chip with 2MB of dedicated sound RAM. Runs independently of the SH4 CPU.
Cache Line
The unit of data transfer between cache and main memory. 32 bytes on SH4. Accessing one byte loads the entire cache line.
GBR (Global Base Register)
An SH4 register reserved for threadlocal storage in KallistiOS. libgodc does not use GBR for goroutine TLS.
KallistiOS (KOS)
The standard opensource SDK for Dreamcast homebrew development. Provides hardware abstraction, memory management, and drivers. It’s pronounced “Kay os”, so it resembles the sound of the word “chaos”.
PowerVR2
The Dreamcast’s GPU. A tilebased deferred renderer with 8MB of dedicated VRAM.
SH4
The Hitachi (now Renesas) SuperH4 processor used in the Dreamcast. 200MHz, 32bit, littleendian, with an FPU optimized for singleprecision math.
VRAM
Video RAM. 8MB dedicated to the PowerVR2 GPU for textures and framebuffers. Allocated via PvrMemMalloc(), not the GC.
Go Terms
//extern
A gccgo directive that declares a function implemented in C. Allows Go code to call KOS functions directly.
Escape Analysis
Compiler analysis that determines whether a variable can stay on the stack or must be allocated on the heap.
gccgo
The GCC frontend for Go. Uses GCC’s backend for code generation, supporting architectures like SH4 that the standard Go compiler doesn’t support.
Interface
A Go type that specifies a set of methods. Variables of interface type can hold any value that implements those methods.
libgo
The standard gccgo runtime library. libgodc replaces this with a Dreamcastspecific implementation.
Slice Header
The 12byte structure representing a Go slice: a pointer to the backing array, length, and capacity.
String Header
The 8byte structure representing a Go string: a pointer to the character data and length.
Abbreviations
| Abbr | Full Form | Meaning | |||| | ABI | Application Binary Interface | How functions pass arguments and return values | | BBA | Broadband Adapter | Dreamcast network adapter (10/100 Ethernet) | | DMA | Direct Memory Access | Hardwaretohardware memory transfer without CPU | | FPU | Floating Point Unit | CPU component for floatingpoint math | | GC | Garbage Collector | Automatic memory management system | | KB | Kilobyte | 1,024 bytes | | MB | Megabyte | 1,048,576 bytes | | MMU | Memory Management Unit | Hardware for virtual memory (Dreamcast doesn’t have one) | | PC | Program Counter | CPU register pointing to current instruction | | PR | Procedure Register | SH4 register holding return address | | SP | Stack Pointer | CPU register pointing to top of stack | | TA | Tile Accelerator | PowerVR2 component that processes geometry | | TLS | ThreadLocal Storage | Perthread/goroutine private data | | VMU | Visual Memory Unit | Dreamcast memory card with LCD screen |
Performance Numbers
Reference benchmarks from real Dreamcast hardware (200MHz SH4).
Verified using tests/bench_architecture.elf:
| Operation | Time | Notes |
||||
| runtime.Gosched() | 120 ns | Minimal yield |
| Direct function call | 140 ns | Baseline comparison |
| Buffered channel op | 1,459 ns | ~1.5 μs |
| Context switch | 6,634 ns | ~6.6 μs, full register save/restore |
| Unbuffered channel roundtrip | 12,782 ns | ~13 μs, send + receive |
| Goroutine spawn + run | 33,659 ns | ~34 μs, 240× overhead vs direct call |
GC Pause Times
| Scenario | Pause | Notes | |||| | Minimal/bypass (≥128 KB objects) | 73 μs | Objects bypass GC heap | | 64 KB live data | 2,199 μs | ~2.2 ms | | 32 KB live data | 6,172 μs | ~6.2 ms |
Note: Objects ≥64 KB bypass the GC heap and go directly to
malloc, hence the minimal pause. The 32 KB scenario with many small objects shows the highest pause because more objects must be scanned and copied.
Memory Configuration
| Parameter | Value | ||| | Goroutine stack | 64 KB | | Context size | 64 bytes | | GC header | 8 bytes | | Large object threshold | 64 KB |
Run tests/bench_architecture.elf on your hardware to verify these numbers.
Acknowledgements
Kudos to:
- Ian Lance Taylor for writing gccgo.
- KallistiOS team for building and maintaining the Dreamcast SDK.
- Dreamcast homebrew community for keeping the console alive.
without you, there would be no libgodc project.