Goroutines
The Trade-off
Let me set expectations: goroutines on Dreamcast work, but differently than on modern hardware.
You get zero parallelism (single CPU), but you get everything else: clean concurrency primitives, channels, and code that feels like Go.
Here’s the thing. Goroutines shine when you have multiple CPU cores:
Modern PC (8 cores):
────────────────────────────────────────────────────────────
Core 1: [──────goroutine A──────]
Core 2: [──────goroutine B──────]
Core 3: [──────goroutine C──────]
Core 4: [──────goroutine D──────]
...
↑
All running SIMULTANEOUSLY
4x faster than running them one-by-one!
But Dreamcast?
Dreamcast (1 core):
────────────────────────────────────────────────────────────
CPU: [───A───][───B───][───A───][───C───][───B───]...
↑
Only ONE runs at a time
ZERO parallelism benefit
So why did libgodc implements them?
Why Bother?
Because Go without goroutines isn’t Go.
Imagine porting Python to a machine without lists. Or JavaScript without callbacks. You could do it, but would it feel like the same language?
I wanted Go on Dreamcast to feel like Go. You can write:
go processEnemies()
go playBackgroundMusic()
go handleInput()
It works. It’s correct. The code is cleaner. It’s just not faster than calling them directly:
processEnemies()
playBackgroundMusic()
handleInput()
There’s overhead—but less than you might expect. Let’s see the numbers.
What Happens Under the Hood
When you create a goroutine, here’s what actually happens:
┌─────────────────────────────────────────────────────────────┐
│ go doSomething() │
│ ──────────────── │
│ │
│ 1. Allocate 64 KB stack (from pool or malloc) │
│ 2. Initialize G struct (~150 bytes) │
│ 3. Save 16 CPU registers to context │
│ 4. Set up context (sp, pc, pr) │
│ 5. Add to run queue │
│ 6. Later: context switch to run (~6.6 μs) │
│ ───────────────────────────────────────────────────── │
│ Total spawn + first run: ~32 μs │
│ │
│ That's ~6,400 CPU cycles per goroutine spawn! │
└─────────────────────────────────────────────────────────────┘
What do you get for this overhead? On a multi-core system: parallelism. On Dreamcast: proper Go semantics and working concurrency primitives. That’s actually worth something!
The Numbers
I ran benchmarks on real Dreamcast hardware (from bench_architecture.elf):
┌─────────────────────────────────────────────────────────────┐
│ OPERATION TIME │
├─────────────────────────────────────────────────────────────┤
│ runtime.Gosched() 120 ns ← very cheap! │
│ Buffered channel op ~1.5 μs │
│ Context switch ~6.6 μs │
│ Channel round-trip ~13 μs │
│ Goroutine spawn+run ~34 μs │
└─────────────────────────────────────────────────────────────┘
At 200 MHz, you get about 200 million cycles per second. At 60 FPS you have 3.3 million cycles per frame. A 34 μs goroutine spawn is ~6,800 cycles—that’s only 0.2% of your frame budget. You can afford a few goroutines per frame, just don’t spawn hundreds!
See the Glossary for a complete reference of all benchmark numbers.
How It Works
The implementation is pretty elegant for a 200 MHz machine. Let’s see how we create the illusion of concurrency.
The G Struct
Every goroutine is a G structure (see runtime/goroutine.h):
┌─────────────────────────────────────────────────────────────┐
│ Goroutine (G) │
│ │
│ _panic: nil (current panic - offset 0) │
│ _defer: nil (deferred functions - offset 4) │
│ atomicstatus: Grunning (or Gwaiting, Grunnable, etc.) │
│ schedlink: next G (run queue linkage) │
│ stack_lo: 0x8c100000 (bottom of stack) │
│ stack_hi: 0x8c110000 (top of stack, 64 KB above) │
│ context: saved CPU registers (64 bytes) │
│ ├── r8-r14 (callee-saved GPRs) │
│ ├── sp, pc, pr (special) │
│ └── fr12-fr15, fpscr, fpul (FPU) │
│ goid: 42 (unique ID - 8 bytes) │
│ waiting: sudog* (channel wait queue entry) │
│ checkpoint: ptr (for panic/recover) │
│ │
└─────────────────────────────────────────────────────────────┘
The key is context, aka the saved CPU registers. This lets us pause mid-function and resume later.
The Run Queue
Runnable goroutines wait in line:
head tail
↓ ↓
┌────┐ ┌────┐ ┌────┐ ┌────┐
│ G3 │──▶│ G7 │──▶│ G2 │──▶│ G9 │──▶ NULL
└────┘ └────┘ └────┘ └────┘
↑
"I'm next!"
The scheduler is simple:
while (true) {
G *gp = runq_get(); // Get next goroutine
if (gp) {
switch_to(gp); // Run it
}
// When it yields, we come back here
}
Context Switching
This is where the magic happens. We’re running goroutine A, and we need to switch to B:
STEP 1: Save A's registers to A's context
────────────────────────────────────────────────────────
CPU A's Context
┌─────────┐ ┌─────────┐
│ r8 = 42 │ ────────────────▶ │ r8 = 42 │
│ r9 = 17 │ │ r9 = 17 │
│ sp = X │ │ sp = X │
│ pc = Y │ │ pc = Y │
└─────────┘ └─────────┘
STEP 2: Load B's registers from B's context
────────────────────────────────────────────────────────
B's Context CPU
┌─────────┐ ┌─────────┐
│ r8 = 99 │ ────────────────▶ │ r8 = 99 │
│ r9 = 55 │ │ r9 = 55 │
│ sp = P │ │ sp = P │
│ pc = Q │ │ pc = Q │
└─────────┘ └─────────┘
STEP 3: Return (now running B!)
────────────────────────────────────────────────────────
CPU continues from B's saved PC with B's saved registers.
To B, it's like it never stopped running!
On SH-4, we save/restore 16 registers (64 bytes). The full context switch with FPU takes ~88 cycles. With lazy FPU optimization (skipping FPU for integer-only goroutines), it drops to ~38 cycles. At 200 MHz, that’s under 0.5 microseconds—the total yield path including scheduler overhead is ~6.6 μs as shown in the benchmarks.
Cooperative Scheduling: The Gotcha
Our scheduler is cooperative, not preemptive. This is different from official Go!
Preemptive (official Go since 1.14): The runtime can forcibly pause a goroutine at any time using timer interrupts or signals. Even an infinite loop gets interrupted so other goroutines can run.
Cooperative (libgodc): Goroutines must volunteer to give up the CPU. The runtime never forces a switch. If a goroutine doesn’t yield, nothing else runs.
Why the difference? Preemptive scheduling requires:
- Signal handlers or timer interrupts to interrupt running code
- Complex stack inspection to find safe preemption points
- More saved state per context switch
On Dreamcast, we keep it simple. The cost is that you must be careful:
// This freezes your Dreamcast (but works fine in official Go!):
func badGoroutine() {
for {
x++ // Infinite loop, never yields
}
}
Where Goroutines Yield
┌─────────────────────────────────────────────────────────────┐
│ YIELDS (lets others run) DOESN'T YIELD │
├─────────────────────────────────────────────────────────────┤
│ ✓ Channel send: ch <- x ✗ Math: x + y * z │
│ ✓ Channel receive: <-ch ✗ Memory: array[i] │
│ ✓ time.Sleep() ✗ Loops: for i := ... │
│ ✓ runtime.Gosched() │
│ ✓ select {} │
└─────────────────────────────────────────────────────────────┘
The Fix for Long Computations
// Bad: No yields for 10 million iterations
for i := 0; i < 10000000; i++ {
result += compute(i)
}
// Good: Yield periodically
for i := 0; i < 10000000; i++ {
result += compute(i)
if i % 10000 == 0 {
runtime.Gosched() // Let others run
}
}
Note: if you have a single long computation with no natural yield points, a direct function call is simpler. Goroutines shine when you have multiple things that can interleave.
When Goroutines Shine
Goroutines work well for several patterns. Here’s real benchmark data from bench_goroutine_usecase.elf:
┌─────────────────────────────────────────────────────────────┐
│ USE CASE OVERHEAD VERDICT │
├─────────────────────────────────────────────────────────────┤
│ Multiple independent tasks 10-38% ✓ Acceptable │
│ Producer-consumer pattern ~163% ⚠ Use carefully │
│ Channel ping-pong ~13 μs/op Know the cost │
└─────────────────────────────────────────────────────────────┘
The key insight: independent tasks (each goroutine does its own work, minimal channel communication) have reasonable overhead (typically ~25%, varies with scheduling). Heavy channel use (producer-consumer with many sends) costs ~163%.
Porting Existing Go Code
If you’re porting Go code that uses goroutines, it works without modification:
// This Go code just works:
func fetch(urls []string) []Result {
ch := make(chan Result, len(urls))
for _, url := range urls {
go func(u string) {
ch <- download(u)
}(url)
}
// ... collect results
}
Patterns to Avoid
Some patterns don’t make sense on a single-core system:
Don’t: Spawn Per-Item
// Inefficient: 1000 spawns = 32 ms overhead
for i := 0; i < 1000; i++ {
go process(items[i])
}
// Better: Process directly, or use one goroutine
for i := 0; i < 1000; i++ {
process(items[i])
}
Don’t: Force Sequential With Channels
// Overcomplicated: These are sequential anyway
go step1()
<-done1
go step2()
<-done2
// Simpler:
step1()
step2()
Be Careful: Heavy Channel Traffic
// Each channel op is ~13 μs
// High-volume producer-consumer shows ~163% overhead
for item := range items {
workChan <- item
}
For high-throughput paths, batch items or use direct calls.