This post will show you how to use D to write a bare-metal “Hello world” program that targets the RISC-V QEMU simulator. In a future blog post (now available) we’ll build on this to target actual hardware: the VisionFive 2 SBC. See blog-code for the final code from this post. For a more complex example, see Multiplix, an operating system I am developing that runs on the VisionFive 2 (and Raspberry Pis).
Why D?
Recently I’ve been writing bare-metal code in C, and I’ve become a bit frustrated with the lack of features that C provides. I started searching for a good replacement, and revisited D (a language I used for a project a few years ago). It turns out D has introduced a mode called betterC1 (sounds exactly like what I want), which essentially disables all language features that require the D runtime. This makes it roughly as easy to use D for bare-metal programming as C. You don’t get all the features of D, but you get enough that it covers all the things I want (in fact, for systems programming I prefer the betterC subset of D over full D). D in betterC mode is exactly what it sounds like, and retains the feel of C – going forward I think I will be using it instead of C in all situations where I would have otherwise used C (even in non-bare-metal situations).
Here are the positives about D I value most:
- A decent import system (no more header files and
#include
). - Automatic bounds checking, and bounded strings and arrays.
- Methods in structs.
- Compile-time code evaluation (run D code at compile-time!).
- Powerful templating and generics.
- Iterators.
- Default support for thread-local storage.
- Scope guards and RAII.
- Some memory safety protections with
@safe
. - A fairly comprehensive and readable online specification.
- An active discord channel with people that answer my questions in minutes.
- Both an LLVM-based compiler (LDC) and a GNU compiler (GDC), which is officially
part of the GCC project.
- And these compilers both export roughly the same flags and intrinsics as Clang and GCC respectively.
These features, combined with the lack of a runtime and the C-like feel of the language (making it easy to port previous code), make it a no-brainer for me to have D as the base choice for any project where I would otherwise use C.
Installing the toolchain
Now that I’ve told you about my reasons for choosing D, let’s try using it to write a bare-metal application that targets RISC-V. If you want to follow along, the first step is to download the toolchain (the following tools should work on Linux or MacOS). You’ll need three different components:
- LDC 1.30 (the LLVM-based D compiler). Can be downloaded from GitHub. Make sure to use version 1.302.
- A
riscv64-unknown-elf
GNU toolchain. Can be downloaded from SiFive’s Freedom Tools repository. - The QEMU RISC-V simulator:
qemu-system-riscv64
. Can be downloaded from SiFive’s Freedom Tools repository, or also usually available as part of your system’s QEMU package.
We’ll be using LDC since it ships with the ability to target riscv64
. I have
used GDC for bare-metal development as well, but it requires building a
toolchain from source since nobody ships pre-built riscv64-unknown-elf-gdc
binaries3. We’ll use the GNU toolchain for assembling, linking, and for other
tools like objcopy
and objdump
, and QEMU for simulating the hardware.
With these installed you should be able to run:
$ ldc2 --version
LDC - the LLVM D compiler (1.30.0):
...
$ riscv64-unknown-elf-ld
riscv64-unknown-elf-ld: no input files
$ qemu-system-riscv64 -h
...
CPU entrypoint
We’re writing bare-metal code, so there’s no operating system, no console, no
files – nothing. The CPU just starts executing instructions at a pre-specified
address4 after performing some initial setup. We’ll figure out what that
address is later when we set up the linkerscript. For now we can just define
the _start
symbol as our entrypoint, and assume the linker will place the
code at this label at the CPU entrypoint.
A D function requires a valid stack pointer, so before we can execute any D code
we need to load the stack pointer register sp
with a valid address.
Let’s make a file called start.s
and put the following in it:
.section ".text.boot"
.globl _start
_start:
la sp, _stack_start
call dstart
_hlt:
j _hlt
For now let’s assume _stack_start
is a symbol with the address of a valid
stack, and in the linkerscript we’ll set this up properly. After loading sp
,
we call a D function called dstart
, defined in the next part.
D entrypoint
Now we can define our dstart
function in dstart.d
. For now we’ll just cause
an infinite loop.
module dstart;
extern (C) void dstart() {
while (1) {}
}
Linkerscript
Before we can compile this program we need a bit of linkerscript to tell the
linker how our code should be laid out. We’ll need to specify the address where
the text section should start (the entry address), and reserve space for all
the data sections (.rodata
, .data
, .bss
), and the stack.
Entry address
Today we’ll be targeting the QEMU virt
RISC-V machine, so we have
to figure out what its entrypoint is.
We can ask QEMU for a list of all devices in the virt
machine by telling it
to dump the its device tree:
$ qemu-system-riscv64 -machine virt,dumpdtb=virt.dtb
$ dtc virt.dtb > virt.dts
In virt.dts
you’ll find the following entry:
memory@80000000 {
device_type = "memory";
reg = <0x00 0x80000000 0x00 0x8000000>;
};
This means that RAM starts at address 0x80000000
(everything below is special
memory or inaccessible). The CPU entrypoint for the virt
machine is the first
instruction in RAM, stored at 0x80000000
.
In the linkerscript, we need to tell the linker that it should place the
_start
function at 0x80000000
. We do this by telling it to put the
.text.boot
section first in the .text
section, located at 0x80000000
.
Then we include the rest of the .text
sections, followed by read-only data,
writable data, and the BSS.
In link.ld
:
ENTRY(_start)
SECTIONS
{
.text 0x80000000 : {
KEEP(*(.text.boot))
*(.text*)
}
.rodata : {
. = ALIGN(8);
*(.rodata*)
*(.srodata*)
. = ALIGN(8);
}
.data : {
. = ALIGN(8);
*(.sdata*)
*(.data*)
. = ALIGN(8);
}
.bss : {
. = ALIGN(8);
_bss_start = .;
*(.sbss*)
*(.bss*)
*(COMMON)
. = ALIGN(8);
_bss_end = .;
}
.kstack : {
. = ALIGN(16);
. += 4K;
_stack_start = .;
}
/DISCARD/ : { *(.comment .note .eh_frame) }
}
What is the BSS?
The BSS is a region of memory that the compiler assumes is initialized to all
zeroes. Usually the static data for a program is directly copied into the ELF
executable – if you have a string "hello world"
in your program, those exact
bytes will live somewhere in the binary (in the read-only data section).
However, a lot of static data is initialized to zero, so instead of putting
those zero bytes directly into the ELF file, the linker lets us save space by
making a special section (the BSS) that must be initialized to all zeroes at
runtime, but won’t actually contain that data in the ELF file itself. So even
if you have a giant 1MB array of zeroes, your ELF binary will be small because
that section will be expanded into RAM only when the application starts.
Usually the OS sets up the BSS before it launches a program, but we’re running
bare-metal, so we have to do that manually in the dstart
function (in the
next section). To make this initialization possible, we define the
_bss_start
and _bss_end
symbols in the linkerscript. These are symbols
whose addresses will be the start and end of the BSS section respectively.
Reserving space for the stack
We also reserve one page for the .kstack
section and mark the _stack_start
symbol to be located to the end of it (remember the stack grows down). The
stack must be 16-byte aligned.
Compile!
Now we have everything we need to compile a basic bare-metal program.
$ ldc2 -Oz -betterC -mtriple=riscv64-unknown-elf -mattr=+m,+a,+c --code-model=medium -c dstart.d
$ riscv64-unknown-elf-as -mno-relax -march=rv64imac start.S -c -o start.o
$ riscv64-unknown-elf-ld -Tlink.ld start.o dstart.o -o prog.elf
Let’s look at some of these flags:
Oz
: optimize aggressively for size.betterC
: enable betterC mode (disable the built-in D runtime).mtriple=riscv64-unknown-elf
: build for the riscv64 bare-metal ELF target.mattr=+m,+a,+c
: enable the following RISC-V extensions:m
(multiply/divide),a
(atomics), andc
(compressed instructions).code-model=medium
: code models in RISC-V control how pointers to far away locations are constructed. Themedium
code model (also calledmedany
) allows us to address any symbol located within 2 GiB of the current address, and is recommended for 64-bit programs. See the SiFive post for more information.mno-relax
: disables linker relaxation in the assembler (it is already disabled by default in LDC). Linker relaxation is a RISC-V-specific optimization that allows the linker to make use of thegp
(global pointer) register. I explain it in more detail in the linker relaxation section.
It’s going to get tedious to type out these commands repeatedly, so let’s create a Makefile5 (or a Knitfile if you’re cool):
SRC=$(wildcard *.d)
OBJ=$(SRC:.d=.o)
all: prog.bin
%.o: %.d
ldc2 -Oz -betterC -mtriple=riscv64-unknown-elf -mattr=+m,+a,+c,+relax --code-model=medium --makedeps=$*.dep $< -c -of $@
%.o: %.s
riscv64-unknown-elf-as -march=rv64imac $< -c -o $@
prog.elf: start.o $(OBJ)
riscv64-unknown-elf-ld -Tlink.ld $^ -o $@
%.bin: %.elf
riscv64-unknown-elf-objcopy $< -O binary $@
%.list: %.elf
riscv64-unknown-elf-objdump -D $< > $@
run: prog.bin
qemu-system-riscv64 -nographic -bios none -machine virt -kernel prog.bin
clean:
rm -f *.bin *.list *.o *.elf *.dep
-include *.dep
and compile with
$ make prog.bin
This file is a raw dump of our program. At this point it clocks in at a whopping 22 bytes.
To see the disassembled program, run
$ make prog.list
...
$ cat prog.list
prog.elf: file format elf64-littleriscv
Disassembly of section .text:
0000000080000000 <_start>:
80000000: 00001117 auipc sp,0x1
80000004: 02010113 addi sp,sp,32 # 80001020 <_stack_start>
80000008: 00000097 auipc ra,0x0
8000000c: 00c080e7 jalr 12(ra) # 80000014 <dstart>
0000000080000010 <_hlt>:
80000010: a001 j 80000010 <_hlt>
...
0000000080000014 <dstart>:
80000014: a001 j 80000014 <dstart>
Looks like our _start
function is being linked properly at 0x80000000
and
has the expected assembly!
If you try to run with
$ make run
qemu-system-riscv64 -nographic -bios none -machine virt -kernel prog.bin
it will just enter an infinite loop (press Ctrl-A
Ctrl-X
to quit QEMU). We
still have a bit more work to do before we get output.
More setup: initializing the BSS
Now let’s modify dstart
to initialize the BSS. We need to declare some
extern
variables so that the linker symbols _bss_start
and _bss_end
are
available to our D code. Then we can just loop from _bss_start
to _bss_end
and assign all the bytes in that range to zero. Once complete, our BSS is
initialized and we can run arbitrary6 D code (using globals that may be
initialized to zero).
extern (C) {
extern __gshared uint _bss_start, _bss_end;
void dstart() {
uint* bss = &_bss_start;
uint* bss_end = &_bss_end;
while (bss < bss_end) {
*bss++ = 0;
}
import main;
kmain();
}
}
And in main.d
we have our bare-metal main entrypoint:
module main;
void kmain() {}
Creating a minimal D runtime
Several D language features are unavailable because of our lack of runtime. For
example, types such as string
and size_t
are undefined, and we can’t use
assertions (we’ll get to those later). The first step to creating a minimal
runtime is to create an object.d
file. The D compiler will search for this
special file and import it automatically everywhere. So we can create
definitions for types like string
and size_t
here. Here is the minimal
definition I like to use, which also defines ptrdiff_t
, noreturn
, and
uintptr
.
module object;
alias string = immutable(char)[];
alias size_t = typeof(int.sizeof);
alias ptrdiff_t = typeof(cast(void*) 0 - cast(void*) 0);
alias noreturn = typeof(*null);
static if ((void*).sizeof == 8) {
alias uintptr = ulong;
} else static if ((void*).sizeof == 4) {
alias uintptr = uint;
} else {
static assert(0, "pointer size must be 4 or 8 bytes");
}
Writing to the UART device
Most systems have a UART device. Generally how this works is that you write a
byte to a special place in memory7, and that byte will be transmitted using the
UART protocol over some pins on the board. In order to read the bytes with your
host computer you need a UART to USB adapter plugged into your host, and then
you can read from the corresponding device file (usually /dev/ttyUSB0
) on the
host computer. Today we’ll just be simulating our bare-metal code in QEMU, so
you don’t need to have a special adapter. QEMU will emulate a UART device and
print out the bytes written to its transmit register.
Enabling volatile loads/stores
When writing to device memory it is important to ensure that the compiler does
not remove our loads/stores. For example, if a device is located at
0x10000000
, we might write directly to that address by casting the integer to
a pointer. To the compiler, it just looks like we are writing to random
addresses, which might be undefined behavior or result in dead code (e.g., if
we never read the value back, the compiler may determine that it can eliminate
the write). We need to inform the compiler that these reads/writes of device
memory must be preserved and cannot be optimized out. D uses the
volatileStore
and volatileLoad
intrinsics for this.
We can define these in our object.d
:
pragma(LDC_intrinsic, "ldc.bitop.vld") ubyte volatileLoad(ubyte* ptr);
pragma(LDC_intrinsic, "ldc.bitop.vld") ushort volatileLoad(ushort* ptr);
pragma(LDC_intrinsic, "ldc.bitop.vld") uint volatileLoad(uint* ptr);
pragma(LDC_intrinsic, "ldc.bitop.vld") ulong volatileLoad(ulong* ptr);
pragma(LDC_intrinsic, "ldc.bitop.vst") void volatileStore(ubyte* ptr, ubyte value);
pragma(LDC_intrinsic, "ldc.bitop.vst") void volatileStore(ushort* ptr, ushort value);
pragma(LDC_intrinsic, "ldc.bitop.vst") void volatileStore(uint* ptr, uint value);
pragma(LDC_intrinsic, "ldc.bitop.vst") void volatileStore(ulong* ptr, ulong value);
Controlling the UART
With that set up, let’s figure out where QEMU’s UART device is located in memory so we can write to it.
The QEMU virt
machine defines a number of virtual devices, one of which is a
UART device. Looking through the QEMU device tree again in virt.dts
, you’ll
see the following:
uart@10000000 {
interrupts = <0x0a>;
interrupt-parent = <0x03>;
clock-frequency = <0x384000>;
reg = <0x00 0x10000000 0x00 0x100>;
compatible = "ns16550a";
};
This says that a ns16550a UART device exists at address 0x10000000
.
On real hardware the UART would need to be properly initialized by writing some
memory-mapped configuration registers (for setting up the baud rate and other
options). However the QEMU device does not require initialization. It emulates
an ns16550a device, and writing to its transmit register is enough to cause a
byte to be written over the UART (which appears on the console when simulating
with QEMU). The transmit register for the ns16550a is the first mapped register,
so it is located at 0x10000000
.
In uart.d
:
module uart;
struct Ns16650a(ubyte* base) {
static void tx(ubyte b) {
volatileStore(base, b);
}
}
alias Uart = Ns16650a!(cast(ubyte*) 0x10000000);
Now in kmain
, we can test the UART.
module main;
import uart;
void kmain() {
Uart.tx('h');
Uart.tx('i');
Uart.tx('\n');
}
$ make prog.bin
$ qemu-system-riscv64 -nographic -bios none -machine virt -kernel prog.bin
hi
Press Ctrl-A
Ctrl-x
to quit QEMU (the program will enter an infinite loop
after returning from kmain
).
Making a simple print function
Now we can just wrap the Uart.tx
function up with a println
function and
we’ll have a bare-metal Hello world!
in no time.
In object.d
:
import uart;
void printElem(char c) {
Uart.tx(c);
}
void printElem(string s) {
foreach (c; s) {
printElem(c);
}
}
void print(Args...)(Args args) {
foreach (arg; args) {
printElem(arg);
}
}
void println(Args...)(Args args) {
print(args, '\n');
}
And in main.d
:
void kmain() {
println("Hello world!");
}
$ make prog.bin
$ qemu-system-riscv64 -nographic -bios none -machine virt -kernel prog.bin
Hello world!
There you have it, (simulated) bare-metal hello world!
Some of the initialization we’ve done hasn’t been strictly necessary (we didn’t end up using any variables in the BSS), but it should set you up properly for writing more complex bare-metal programs. The next sections discuss some further steps.
Bonus content
Adding support for assertions and bounds-checking
If you try to use a D assert expression, you might notice that the linking step fails:
riscv64-unknown-elf-ld: dstart.o: in function `_D6dstart5kmainFZv':
dstart.d:(.text+0x3c): undefined reference to `__assert'
It is looking for an __assert
8 function, so let’s create one in the object.d
file:
size_t strlen(const(char)* s) {
size_t n;
for (n = 0; *s != '\0'; ++s) {
++n;
}
return n;
}
extern (C) noreturn __assert(const(char)* msg, const(char)* file, int line) {
// convert a char pointer into a bounded string with the [0 .. length] syntax
string smsg = cast(string) msg[0 .. strlen(msg)];
string sfile = cast(string) file[0 .. strlen(file)];
println("fatal error: ", sfile, ": ", smsg);
while (1) {}
}
Now you can use assert
statements!
D also supports bounds-checking, and internally the compiler will also call
__assert
when a bounds check fails. This means we also have working bounds
checks now.
Try this in main.d
:
void kmain() {
char[10] array;
int x = 12;
println(array[x]);
}
Running it gives
fatal error: main.d: array index out of bounds
Bounds-checked arrays!
This code doesn’t print the line number because that requires converting an
int
to a string
– something left as an exercise to the reader.
Enabling linker relaxation
Linker
relaxation
is an optimization in the RISC-V toolchain that allows globals to be accessed
through the global pointer (stored in the gp
register). This value is a
pointer to somewhere in the data section, which allows instructions to load
globals by directly offsetting from gp
, instead of constructing the
address of the global from scratch (which may require multiple instructions on
RISC-V).
To enable linker relaxation we have to do three things:
- Modify the linkerscript so that it defines a symbol for the global pointer.
- Load the
gp
register with this value in the_start
function. - Enable linker relaxation in our compiler.
To modify the linkerscript we just add the following at the beginning of the
.rodata
section definition:
__global_pointer$ = . + 0x800;
This sets up the __global_pointer$
symbol (a special symbol that the
linker assumes is stored in gp
) to point 0x800
bytes into the data
segment (RISC-V instructions can load/store values offset up to 0x800
bytes
from the gp
register in either direction in one instruction). This allows
offsets from gp
to cover most/all of static data.
Next we add to _start
:
.option push
.option norelax
la gp, __global_pointer$
.option pop
We need to temporarily enable the norelax
option, otherwise the assembler
will optimize this to mv gp, gp
.
Finally, we can remove the -mno-relax
flag from the riscv64-unknown-elf-as
invocation, and add -mattr=+m,+a,+c,+relax
to the ldc2
invocation to enable
linker relaxation in the compiler.
Removing unused functions
If you take a look at the disassembly of the program (make prog.list
), you
might notice there are definitions for functions that are never called. This is
because those functions have been inlined, but the definitions were not
removed. Functions/globals in D are always exported in the object file, even if
they are marked private
(I’m not really sure why). Luckily modern linkers can
be pretty smart and it’s easy to have the linker remove these unused functions.
Pass --function-sections
and --data-sections
to LDC to have it put each
function/global in its own section (still within .text
, .data
etc.). Now if
you pass the --gc-sections
flag to the linker, it will remove any
unreferenced sections (hence removing any unused functions/globals). With these
flags I got the final “hello world” binary down to 160 bytes.
This is a basic form of optimization performed by the linker. There are more
advanced forms of link-time optimization (LTO), which I won’t discuss in much
detail. If you pass -flto=thin
or -flto=full
to LDC, the object files that
it generates will be LLVM bitcode. Then you will need to invoke the linker with
the LLVMgold linker plugin (or use LLD) so that it can read these files. With
this method, the linker will apply full compiler optimizations across object
files.
Thread-local storage and globals
Globals are thread-local by default in D. That means if you declare a global as
int x;
then whenever you access x
, the compiler will do so through the
system’s thread pointer (on RISC-V this is stored in the tp
register). That
means if you use a thread-local variable, you had better make sure tp
points to a block of memory where x
is located, and if you have multiple
threads each thread’s tp
should point to a distinct thread-local block (each thread
will have its own private copy of x
). I won’t explain in detail how to set that up here,
but briefly, you’ll need to initialize the .tdata
and .tbss
sections for each thread
in dstart
, and load tp
with a pointer to the current thread’s local .tdata
.
To make a global shared across all threads, you need to mark it as immutable
or shared
. A variable marked as shared
imposes some limits, and basically
forces you to mark everything it touches as shared
. You can still read/write
it without checks, but at least you should be able to easily know if you are
accessing a shared variable (and manually verify you have the appropriate
synchronization). In a future version of D it is likely that directly accessing
a shared variable will be disallowed, except through atomic intrinsics. If you
have a lock to protect the variable, then you will need to cast away the
shared
qualifier manually, which isn’t perfect but forces the programmer to
acknowledge the possible unsafety of accessing the shared global. You can
always use the __gshared
attribute as an escape hatch, which makes the global
shared but does not make any changes to the type (no limitations). A global
marked as __gshared
is equivalent to a C global.
Final remarks
I hope this provided a simple introduction to D for bare-metal programming, and that you might consider using D instead of C in some future project as a result. This post has only covered running in a simulated environment. In a future post I’ll show how to write bare-metal code for the VisionFive 29, a recently released RISC-V SBC produced by StarFive. Stay tuned! (now available)
If you want to see a larger example, I am developing an operating system called Multiplix in D. It has support for RISC-V and AArch64, and targets the VisionFive, VisionFive 2, Raspberry Pi 3, and Raspberry Pi 4 (and likely more boards in the future). Check it out! It is currently heavily in-progress, but I plan to make a post about it when it is further along.
The code from this post is available in my blog-code repository.
-
Also sometimes called DasBetterC. Sehr gut! ↩︎
-
Version 1.31 has a problem with the default RISC-V ABI. This is fixed on master and the problem will be gone in 1.32. ↩︎
-
I actually distribute some here. These are builds from the master branch of GCC 13. See the
gdc
branch in blog-code for a version of this post’s code that works with GDC instead of LDC. ↩︎ -
This varies by board and is documented in the technical reference manual (if you’re lucky). On the VisionFive it is
0x80000000
and on the VisionFive 2 it is0x40000000
. ↩︎ -
We have an additional flag,
--makedeps
, which asks LDC to output a dependency file that captures dependencies between D source files that import each other. This is similar to the-MMD
flag in C compilers. ↩︎ -
Actually we haven’t set up thread-local storage, so we can’t use D’s TLS globals yet. We’ll also need additional runtime callbacks for things like assertions. ↩︎
-
This special memory address is sometimes called a device register, or memory-mapped register. Accessing devices in this way is usually called memory-mapped I/O (MMIO). ↩︎
-
If you are using GDC, you’ll need to create a variety of assertion failure functions named
_d_assert_msg
,_d_assert
,_d_arraybounds
, … ↩︎ -
Specs: quad-core SiFive U74 at 1.5 GHz, with an additional S7 monitor core and 8 GB of RAM. ↩︎