Reading List
The most recent articles from a list of feeds I subscribe to.
Behind "Hello World" on Linux
Today I was thinking about – what happens when you run a simple “Hello World” Python program on Linux, like this one?
print("hello world")
Here’s what it looks like at the command line:
$ python3 hello.py
hello world
But behind the scenes, there’s a lot more going on. I’ll
describe some of what happens, and (much much more importantly!) explain some tools you can use to
see what’s going on behind the scenes yourself. We’ll use readelf, strace,
ldd, debugfs, /proc, ltrace, dd, and stat. I won’t talk about the Python-specific parts at all – just what happens when you run any dynamically linked executable.
Here’s a table of contents:
- parse “python3 hello.py”
- figure out the full path to python3
- stat, under the hood
- time to fork
- the shell calls execve
- get the binary’s contents
- find the interpreter
- dynamic linking
- go to _start
- write a string
before execve
Before we even start the Python interpreter, there are a lot of things that have to happen. What executable are we even running? Where is it?
1: The shell parses the string python3 hello.py into a command to run and a list of arguments: python3, and ['hello.py']
A bunch of things like glob expansion could happen here. For example if you run python3 *.py, the shell will expand that into python3 hello.py
2: The shell figures out the full path to python3
Now we know we need to run python3. But what’s the full path to that binary? The way this works is that there’s a special environment variable named PATH.
See for yourself: Run echo $PATH in your shell. For me it looks like this.
$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
When you run a command, the shell will search every directory in that list (in order) to try to find a match.
In fish (my shell), you can see the path resolution logic here.
It uses the stat system call to check if files exist.
See for yourself: Run strace -e stat, and then run a command like python3. You should see output like this:
stat("/usr/local/sbin/python3", 0x7ffcdd871f40) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/python3", 0x7ffcdd871f40) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/python3", 0x7ffcdd871f40) = -1 ENOENT (No such file or directory)
stat("/usr/bin/python3", {st_mode=S_IFREG|0755, st_size=5479736, ...}) = 0
You can see that it finds the binary at /usr/bin/python3 and stops: it
doesn’t continue searching /sbin or /bin.
(if this doesn’t work for you, instead try strace -o out bash, and then grep
stat out. One reader mentioned that their version of libc uses a different
system call instead of stat)
2.1: A note on execvp
If you want to run the same PATH searching logic as the shell does without
reimplementing it yourself, you can use the libc function execvp (or one of
the other exec* functions with p in the name).
3: stat, under the hood
Now you might be wondering – Julia, what is stat doing? Well, when your OS opens a file, it’s split into 2 steps.
- It maps the filename to an inode, which contains metadata about the file
- It uses the inode to get the file’s contents
The stat system call just returns the contents of the file’s inodes – it
doesn’t read the contents at all. The advantage of this is that it’s a lot
faster. Let’s go on a short adventure into inodes. (this great post “A disk is a bunch of bits” by Dmitry Mazin has more details)
$ stat /usr/bin/python3
File: /usr/bin/python3 -> python3.9
Size: 9 Blocks: 0 IO Block: 4096 symbolic link
Device: fe01h/65025d Inode: 6206 Links: 1
Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2023-08-03 14:17:28.890364214 +0000
Modify: 2021-04-05 12:00:48.000000000 +0000
Change: 2021-06-22 04:22:50.936969560 +0000
Birth: 2021-06-22 04:22:50.924969237 +0000
See for yourself: Let’s go see where exactly that inode is on our hard drive.
First, we have to find our hard drive’s device name
$ df
...
tmpfs 100016 604 99412 1% /run
/dev/vda1 25630792 14488736 10062712 60% /
...
Looks like it’s /dev/vda1. Next, let’s find out where the inode for /usr/bin/python3 is on our hard drive:
$ sudo debugfs /dev/vda1
debugfs 1.46.2 (28-Feb-2021)
debugfs: imap /usr/bin/python3
Inode 6206 is part of block group 0
located at block 658, offset 0x0d00
I have no idea how debugfs is figuring out the location of the inode for that filename, but we’re going to leave that alone.
Now, we need to calculate how many bytes into our hard drive “block 658, offset 0x0d00” is on the big array of bytes that is your hard drive. Each block is 4096 bytes, so we need to go 4096 * 658 + 0x0d00 bytes. A calculator tells me that’s 2698496
$ sudo dd if=/dev/vda1 bs=1 skip=2698496 count=256 2>/dev/null | hexdump -C
00000000 ff a1 00 00 09 00 00 00 f8 b6 cb 64 9a 65 d1 60 |...........d.e.`|
00000010 f0 fb 6a 60 00 00 00 00 00 00 01 00 00 00 00 00 |..j`............|
00000020 00 00 00 00 01 00 00 00 70 79 74 68 6f 6e 33 2e |........python3.|
00000030 39 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |9...............|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000060 00 00 00 00 12 4a 95 8c 00 00 00 00 00 00 00 00 |.....J..........|
00000070 00 00 00 00 00 00 00 00 00 00 00 00 2d cb 00 00 |............-...|
00000080 20 00 bd e7 60 15 64 df 00 00 00 00 d8 84 47 d4 | ...`.d.......G.|
00000090 9a 65 d1 60 54 a4 87 dc 00 00 00 00 00 00 00 00 |.e.`T...........|
000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
Neat! There’s our inode! You can see it says python3 in it, which is a really
good sign. We’re not going to go through all of this, but the ext4 inode struct from the Linux kernel
says that the first 16 bits are the “mode”, or permissions. So let’s work that out how ffa1 corresponds to file permissions.
- The bytes
ffa1correspond to the number0xa1ff, or 41471 (because x86 is little endian) - 41471 in octal is
0120777 - This is a bit weird – that file’s permissions could definitely be
777, but what are the first 3 digits? I’m not used to seeing those! You can find out what the012means in man inode (scroll down to “The file type and mode”). There’s a little table that says012means “symbolic link”.
Let’s list the file and see if it is in fact a symbolic link with permissions 777:
$ ls -l /usr/bin/python3
lrwxrwxrwx 1 root root 9 Apr 5 2021 /usr/bin/python3 -> python3.9
It is! Hooray, we decoded it correctly.
4: Time to fork
We’re still not ready to start python3. First, the shell needs to create a
new child process to run. The way new processes start on Unix is a little weird
– first the process clones itself, and then runs execve, which replaces the
cloned process with a new process.
*See for yourself: Run strace -e clone bash, then run python3. You should see something like this:
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f03788f1a10) = 3708100
3708100 is the PID of the new process, which is a child of the shell process.
Some more tools to look at what’s going on with processes:
pstreewill show you a tree of all the processes on your systemcat /proc/PID/statshows you some information about the process. The contents of that file are documented inman proc. For example the 4th field is the parent PID.
4.1: What the new process inherits.
The new process (which will become python3) has inherited a bunch of from the shell. For example, it’s inherited:
- environment variables: you can look at them with
cat /proc/PID/environ | tr '\0' '\n' - file descriptors for stdout and stderr: look at them with
ls -l /proc/PID/fd - a working directory (whatever the current directory is)
- namespaces and cgroups (if it’s in a container)
- the user and group that’s running it
- probably more things I’m not thinking of right now
5: The shell calls execve
Now we’re ready to start the Python interpreter!
See for yourself: Run strace -f -e execve bash, then run python3. The -f is important because we want to follow any forked child subprocesses. You should see something like this:
[pid 3708381] execve("/usr/bin/python3", ["python3"], 0x560397748300 /* 21 vars */) = 0
The first argument is the binary, and the second argument is the list of command line arguments. The command line arguments get placed in a special location in the program’s memory so that it can access them when it runs.
Now, what’s going on inside execve?
6: get the binary’s contents
The first thing that has to happen is that we need to open the python3
binary file and read its contents. So far we’ve only used the stat system call to access its metadata,
but now we need its contents.
Let’s look at the output of stat again:
$ stat /usr/bin/python3
File: /usr/bin/python3 -> python3.9
Size: 9 Blocks: 0 IO Block: 4096 symbolic link
Device: fe01h/65025d Inode: 6206 Links: 1
...
This takes up 0 blocks of space on the disk. This is because the contents of
the symbolic link (python3.9) are actually in the inode itself: you can see
them here (from the binary contents of the inode above, it’s split across 2
lines in the hexdump output):
00000020 00 00 00 00 01 00 00 00 70 79 74 68 6f 6e 33 2e |........python3.|
00000030 39 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |9...............|
So we’ll need to open /usr/bin/python3.9 instead. All of this is happening
inside the kernel so you won’t see it another system call for that.
Every file is made up of a bunch of blocks on the hard drive. I think each of these blocks on my system is 4096 bytes, so the minimum size of a file is 4096 bytes – even if the file is only 5 bytes, it still takes up 4KB on disk.
See for yourself: We can find the block numbers using debugfs like this: (again, I got these instructions from dmitry mazin’s “A disk is a bunch of bits” post)
$ debugfs /dev/vda1
debugfs: blocks /usr/bin/python3.9
145408 145409 145410 145411 145412 145413 145414 145415 145416 145417 145418 145419 145420 145421 145422 145423 145424 145425 145426 145427 145428 145429 145430 145431 145432 145433 145434 145435 145436 145437
Now we can use dd to read the first block of the file. We’ll set the block size to 4096 bytes, skip 145408 blocks, and read 1 block.
$ dd if=/dev/vda1 bs=4096 skip=145408 count=1 2>/dev/null | hexdump -C | head
00000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 |.ELF............|
00000010 02 00 3e 00 01 00 00 00 c0 a5 5e 00 00 00 00 00 |..>.......^.....|
00000020 40 00 00 00 00 00 00 00 b8 95 53 00 00 00 00 00 |@.........S.....|
00000030 00 00 00 00 40 00 38 00 0b 00 40 00 1e 00 1d 00 |....@.8...@.....|
00000040 06 00 00 00 04 00 00 00 40 00 00 00 00 00 00 00 |........@.......|
00000050 40 00 40 00 00 00 00 00 40 00 40 00 00 00 00 00 |@.@.....@.@.....|
00000060 68 02 00 00 00 00 00 00 68 02 00 00 00 00 00 00 |h.......h.......|
00000070 08 00 00 00 00 00 00 00 03 00 00 00 04 00 00 00 |................|
00000080 a8 02 00 00 00 00 00 00 a8 02 40 00 00 00 00 00 |..........@.....|
00000090 a8 02 40 00 00 00 00 00 1c 00 00 00 00 00 00 00 |..@.............|
You can see that we get the exact same output as if we read the file with cat, like this:
$ cat /usr/bin/python3.9 | hexdump -C | head
00000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 |.ELF............|
00000010 02 00 3e 00 01 00 00 00 c0 a5 5e 00 00 00 00 00 |..>.......^.....|
00000020 40 00 00 00 00 00 00 00 b8 95 53 00 00 00 00 00 |@.........S.....|
00000030 00 00 00 00 40 00 38 00 0b 00 40 00 1e 00 1d 00 |....@.8...@.....|
00000040 06 00 00 00 04 00 00 00 40 00 00 00 00 00 00 00 |........@.......|
00000050 40 00 40 00 00 00 00 00 40 00 40 00 00 00 00 00 |@.@.....@.@.....|
00000060 68 02 00 00 00 00 00 00 68 02 00 00 00 00 00 00 |h.......h.......|
00000070 08 00 00 00 00 00 00 00 03 00 00 00 04 00 00 00 |................|
00000080 a8 02 00 00 00 00 00 00 a8 02 40 00 00 00 00 00 |..........@.....|
00000090 a8 02 40 00 00 00 00 00 1c 00 00 00 00 00 00 00 |..@.............|
an aside on magic numbers
This file starts with ELF, which is a “magic number”, or a byte sequence that
tells us that this is an ELF file. ELF is the binary file format on Linux.
Different file formats have different magic numbers, for example the magic
number for gzip is 1f8b. The magic number at the beginning is how file blah.gz knows that it’s a gzip file.
I think file has a variety of heuristics for figuring out the file type of a
file, not just magic numbers, but the magic number is an important one.
7: find the interpreter
Let’s parse the ELF file to see what’s in there.
See for yourself: Run readelf -a /usr/bin/python3.9. Here’s what I get (though I’ve redacted a LOT of stuff):
$ readelf -a /usr/bin/python3.9
ELF Header:
Class: ELF64
Machine: Advanced Micro Devices X86-64
...
-> Entry point address: 0x5ea5c0
...
Program Headers:
Type Offset VirtAddr PhysAddr
INTERP 0x00000000000002a8 0x00000000004002a8 0x00000000004002a8
0x000000000000001c 0x000000000000001c R 0x1
-> [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
...
-> 1238: 00000000005ea5c0 43 FUNC GLOBAL DEFAULT 13 _start
Here’s what I understand of what’s going on here:
- it’s telling the kernel to run
/lib64/ld-linux-x86-64.so.2to start this program. This is called the dynamic linker and we’ll talk about it next - it’s specifying an entry point (at
0x5ea5c0, which is where this program’s code starts)
Now let’s talk about the dynamic linker.
8: dynamic linking
Okay! We’ve read the bytes from disk and we’ve started this “interpreter” thing. What next? Well, if you run strace -o out.strace python3, you’ll see a bunch of stuff like this right after the execve system call:
execve("/usr/bin/python3", ["python3"], 0x560af13472f0 /* 21 vars */) = 0
brk(NULL) = 0xfcc000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=32091, ...}) = 0
mmap(NULL, 32091, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f718a1e3000
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 l\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=149520, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f718a1e1000
...
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
This all looks a bit intimidating at first, but the part I want you to pay
attention to is openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0".
This is opening a C threading library called pthread that the Python
interpreter needs to run.
See for yourself: If you want to know which libraries a binary needs to load at runtime, you can use ldd. Here’s what that looks like for me:
$ ldd /usr/bin/python3.9
linux-vdso.so.1 (0x00007ffc2aad7000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2fd6554000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2fd654e000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2fd6549000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2fd6405000)
libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f2fd63d6000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f2fd63b9000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2fd61e3000)
/lib64/ld-linux-x86-64.so.2 (0x00007f2fd6580000)
You can see that the first library listed is /lib/x86_64-linux-gnu/libpthread.so.0, which is why it was loaded first.
on LD_LIBRARY_PATH
I’m honestly still a little confused about dynamic linking. Some things I know:
- Dynamic linking happens in userspace and the dynamic linker on my system is at
/lib64/ld-linux-x86-64.so.2. If you’re missing the dynamic linker, you can end up with weird bugs like this weird “file not found” error - The dynamic linker uses the
LD_LIBRARY_PATHenvironment variable to find libraries - The dynamic linker will also use the
LD_PRELOADenvironment to override any dynamically linked function you want (you can use this for fun hacks, or to replace your default memory allocator with an alternative one like jemalloc) - there are some
mprotects in the strace output which are marking the library code as read-only, for security reasons - on Mac, it’s
DYLD_LIBRARY_PATHinstead ofLD_LIBRARY_PATH
You might be wondering – if dynamic linking happens in userspace, why don’t we
see a bunch of stat system calls where it’s searching through
LD_LIBRARY_PATH for the libraries, the way we did when bash was searching the
PATH?
That’s because ld has a cache in /etc/ld.so.cache, and all of those
libraries have already been found in the past. You can see it opening the cache
in the strace output – openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3.
There are still a bunch of system calls after dynamic linking in the full strace output that I
still don’t really understand (what’s prlimit64 doing? where does the locale
stuff come in? what’s gconv-modules.cache? what’s rt_sigaction doing?
what’s arch_prctl? what’s set_tid_address and set_robust_list?). But this feels like a good start.
aside: ldd is actually a simple shell script!
Someone on mastodon pointed out that ldd is actually a shell script
that just sets the LD_TRACE_LOADED_OBJECTS=1 environment variable and
starts the program. So you can do exactly the same thing like this:
$ LD_TRACE_LOADED_OBJECTS=1 python3
linux-vdso.so.1 (0x00007ffe13b0a000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f01a5a47000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f01a5a41000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2fd6549000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2fd6405000)
libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f2fd63d6000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f2fd63b9000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2fd61e3000)
/lib64/ld-linux-x86-64.so.2 (0x00007f2fd6580000)
Apparently ld is also a binary you can just run, so /lib64/ld-linux-x86-64.so.2 --list /usr/bin/python3.9 also does the the same thing.
on init and fini
Let’s talk about this line in the strace output:
set_tid_address(0x7f58880dca10) = 3709103
This seems to have something to do with threading, and I think this might be
happening because the pthread library (and every other dynamically loaded)
gets to run initialization code when it’s loaded. The code that runs when the
library is loaded is in the init section (or maybe also the .ctors section).
See for yourself: Let’s take a look at that using readelf:
$ readelf -a /lib/x86_64-linux-gnu/libpthread.so.0
...
[10] .rela.plt RELA 00000000000051f0 000051f0
00000000000007f8 0000000000000018 AI 4 26 8
[11] .init PROGBITS 0000000000006000 00006000
000000000000000e 0000000000000000 AX 0 0 4
[12] .plt PROGBITS 0000000000006010 00006010
0000000000000560 0000000000000010 AX 0 0 16
...
This library doesn’t have a .ctors section, just an .init. But what’s in
that .init section? We can use objdump to disassemble the code:
$ objdump -d /lib/x86_64-linux-gnu/libpthread.so.0
Disassembly of section .init:
0000000000006000 <_init>:
6000: 48 83 ec 08 sub $0x8,%rsp
6004: e8 57 08 00 00 callq 6860 <__pthread_initialize_minimal>
6009: 48 83 c4 08 add $0x8,%rsp
600d: c3
So it’s calling __pthread_initialize_minimal. I found the code for that function in glibc,
though I had to find an older version of glibc because it looks like in more
recent versions libpthread is no longer a separate library.
I’m not sure whether this set_tid_address system call actually comes from
__pthread_initialize_minimal, but at least we’ve learned that libraries can
run code on startup through the .init section.
Here’s a note from man elf on the .init section:
$ man elf
.init This section holds executable instructions that contribute to the process initialization code. When a program starts to run
the system arranges to execute the code in this section before calling the main program entry point.
There’s also a .fini section in the ELF file that runs at the end, and
.ctors / .dtors (constructors and destructors) are other sections that
could exist.
Okay, that’s enough about dynamic linking.
9: go to _start
After dynamic linking is done, we go to _start in the Python interpreter.
Then it does all the normal Python interpreter things you’d expect.
I’m not going to talk about this because here I’m interested in general facts about how binaries are run on Linux, not the Python interpreter specifically.
10: write a string
We still need to print out “hello world” though. Under the hood, the Python print function calls some function from libc. But which one? Let’s find out!
See for yourself: Run ltrace -o out python3 hello.py.
$ ltrace -o out python3 hello.py
$ grep hello out
write(1, "hello world\n", 12) = 12
So it looks like it’s calling write
I honestly am always a little suspicious of ltrace – unlike strace (which I
would trust with my life), I’m never totally sure that ltrace is actually
reporting library calls accurately. But in this case it seems to be working. And
if we look at the cpython source code, it does seem to be calling write() in some places. So I’m willing to believe that.
what’s libc?
We just said that Python calls the write function from libc. What’s libc?
It’s the C standard library, and it’s responsible for a lot of basic things
like:
- allocating memory with
malloc - file I/O (opening/closing/
- executing programs (with
execvp, like we mentioned before) - looking up DNS records with
getaddrinfo - managing threads with
pthread
Programs don’t have to use libc (on Linux, Go famously doesn’t use it and calls Linux system calls directly instead), but most other programming languages I use (node, Python, Ruby, Rust) all use libc. I’m not sure about Java.
You can find out if you’re using libc by running ldd on your binary: if you
see something like libc.so.6, that’s libc.
why does libc matter?
You might be wondering – why does it matter that Python calls the libc write
and then libc calls the write system call? Why am I making a point of saying
that libc is in the middle?
I think in this case it doesn’t really matter (AFAIK the write libc function
maps pretty directly to the write system call)
But there are different libc implementations, and sometimes they behave differently. The two main ones are glibc (GNU libc) and musl libc.
For example, until recently musl’s getaddrinfo didn’t support TCP DNS, here’s a blog post talking about a bug that that caused.
a little detour into stdout and terminals
In this program, stdout (the 1 file descriptor) is a terminal. And you can do
funny things with terminals! Here’s one:
- In a terminal, run
ls -l /proc/self/fd/1. I get/dev/pts/2 - In another terminal window, write
echo hello > /dev/pts/2 - Go back to the original terminal window. You should see
helloprinted there!
that’s all for now!
Hopefully you have a better idea of how hello world gets printed! I’m going to stop
adding more details for now because this is already pretty long, but obviously there’s
more to say and I might add more if folks chip in with extra details. I’d
especially love suggestions for other tools you could use to inspect parts of
the process that I haven’t explained here.
Thanks to everyone who suggested corrections / additions – I’ve edited this blog post a lot to incorporate more things :)
Some things I’d like to add if I can figure out how to spy on them:
- the kernel loader and ASLR (I haven’t figured out yet how to use bpftrace + kprobes to trace the kernel loader’s actions)
- TTYs (I haven’t figured out how to trace the way
write(1, "hello world", 11)gets sent to the TTY that I’m looking at)
I’d love to see a Mac version of this
One of my frustrations with Mac OS is that I don’t know how to introspect my
system on this level – when I print hello world, I can’t figure out how to
spy on what’s going on behind the scenes the way I can on Linux. I’d love to
see a really in depth explainer.
Some Mac equivalents I know about:
ldd->otool -Lreadelf->otool- supposedly you can use
dtrussordtraceon mac instead of strace but I’ve never been brave enough to turn off system integrity protection to get it to work strace->sc_usageseems to be able to collect stats about syscall usage, andfs_usageabout file usage
more reading
Some more links:
- A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux
- an exploration of “hello world” on FreeBSD
- hello world under the microscope for Windows
- From LWN: how programs get run (and part two) have a bunch more details on the internals of
execve - Putting the “You” in CPU by Lexi Mattick
- “Hello, world” from scratch on a 6502 (video from Ben Eater)
Why is DNS still hard to learn?
I write a lot about technologies that I found hard to learn about. A while back my friend Sumana asked me an interesting question – why are these things so hard to learn about? Why do they seem so mysterious?
For example, take DNS. We’ve been using DNS since the 80s (for more than 35 years!). It’s used in every website on the internet. And it’s pretty stable – in a lot of ways, it works the exact same way it did 30 years ago.
But it took me YEARS to figure out how to confidently debug DNS issues, and I’ve seen a lot of other programmers struggle with debugging DNS problems as well. So what’s going on?
Here are a couple of thoughts about why learning to troubleshoot DNS problems is hard.
(I’m not going to explain DNS very much in this post, see Implement DNS in a Weekend or my DNS blog posts for more about how DNS works)
it’s not because DNS is super hard
When I finally learned how to troubleshoot DNS problems, my reaction was “what, that was it???? that’s not that hard!“. I felt a little bit cheated! I could explain to you everything that I found confusing about DNS in a few hours.
So – if DNS is not all that complicated, why did it take me so many years to
figure out how to troubleshoot pretty basic DNS issues (like “my domain doesn’t
resolve even though I’ve set it up correctly” or “dig and my browser have
different DNS results, why?“)?
And I wasn’t alone in finding DNS hard to learn! I’ve talked to a lot of smart friends who are very experienced programmers about DNS of the years, and many of them either:
- didn’t feel comfortable making simple DNS changes to their websites
- or were confused about basic facts about how DNS works (like that records are pulled and not pushed)
- or did understand DNS basics pretty well, but had the some of the same
knowledge gaps that I’d struggled with (negative caching and the details of
how
digand your browser do DNS queries differently)
So if we’re all struggling with the same things about DNS, what’s going on? Why is it so hard to learn for so many people?
Here are some ideas.
a lot of the system is hidden
When you make a DNS request on your computer, the basic story is:
- your computer makes a request to a server called resolver
- the resolver checks its cache, and makes requests to some other servers called authoritative nameservers
Here are some things you don’t see:
- the resolver’s cache. What’s in there?
- which library code on your computer is making the DNS request (is it libc
getaddrinfo? if so, is it the getaddrinfo from glibc, or musl, or apple? is it your browser’s DNS code? is it a different custom DNS implementation?). All of these options behave slightly differently and have different configuration, approaches to caching, available features, etc. For example musl DNS didn’t support TCP until early 2023. - the conversation between the resolver and the authoritative nameservers. I
think a lot of DNS issues would be SO simple to understand if you could
magically get a trace of exactly which authoritative nameservers were
queried downstream during your request, and what they said. (like, what if
you could run
dig +debug google.comand it gave you a bunch of extra debugging information?)
dealing with hidden systems
A couple of ideas for how to deal with hidden systems
- just teaching people what the hidden systems are makes a huge difference. For a long time I had no idea that my computer had many different DNS libraries that were used in different situations and I was confused about this for literally years. This is a big part of my approach.
- with Mess With DNS we tried out this “fishbowl” approach where it shows you some parts of the system (the conversation with the resolver and the authoritative nameserver) that are normally hidden
- I feel like it would be extremely cool to extend DNS to include a “debugging information” section. (edit: it looks like this already exists! It’s called Extended DNS Errors, or EDE, and tools are slowly adding support for it.
Extended DNS Errors seem cool
Extended DNS Errors are a new way for DNS servers to provide extra debugging information in DNS response. Here’s an example of what that looks like:
$ dig @8.8.8.8 xjwudh.com
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 39830
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; EDE: 12 (NSEC Missing): (Invalid denial of existence of xjwudh.com/a)
;; QUESTION SECTION:
;xjwudh.com. IN A
;; AUTHORITY SECTION:
com. 900 IN SOA a.gtld-servers.net. nstld.verisign-grs.com. 1690634120 1800 900 604800 86400
;; Query time: 92 msec
;; SERVER: 8.8.8.8#53(8.8.8.8) (UDP)
;; WHEN: Sat Jul 29 08:35:45 EDT 2023
;; MSG SIZE rcvd: 161
Here I’ve requested a nonexistent domain, and I got the extended error EDE:
12 (NSEC Missing): (Invalid denial of existence of xjwudh.com/a). I’m not
sure what that means (it’s some DNSSEC Thing), but it’s cool to see an extra
debug message like that.
I did have to install a newer version of dig to get the above to work.
confusing tools
Even though a lot of DNS stuff is hidden, there are a lot of ways to figure out
what’s going on by using dig.
For example, you can use dig +norecurse to figure out if a given DNS resolver
has a particular record in its cache. 8.8.8.8 seems to return a SERVFAIL
response if the response isn’t cached.
here’s what that looks like for google.com
$ dig +norecurse @8.8.8.8 google.com
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11653
;; flags: qr ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;google.com. IN A
;; ANSWER SECTION:
google.com. 21 IN A 172.217.4.206
;; Query time: 57 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Fri Jul 28 10:50:45 EDT 2023
;; MSG SIZE rcvd: 55
and for homestarrunner.com:
$ dig +norecurse @8.8.8.8 homestarrunner.com
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 55777
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;homestarrunner.com. IN A
;; Query time: 52 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Fri Jul 28 10:51:01 EDT 2023
;; MSG SIZE rcvd: 47
Here you can see we got a normal NOERROR response for google.com (which is
in 8.8.8.8’s cache) but a SERVFAIL for homestarrunner.com (which isn’t).
This doesn’t mean there’s no DNS record homestarrunner.com (there is!), it’s
just not cached).
But this output is really confusing to read if you’re not used to it! Here are a few things that I think are weird about it:
- the headings are weird (there’s
->>HEADER<<-,flags:,OPT PSEUDOSECTION:,QUESTION SECTION:,ANSWER SECTION:) - the spacing is weird (why is the no newline between
OPT PSEUDOSECTIONandQUESTION SECTION?) MSG SIZE rcvd: 47is weird (are there other fields inMSG SIZEother thanrcvd? what are they?)- it says that there’s 1 record in the ADDITIONAL section but doesn’t show it, you have to somehow magically know that the “OPT PSEUDOSECTION” record is actually in the additional section
In general dig’s output has the feeling of a script someone wrote in an adhoc
way that grew organically over time and not something that was intentionally
designed.
dealing with confusing tools
some ideas for improving on confusing tools:
- explain the output. For example I wrote how to use dig explaining how
dig’s output works and how to configure it to give you a shorter output by default - make new, more friendly tools. For example for DNS there’s
dog and doggo and my dns lookup tool. I think these are really cool but
personally I don’t use them because sometimes I want to do something a little
more advanced (like using
+norecurse) and as far as I can tell neitherdognordoggosupport+norecurse. I’d rather use 1 tool for everything, so I stick todig. Replacing the breadth of functionality ofdigis a huge undertaking. - make dig’s output a little more friendly. If I were better at C programming,
I might try to write a
digpull request that adds a+humanflag to dig that formats the long form output in a more structured and readable way, maybe something like this:
$ dig +human +norecurse @8.8.8.8 google.com
HEADER:
opcode: QUERY
status: NOERROR
id: 11653
flags: qr ra
records: QUESTION: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
QUESTION SECTION:
google.com. IN A
ANSWER SECTION:
google.com. 21 IN A 172.217.4.206
ADDITIONAL SECTION:
EDNS: version: 0, flags:; udp: 512
EXTRA INFO:
Time: Fri Jul 28 10:51:01 EDT 2023
Elapsed: 52 msec
Server: 8.8.8.8:53
Protocol: UDP
Response size: 47 bytes
This makes the structure of the DNS response more clear – there’s the header, the question, the answer, and the additional section.
And it’s not “dumbed down” or anything! It’s the exact same information, just formatted in a more structured way. My biggest frustration with alternative DNS tools that they often remove information in the name of clarity. And though there’s definitely a place for those tools, I want to see all the information! I just want it to be presented clearly.
We’ve learned a lot about how to design more user friendly command line tools in the last 40 years and I think it would be cool to apply some of that knowledge to some of our older crustier tools.
dig +yaml
One quick note on dig: newer versions of dig do have a +yaml output format
which feels a little clearer to me, though it’s too verbose for my taste (a
pretty simple DNS response doesn’t fit on my screen)
weird gotchas
DNS has some weird stuff that’s relatively common to run into, but pretty hard to learn about if nobody tells you what’s going on. A few examples (there are more in some ways DNS can break:
- negative caching! (which I talk about in this talk) It took me probably 5 years to realize that I shouldn’t visit a domain that doesn’t have a DNS record yet, because then the nonexistence of that record will be cached, and it gets cached for HOURS, and it’s really annoying.
- differences in
getaddrinfoimplementations: until early 2023,musldidn’t support TCP DNS - resolvers that ignore TTLs: if you set a TTL on your DNS records (like “5 minutes”), some resolvers will ignore those TTLs completely and cache the records for longer, like maybe 24 hours instead
- if you configure nginx wrong (like this), it’ll cache DNS records forever.
- how ndots can make your Kubernetes DNS slow
dealing with weird gotchas
I don’t have as good answers here as I would like to, but knowledge about weird gotchas is extremely hard won (again, it took me years to figure out negative caching!) and it feels very silly to me that people have to rediscover them for themselves over and over and over again.
A few ideas:
- It’s incredibly helpful when people call out gotchas when explaining a topic. For example (leaving DNS for a moment), Josh Comeau’s Flexbox intro explains this minimum size gotcha which I ran into SO MANY times for several years before finally finding an explanation of what was going on.
- I’d love to see more community collections of common gotchas. For bash, shellcheck is an incredible collection of bash gotchas.
One tricky thing about documenting DNS gotchas is that different people are going to run into different gotchas – if you’re just configuring DNS for your personal domain once every 3 years, you’re probably going to run into different gotchas than someone who administrates DNS for a domain with heavy traffic.
A couple of more quick reasons:
infrequent exposure
A lot of people only deal with DNS extremely infrequently. And of course if you only touch DNS every 3 years it’s going to be harder to learn!
I think cheat sheets (like “here are the steps to changing your nameservers”) can really help with this.
it’s hard to experiment with
DNS can be scary to experiment with – you don’t want to mess up your domain. We built Mess With DNS to make this one a little easier.
that’s all for now
I’d love to hear other thoughts about what makes DNS (or your favourite mysterious technology) hard to learn.
Lima: a nice way to run Linux VMs on Mac
Hello! Here’s a new entry in the “cool software julia likes” section.
A little while ago I started using a Mac, and one of my biggest
frustrations with it is that often I need to run Linux-specific software. For
example, the nginx playground I
posted about the other day only works on Linux because it uses Linux namespaces (via bubblewrap)
to sandbox nginx. And I’m working on another playground right now that uses bubblewrap too.
This post is very short, it’s just to say that Lima seems nice and much simpler to get started with than Vagrant.
enter Lima!
I was complaining about this to a friend, and they mentioned Lima, which stands for Linux on Mac. I’d heard of colima (another way to run Linux containers on Mac), but I hadn’t realized that Lima also just lets you run VMs.
It was surprisingly simple to set up. I just had to:
- Install Lima (I did
nix-env -iA nixpkgs.limabut you can also install it withbrew install lima) - Run
limactl start defaultto start the VM - Run
limato get a shell
That’s it! By default it mounts your home directory as read-only inside the VM
There’s a config file in ~/.lima/default/lima.yaml, but I haven’t needed to change it yet.
some nice things about Lima
Some things I appreciate about Lima (as opposed to Vagrant which I’ve used in the past and found kind of frustrating) are:
- it provides a default config
- it automatically downloads a Ubuntu 22.04 image to use in the VM (which is what I would have probably picked anyway)
- it mounts my entire home directory inside the VM, which I really like as a default choice (it feels very seamless)
I think the paradigm of “I have a single chaotic global Linux VM which I use
for all my projects” might work better for me than super carefully configured
per-project VMs. Though I’m sure that you can have carefully configured
per-project VMs with Lima too if you want, I’m just only using the default VM.
problem 1: I don’t know how to mount directories read-write
I wanted to have my entire home directory mounted read-only, but have some
subdirectories (like ~/work/nginx-playground) mounted read-write. I did some
research and here’s what I found:
- a comment on this github issue says that you can use mountType: “virtiofs” and vmType: “vz” to mount subdirectories of your home directory read-write
- the Lima version packaged in nix 23.05 doesn’t seem to support
vmType: vz(though I could be wrong about this)
Maybe I’ll figure out how to mount directories read-write later, I’m not too bothered by working around it for now.
problem 2: networking
I’m trying to set up some weird networking stuff (this tun/tap setup)
in Lima and while it appeared to work at first, actually the tun network
device seems to be unreliable in a weird way for reasons I don’t understand.
Another weird Lima networking thing: here’s what gets printed out when I ping a machine:
$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
ping: Warning: time of day goes back (-7148662230695168869us), taking countermeasures
ping: Warning: time of day goes back (-7148662230695168680us), taking countermeasures
64 bytes from 8.8.8.8: icmp_seq=0 ttl=255 time=0.000 ms
wrong data byte #16 should be 0x10 but was 0x0
#16 0 6 0 1 6c 55 ad 64 0 0 0 0 72 95 9 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 1a 1b
#48 1c 1d 1e 1f 20 21 22 23
ping: Warning: time of day goes back (-6518721232815721329us), taking countermeasures
64 bytes from 8.8.8.8: icmp_seq=0 ttl=255 time=0.000 ms (DUP!)
wrong data byte #16 should be 0x10 but was 0x0
#16 0 6 0 2 6d 55 ad 64 0 0 0 0 2f 9d 9 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 1a 1b
#48 1c 1d 1e 1f 20 21 22 23
ping: Warning: time of day goes back (-4844789546316441458us), taking countermeasures
64 bytes from 8.8.8.8: icmp_seq=0 ttl=255 time=0.000 ms (DUP!)
wrong data byte #16 should be 0x10 but was 0x0
#16 0 6 0 3 6e 55 ad 64 0 0 0 0 69 b3 9 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 1a 1b
#48 1c 1d 1e 1f 20 21 22 23
ping: Warning: time of day goes back (-3834857329877608539us), taking countermeasures
64 bytes from 8.8.8.8: icmp_seq=0 ttl=255 time=0.000 ms (DUP!)
wrong data byte #16 should be 0x10 but was 0x0
#16 0 6 0 4 6f 55 ad 64 0 0 0 0 6c c0 9 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 1a 1b
#48 1c 1d 1e 1f 20 21 22 23
ping: Warning: time of day goes back (-2395394298978302982us), taking countermeasures
64 bytes from 8.8.8.8: icmp_seq=0 ttl=255 time=0.000 ms (DUP!)
wrong data byte #16 should be 0x10 but was 0x0
#16 0 6 0 5 70 55 ad 64 0 0 0 0 65 d3 9 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 1a 1b
#48 1c 1d 1e 1f 20 21 22 23
This seems to be a known issue with ICMP.
why not use containers?
I wanted a VM and not a Linux container because:
- the playground runs on a VM in production, not in a container, and generally it’s easier to develop in a similar environment to production
- all of my playgrounds use Linux namespaces, and I don’t know how to create a namespace inside a container. Probably you can but I don’t feel like figuring it out and it seems like an unnecessary distraction.
- on Mac you need to run containers inside a Linux VM anyway, so I’d rather use a VM directly and not introduce another unnecessary layer
OrbStack seems nice too
After I wrote this, a bunch of people commented to say that OrbStack is great. I was struggling with the networking in Lima (like I mentioned above) so I tried out OrbStack and the network does seem to be better.
ping acts normally, unlike in Lima:
$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=113 time=19.8 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=113 time=15.9 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=113 time=23.1 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=113 time=22.7 ms
The setup steps for OrbStack are:
- Download OrbStack from the website
- In the GUI, create a VM
- Run
orb - That’s it
So it seems equally simple to set up.
that’s all!
Some other notes:
- It looks like Lima works on Linux too
- a bunch of people on Mastodon also said colima (built on top of Lima) is a nice Docker alternative on Mac for running Linux containers
Open sourcing the nginx playground
Hello! In 2021 I released a small playground for testing nginx configurations called nginx playground. There’s a blog post about it here.
This is an extremely short post to say that at the time I didn’t make it open source, but I am making it open source now. It’s not a lot of code but maybe it’ll be interesting to someone, and maybe someone will even build on it to make more playgrounds! I’d love to see an HAProxy playground or something in a similar vein.
Here’s the github repo. The
frontend is in static/ and the backend is in api/. The README is mostly an
extended apology for the developer experience and note that the project is
unmaintained. But I did test that the build instructions work!
why didn’t I open source this before?
I’m not very good at open source. Some of the problems I have with open sourcing things are:
- I dislike (and am very bad at) maintaining open source projects – I usually ignore basically all feature requests and most bug reports and then feel bad about it. I handed off maintainership to both of the open source projects that I started (rbspy and rust-bcc) to other people who are doing a MUCH better job than I ever did.
- Sometimes the developer experience for the project is pretty bad
- Sometimes there’s configuration in the project (like the
fly.tomlor the analytics I have set up) which don’t really make sense for other people to copy
new approach: don’t pretend I’m going to improve it
In the past I’ve had some kind of belief that I’m going to improve the problems with my code later. But I haven’t touched this project in more than a year and I think it’s unlikely I’m going to go back to it unless it breaks in some dramatic way.
So instead of pretending I’m going to improve things, I decided to just:
- tell people in the README that the project is unmaintained
- write down all the security caveats I know about
- test the build instructions I wrote to make sure that they work (on a fresh machine, even!)
- explain (but do not fix!!) some of the messy parts of the project
that’s all!
Maybe I will open source more of my tiny projects in the future, we’ll see! Thanks to Sumana Harihareswara for helping me think through this.
New zine: How Integers and Floats Work
Hello! On Wednesday, we released a new zine: How Integers and Floats Work!
You can get it for $12 here: https://wizardzines.com/zines/integers-floats, or get an 13-pack of all my zines here.
Here’s the cover:
the table of contents
Here’s the table of contents!
Now let’s talk about some of the motivations for writing this zine!
motivation 1: demystify binary
I wrote this zine because I used to find binary data really impenetrable. There are all these 0s and 1s! What does it mean?
But if you look at any binary file format, most of it is integers! For example, if you look at the DNS parsing in Implement DNS in a Weekend, it’s all about encoding and decoding a bunch of integers (plus some ASCII strings, which arguably are also arrays of integers).
So I think that learning how integers work in depth is a really nice way to get started with understanding binary file formats. The zine also talks about some other tricks for encoding binary data into integers with binary operations and bit flags.
motivation 2: explain floating point
The second motivation was to explain floating point. Floating point is pretty weird! (see [examples of floating point problems]() for a very long list)
And almost all explanations of floating point I’ve read have been really math and notation heavy in a way that I find pretty unpleasant and confusing, even though I love math more than most people (I did a pure math degree) and am pretty good at it.
We spent weeks working on a clearer explanation of floating point with minimal math jargon and lots of pictures and I think we got there. Here’s one example page, on the floating point number line:

it comes with a playground: memory spy!
One of my favourite ways to learn about how my computer represents things in memory has been to use a debugger to look at the memory of a real program.
But C debuggers like gdb are pretty hard to use at first! So Marie and I made a playground called Memory Spy. It runs a C debugger behind the scenes, but it provides a much simpler interface – there are a bunch of very simple example C programs, and you can just click on each line to view how the variable on that line is represented in memory.
Here’s a screenshot:
Memory Spy is inspired by Philip Guo’s great Python Tutor.
float.exposed is great
When doing demos and research for this zine, I found myself reaching for float.exposed a lot to show how numbers are encoded in floating point. It’s by Bartosz Ciechanowski, who has tons of other great visualizations on his site.
I loved it so much that I made a clone called integer.exposed for integers (with permission), so that people could look at integers in a similar way.
some blog posts I wrote along the way
Here are a few blog posts I wrote while thinking about how to write this zine:
- examples of floating point problems
- examples of problems with integers
- some possible reasons for 8-bit bytes
you can get a print copy shipped to you!
There’s always been the option to print the zines yourself on your home printer.
But this time there’s a new option too: you can get a print copy shipped to you! (just click on the “print version” link on this page)
The only caveat is print orders will ship in August – I need to wait for orders to come in to get an idea of how many I should print before sending it to the printer.
people who helped with this zine
I don’t make these zines by myself!
I worked with Marie LeBlanc Flanagan every morning for 5 months to clarify explanations and build memory spy.
The cover is by Vladimir Kašiković, Gersande La Flèche did copy editing, Dolly Lanuza did editing, another friend did technical review.
Stefan Karpinski gave a talk 10 years ago at the Recurse Center (I even blogged about it at the time) which was the first explanation of floating point that ever made any sense to me. He also explained how signed integers work to me in a Mastodon post a few months ago, when I was in the middle of writing the zine.
And finally, I want to thank all the beta readers – 60 of you read the zine and left comments about what was confusing, what was working, and ideas for how to make it better. It made the end product so much better.
thank you
As always: if you’ve bought zines in the past, thank you for all your support over the years. I couldn’t do this without you.

