Reading List

The most recent articles from a list of feeds I subscribe to.

Behind "Hello World" on Linux

Today I was thinking about – what happens when you run a simple “Hello World” Python program on Linux, like this one?

print("hello world")

Here’s what it looks like at the command line:

$ python3 hello.py
hello world

But behind the scenes, there’s a lot more going on. I’ll describe some of what happens, and (much much more importantly!) explain some tools you can use to see what’s going on behind the scenes yourself. We’ll use readelf, strace, ldd, debugfs, /proc, ltrace, dd, and stat. I won’t talk about the Python-specific parts at all – just what happens when you run any dynamically linked executable.

Here’s a table of contents:

  1. parse “python3 hello.py”
  2. figure out the full path to python3
  3. stat, under the hood
  4. time to fork
  5. the shell calls execve
  6. get the binary’s contents
  7. find the interpreter
  8. dynamic linking
  9. go to _start
  10. write a string

before execve

Before we even start the Python interpreter, there are a lot of things that have to happen. What executable are we even running? Where is it?

1: The shell parses the string python3 hello.py into a command to run and a list of arguments: python3, and ['hello.py']

A bunch of things like glob expansion could happen here. For example if you run python3 *.py, the shell will expand that into python3 hello.py

2: The shell figures out the full path to python3

Now we know we need to run python3. But what’s the full path to that binary? The way this works is that there’s a special environment variable named PATH.

See for yourself: Run echo $PATH in your shell. For me it looks like this.

$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

When you run a command, the shell will search every directory in that list (in order) to try to find a match.

In fish (my shell), you can see the path resolution logic here. It uses the stat system call to check if files exist.

See for yourself: Run strace -e stat, and then run a command like python3. You should see output like this:

stat("/usr/local/sbin/python3", 0x7ffcdd871f40) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/python3", 0x7ffcdd871f40) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/python3", 0x7ffcdd871f40) = -1 ENOENT (No such file or directory)
stat("/usr/bin/python3", {st_mode=S_IFREG|0755, st_size=5479736, ...}) = 0

You can see that it finds the binary at /usr/bin/python3 and stops: it doesn’t continue searching /sbin or /bin.

(if this doesn’t work for you, instead try strace -o out bash, and then grep stat out. One reader mentioned that their version of libc uses a different system call instead of stat)

2.1: A note on execvp

If you want to run the same PATH searching logic as the shell does without reimplementing it yourself, you can use the libc function execvp (or one of the other exec* functions with p in the name).

3: stat, under the hood

Now you might be wondering – Julia, what is stat doing? Well, when your OS opens a file, it’s split into 2 steps.

  1. It maps the filename to an inode, which contains metadata about the file
  2. It uses the inode to get the file’s contents

The stat system call just returns the contents of the file’s inodes – it doesn’t read the contents at all. The advantage of this is that it’s a lot faster. Let’s go on a short adventure into inodes. (this great post “A disk is a bunch of bits” by Dmitry Mazin has more details)

$ stat /usr/bin/python3
  File: /usr/bin/python3 -> python3.9
  Size: 9         	Blocks: 0          IO Block: 4096   symbolic link
Device: fe01h/65025d	Inode: 6206        Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-08-03 14:17:28.890364214 +0000
Modify: 2021-04-05 12:00:48.000000000 +0000
Change: 2021-06-22 04:22:50.936969560 +0000
 Birth: 2021-06-22 04:22:50.924969237 +0000

See for yourself: Let’s go see where exactly that inode is on our hard drive.

First, we have to find our hard drive’s device name

$ df
...
tmpfs             100016      604     99412   1% /run
/dev/vda1       25630792 14488736  10062712  60% /
...

Looks like it’s /dev/vda1. Next, let’s find out where the inode for /usr/bin/python3 is on our hard drive:

$ sudo debugfs /dev/vda1
debugfs 1.46.2 (28-Feb-2021)
debugfs:  imap /usr/bin/python3
Inode 6206 is part of block group 0
	located at block 658, offset 0x0d00

I have no idea how debugfs is figuring out the location of the inode for that filename, but we’re going to leave that alone.

Now, we need to calculate how many bytes into our hard drive “block 658, offset 0x0d00” is on the big array of bytes that is your hard drive. Each block is 4096 bytes, so we need to go 4096 * 658 + 0x0d00 bytes. A calculator tells me that’s 2698496

$ sudo dd if=/dev/vda1 bs=1 skip=2698496 count=256 2>/dev/null | hexdump -C
00000000  ff a1 00 00 09 00 00 00  f8 b6 cb 64 9a 65 d1 60  |...........d.e.`|
00000010  f0 fb 6a 60 00 00 00 00  00 00 01 00 00 00 00 00  |..j`............|
00000020  00 00 00 00 01 00 00 00  70 79 74 68 6f 6e 33 2e  |........python3.|
00000030  39 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |9...............|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00 00 00 12 4a 95 8c  00 00 00 00 00 00 00 00  |.....J..........|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 2d cb 00 00  |............-...|
00000080  20 00 bd e7 60 15 64 df  00 00 00 00 d8 84 47 d4  | ...`.d.......G.|
00000090  9a 65 d1 60 54 a4 87 dc  00 00 00 00 00 00 00 00  |.e.`T...........|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Neat! There’s our inode! You can see it says python3 in it, which is a really good sign. We’re not going to go through all of this, but the ext4 inode struct from the Linux kernel says that the first 16 bits are the “mode”, or permissions. So let’s work that out how ffa1 corresponds to file permissions.

  • The bytes ffa1 correspond to the number 0xa1ff, or 41471 (because x86 is little endian)
  • 41471 in octal is 0120777
  • This is a bit weird – that file’s permissions could definitely be 777, but what are the first 3 digits? I’m not used to seeing those! You can find out what the 012 means in man inode (scroll down to “The file type and mode”). There’s a little table that says 012 means “symbolic link”.

Let’s list the file and see if it is in fact a symbolic link with permissions 777:

$ ls -l /usr/bin/python3
lrwxrwxrwx 1 root root 9 Apr  5  2021 /usr/bin/python3 -> python3.9

It is! Hooray, we decoded it correctly.

4: Time to fork

We’re still not ready to start python3. First, the shell needs to create a new child process to run. The way new processes start on Unix is a little weird – first the process clones itself, and then runs execve, which replaces the cloned process with a new process.

*See for yourself: Run strace -e clone bash, then run python3. You should see something like this:

clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f03788f1a10) = 3708100

3708100 is the PID of the new process, which is a child of the shell process.

Some more tools to look at what’s going on with processes:

  • pstree will show you a tree of all the processes on your system
  • cat /proc/PID/stat shows you some information about the process. The contents of that file are documented in man proc. For example the 4th field is the parent PID.

4.1: What the new process inherits.

The new process (which will become python3) has inherited a bunch of from the shell. For example, it’s inherited:

  1. environment variables: you can look at them with cat /proc/PID/environ | tr '\0' '\n'
  2. file descriptors for stdout and stderr: look at them with ls -l /proc/PID/fd
  3. a working directory (whatever the current directory is)
  4. namespaces and cgroups (if it’s in a container)
  5. the user and group that’s running it
  6. probably more things I’m not thinking of right now

5: The shell calls execve

Now we’re ready to start the Python interpreter!

See for yourself: Run strace -f -e execve bash, then run python3. The -f is important because we want to follow any forked child subprocesses. You should see something like this:

[pid 3708381] execve("/usr/bin/python3", ["python3"], 0x560397748300 /* 21 vars */) = 0

The first argument is the binary, and the second argument is the list of command line arguments. The command line arguments get placed in a special location in the program’s memory so that it can access them when it runs.

Now, what’s going on inside execve?

6: get the binary’s contents

The first thing that has to happen is that we need to open the python3 binary file and read its contents. So far we’ve only used the stat system call to access its metadata, but now we need its contents.

Let’s look at the output of stat again:

$ stat /usr/bin/python3
  File: /usr/bin/python3 -> python3.9
  Size: 9         	Blocks: 0          IO Block: 4096   symbolic link
Device: fe01h/65025d	Inode: 6206        Links: 1
...

This takes up 0 blocks of space on the disk. This is because the contents of the symbolic link (python3.9) are actually in the inode itself: you can see them here (from the binary contents of the inode above, it’s split across 2 lines in the hexdump output):

00000020  00 00 00 00 01 00 00 00  70 79 74 68 6f 6e 33 2e  |........python3.|
00000030  39 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |9...............|

So we’ll need to open /usr/bin/python3.9 instead. All of this is happening inside the kernel so you won’t see it another system call for that.

Every file is made up of a bunch of blocks on the hard drive. I think each of these blocks on my system is 4096 bytes, so the minimum size of a file is 4096 bytes – even if the file is only 5 bytes, it still takes up 4KB on disk.

See for yourself: We can find the block numbers using debugfs like this: (again, I got these instructions from dmitry mazin’s “A disk is a bunch of bits” post)

$ debugfs /dev/vda1
debugfs:  blocks /usr/bin/python3.9
145408 145409 145410 145411 145412 145413 145414 145415 145416 145417 145418 145419 145420 145421 145422 145423 145424 145425 145426 145427 145428 145429 145430 145431 145432 145433 145434 145435 145436 145437

Now we can use dd to read the first block of the file. We’ll set the block size to 4096 bytes, skip 145408 blocks, and read 1 block.

$ dd if=/dev/vda1 bs=4096 skip=145408 count=1 2>/dev/null | hexdump -C | head
00000000  7f 45 4c 46 02 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  02 00 3e 00 01 00 00 00  c0 a5 5e 00 00 00 00 00  |..>.......^.....|
00000020  40 00 00 00 00 00 00 00  b8 95 53 00 00 00 00 00  |@.........S.....|
00000030  00 00 00 00 40 00 38 00  0b 00 40 00 1e 00 1d 00  |....@.8...@.....|
00000040  06 00 00 00 04 00 00 00  40 00 00 00 00 00 00 00  |........@.......|
00000050  40 00 40 00 00 00 00 00  40 00 40 00 00 00 00 00  |@.@.....@.@.....|
00000060  68 02 00 00 00 00 00 00  68 02 00 00 00 00 00 00  |h.......h.......|
00000070  08 00 00 00 00 00 00 00  03 00 00 00 04 00 00 00  |................|
00000080  a8 02 00 00 00 00 00 00  a8 02 40 00 00 00 00 00  |..........@.....|
00000090  a8 02 40 00 00 00 00 00  1c 00 00 00 00 00 00 00  |..@.............|

You can see that we get the exact same output as if we read the file with cat, like this:

$ cat /usr/bin/python3.9 | hexdump -C | head
00000000  7f 45 4c 46 02 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  02 00 3e 00 01 00 00 00  c0 a5 5e 00 00 00 00 00  |..>.......^.....|
00000020  40 00 00 00 00 00 00 00  b8 95 53 00 00 00 00 00  |@.........S.....|
00000030  00 00 00 00 40 00 38 00  0b 00 40 00 1e 00 1d 00  |....@.8...@.....|
00000040  06 00 00 00 04 00 00 00  40 00 00 00 00 00 00 00  |........@.......|
00000050  40 00 40 00 00 00 00 00  40 00 40 00 00 00 00 00  |@.@.....@.@.....|
00000060  68 02 00 00 00 00 00 00  68 02 00 00 00 00 00 00  |h.......h.......|
00000070  08 00 00 00 00 00 00 00  03 00 00 00 04 00 00 00  |................|
00000080  a8 02 00 00 00 00 00 00  a8 02 40 00 00 00 00 00  |..........@.....|
00000090  a8 02 40 00 00 00 00 00  1c 00 00 00 00 00 00 00  |..@.............|

an aside on magic numbers

This file starts with ELF, which is a “magic number”, or a byte sequence that tells us that this is an ELF file. ELF is the binary file format on Linux.

Different file formats have different magic numbers, for example the magic number for gzip is 1f8b. The magic number at the beginning is how file blah.gz knows that it’s a gzip file.

I think file has a variety of heuristics for figuring out the file type of a file, not just magic numbers, but the magic number is an important one.

7: find the interpreter

Let’s parse the ELF file to see what’s in there.

See for yourself: Run readelf -a /usr/bin/python3.9. Here’s what I get (though I’ve redacted a LOT of stuff):

$ readelf -a /usr/bin/python3.9
ELF Header:
    Class:                             ELF64
    Machine:                           Advanced Micro Devices X86-64
...
->  Entry point address:               0x5ea5c0
...
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
  INTERP         0x00000000000002a8 0x00000000004002a8 0x00000000004002a8
                 0x000000000000001c 0x000000000000001c  R      0x1
->      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
        ...
->        1238: 00000000005ea5c0    43 FUNC    GLOBAL DEFAULT   13 _start

Here’s what I understand of what’s going on here:

  1. it’s telling the kernel to run /lib64/ld-linux-x86-64.so.2 to start this program. This is called the dynamic linker and we’ll talk about it next
  2. it’s specifying an entry point (at 0x5ea5c0, which is where this program’s code starts)

Now let’s talk about the dynamic linker.

8: dynamic linking

Okay! We’ve read the bytes from disk and we’ve started this “interpreter” thing. What next? Well, if you run strace -o out.strace python3, you’ll see a bunch of stuff like this right after the execve system call:

execve("/usr/bin/python3", ["python3"], 0x560af13472f0 /* 21 vars */) = 0
brk(NULL)                       = 0xfcc000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=32091, ...}) = 0
mmap(NULL, 32091, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f718a1e3000
close(3)                        = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 l\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=149520, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f718a1e1000
...
close(3)                        = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3

This all looks a bit intimidating at first, but the part I want you to pay attention to is openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0". This is opening a C threading library called pthread that the Python interpreter needs to run.

See for yourself: If you want to know which libraries a binary needs to load at runtime, you can use ldd. Here’s what that looks like for me:

$ ldd /usr/bin/python3.9
	linux-vdso.so.1 (0x00007ffc2aad7000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2fd6554000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2fd654e000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2fd6549000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2fd6405000)
	libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f2fd63d6000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f2fd63b9000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2fd61e3000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f2fd6580000)

You can see that the first library listed is /lib/x86_64-linux-gnu/libpthread.so.0, which is why it was loaded first.

on LD_LIBRARY_PATH

I’m honestly still a little confused about dynamic linking. Some things I know:

  • Dynamic linking happens in userspace and the dynamic linker on my system is at /lib64/ld-linux-x86-64.so.2. If you’re missing the dynamic linker, you can end up with weird bugs like this weird “file not found” error
  • The dynamic linker uses the LD_LIBRARY_PATH environment variable to find libraries
  • The dynamic linker will also use the LD_PRELOAD environment to override any dynamically linked function you want (you can use this for fun hacks, or to replace your default memory allocator with an alternative one like jemalloc)
  • there are some mprotects in the strace output which are marking the library code as read-only, for security reasons
  • on Mac, it’s DYLD_LIBRARY_PATH instead of LD_LIBRARY_PATH

You might be wondering – if dynamic linking happens in userspace, why don’t we see a bunch of stat system calls where it’s searching through LD_LIBRARY_PATH for the libraries, the way we did when bash was searching the PATH?

That’s because ld has a cache in /etc/ld.so.cache, and all of those libraries have already been found in the past. You can see it opening the cache in the strace output – openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3.

There are still a bunch of system calls after dynamic linking in the full strace output that I still don’t really understand (what’s prlimit64 doing? where does the locale stuff come in? what’s gconv-modules.cache? what’s rt_sigaction doing? what’s arch_prctl? what’s set_tid_address and set_robust_list?). But this feels like a good start.

aside: ldd is actually a simple shell script!

Someone on mastodon pointed out that ldd is actually a shell script that just sets the LD_TRACE_LOADED_OBJECTS=1 environment variable and starts the program. So you can do exactly the same thing like this:

$ LD_TRACE_LOADED_OBJECTS=1 python3
	linux-vdso.so.1 (0x00007ffe13b0a000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f01a5a47000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f01a5a41000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2fd6549000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2fd6405000)
	libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f2fd63d6000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f2fd63b9000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2fd61e3000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f2fd6580000)

Apparently ld is also a binary you can just run, so /lib64/ld-linux-x86-64.so.2 --list /usr/bin/python3.9 also does the the same thing.

on init and fini

Let’s talk about this line in the strace output:

set_tid_address(0x7f58880dca10)         = 3709103

This seems to have something to do with threading, and I think this might be happening because the pthread library (and every other dynamically loaded) gets to run initialization code when it’s loaded. The code that runs when the library is loaded is in the init section (or maybe also the .ctors section).

See for yourself: Let’s take a look at that using readelf:

$ readelf -a /lib/x86_64-linux-gnu/libpthread.so.0
...
  [10] .rela.plt         RELA             00000000000051f0  000051f0
       00000000000007f8  0000000000000018  AI       4    26     8
  [11] .init             PROGBITS         0000000000006000  00006000
       000000000000000e  0000000000000000  AX       0     0     4
  [12] .plt              PROGBITS         0000000000006010  00006010
       0000000000000560  0000000000000010  AX       0     0     16
...

This library doesn’t have a .ctors section, just an .init. But what’s in that .init section? We can use objdump to disassemble the code:

$ objdump -d /lib/x86_64-linux-gnu/libpthread.so.0
Disassembly of section .init:

0000000000006000 <_init>:
    6000:       48 83 ec 08             sub    $0x8,%rsp
    6004:       e8 57 08 00 00          callq  6860 <__pthread_initialize_minimal>
    6009:       48 83 c4 08             add    $0x8,%rsp
    600d:       c3

So it’s calling __pthread_initialize_minimal. I found the code for that function in glibc, though I had to find an older version of glibc because it looks like in more recent versions libpthread is no longer a separate library.

I’m not sure whether this set_tid_address system call actually comes from __pthread_initialize_minimal, but at least we’ve learned that libraries can run code on startup through the .init section.

Here’s a note from man elf on the .init section:

$ man elf
 .init  This section holds executable instructions that contribute to the process initialization code.  When a program starts to run
              the system arranges to execute the code in this section before calling the main program entry point.

There’s also a .fini section in the ELF file that runs at the end, and .ctors / .dtors (constructors and destructors) are other sections that could exist.

Okay, that’s enough about dynamic linking.

9: go to _start

After dynamic linking is done, we go to _start in the Python interpreter. Then it does all the normal Python interpreter things you’d expect.

I’m not going to talk about this because here I’m interested in general facts about how binaries are run on Linux, not the Python interpreter specifically.

10: write a string

We still need to print out “hello world” though. Under the hood, the Python print function calls some function from libc. But which one? Let’s find out!

See for yourself: Run ltrace -o out python3 hello.py.

$ ltrace -o out python3 hello.py
$ grep hello out
write(1, "hello world\n", 12) = 12

So it looks like it’s calling write

I honestly am always a little suspicious of ltrace – unlike strace (which I would trust with my life), I’m never totally sure that ltrace is actually reporting library calls accurately. But in this case it seems to be working. And if we look at the cpython source code, it does seem to be calling write() in some places. So I’m willing to believe that.

what’s libc?

We just said that Python calls the write function from libc. What’s libc? It’s the C standard library, and it’s responsible for a lot of basic things like:

  • allocating memory with malloc
  • file I/O (opening/closing/
  • executing programs (with execvp, like we mentioned before)
  • looking up DNS records with getaddrinfo
  • managing threads with pthread

Programs don’t have to use libc (on Linux, Go famously doesn’t use it and calls Linux system calls directly instead), but most other programming languages I use (node, Python, Ruby, Rust) all use libc. I’m not sure about Java.

You can find out if you’re using libc by running ldd on your binary: if you see something like libc.so.6, that’s libc.

why does libc matter?

You might be wondering – why does it matter that Python calls the libc write and then libc calls the write system call? Why am I making a point of saying that libc is in the middle?

I think in this case it doesn’t really matter (AFAIK the write libc function maps pretty directly to the write system call)

But there are different libc implementations, and sometimes they behave differently. The two main ones are glibc (GNU libc) and musl libc.

For example, until recently musl’s getaddrinfo didn’t support TCP DNS, here’s a blog post talking about a bug that that caused.

a little detour into stdout and terminals

In this program, stdout (the 1 file descriptor) is a terminal. And you can do funny things with terminals! Here’s one:

  1. In a terminal, run ls -l /proc/self/fd/1. I get /dev/pts/2
  2. In another terminal window, write echo hello > /dev/pts/2
  3. Go back to the original terminal window. You should see hello printed there!

that’s all for now!

Hopefully you have a better idea of how hello world gets printed! I’m going to stop adding more details for now because this is already pretty long, but obviously there’s more to say and I might add more if folks chip in with extra details. I’d especially love suggestions for other tools you could use to inspect parts of the process that I haven’t explained here.

Thanks to everyone who suggested corrections / additions – I’ve edited this blog post a lot to incorporate more things :)

Some things I’d like to add if I can figure out how to spy on them:

  • the kernel loader and ASLR (I haven’t figured out yet how to use bpftrace + kprobes to trace the kernel loader’s actions)
  • TTYs (I haven’t figured out how to trace the way write(1, "hello world", 11) gets sent to the TTY that I’m looking at)

I’d love to see a Mac version of this

One of my frustrations with Mac OS is that I don’t know how to introspect my system on this level – when I print hello world, I can’t figure out how to spy on what’s going on behind the scenes the way I can on Linux. I’d love to see a really in depth explainer.

Some Mac equivalents I know about:

  • ldd -> otool -L
  • readelf -> otool
  • supposedly you can use dtruss or dtrace on mac instead of strace but I’ve never been brave enough to turn off system integrity protection to get it to work
  • strace -> sc_usage seems to be able to collect stats about syscall usage, and fs_usage about file usage

more reading

Some more links:

Why is DNS still hard to learn?

I write a lot about technologies that I found hard to learn about. A while back my friend Sumana asked me an interesting question – why are these things so hard to learn about? Why do they seem so mysterious?

For example, take DNS. We’ve been using DNS since the 80s (for more than 35 years!). It’s used in every website on the internet. And it’s pretty stable – in a lot of ways, it works the exact same way it did 30 years ago.

But it took me YEARS to figure out how to confidently debug DNS issues, and I’ve seen a lot of other programmers struggle with debugging DNS problems as well. So what’s going on?

Here are a couple of thoughts about why learning to troubleshoot DNS problems is hard.

(I’m not going to explain DNS very much in this post, see Implement DNS in a Weekend or my DNS blog posts for more about how DNS works)

it’s not because DNS is super hard

When I finally learned how to troubleshoot DNS problems, my reaction was “what, that was it???? that’s not that hard!“. I felt a little bit cheated! I could explain to you everything that I found confusing about DNS in a few hours.

So – if DNS is not all that complicated, why did it take me so many years to figure out how to troubleshoot pretty basic DNS issues (like “my domain doesn’t resolve even though I’ve set it up correctly” or “dig and my browser have different DNS results, why?“)?

And I wasn’t alone in finding DNS hard to learn! I’ve talked to a lot of smart friends who are very experienced programmers about DNS of the years, and many of them either:

  • didn’t feel comfortable making simple DNS changes to their websites
  • or were confused about basic facts about how DNS works (like that records are pulled and not pushed)
  • or did understand DNS basics pretty well, but had the some of the same knowledge gaps that I’d struggled with (negative caching and the details of how dig and your browser do DNS queries differently)

So if we’re all struggling with the same things about DNS, what’s going on? Why is it so hard to learn for so many people?

Here are some ideas.

a lot of the system is hidden

When you make a DNS request on your computer, the basic story is:

  1. your computer makes a request to a server called resolver
  2. the resolver checks its cache, and makes requests to some other servers called authoritative nameservers

Here are some things you don’t see:

  • the resolver’s cache. What’s in there?
  • which library code on your computer is making the DNS request (is it libc getaddrinfo? if so, is it the getaddrinfo from glibc, or musl, or apple? is it your browser’s DNS code? is it a different custom DNS implementation?). All of these options behave slightly differently and have different configuration, approaches to caching, available features, etc. For example musl DNS didn’t support TCP until early 2023.
  • the conversation between the resolver and the authoritative nameservers. I think a lot of DNS issues would be SO simple to understand if you could magically get a trace of exactly which authoritative nameservers were queried downstream during your request, and what they said. (like, what if you could run dig +debug google.com and it gave you a bunch of extra debugging information?)

dealing with hidden systems

A couple of ideas for how to deal with hidden systems

  • just teaching people what the hidden systems are makes a huge difference. For a long time I had no idea that my computer had many different DNS libraries that were used in different situations and I was confused about this for literally years. This is a big part of my approach.
  • with Mess With DNS we tried out this “fishbowl” approach where it shows you some parts of the system (the conversation with the resolver and the authoritative nameserver) that are normally hidden
  • I feel like it would be extremely cool to extend DNS to include a “debugging information” section. (edit: it looks like this already exists! It’s called Extended DNS Errors, or EDE, and tools are slowly adding support for it.

Extended DNS Errors seem cool

Extended DNS Errors are a new way for DNS servers to provide extra debugging information in DNS response. Here’s an example of what that looks like:

$ dig @8.8.8.8 xjwudh.com
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 39830
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; EDE: 12 (NSEC Missing): (Invalid denial of existence of xjwudh.com/a)
;; QUESTION SECTION:
;xjwudh.com.			IN	A

;; AUTHORITY SECTION:
com.			900	IN	SOA	a.gtld-servers.net. nstld.verisign-grs.com. 1690634120 1800 900 604800 86400

;; Query time: 92 msec
;; SERVER: 8.8.8.8#53(8.8.8.8) (UDP)
;; WHEN: Sat Jul 29 08:35:45 EDT 2023
;; MSG SIZE  rcvd: 161

Here I’ve requested a nonexistent domain, and I got the extended error EDE: 12 (NSEC Missing): (Invalid denial of existence of xjwudh.com/a). I’m not sure what that means (it’s some DNSSEC Thing), but it’s cool to see an extra debug message like that.

I did have to install a newer version of dig to get the above to work.

confusing tools

Even though a lot of DNS stuff is hidden, there are a lot of ways to figure out what’s going on by using dig.

For example, you can use dig +norecurse to figure out if a given DNS resolver has a particular record in its cache. 8.8.8.8 seems to return a SERVFAIL response if the response isn’t cached.

here’s what that looks like for google.com

$ dig +norecurse  @8.8.8.8 google.com
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11653
;; flags: qr ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		21	IN	A	172.217.4.206

;; Query time: 57 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Fri Jul 28 10:50:45 EDT 2023
;; MSG SIZE  rcvd: 55

and for homestarrunner.com:

$ dig +norecurse  @8.8.8.8 homestarrunner.com
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 55777
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;homestarrunner.com.		IN	A

;; Query time: 52 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Fri Jul 28 10:51:01 EDT 2023
;; MSG SIZE  rcvd: 47

Here you can see we got a normal NOERROR response for google.com (which is in 8.8.8.8’s cache) but a SERVFAIL for homestarrunner.com (which isn’t). This doesn’t mean there’s no DNS record homestarrunner.com (there is!), it’s just not cached).

But this output is really confusing to read if you’re not used to it! Here are a few things that I think are weird about it:

  1. the headings are weird (there’s ->>HEADER<<-, flags:, OPT PSEUDOSECTION:, QUESTION SECTION:, ANSWER SECTION:)
  2. the spacing is weird (why is the no newline between OPT PSEUDOSECTION and QUESTION SECTION?)
  3. MSG SIZE rcvd: 47 is weird (are there other fields in MSG SIZE other than rcvd? what are they?)
  4. it says that there’s 1 record in the ADDITIONAL section but doesn’t show it, you have to somehow magically know that the “OPT PSEUDOSECTION” record is actually in the additional section

In general dig’s output has the feeling of a script someone wrote in an adhoc way that grew organically over time and not something that was intentionally designed.

dealing with confusing tools

some ideas for improving on confusing tools:

  • explain the output. For example I wrote how to use dig explaining how dig’s output works and how to configure it to give you a shorter output by default
  • make new, more friendly tools. For example for DNS there’s dog and doggo and my dns lookup tool. I think these are really cool but personally I don’t use them because sometimes I want to do something a little more advanced (like using +norecurse) and as far as I can tell neither dog nor doggo support +norecurse. I’d rather use 1 tool for everything, so I stick to dig. Replacing the breadth of functionality of dig is a huge undertaking.
  • make dig’s output a little more friendly. If I were better at C programming, I might try to write a dig pull request that adds a +human flag to dig that formats the long form output in a more structured and readable way, maybe something like this:
$ dig +human +norecurse  @8.8.8.8 google.com 
HEADER:
  opcode: QUERY
  status: NOERROR
  id: 11653
  flags: qr ra
  records: QUESTION: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

QUESTION SECTION:
  google.com.			IN	A

ANSWER SECTION:
  google.com.		21	IN	A	172.217.4.206
  
ADDITIONAL SECTION:
  EDNS: version: 0, flags:; udp: 512

EXTRA INFO:
  Time: Fri Jul 28 10:51:01 EDT 2023
  Elapsed: 52 msec
  Server: 8.8.8.8:53
  Protocol: UDP
  Response size: 47 bytes

This makes the structure of the DNS response more clear – there’s the header, the question, the answer, and the additional section.

And it’s not “dumbed down” or anything! It’s the exact same information, just formatted in a more structured way. My biggest frustration with alternative DNS tools that they often remove information in the name of clarity. And though there’s definitely a place for those tools, I want to see all the information! I just want it to be presented clearly.

We’ve learned a lot about how to design more user friendly command line tools in the last 40 years and I think it would be cool to apply some of that knowledge to some of our older crustier tools.

dig +yaml

One quick note on dig: newer versions of dig do have a +yaml output format which feels a little clearer to me, though it’s too verbose for my taste (a pretty simple DNS response doesn’t fit on my screen)

weird gotchas

DNS has some weird stuff that’s relatively common to run into, but pretty hard to learn about if nobody tells you what’s going on. A few examples (there are more in some ways DNS can break:

  • negative caching! (which I talk about in this talk) It took me probably 5 years to realize that I shouldn’t visit a domain that doesn’t have a DNS record yet, because then the nonexistence of that record will be cached, and it gets cached for HOURS, and it’s really annoying.
  • differences in getaddrinfo implementations: until early 2023, musl didn’t support TCP DNS
  • resolvers that ignore TTLs: if you set a TTL on your DNS records (like “5 minutes”), some resolvers will ignore those TTLs completely and cache the records for longer, like maybe 24 hours instead
  • if you configure nginx wrong (like this), it’ll cache DNS records forever.
  • how ndots can make your Kubernetes DNS slow

dealing with weird gotchas

I don’t have as good answers here as I would like to, but knowledge about weird gotchas is extremely hard won (again, it took me years to figure out negative caching!) and it feels very silly to me that people have to rediscover them for themselves over and over and over again.

A few ideas:

  • It’s incredibly helpful when people call out gotchas when explaining a topic. For example (leaving DNS for a moment), Josh Comeau’s Flexbox intro explains this minimum size gotcha which I ran into SO MANY times for several years before finally finding an explanation of what was going on.
  • I’d love to see more community collections of common gotchas. For bash, shellcheck is an incredible collection of bash gotchas.

One tricky thing about documenting DNS gotchas is that different people are going to run into different gotchas – if you’re just configuring DNS for your personal domain once every 3 years, you’re probably going to run into different gotchas than someone who administrates DNS for a domain with heavy traffic.

A couple of more quick reasons:

infrequent exposure

A lot of people only deal with DNS extremely infrequently. And of course if you only touch DNS every 3 years it’s going to be harder to learn!

I think cheat sheets (like “here are the steps to changing your nameservers”) can really help with this.

it’s hard to experiment with

DNS can be scary to experiment with – you don’t want to mess up your domain. We built Mess With DNS to make this one a little easier.

that’s all for now

I’d love to hear other thoughts about what makes DNS (or your favourite mysterious technology) hard to learn.

Lima: a nice way to run Linux VMs on Mac

Hello! Here’s a new entry in the “cool software julia likes” section.

A little while ago I started using a Mac, and one of my biggest frustrations with it is that often I need to run Linux-specific software. For example, the nginx playground I posted about the other day only works on Linux because it uses Linux namespaces (via bubblewrap) to sandbox nginx. And I’m working on another playground right now that uses bubblewrap too.

This post is very short, it’s just to say that Lima seems nice and much simpler to get started with than Vagrant.

enter Lima!

I was complaining about this to a friend, and they mentioned Lima, which stands for Linux on Mac. I’d heard of colima (another way to run Linux containers on Mac), but I hadn’t realized that Lima also just lets you run VMs.

It was surprisingly simple to set up. I just had to:

  1. Install Lima (I did nix-env -iA nixpkgs.lima but you can also install it with brew install lima)
  2. Run limactl start default to start the VM
  3. Run lima to get a shell

That’s it! By default it mounts your home directory as read-only inside the VM

There’s a config file in ~/.lima/default/lima.yaml, but I haven’t needed to change it yet.

some nice things about Lima

Some things I appreciate about Lima (as opposed to Vagrant which I’ve used in the past and found kind of frustrating) are:

  1. it provides a default config
  2. it automatically downloads a Ubuntu 22.04 image to use in the VM (which is what I would have probably picked anyway)
  3. it mounts my entire home directory inside the VM, which I really like as a default choice (it feels very seamless)

I think the paradigm of “I have a single chaotic global Linux VM which I use for all my projects” might work better for me than super carefully configured per-project VMs. Though I’m sure that you can have carefully configured per-project VMs with Lima too if you want, I’m just only using the default VM.

problem 1: I don’t know how to mount directories read-write

I wanted to have my entire home directory mounted read-only, but have some subdirectories (like ~/work/nginx-playground) mounted read-write. I did some research and here’s what I found:

Maybe I’ll figure out how to mount directories read-write later, I’m not too bothered by working around it for now.

problem 2: networking

I’m trying to set up some weird networking stuff (this tun/tap setup) in Lima and while it appeared to work at first, actually the tun network device seems to be unreliable in a weird way for reasons I don’t understand.

Another weird Lima networking thing: here’s what gets printed out when I ping a machine:

$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
ping: Warning: time of day goes back (-7148662230695168869us), taking countermeasures
ping: Warning: time of day goes back (-7148662230695168680us), taking countermeasures
64 bytes from 8.8.8.8: icmp_seq=0 ttl=255 time=0.000 ms
wrong data byte #16 should be 0x10 but was 0x0
#16	0 6 0 1 6c 55 ad 64 0 0 0 0 72 95 9 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 1a 1b
#48	1c 1d 1e 1f 20 21 22 23
ping: Warning: time of day goes back (-6518721232815721329us), taking countermeasures
64 bytes from 8.8.8.8: icmp_seq=0 ttl=255 time=0.000 ms (DUP!)
wrong data byte #16 should be 0x10 but was 0x0
#16	0 6 0 2 6d 55 ad 64 0 0 0 0 2f 9d 9 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 1a 1b
#48	1c 1d 1e 1f 20 21 22 23
ping: Warning: time of day goes back (-4844789546316441458us), taking countermeasures
64 bytes from 8.8.8.8: icmp_seq=0 ttl=255 time=0.000 ms (DUP!)
wrong data byte #16 should be 0x10 but was 0x0
#16	0 6 0 3 6e 55 ad 64 0 0 0 0 69 b3 9 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 1a 1b
#48	1c 1d 1e 1f 20 21 22 23
ping: Warning: time of day goes back (-3834857329877608539us), taking countermeasures
64 bytes from 8.8.8.8: icmp_seq=0 ttl=255 time=0.000 ms (DUP!)
wrong data byte #16 should be 0x10 but was 0x0
#16	0 6 0 4 6f 55 ad 64 0 0 0 0 6c c0 9 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 1a 1b
#48	1c 1d 1e 1f 20 21 22 23
ping: Warning: time of day goes back (-2395394298978302982us), taking countermeasures
64 bytes from 8.8.8.8: icmp_seq=0 ttl=255 time=0.000 ms (DUP!)
wrong data byte #16 should be 0x10 but was 0x0
#16	0 6 0 5 70 55 ad 64 0 0 0 0 65 d3 9 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 1a 1b
#48	1c 1d 1e 1f 20 21 22 23

This seems to be a known issue with ICMP.

why not use containers?

I wanted a VM and not a Linux container because:

  1. the playground runs on a VM in production, not in a container, and generally it’s easier to develop in a similar environment to production
  2. all of my playgrounds use Linux namespaces, and I don’t know how to create a namespace inside a container. Probably you can but I don’t feel like figuring it out and it seems like an unnecessary distraction.
  3. on Mac you need to run containers inside a Linux VM anyway, so I’d rather use a VM directly and not introduce another unnecessary layer

OrbStack seems nice too

After I wrote this, a bunch of people commented to say that OrbStack is great. I was struggling with the networking in Lima (like I mentioned above) so I tried out OrbStack and the network does seem to be better.

ping acts normally, unlike in Lima:

$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=113 time=19.8 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=113 time=15.9 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=113 time=23.1 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=113 time=22.7 ms

The setup steps for OrbStack are:

  1. Download OrbStack from the website
  2. In the GUI, create a VM
  3. Run orb
  4. That’s it

So it seems equally simple to set up.

that’s all!

Some other notes:

  • It looks like Lima works on Linux too
  • a bunch of people on Mastodon also said colima (built on top of Lima) is a nice Docker alternative on Mac for running Linux containers

Open sourcing the nginx playground

Hello! In 2021 I released a small playground for testing nginx configurations called nginx playground. There’s a blog post about it here.

This is an extremely short post to say that at the time I didn’t make it open source, but I am making it open source now. It’s not a lot of code but maybe it’ll be interesting to someone, and maybe someone will even build on it to make more playgrounds! I’d love to see an HAProxy playground or something in a similar vein.

Here’s the github repo. The frontend is in static/ and the backend is in api/. The README is mostly an extended apology for the developer experience and note that the project is unmaintained. But I did test that the build instructions work!

why didn’t I open source this before?

I’m not very good at open source. Some of the problems I have with open sourcing things are:

  • I dislike (and am very bad at) maintaining open source projects – I usually ignore basically all feature requests and most bug reports and then feel bad about it. I handed off maintainership to both of the open source projects that I started (rbspy and rust-bcc) to other people who are doing a MUCH better job than I ever did.
  • Sometimes the developer experience for the project is pretty bad
  • Sometimes there’s configuration in the project (like the fly.toml or the analytics I have set up) which don’t really make sense for other people to copy

new approach: don’t pretend I’m going to improve it

In the past I’ve had some kind of belief that I’m going to improve the problems with my code later. But I haven’t touched this project in more than a year and I think it’s unlikely I’m going to go back to it unless it breaks in some dramatic way.

So instead of pretending I’m going to improve things, I decided to just:

  • tell people in the README that the project is unmaintained
  • write down all the security caveats I know about
  • test the build instructions I wrote to make sure that they work (on a fresh machine, even!)
  • explain (but do not fix!!) some of the messy parts of the project

that’s all!

Maybe I will open source more of my tiny projects in the future, we’ll see! Thanks to Sumana Harihareswara for helping me think through this.

New zine: How Integers and Floats Work

Hello! On Wednesday, we released a new zine: How Integers and Floats Work!

You can get it for $12 here: https://wizardzines.com/zines/integers-floats, or get an 13-pack of all my zines here.

Here’s the cover:

the table of contents

Here’s the table of contents!

Now let’s talk about some of the motivations for writing this zine!

motivation 1: demystify binary

I wrote this zine because I used to find binary data really impenetrable. There are all these 0s and 1s! What does it mean?

But if you look at any binary file format, most of it is integers! For example, if you look at the DNS parsing in Implement DNS in a Weekend, it’s all about encoding and decoding a bunch of integers (plus some ASCII strings, which arguably are also arrays of integers).

So I think that learning how integers work in depth is a really nice way to get started with understanding binary file formats. The zine also talks about some other tricks for encoding binary data into integers with binary operations and bit flags.

motivation 2: explain floating point

The second motivation was to explain floating point. Floating point is pretty weird! (see [examples of floating point problems]() for a very long list)

And almost all explanations of floating point I’ve read have been really math and notation heavy in a way that I find pretty unpleasant and confusing, even though I love math more than most people (I did a pure math degree) and am pretty good at it.

We spent weeks working on a clearer explanation of floating point with minimal math jargon and lots of pictures and I think we got there. Here’s one example page, on the floating point number line:

it comes with a playground: memory spy!

One of my favourite ways to learn about how my computer represents things in memory has been to use a debugger to look at the memory of a real program.

But C debuggers like gdb are pretty hard to use at first! So Marie and I made a playground called Memory Spy. It runs a C debugger behind the scenes, but it provides a much simpler interface – there are a bunch of very simple example C programs, and you can just click on each line to view how the variable on that line is represented in memory.

Here’s a screenshot:

Memory Spy is inspired by Philip Guo’s great Python Tutor.

float.exposed is great

When doing demos and research for this zine, I found myself reaching for float.exposed a lot to show how numbers are encoded in floating point. It’s by Bartosz Ciechanowski, who has tons of other great visualizations on his site.

I loved it so much that I made a clone called integer.exposed for integers (with permission), so that people could look at integers in a similar way.

some blog posts I wrote along the way

Here are a few blog posts I wrote while thinking about how to write this zine:

you can get a print copy shipped to you!

There’s always been the option to print the zines yourself on your home printer.

But this time there’s a new option too: you can get a print copy shipped to you! (just click on the “print version” link on this page)

The only caveat is print orders will ship in August – I need to wait for orders to come in to get an idea of how many I should print before sending it to the printer.

people who helped with this zine

I don’t make these zines by myself!

I worked with Marie LeBlanc Flanagan every morning for 5 months to clarify explanations and build memory spy.

The cover is by Vladimir Kašiković, Gersande La Flèche did copy editing, Dolly Lanuza did editing, another friend did technical review.

Stefan Karpinski gave a talk 10 years ago at the Recurse Center (I even blogged about it at the time) which was the first explanation of floating point that ever made any sense to me. He also explained how signed integers work to me in a Mastodon post a few months ago, when I was in the middle of writing the zine.

And finally, I want to thank all the beta readers – 60 of you read the zine and left comments about what was confusing, what was working, and ideas for how to make it better. It made the end product so much better.

thank you

As always: if you’ve bought zines in the past, thank you for all your support over the years. I couldn’t do this without you.