Reading List

The most recent articles from a list of feeds I subscribe to.

In a git repository, where do your files live?

Hello! I was talking to a friend about how git works today, and we got onto the topic – where does git store your files? We know that it’s in your .git directory, but where exactly in there are all the versions of your old files?

For example, this blog is in a git repository, and it contains a file called content/post/2019-06-28-brag-doc.markdown. Where is that in my .git folder? And where are the old versions of that file? Let’s investigate by writing some very short Python programs.

git stores files in .git/objects

Every previous version of every file in your repository is in .git/objects. For example, for this blog, .git/objects contains 2700 files.

$ find .git/objects/ -type f | wc -l
2761

note: .git/objects actually has more information than “every previous version of every file in your repository”, but we’re not going to get into that just yet

Here’s a very short Python program (find-git-object.py) that finds out where any given file is stored in .git/objects.

import hashlib
import sys


def object_path(content):
    header = f"blob {len(content)}\0"
    data = header.encode() + content
    digest = hashlib.sha1(data).hexdigest()
    return f".git/objects/{digest[:2]}/{digest[2:]}"


with open(sys.argv[1], "rb") as f:
    print(object_path(f.read()))

What this does is:

  • read the contents of the file
  • calculate a header (blob 16673\0) and combine it with the contents
  • calculate the sha1 sum (e33121a9af82dd99d6d706d037204251d41d54 in this case)
  • translate that sha1 sum into a path (.git/objects/e3/3121a9af82dd99d6d706d037204251d41d54)

We can run it like this:

$ python3 find-git-object.py content/post/2019-06-28-brag-doc.markdown
.git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54

jargon: “content addressed storage”

The term for this storage strategy (where the filename of an object in the database is the same as the hash of the file’s contents) is “content addressed storage”.

One neat thing about content addressed storage is that if I have two files (or 50 files!) with the exact same contents, that doesn’t take up any extra space in Git’s database – if the hash of the contents is aabbbbbbbbbbbbbbbbbbbbbbbbb, they’ll both be stored in .git/objects/aa/bbbbbbbbbbbbbbbbbbbbb.

how are those objects encoded?

If I try to look at this file in .git/objects, it gets a bit weird:

$ cat .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54
x^A<8D><9B>}s<E3>Ƒ<C6><EF>o|<8A>^Q<9D><EC>ju<92><E8><DD>\<9C><9C>*<89>j<FD>^...

What’s going on? Let’s run file on it:

$ file .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54
.git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54: zlib compressed data

It’s just compressed! We can write another little Python program called decompress.py that uses the zlib module to decompress the data:

import zlib
import sys

with open(sys.argv[1], "rb") as f:
    content = f.read()
    print(zlib.decompress(content).decode())

Now let’s decompress it:

$ python3 decompress.py .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54 
blob 16673---
title: "Get your work recognized: write a brag document"
date: 2019-06-28T18:46:02Z
url: /blog/brag-documents/
categories: []
---
... the entire blog post ...

So this data is encoded in a pretty simple way: there’s this blob 16673\0 thing, and then the full contents of the file.

there aren’t any diffs

One thing that surprised me here is the first time I learned it: there aren’t any diffs here! That file is the 9th version of that blog post, but the version git stores in the .git/objects is the whole file, not the diff from the previous version.

Git actually sometimes also does store files as diffs (when you run git gc it can combine multiple different files into a “packfile” for efficiency), but I have never needed to think about that in my life so we’re not going to get into it. Aditya Mukerjee has a great post called Unpacking Git packfiles about how the format works.

what about older versions of the blog post?

Now you might be wondering – if there are 8 previous versions of that blog post (before I fixed some typos), where are they in the .git/objects directory? How do we find them?

First, let’s find every commit where that file changed with git log:

$ git log --oneline  content/post/2019-06-28-brag-doc.markdown
c6d4db2d
423cd76a
7e91d7d0
f105905a
b6d23643
998a46dd
67a26b04
d9999f17
026c0f52
72442b67

Now let’s pick a previous commit, let’s say 026c0f52. Commits are also stored in .git/objects, and we can try to look at it there. But the commit isn’t there! ls .git/objects/02/6c* doesn’t have any results! You know how we mentioned “sometimes git packs objects to save space but we don’t need to worry about it?“. I guess now is the time that we need to worry about it.

So let’s take care of that.

let’s unpack some objects

So we need to unpack the objects from the pack files. I looked it up on Stack Overflow and apparently you can do it like this:

$ mv .git/objects/pack/pack-adeb3c14576443e593a3161e7e1b202faba73f54.pack .
$ git unpack-objects < pack-adeb3c14576443e593a3161e7e1b202faba73f54.pack

This is weird repository surgery so it’s a bit alarming but I can always just clone the repository from Github again if I mess it up, so I wasn’t too worried.

After unpacking all the object files, we end up with way more objects: about 20000 instead of about 2700. Neat.

find .git/objects/ -type f | wc -l
20138

back to looking at a commit

Now we can go back to looking at our commit 026c0f52. You know how we said that not everything in .git/objects is a file? Some of them are commits! And to figure out where the old version of our post content/post/2019-06-28-brag-doc.markdown is stored, we need to dig pretty deep into this commit.

The first step is to look at the commit in .git/objects.

commit step 1: look at the commit

The commit 026c0f52 is now in .git/objects/02/6c0f5208c5ea10608afc9252c4a56c1ac1d7e4 after doing some unpacking and we can look at it like this:

$ python3 decompress.py .git/objects/02/6c0f5208c5ea10608afc9252c4a56c1ac1d7e4
commit 211tree 01832a9109ab738dac78ee4e95024c74b9b71c27
parent 72442b67590ae1fcbfe05883a351d822454e3826
author Julia Evans <julia@jvns.ca> 1561998673 -0400
committer Julia Evans <julia@jvns.ca> 1561998673 -0400

brag doc

We can also get same information with git cat-file -p 026c0f52, which does the same thing but does a better job of formatting the data. (the -p option means “format it nicely please”)

commit step 2: look at the tree

This commit has a tree. What’s that? Well let’s take a look. The tree’s ID is 01832a9109ab738dac78ee4e95024c74b9b71c27, and we can use our decompress.py script from earlier to look at that git object. (though I had to remove the .decode() to get the script to not crash)

$ python3 decompress.py .git/objects/01/832a9109ab738dac78ee4e95024c74b9b71c27
b'tree 396\x00100644 .gitignore\x00\xc3\xf7`$8\x9b\x8dO\x19/\x18\xb7}|\xc7\xce\x8e:h\xad100644 README.md\x00~\xba\xec\xb3\x11\xa0^\x1c\xa9\xa4?\x1e\xb9\x0f\x1cfG\x96\x0b

This is formatted in kind of an unreadable way. The main display issue here is that the commit hashes (\xc3\xf7$8\x9b\x8dO\x19/\x18\xb7}|\xc7\xce\…) are raw bytes instead of being encoded in hexadecimal. So we see \xc3\xf7$8\x9b\x8d instead of c3f76024389b8d. Let’s switch over to using git cat-file -p which formats the data in a friendlier way, because I don’t feel like writing a parser for that.

$ git cat-file -p 01832a9109ab738dac78ee4e95024c74b9b71c27
100644 blob c3f76024389b8d4f192f18b77d7cc7ce8e3a68ad	.gitignore
100644 blob 7ebaecb311a05e1ca9a43f1eb90f1c6647960bc1	README.md
100644 blob 0f21dc9bf1a73afc89634bac586271384e24b2c9	Rakefile
100644 blob 00b9d54abd71119737d33ee5d29d81ebdcea5a37	config.yaml
040000 tree 61ad34108a327a163cdd66fa1a86342dcef4518e	content <-- this is where we're going next
040000 tree 6d8543e9eeba67748ded7b5f88b781016200db6f	layouts
100644 blob 22a321a88157293c81e4ddcfef4844c6c698c26f	mystery.rb
040000 tree 8157dc84a37fca4cb13e1257f37a7dd35cfe391e	scripts
040000 tree 84fe9c4cb9cef83e78e90a7fbf33a9a799d7be60	static
040000 tree 34fd3aa2625ba784bced4a95db6154806ae1d9ee	themes

This is showing us all of the files I had in the root directory of the repository as of that commit. Looks like I accidentally committed some file called mystery.rb at some point which I later removed.

Our file is in the content directory, so let’s look at that tree: 61ad34108a327a163cdd66fa1a86342dcef4518e

commit step 3: yet another tree

$ git cat-file -p 61ad34108a327a163cdd66fa1a86342dcef4518e

040000 tree 1168078878f9d500ea4e7462a9cd29cbdf4f9a56	about
100644 blob e06d03f28d58982a5b8282a61c4d3cd5ca793005	newsletter.markdown
040000 tree 1f94b8103ca9b6714614614ed79254feb1d9676c	post <-- where we're going next!
100644 blob 2d7d22581e64ef9077455d834d18c209a8f05302	profiler-project.markdown
040000 tree 06bd3cee1ed46cf403d9d5a201232af5697527bb	projects
040000 tree 65e9357973f0cc60bedaa511489a9c2eeab73c29	talks
040000 tree 8a9d561d536b955209def58f5255fc7fe9523efd	zines

Still not done…

commit step 4: one more tree….

The file we’re looking for is in the post/ directory, so there’s one more tree:

$ git cat-file -p 1f94b8103ca9b6714614614ed79254feb1d9676c	
.... MANY MANY lines omitted ...
100644 blob 170da7b0e607c4fd6fb4e921d76307397ab89c1e	2019-02-17-organizing-this-blog-into-categories.markdown
100644 blob 7d4f27e9804e3dc80ab3a3912b4f1c890c4d2432	2019-03-15-new-zine--bite-size-networking-.markdown
100644 blob 0d1b9fbc7896e47da6166e9386347f9ff58856aa	2019-03-26-what-are-monoidal-categories.markdown
100644 blob d6949755c3dadbc6fcbdd20cc0d919809d754e56	2019-06-23-a-few-debugging-resources.markdown
100644 blob 3105bdd067f7db16436d2ea85463755c8a772046	2019-06-28-brag-doc.markdown <-- found it!!!!!

Here the 2019-06-28-brag-doc.markdown is the last file listed because it was the most recent blog post when it was published.

commit step 5: we made it!

Finally we have found the object file where a previous version of my blog post lives! Hooray! It has the hash 3105bdd067f7db16436d2ea85463755c8a772046, so it’s in git/objects/31/05bdd067f7db16436d2ea85463755c8a772046.

We can look at it with decompress.py

$ python3 decompress.py .git/objects/31/05bdd067f7db16436d2ea85463755c8a772046 | head
blob 15924---
title: "Get your work recognized: write a brag document"
date: 2019-06-28T18:46:02Z
url: /blog/brag-documents/
categories: []
---
... rest of the contents of the file here ...

This is the old version of the post! If I ran git checkout 026c0f52 content/post/2019-06-28-brag-doc.markdown or git restore --source 026c0f52 content/post/2019-06-28-brag-doc.markdown, that’s what I’d get.

this tree traversal is how git log works

This whole process we just went through (find the commit, go through the various directory trees, search for the filename we wanted) seems kind of long and complicated but this is actually what’s happening behind the scenes when we run git log content/post/2019-06-28-brag-doc.markdown. It needs to go through every single commit in your history, check the version (for example 3105bdd067f7db16436d2ea85463755c8a772046 in this case) of content/post/2019-06-28-brag-doc.markdown, and see if it changed from the previous commit.

That’s why git log FILENAME is a little slow sometimes – I have 3000 commits in this repository and it needs to do a bunch of work for every single commit to figure out if the file changed in that commit or not.

how many previous versions of files do I have?

Right now I have 1530 files tracked in my blog repository:

$ git ls-files | wc -l
1530

But how many historical files are there? We can list everything in .git/objects to see how many object files there are:

$ find .git/objects/ -type f | grep -v pack | awk -F/ '{print $3 $4}' | wc -l
20135

Not all of these represent previous versions of files though – as we saw before, lots of them are commits and directory trees. But we can write another little Python script called find-blobs.py that goes through all of the objects and checks if it starts with blob or not:

import zlib
import sys

for line in sys.stdin:
    line = line.strip()
    filename = f".git/objects/{line[0:2]}/{line[2:]}"
    with open(filename, "rb") as f:
        contents = zlib.decompress(f.read())
        if contents.startswith(b"blob"):
            print(line)
$ find .git/objects/ -type f | grep -v pack | awk -F/ '{print $3 $4}' | python3 find-blobs.py | wc -l
6713

So it looks like there are 6713 - 1530 = 5183 old versions of files lying around in my git repository that git is keeping around for me in case I ever want to get them back. How nice!

that’s all!

Here’s the gist with all the code for this post. There’s not very much.

I thought I already knew how git worked, but I’d never really thought about pack files before so this was a fun exploration. I also don’t spend too much time thinking about how much work git log is actually doing when I ask it to track the history of a file, so that was fun to dig into.

As a funny postscript: as soon as I committed this blog post, git got mad about how many objects I had in my repository (I guess 20,000 is too many!) and ran git gc to compress them all into packfiles. So now my .git/objects directory is very small:

$ find .git/objects/ -type f | wc -l
14

Notes on using a single-person Mastodon server

I started using Mastodon back in November, and it’s the Twitter alternative where I’ve been spending most of my time recently, mostly because the Fediverse is where a lot of the Linux nerds seem to be right now.

I’ve found Mastodon quite a bit more confusing than Twitter because it’s a distributed system, so here are a few technical things I’ve learned about it over the last 10 months. I’ll mostly talk about what using a single-person server has been like for me, as well as a couple of notes about the API, DMs and ActivityPub.

I might have made some mistakes, please let me know if I’ve gotten anything wrong!

what’s a mastodon instance?

First: Mastodon is a decentralized collection of independently run servers instead of One Big Server. The software is open source.

In general, if you have an account on one server (like ruby.social), you can follow people on another server (like hachyderm.io), and they can follow you.

I’m going to use the terms “Mastodon server” and “Mastodon instance” interchangeably in this post.

on choosing a Mastodon instance

These were the things I was concerned about when choosing an instance:

  1. An instance name that I was comfortable being part of my online identity. For example, I probably wouldn’t want to be @b0rk@infosec.exchange because I’m not an infosec person.
  2. The server’s stability. Most servers are volunteer-run, and volunteer moderation work can be exhausting – will the server really be around in a few years? For example mastodon.technology and mastodon.lol shut down.
  3. The admins’ moderation policies.
  4. That server’s general reputation with other servers. I started out on mastodon.social, but some servers choose to block or limit mastodon.social for various reasons
  5. The community: every Mastodon instance has a local timeline with all posts from users on that instance, would I be interested in reading the local timeline?
  6. Whether my account would be a burden for the admin of that server (since I have a lot of followers)

In the end, I chose to run my own mastodon server because it seemed simplest – I could pick a domain I liked, and I knew I’d definitely agree with the moderation decisions because I’d be in charge.

I’m not going to give server recommendations here, but here’s a list of the top 200 most common servers people who follow me use.

using your own domain

One big thing I wondered was – can I use my own domain (and have the username @b0rk@jvns.ca or something) but be on someone else’s Mastodon server?

The answer to this seems to be basically “no”: if you want to use your own domain on Mastodon, you need to run your own server. (you can kind of do this, but it’s more like an alias or redirect – if I used that method to direct b0rk@jvns.ca to b0rk@mastodon.social, my posts would still show up as being from b0rk@mastodon.social)

There’s also other ActivityPub software (Takahē) that supports people bringing their own domain in a first-class way.

notes on having my own server

I really wanted to have a way to use my own domain name for identity, but to share server hosting costs with other people. This isn’t possible on Mastodon right now, so I decided to set up my own server instead.

I chose to run a Mastodon server (instead of some other ActivityPub implementation) because Mastodon is the most popular one. Good managed Mastodon hosting is readily available, there are tons of options for client apps, and I know for sure that my server will work well with other people’s servers.

I use masto.host for Mastodon hosting, and it’s been great so far. I have nothing interesting to say about what it’s like to operate a Mastodon instance because I know literally nothing about it. Masto.host handles all of the server administration and Mastodon updates, and I never think about it at all.

Right now I’m on their $19/month (“Star”) plan, but it’s possible I could use a smaller plan with no problems. Right now their cheapest plan is $6/month and I expect that would be fine for someone with a smaller account.

Some things I was worried about when embarking on my own Mastodon server:

  • I wanted to run the server at social.jvns.ca, but I wanted my username to be b0rk@jvns.ca instead of b0rk@social.jvns.ca. To get this to work I followed these Setting up a personal fediverse ID directions from Jacob Kaplan-Moss and it’s been fine.
  • The administration burden of running my own server. I imported a small list of servers to block/defederate from but didn’t do anything else. That’s been fine.
  • Reply and profile visibility. This has been annoying and we’ll talk about it next

downsides to being on a single-person server

Being on a 1-person server has some significant downsides. To understand why, you need to understand a little about how Mastodon works.

Every Mastodon server has a database of posts. Servers only have posts that they were explicitly sent by another server in their database.

Some reasons that servers might receive posts:

  • someone on the server follows a user
  • a post mentions someone on the server

As a 1-person server, my server does not receive that many posts! I only get posts from people I follow or posts that explicitly mention me in some way.

The causes several problems:

  1. when I visit someone’s profile on Mastodon who I don’t already follow, my server will not fetch the profile’s content (it’ll fetch their profile picture, description, and pinned posts, but not any of their post history). So their profile appears as if they’ve never posted anything
  2. bad reply visibility: when I look at the replies to somebody else’s post (even if I follow them!), I don’t see all of the replies, only the ones which have made it to my server. If you want to understand the exact rules about who can see which replies (which are quite complicated!), here’s a great deep dive by Sebastian Jambor. I think it’s possible to end up in a state where no one person can see all of the replies, including the original poster.
  3. favourite and boost accounts are inaccurate – usually posts show up having at most 1 or 2 favourites / boosts, even if the post was actually favourite or boosted hundreds of times. I think this is because it only counts favourites/boosts from people I follow.

All of these things will happen to users of any small Mastodon server, not just 1-person servers.

bad reply visibility makes conversations harder

A lot of people are on smaller servers, so when they’re participating in a conversation, they can’t see all the replies to the post.

This means that replies can get pretty repetitive because people literally cannot see each other’s replies. This is especially annoying for posts that are popular or controversial, because the person who made the post has to keep reading similar replies over and over again by people who think they’re making the point for the first time.

To get around this (as a reader), you can click “open link to post” or something in your Mastodon client, which will open up the page on the poster’s server where you can read all of the replies. It’s pretty annoying though.

As a poster, I’ve tried to reduce repetitiveness in replies by:

  • putting requests in my posts like “(no need to reply if you don’t remember, or if you’ve been using the command line comfortably for 15 years — this question isn’t for you :) )”
  • occasionally editing my posts to include very common replies
  • very occasionally deleting the post if it gets too out of hand

The Mastodon devs are extremely aware of these issues, there are a bunch of github issues about them:

My guess is that there are technical reasons these features are difficult to add because those issues have been open for 5-7 years.

The Mastodon devs have said that they plan to improve reply fetching, but that it requires a significant amount of work.

some visibility workarounds

Some people have built workarounds for fetching profiles / replies.

Also, there are a couple of Mastodon clients which will proactively fetch replies. For iOS:

  • Mammoth does it automatically
  • Mona will fetch posts if I click “load from remote server” manually

I haven’t tried those yet though.

other downsides of running your own server: discovery is much harder

Mastodon instances have a “local timeline” where you can see everything other people on the server are posting, and a “federated timeline” which shows sort of a combined feed from everyone followed by anyone on the server. This means that you can see trending posts and get an idea of what’s going on and find people to follow. You don’t get that if you’re on a 1-person server – it’s just me talking to myself! (plus occasional interjections from my reruns bot).

Some workarounds people mentioned for this:

  • you can populate your federated timeline with posts from another instance by using a relay. I haven’t done this but someone else said they use FediBuzz and I might try it out.
  • some mastodon clients (like apparently Moshidon on Android) let you follow other instances

If anyone else on small servers has suggestions for how to make discovery easier I’d love to hear them.

account migration

When I moved to my own server from mastodon.social, I needed to run an account migration to move over my followers. First, here’s how migration works:

  1. Account migration does not move over your posts. All of my posts stayed on my old account. This is part of why I moved to running my own server – I didn’t want to ever lose my posts a second time.
  2. Account migration does not move over the list of people you follow/mute/block. But you can import/export that list in your Mastodon settings so it’s not a big deal. If you follow private accounts they’ll have to re-approve your follow request.
  3. Account migration does move over your followers

The follower move was the part I was most worried about. Here’s how it turned out:

  • over ~24 hours, most of my followers moved to the new account
  • one or two servers did not get the message about the account migration for some reason, so about 2000 followers were “stuck” and didn’t migrate. I fixed this by waiting 30 days and re-running the account migration, which moved over most of the remaining followers. There’s also a tootctl command that the admin of the old instance can run to retry the migration
  • about 200 of my followers never migrated over, I think because they’re using ActivityPub software other than Mastodon which doesn’t support account migration. You can see the old account here

using the Mastodon API is great

One thing I love about Mastodon is – it has an API that’s MUCH easier to use than Twitter’s API. I’ve always been frustrated with how difficult it is to navigate large Twitter threads, so I made a small mastodon thread view website that lets you log into your Mastodon account. It’s pretty janky and it’s really only made for me to use, but I’ve really appreciated the ability to write my own janky software to improve my Mastodon experience.

Some notes on the Mastodon API:

Next I’ll talk about a few general things about Mastodon that confused or surprised me that aren’t specific to being on a single-person instance.

DMs are weird

The way Mastodon DMs work surprised me in a few ways:

  • Technically DMs are just regular posts with visibility limited to the people mentioned in the post. This means that if you accidentally mention someone in a DM (“@x is such a jerk”), it’s possible to accidentally send the message to them
  • DMs aren’t very private: the admins on the sending and receiving servers can technically read your DMs if they have access to the database. So they’re not appropriate for sensitive information.
  • Turning off DMs is weird. Personally I don’t like receiving DMs from strangers – it’s too much to keep track of and I’d prefer that people email me. On Twitter, I can just turn it off and people won’t see an option to DM me. But on Mastodon, when I turn off notifications for DMs, anyone can still “DM” me, but the message will go into a black hole and I’ll never see it. I put a note in my profile about this.

defederation and limiting

There are a couple of different ways for a server to block another Mastodon server. I haven’t really had to do this much but people talk about it a lot and I was confused about the difference, so:

  • A server can defederate from another server (this seems to be called suspend in the Mastodon docs). This means that nobody on a server can follow someone from the other server.
  • A server can limit (also known as “silence”) a user or server. This means that content from that user is only visible to that user’s followers – people can’t discover the user through retweets (aka “boosts” on Mastodon).

One thing that wasn’t obvious to me is that who servers defederate / limit is sometimes hidden, so it’s hard to suss out what’s going on if you’re considering joining a server, or trying to understand why you can’t see certain posts.

there’s no search for posts

There’s no way to search past posts you’ve read. If I see something interesting on my timeline and want to find it later, I usually can’t. (Mastodon has a Elasticsearch-based search feature, but it only allows you to search your own posts, your mentions, your favourites, and your bookmarks)

These limitations on search are intentional (and a very common source of arguments) – it’s a privacy / safety issue. Here’s a summary from Tim Bray with lots of links.

It would be personally convenient for me to be able to search more easily but I respect folks’ safety concerns so I’ll leave it at that.

My understanding is that the Mastodon devs are planning to add opt-in search for public posts relatively soon.

other ActivityPub software

We’ve been talking about Mastodon a lot, but not everyone who I follow is using Mastodon: Mastodon uses a protocol called ActivityPub to distribute messages.

Here are some examples of other software I see people talking about, in no particular order:

I’m probably missing a bunch of important ones.

what’s the difference between Mastodon and other ActivityPub software?

This confused me for a while, and I’m still not super clear on how ActivityPub works. What I’ve understood is:

  • ActivityPub is a protocol (you can explore how it works with blinry’s nice JSON explorer)
  • Mastodon servers communicate with each other (and with other ActivityPub servers) using ActivityPub
  • Mastodon clients communicate with their server using the Mastodon API, which is its own thing
  • There’s also software like GoToSocial that aims to be compatible with the Mastodon API, so that you can use a Mastodon client with it

more mastodon resources

  • Fedi.Tips seems to be a great introduction
  • I think you can still use FediFinder to find folks you followed on Twitter on Mastodon
  • I’ve been using the Ivory client on iOS, but there are lots of great clients. Elk is an alternative web client that folks seem to like.

I haven’t written here about what Mastodon culture is like because other people have done a much better job of talking about it than me, but of course it’s is the biggest thing that affects your experience and it was the thing that took me longest to get a handle on. A few links:

that’s all!

I don’t regret setting up a single-user server – even though it’s inconvenient, it’s important to me to have control over my social media. I think “have control over my social media” is more important to me than it is to most other people though, because I use Twitter/Mastodon a lot for work.

I am happy that I didn’t start out on a single-user server though – I think it would have made getting started on Mastodon a lot more difficult.

Mastodon is pretty rough around the edges sometimes but I’m able to have more interesting conversations about computers there than I am on Twitter (or Bluesky), so that’s where I’m staying for now.

What helps people get comfortable on the command line?

Sometimes I talk to friends who need to use the command line, but are intimidated by it. I never really feel like I have good advice (I’ve been using the command line for too long), and so I asked some people on Mastodon:

if you just stopped being scared of the command line in the last year or three — what helped you?

(no need to reply if you don’t remember, or if you’ve been using the command line comfortably for 15 years — this question isn’t for you :) )

This list is still a bit shorter than I would like, but I’m posting it in the hopes that I can collect some more answers. There obviously isn’t one single thing that works for everyone – different people take different paths.

I think there are three parts to getting comfortable: reducing risks, motivation and resources. I’ll start with risks, then a couple of motivations and then list some resources.

ways to reduce risk

A lot of people are (very rightfully!) concerned about accidentally doing some destructive action on the command line that they can’t undo.

A few strategies people said helped them reduce risks:

  • regular backups (one person mentioned they accidentally deleted their entire home directory last week in a command line mishap, but it was okay because they had a backup)
  • For code, using git as much as possible
  • Aliasing rm to a tool like safe-rm or rmtrash so that you can’t accidentally delete something you shouldn’t (or just rm -i)
  • Mostly avoid using wildcards, use tab completion instead. (my shell will tab complete rm *.txt and show me exactly what it’s going to remove)
  • Fancy terminal prompts that tell you the current directory, machine you’re on, git branch, and whether you’re root
  • Making a copy of files if you’re planning to run an untested / dangerous command on them
  • Having a dedicated test machine (like a cheap old Linux computer or Raspberry Pi) for particularly dangerous testing, like testing backup software or partitioning
  • Use --dry-run options for dangerous commands, if they’re available
  • Build your own --dry-run options into your shell scripts

a “killer app”

A few people mentioned a “killer command line app” that motivated them to start spending more time on the command line. For example:

  • ripgrep
  • jq
  • wget / curl
  • git (some folks found they preferred the git CLI to using a GUI)
  • ffmpeg (for video work)
  • yt-dlp
  • hard drive data recovery tools (from this great story)

A couple of people also mentioned getting frustrated with GUI tools (like heavy IDEs that use all your RAM and crash your computer) and being motivated to replace them with much lighter weight command line tools.

inspiring command line wizardry

One person mentioned being motivated by seeing cool stuff other people were doing with the command line, like:

explain shell

Several people mentioned explainshell where you can paste in any shell incantation and get it to break it down into different parts.

history, tab completion, etc:

There were lots of little tips and tricks mentioned that make it a lot easier to work on the command line, like:

  • up arrow to see the previous command
  • Ctrl+R to search your bash history
  • navigating inside a line with Ctrl+w (to delete a word), Ctrl+a (to go to the beginning of the line), Ctrl+e (to go to the end), and Ctrl+left arrow / Ctrl+right arrow (to jump back/forward a word)
  • setting bash history to unlimited
  • cd - to go back to the previous directory
  • tab completion of filenames and command names
  • learning how to use a pager like less to read man pages or other large text files (how to search, scroll, etc)
  • backing up configuration files before editing them
  • using pbcopy/pbpaste on Mac OS to copy/paste from your clipboard to stdout/stdin
  • on Mac OS, you can drag a folder from the Finder into the terminal to get its path

fzf

Lots of mentions of using fzf as a better way to fuzzy search shell history. Some other things people mentioned using fzf for:

  • picking git branches (git checkout $(git for-each-ref --format='%(refname:short)' refs/heads/ | fzf))
  • quickly finding files to edit (nvim $(fzf))
  • switching kubernetes contexts (kubectl config use-context $(kubectl config get-contexts -o name | fzf --height=10 --prompt="Kubernetes Context> "))
  • picking a specific test to run from a test suite

The general pattern here is that you use fzf to pick something (a file, a git branch, a command line argument), fzf prints the thing you picked to stdout, and then you insert that as the command line argument to another command.

You can also use fzf as an tool to automatically preview the output and quickly iterate, for example:

  • automatically previewing jq output (echo '' | fzf --preview "jq {q} < YOURFILE.json")
  • or for sed (echo '' | fzf --preview "sed {q} YOURFILE")
  • or for awk (echo '' | fzf --preview "awk {q} YOURFILE")

You get the idea.

In general folks will generally define an alias for their fzf incantations so you can type gcb or something to quickly pick a git branch to check out.

raspberry pi

Some people started using a Raspberry Pi, where it’s safer to experiment without worrying about breaking your computer (you can just erase the SD card and start over!)

a fancy shell setup

Lots of people said they got more comfortable with the command line when they started using a more user-friendly shell setup like oh-my-zsh or fish. I really agree with this one – I’ve been using fish for 10 years and I love it.

A couple of other things you can do here:

  • some folks said that making their terminal prettier helped them feel more comfortable (“make it pink!”).
  • set up a fancy shell prompt to give you more information (for example you can make the prompt red when a command fails). Specifically transient prompts (where you set a super fancy prompt for the current command, but a much simpler one for past commands) seem really nice.

Some tools for theming your terminal:

  • I use base16-shell
  • powerlevel10k is a popular fancy zsh theme which has transient prompts
  • starship is a fancy prompt tool
  • on a Mac, I think iTerm2 is easier to customize than the default terminal

a fancy file manager

A few people mentioned fancy terminal file managers like ranger or nnn, which I hadn’t heard of.

a helpful friend or coworker

Someone who can answer beginner questions and give you pointers is invaluable.

shoulder surfing

Several mentions of watching someone more experienced using the terminal – there are lots of little things that experienced users don’t even realize they’re doing which you can pick up.

aliases

Lots of people said that making their own aliases or scripts for commonly used tasks felt like a magical “a ha!” moment, because:

  • they don’t have to remember the syntax
  • then they have a list of their most commonly used commands that they can summon easily

cheat sheets to get examples

A lot of man pages don’t have examples, for example the openssl s_client man page has no examples. This makes it a lot harder to get started!

People mentioned a couple of cheat sheet tools, like:

  • tldr.sh
  • cheat (which has the bonus of being editable – you can add your own commands to reference later)
  • um (an incredibly minimal system that you have to build yourself)

For example the cheat page for openssl is really great – I think it includes almost everything I’ve ever actually used openssl for in practice (except the -servername option for openssl s_client).

One person said that they configured their .bash_profile to print out a cheat sheet every time they log in.

don’t try to memorize

A couple of people said that they needed to change their approach – instead of trying to memorize all the commands, they realized they could just look up commands as needed and they’d naturally memorize the ones they used the most over time.

(I actually recently had the exact same realization about learning to read x86 assembly – I was taking a class and the instructor said “yeah, just look everything up every time to start, eventually you’ll learn the most common instructions by heart”)

Some people also said the opposite – that they used a spaced repetition app like Anki to memorize commonly used commands.

vim

One person mentioned that they started using vim on the command line to edit files, and once they were using a terminal text editor it felt more natural to use the command line for other things too.

Also apparently there’s a new editor called micro which is like a nicer version of pico/nano, for folks who don’t want to learn emacs or vim.

use Linux on the desktop

One person said that they started using Linux as their main daily driver, and having to fix Linux issues helped them learn. That’s also how I got comfortable with the command too back in ~2004 (I was really into installing lots of different Linux distributions to try to find my favourite one), but my guess is that it’s not the most popular strategy these days.

being forced to only use the terminal

Some people said that they took a university class where the professor made them do everything in the terminal, or that they created a rule for themselves that they had to do all their work in the terminal for a while.

workshops

A couple of people said that workshops like Software Carpentry workshops (an introduction to the command line, git, and Python/R programming for scientists) helped them get more comfortable with the command line.

You can see the software carpentry curriculum here.

books & articles

a few that were mentioned:

articles:

books:

videos:

Some tactics for writing in public

Someone recently asked me – “how do you deal with writing in public? People on the internet are such assholes!”

I’ve often heard the advice “don’t read the comments”, but actually I’ve learned a huge amount from reading internet comments on my posts from strangers over the years, even if sometimes people are jerks. So I want to explain some tactics I use to try to make the comments on my posts more informative and useful to me, and to try to minimize the number of annoying comments I get.

talk about facts

On here I mostly talk about facts – either facts about computers, or stories about my experiences using computers.

For example this post about tcpdump contains some basic facts about how to use tcpdump, as well as an example of how I’ve used it in the past.

Talking about facts means I get a lot of fact-based comments like:

  • people sharing their own similar (or different) experiences (“I use tcpdump a lot to look at our RTP sequence numbers”)
  • pointers to other resources (“the documentation from F5 about tcpdump is great”)
  • other interesting related facts I didn’t mention (“you can use tcpdump -X too”, “netsh on windows is great”, “you can use sudo tcpdump -s 0 -A 'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420' to filter for HTTP GET requests)
  • potential problems or gotchas (“be careful about running tcpdump as root, try just setting the required capabilities instead”)
  • questions (“Is there a way to place the BPF filter after IP packet reassembly?” or “what’s the advantage of tcpdump over wireshark?”)
  • mistakes I made

In general, I’d say that people’s comments about facts tend to stay pretty normal. The main kinds of negative comments I get about facts are:

  • occasionally people get a little rude about facts I didn’t mention (“Didn’t use -n in any of the examples…please…“). I think I didn’t mention -n in that post because at the time I didn’t know why the -n flag was useful (it’s useful because it turns off this annoying reverse DNS lookup that tcpdump does by default so you can see the IP addresses).
  • people are also sometimes weird about mistakes. I mostly try to head this off by trying to be self-aware about my knowledge level on a topic, and saying “I’m not sure…” when I’m not sure about something.

stories are great

I think stories encourage pretty good discussion. For example, why you should understand (a little) about TCP is a story about a time it was important for me to understand how TCP worked.

When I share stories about problems I solved, the comments really help me understand how what I learned fits into a bigger context. For example:

  • is this a common problem? people will often comment saying “this happened to me too!”
  • what are other common related problems that come up?
  • are there other possible solutions I didn’t consider?

Also I think these kinds of stories are incredibly important – that post describes a bug that was VERY hard for me to solve, and the only reason I was able to figure it out in the first place was that I read this blog post.

ask technical questions

Often in my blog posts I ask technical questions that I don’t know the answer to (or just mention “I don’t know X…”). This helps people focus their replies a little bit – an obvious comment to make is to provide an answer to the question, or explain the thing I didn’t know!

This is fun because it feels like a guaranteed way to get value out of people’s comments – people LOVE answering questions, and so they get to look smart, and I get the answer to a question I have! Everyone wins!

fix mistakes

I make a lot of mistakes in my blog posts, because I write about a lot of things that are on the edge of my knowledge. When people point out mistakes, I often edit the blog post to fix it.

Usually I’ll stay near a computer for a few hours after I post a blog post so that I can fix mistakes quickly as they come up.

Some people are very careful to list every single error they made in their blog posts (“errata: the post previously said X which was wrong, I have corrected it to say Y”). Personally I make mistakes constantly and I don’t have time for that so I just edit the post to fix the mistakes.

ask for examples/experiences, not opinions

A lot of the time when I post a blog post, people on Twitter/Mastodon will reply with various opinions they have about the thing. For example, someone recently replied to a blog post about DNS saying that they love using zone files and dislike web interfaces for managing DNS records. That’s not an opinion I share, so I asked them why.

They explained that there are some DNS record types (specifically TLSA) that they find often aren’t supported in web interfaces. I didn’t know that people used TLSA records, so I learned something! Cool!

I’ve found that asking people to share their experiences (“I wanted to use X DNS record type and I couldn’t”) instead of their opinions (“DNS web admin interfaces are bad”) leads to a lot of useful information and discussion. I’ve learned a lot from it over the years, and written a lot of tweets like “which DNS record types have you needed?” to try to extract more information about people’s experiences.

I try to model the same behaviour in my own work when I can – if I have an opinion, I’ll try to explain the experiences I’ve had with computers that caused me to have that opinion.

start with a little context

I think internet strangers are more likely to reply in a weird way when they have no idea who you are or why you’re writing this thing. It’s easy to make incorrect assumptions! So often I’ll mention a little context about why I’m writing this particular blog post.

For example:

A little while ago I started using a Mac, and one of my biggest frustrations with it is that often I need to run Linux-specific software.

or

I’ve started to run a few more servers recently (nginx playground, mess with dns, dns lookup), so I’ve been thinking about monitoring.

or

Last night, I needed to scan some documents for some bureaucratic reasons. I’d never used a scanner on Linux before and I was worried it would take hours to figure out…

avoid causing boring conversations

There are some kinds of programming conversations that I find extremely boring (like “should people learn vim?” or “is functional programming better than imperative programming?“). So I generally try to avoid writing blog posts that I think will result in a conversation/comment thread that I find annoying or boring.

For example, I wouldn’t write about my opinions about functional programming: I don’t really have anything interesting to say about it and I think it would lead to a conversation that I’m not interested in having.

I don’t always succeed at this of course (it’s impossible to predict what people are going to want to comment about!), but I try to avoid the most obvious flamebait triggers I’ve seen in the past.

There are a bunch of “flamebait” triggers that can set people off on a conversation that I find boring: cryptocurrency, tailwind, DNSSEC/DoH, etc. So I have a weird catalog in my head of things not to mention if I don’t want to start the same discussion about that thing for the 50th time.

Of course, if you think that conversations about functional programming are interesting, you should write about functional programming and start the conversations you want to have!

Also, it’s often possible to start an interesting conversation about a topic where the conversation is normally boring. For example I often see the same talking points about IPv6 vs IPv4 over and over again, but I remember the comments on Reasons for servers to support IPv6 being pretty interesting. In general if I really care about a topic I’ll talk about it anyway, but I don’t care about functional programming very much so I don’t see the point of bringing it up.

preempt common suggestions

Another kind of “boring conversation” I try to avoid is suggestions of things I have already considered. Like when someone says “you should do X” but I already know I could have done X and chose not to because of A B C.

So I often will add a short note like “I decided not to do X because of A B C” or “you can also do X” or “normally I would do X, here I didn’t because…”. For example, in this post about nix, I list a bunch of Nix features I’m choosing not to use (nix-shell, nix flakes, home manager) to avoid a bunch of helpful people telling me that I should use flakes.

Listing the things I’m not doing is also helpful to readers – maybe someone new to nix will discover nix flakes through that post and decide to use them! Or maybe someone will learn that there are exceptions to when a certain “best practice” is appropriate.

set some boundaries

Recently on Mastodon I complained about some gross terminology (“domain information groper”) that I’d just noticed in the dig man page on my machine. A few dudes in the replies (who by now have all deleted their posts) asked me to prove that the original author intended it to be offensive (which of course is besides the point, there’s just no need to have a term widely understood to be referring to sexual assault in the dig man page) or tried to explain to me why it actually wasn’t a problem.

So I blocked a few people and wrote a quick post:

man so many dudes in the replies demanding that i prove that the person who named dig “domain information groper” intended it in an offensive way. Big day for the block button I guess :)

I don’t do this too often, but I think it’s very important on social media to occasionally set some rules about what kind of behaviour I won’t tolerate. My goal here is usually to drive away some of the assholes (they can unfollow me!) and try to create a more healthy space for everyone else to have a conversation about computers in.

Obviously this only works in situations (like Twitter/Mastodon) where I have the ability to garden my following a little bit over time – I can’t do this on HN or Reddit or Lobsters or whatever and wouldn’t try.

As for fixing it – the dig maintainers removed the problem language years ago, but Mac OS still has a very outdated version for license reasons.

(you might notice that this section is breaking the “avoid boring conversations” rule above, this section was certain to start a very boring argument, but I felt it was important to talk about boundaries so I left it in)

don’t argue

Sometimes people seem to want to get into arguments or make dismissive comments. I don’t reply to them, even if they’re wrong. I dislike arguing on the internet and I’m extremely bad at it, so it’s not a good use of my time.

analyze negative comments

If I get a lot of negative comments that I didn’t expect, I try to see if I can get something useful out of it.

For example, I wrote a toy DNS resolver once and some of the commenters were upset that I didn’t handle parsing the DNS packet. At the time I thought this was silly (I thought DNS parsing was really straightforward and that it was obvious how to do it, who cares that I didn’t handle it?) but I realized that maybe the commenters didn’t think it was easy or obvious, and wanted to know how to do it. Which makes sense! It’s not obvious at all if you haven’t done it before!

Those comments partly inspired implement DNS in a weekend, which focuses much more heavily on the parsing aspects, and which I think is a much better explanation how to write a DNS resolver. So ultimately those comments helped me a lot, even if I found them annoying at the time.

(I realize this section makes me sound like a Perfectly Logical Person who does not get upset by negative public criticism, I promise this is not at all the case and I have 100000 feelings about everything that happens on the internet and get upset all the time. But I find that analyzing the criticism and trying to take away something useful from it helps a bit)

that’s all!

Thanks to Shae, Aditya, Brian, and Kamal for reading a draft of this.

Some other similar posts I’ve written in the past:

Behind "Hello World" on Linux

Today I was thinking about – what happens when you run a simple “Hello World” Python program on Linux, like this one?

print("hello world")

Here’s what it looks like at the command line:

$ python3 hello.py
hello world

But behind the scenes, there’s a lot more going on. I’ll describe some of what happens, and (much much more importantly!) explain some tools you can use to see what’s going on behind the scenes yourself. We’ll use readelf, strace, ldd, debugfs, /proc, ltrace, dd, and stat. I won’t talk about the Python-specific parts at all – just what happens when you run any dynamically linked executable.

Here’s a table of contents:

  1. parse “python3 hello.py”
  2. figure out the full path to python3
  3. stat, under the hood
  4. time to fork
  5. the shell calls execve
  6. get the binary’s contents
  7. find the interpreter
  8. dynamic linking
  9. go to _start
  10. write a string

before execve

Before we even start the Python interpreter, there are a lot of things that have to happen. What executable are we even running? Where is it?

1: The shell parses the string python3 hello.py into a command to run and a list of arguments: python3, and ['hello.py']

A bunch of things like glob expansion could happen here. For example if you run python3 *.py, the shell will expand that into python3 hello.py

2: The shell figures out the full path to python3

Now we know we need to run python3. But what’s the full path to that binary? The way this works is that there’s a special environment variable named PATH.

See for yourself: Run echo $PATH in your shell. For me it looks like this.

$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

When you run a command, the shell will search every directory in that list (in order) to try to find a match.

In fish (my shell), you can see the path resolution logic here. It uses the stat system call to check if files exist.

See for yourself: Run strace -e stat, and then run a command like python3. You should see output like this:

stat("/usr/local/sbin/python3", 0x7ffcdd871f40) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/python3", 0x7ffcdd871f40) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/python3", 0x7ffcdd871f40) = -1 ENOENT (No such file or directory)
stat("/usr/bin/python3", {st_mode=S_IFREG|0755, st_size=5479736, ...}) = 0

You can see that it finds the binary at /usr/bin/python3 and stops: it doesn’t continue searching /sbin or /bin.

(if this doesn’t work for you, instead try strace -o out bash, and then grep stat out. One reader mentioned that their version of libc uses a different system call instead of stat)

2.1: A note on execvp

If you want to run the same PATH searching logic as the shell does without reimplementing it yourself, you can use the libc function execvp (or one of the other exec* functions with p in the name).

3: stat, under the hood

Now you might be wondering – Julia, what is stat doing? Well, when your OS opens a file, it’s split into 2 steps.

  1. It maps the filename to an inode, which contains metadata about the file
  2. It uses the inode to get the file’s contents

The stat system call just returns the contents of the file’s inodes – it doesn’t read the contents at all. The advantage of this is that it’s a lot faster. Let’s go on a short adventure into inodes. (this great post “A disk is a bunch of bits” by Dmitry Mazin has more details)

$ stat /usr/bin/python3
  File: /usr/bin/python3 -> python3.9
  Size: 9         	Blocks: 0          IO Block: 4096   symbolic link
Device: fe01h/65025d	Inode: 6206        Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-08-03 14:17:28.890364214 +0000
Modify: 2021-04-05 12:00:48.000000000 +0000
Change: 2021-06-22 04:22:50.936969560 +0000
 Birth: 2021-06-22 04:22:50.924969237 +0000

See for yourself: Let’s go see where exactly that inode is on our hard drive.

First, we have to find our hard drive’s device name

$ df
...
tmpfs             100016      604     99412   1% /run
/dev/vda1       25630792 14488736  10062712  60% /
...

Looks like it’s /dev/vda1. Next, let’s find out where the inode for /usr/bin/python3 is on our hard drive:

$ sudo debugfs /dev/vda1
debugfs 1.46.2 (28-Feb-2021)
debugfs:  imap /usr/bin/python3
Inode 6206 is part of block group 0
	located at block 658, offset 0x0d00

I have no idea how debugfs is figuring out the location of the inode for that filename, but we’re going to leave that alone.

Now, we need to calculate how many bytes into our hard drive “block 658, offset 0x0d00” is on the big array of bytes that is your hard drive. Each block is 4096 bytes, so we need to go 4096 * 658 + 0x0d00 bytes. A calculator tells me that’s 2698496

$ sudo dd if=/dev/vda1 bs=1 skip=2698496 count=256 2>/dev/null | hexdump -C
00000000  ff a1 00 00 09 00 00 00  f8 b6 cb 64 9a 65 d1 60  |...........d.e.`|
00000010  f0 fb 6a 60 00 00 00 00  00 00 01 00 00 00 00 00  |..j`............|
00000020  00 00 00 00 01 00 00 00  70 79 74 68 6f 6e 33 2e  |........python3.|
00000030  39 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |9...............|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00 00 00 12 4a 95 8c  00 00 00 00 00 00 00 00  |.....J..........|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 2d cb 00 00  |............-...|
00000080  20 00 bd e7 60 15 64 df  00 00 00 00 d8 84 47 d4  | ...`.d.......G.|
00000090  9a 65 d1 60 54 a4 87 dc  00 00 00 00 00 00 00 00  |.e.`T...........|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Neat! There’s our inode! You can see it says python3 in it, which is a really good sign. We’re not going to go through all of this, but the ext4 inode struct from the Linux kernel says that the first 16 bits are the “mode”, or permissions. So let’s work that out how ffa1 corresponds to file permissions.

  • The bytes ffa1 correspond to the number 0xa1ff, or 41471 (because x86 is little endian)
  • 41471 in octal is 0120777
  • This is a bit weird – that file’s permissions could definitely be 777, but what are the first 3 digits? I’m not used to seeing those! You can find out what the 012 means in man inode (scroll down to “The file type and mode”). There’s a little table that says 012 means “symbolic link”.

Let’s list the file and see if it is in fact a symbolic link with permissions 777:

$ ls -l /usr/bin/python3
lrwxrwxrwx 1 root root 9 Apr  5  2021 /usr/bin/python3 -> python3.9

It is! Hooray, we decoded it correctly.

4: Time to fork

We’re still not ready to start python3. First, the shell needs to create a new child process to run. The way new processes start on Unix is a little weird – first the process clones itself, and then runs execve, which replaces the cloned process with a new process.

*See for yourself: Run strace -e clone bash, then run python3. You should see something like this:

clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f03788f1a10) = 3708100

3708100 is the PID of the new process, which is a child of the shell process.

Some more tools to look at what’s going on with processes:

  • pstree will show you a tree of all the processes on your system
  • cat /proc/PID/stat shows you some information about the process. The contents of that file are documented in man proc. For example the 4th field is the parent PID.

4.1: What the new process inherits.

The new process (which will become python3) has inherited a bunch of from the shell. For example, it’s inherited:

  1. environment variables: you can look at them with cat /proc/PID/environ | tr '\0' '\n'
  2. file descriptors for stdout and stderr: look at them with ls -l /proc/PID/fd
  3. a working directory (whatever the current directory is)
  4. namespaces and cgroups (if it’s in a container)
  5. the user and group that’s running it
  6. probably more things I’m not thinking of right now

5: The shell calls execve

Now we’re ready to start the Python interpreter!

See for yourself: Run strace -f -e execve bash, then run python3. The -f is important because we want to follow any forked child subprocesses. You should see something like this:

[pid 3708381] execve("/usr/bin/python3", ["python3"], 0x560397748300 /* 21 vars */) = 0

The first argument is the binary, and the second argument is the list of command line arguments. The command line arguments get placed in a special location in the program’s memory so that it can access them when it runs.

Now, what’s going on inside execve?

6: get the binary’s contents

The first thing that has to happen is that we need to open the python3 binary file and read its contents. So far we’ve only used the stat system call to access its metadata, but now we need its contents.

Let’s look at the output of stat again:

$ stat /usr/bin/python3
  File: /usr/bin/python3 -> python3.9
  Size: 9         	Blocks: 0          IO Block: 4096   symbolic link
Device: fe01h/65025d	Inode: 6206        Links: 1
...

This takes up 0 blocks of space on the disk. This is because the contents of the symbolic link (python3.9) are actually in the inode itself: you can see them here (from the binary contents of the inode above, it’s split across 2 lines in the hexdump output):

00000020  00 00 00 00 01 00 00 00  70 79 74 68 6f 6e 33 2e  |........python3.|
00000030  39 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |9...............|

So we’ll need to open /usr/bin/python3.9 instead. All of this is happening inside the kernel so you won’t see it another system call for that.

Every file is made up of a bunch of blocks on the hard drive. I think each of these blocks on my system is 4096 bytes, so the minimum size of a file is 4096 bytes – even if the file is only 5 bytes, it still takes up 4KB on disk.

See for yourself: We can find the block numbers using debugfs like this: (again, I got these instructions from dmitry mazin’s “A disk is a bunch of bits” post)

$ debugfs /dev/vda1
debugfs:  blocks /usr/bin/python3.9
145408 145409 145410 145411 145412 145413 145414 145415 145416 145417 145418 145419 145420 145421 145422 145423 145424 145425 145426 145427 145428 145429 145430 145431 145432 145433 145434 145435 145436 145437

Now we can use dd to read the first block of the file. We’ll set the block size to 4096 bytes, skip 145408 blocks, and read 1 block.

$ dd if=/dev/vda1 bs=4096 skip=145408 count=1 2>/dev/null | hexdump -C | head
00000000  7f 45 4c 46 02 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  02 00 3e 00 01 00 00 00  c0 a5 5e 00 00 00 00 00  |..>.......^.....|
00000020  40 00 00 00 00 00 00 00  b8 95 53 00 00 00 00 00  |@.........S.....|
00000030  00 00 00 00 40 00 38 00  0b 00 40 00 1e 00 1d 00  |....@.8...@.....|
00000040  06 00 00 00 04 00 00 00  40 00 00 00 00 00 00 00  |........@.......|
00000050  40 00 40 00 00 00 00 00  40 00 40 00 00 00 00 00  |@.@.....@.@.....|
00000060  68 02 00 00 00 00 00 00  68 02 00 00 00 00 00 00  |h.......h.......|
00000070  08 00 00 00 00 00 00 00  03 00 00 00 04 00 00 00  |................|
00000080  a8 02 00 00 00 00 00 00  a8 02 40 00 00 00 00 00  |..........@.....|
00000090  a8 02 40 00 00 00 00 00  1c 00 00 00 00 00 00 00  |..@.............|

You can see that we get the exact same output as if we read the file with cat, like this:

$ cat /usr/bin/python3.9 | hexdump -C | head
00000000  7f 45 4c 46 02 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  02 00 3e 00 01 00 00 00  c0 a5 5e 00 00 00 00 00  |..>.......^.....|
00000020  40 00 00 00 00 00 00 00  b8 95 53 00 00 00 00 00  |@.........S.....|
00000030  00 00 00 00 40 00 38 00  0b 00 40 00 1e 00 1d 00  |....@.8...@.....|
00000040  06 00 00 00 04 00 00 00  40 00 00 00 00 00 00 00  |........@.......|
00000050  40 00 40 00 00 00 00 00  40 00 40 00 00 00 00 00  |@.@.....@.@.....|
00000060  68 02 00 00 00 00 00 00  68 02 00 00 00 00 00 00  |h.......h.......|
00000070  08 00 00 00 00 00 00 00  03 00 00 00 04 00 00 00  |................|
00000080  a8 02 00 00 00 00 00 00  a8 02 40 00 00 00 00 00  |..........@.....|
00000090  a8 02 40 00 00 00 00 00  1c 00 00 00 00 00 00 00  |..@.............|

an aside on magic numbers

This file starts with ELF, which is a “magic number”, or a byte sequence that tells us that this is an ELF file. ELF is the binary file format on Linux.

Different file formats have different magic numbers, for example the magic number for gzip is 1f8b. The magic number at the beginning is how file blah.gz knows that it’s a gzip file.

I think file has a variety of heuristics for figuring out the file type of a file, not just magic numbers, but the magic number is an important one.

7: find the interpreter

Let’s parse the ELF file to see what’s in there.

See for yourself: Run readelf -a /usr/bin/python3.9. Here’s what I get (though I’ve redacted a LOT of stuff):

$ readelf -a /usr/bin/python3.9
ELF Header:
    Class:                             ELF64
    Machine:                           Advanced Micro Devices X86-64
...
->  Entry point address:               0x5ea5c0
...
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
  INTERP         0x00000000000002a8 0x00000000004002a8 0x00000000004002a8
                 0x000000000000001c 0x000000000000001c  R      0x1
->      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
        ...
->        1238: 00000000005ea5c0    43 FUNC    GLOBAL DEFAULT   13 _start

Here’s what I understand of what’s going on here:

  1. it’s telling the kernel to run /lib64/ld-linux-x86-64.so.2 to start this program. This is called the dynamic linker and we’ll talk about it next
  2. it’s specifying an entry point (at 0x5ea5c0, which is where this program’s code starts)

Now let’s talk about the dynamic linker.

8: dynamic linking

Okay! We’ve read the bytes from disk and we’ve started this “interpreter” thing. What next? Well, if you run strace -o out.strace python3, you’ll see a bunch of stuff like this right after the execve system call:

execve("/usr/bin/python3", ["python3"], 0x560af13472f0 /* 21 vars */) = 0
brk(NULL)                       = 0xfcc000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=32091, ...}) = 0
mmap(NULL, 32091, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f718a1e3000
close(3)                        = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 l\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=149520, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f718a1e1000
...
close(3)                        = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3

This all looks a bit intimidating at first, but the part I want you to pay attention to is openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0". This is opening a C threading library called pthread that the Python interpreter needs to run.

See for yourself: If you want to know which libraries a binary needs to load at runtime, you can use ldd. Here’s what that looks like for me:

$ ldd /usr/bin/python3.9
	linux-vdso.so.1 (0x00007ffc2aad7000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2fd6554000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2fd654e000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2fd6549000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2fd6405000)
	libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f2fd63d6000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f2fd63b9000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2fd61e3000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f2fd6580000)

You can see that the first library listed is /lib/x86_64-linux-gnu/libpthread.so.0, which is why it was loaded first.

on LD_LIBRARY_PATH

I’m honestly still a little confused about dynamic linking. Some things I know:

  • Dynamic linking happens in userspace and the dynamic linker on my system is at /lib64/ld-linux-x86-64.so.2. If you’re missing the dynamic linker, you can end up with weird bugs like this weird “file not found” error
  • The dynamic linker uses the LD_LIBRARY_PATH environment variable to find libraries
  • The dynamic linker will also use the LD_PRELOAD environment to override any dynamically linked function you want (you can use this for fun hacks, or to replace your default memory allocator with an alternative one like jemalloc)
  • there are some mprotects in the strace output which are marking the library code as read-only, for security reasons
  • on Mac, it’s DYLD_LIBRARY_PATH instead of LD_LIBRARY_PATH

You might be wondering – if dynamic linking happens in userspace, why don’t we see a bunch of stat system calls where it’s searching through LD_LIBRARY_PATH for the libraries, the way we did when bash was searching the PATH?

That’s because ld has a cache in /etc/ld.so.cache, and all of those libraries have already been found in the past. You can see it opening the cache in the strace output – openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3.

There are still a bunch of system calls after dynamic linking in the full strace output that I still don’t really understand (what’s prlimit64 doing? where does the locale stuff come in? what’s gconv-modules.cache? what’s rt_sigaction doing? what’s arch_prctl? what’s set_tid_address and set_robust_list?). But this feels like a good start.

aside: ldd is actually a simple shell script!

Someone on mastodon pointed out that ldd is actually a shell script that just sets the LD_TRACE_LOADED_OBJECTS=1 environment variable and starts the program. So you can do exactly the same thing like this:

$ LD_TRACE_LOADED_OBJECTS=1 python3
	linux-vdso.so.1 (0x00007ffe13b0a000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f01a5a47000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f01a5a41000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2fd6549000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2fd6405000)
	libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f2fd63d6000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f2fd63b9000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2fd61e3000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f2fd6580000)

Apparently ld is also a binary you can just run, so /lib64/ld-linux-x86-64.so.2 --list /usr/bin/python3.9 also does the the same thing.

on init and fini

Let’s talk about this line in the strace output:

set_tid_address(0x7f58880dca10)         = 3709103

This seems to have something to do with threading, and I think this might be happening because the pthread library (and every other dynamically loaded) gets to run initialization code when it’s loaded. The code that runs when the library is loaded is in the init section (or maybe also the .ctors section).

See for yourself: Let’s take a look at that using readelf:

$ readelf -a /lib/x86_64-linux-gnu/libpthread.so.0
...
  [10] .rela.plt         RELA             00000000000051f0  000051f0
       00000000000007f8  0000000000000018  AI       4    26     8
  [11] .init             PROGBITS         0000000000006000  00006000
       000000000000000e  0000000000000000  AX       0     0     4
  [12] .plt              PROGBITS         0000000000006010  00006010
       0000000000000560  0000000000000010  AX       0     0     16
...

This library doesn’t have a .ctors section, just an .init. But what’s in that .init section? We can use objdump to disassemble the code:

$ objdump -d /lib/x86_64-linux-gnu/libpthread.so.0
Disassembly of section .init:

0000000000006000 <_init>:
    6000:       48 83 ec 08             sub    $0x8,%rsp
    6004:       e8 57 08 00 00          callq  6860 <__pthread_initialize_minimal>
    6009:       48 83 c4 08             add    $0x8,%rsp
    600d:       c3

So it’s calling __pthread_initialize_minimal. I found the code for that function in glibc, though I had to find an older version of glibc because it looks like in more recent versions libpthread is no longer a separate library.

I’m not sure whether this set_tid_address system call actually comes from __pthread_initialize_minimal, but at least we’ve learned that libraries can run code on startup through the .init section.

Here’s a note from man elf on the .init section:

$ man elf
 .init  This section holds executable instructions that contribute to the process initialization code.  When a program starts to run
              the system arranges to execute the code in this section before calling the main program entry point.

There’s also a .fini section in the ELF file that runs at the end, and .ctors / .dtors (constructors and destructors) are other sections that could exist.

Okay, that’s enough about dynamic linking.

9: go to _start

After dynamic linking is done, we go to _start in the Python interpreter. Then it does all the normal Python interpreter things you’d expect.

I’m not going to talk about this because here I’m interested in general facts about how binaries are run on Linux, not the Python interpreter specifically.

10: write a string

We still need to print out “hello world” though. Under the hood, the Python print function calls some function from libc. But which one? Let’s find out!

See for yourself: Run ltrace -o out python3 hello.py.

$ ltrace -o out python3 hello.py
$ grep hello out
write(1, "hello world\n", 12) = 12

So it looks like it’s calling write

I honestly am always a little suspicious of ltrace – unlike strace (which I would trust with my life), I’m never totally sure that ltrace is actually reporting library calls accurately. But in this case it seems to be working. And if we look at the cpython source code, it does seem to be calling write() in some places. So I’m willing to believe that.

what’s libc?

We just said that Python calls the write function from libc. What’s libc? It’s the C standard library, and it’s responsible for a lot of basic things like:

  • allocating memory with malloc
  • file I/O (opening/closing/
  • executing programs (with execvp, like we mentioned before)
  • looking up DNS records with getaddrinfo
  • managing threads with pthread

Programs don’t have to use libc (on Linux, Go famously doesn’t use it and calls Linux system calls directly instead), but most other programming languages I use (node, Python, Ruby, Rust) all use libc. I’m not sure about Java.

You can find out if you’re using libc by running ldd on your binary: if you see something like libc.so.6, that’s libc.

why does libc matter?

You might be wondering – why does it matter that Python calls the libc write and then libc calls the write system call? Why am I making a point of saying that libc is in the middle?

I think in this case it doesn’t really matter (AFAIK the write libc function maps pretty directly to the write system call)

But there are different libc implementations, and sometimes they behave differently. The two main ones are glibc (GNU libc) and musl libc.

For example, until recently musl’s getaddrinfo didn’t support TCP DNS, here’s a blog post talking about a bug that that caused.

a little detour into stdout and terminals

In this program, stdout (the 1 file descriptor) is a terminal. And you can do funny things with terminals! Here’s one:

  1. In a terminal, run ls -l /proc/self/fd/1. I get /dev/pts/2
  2. In another terminal window, write echo hello > /dev/pts/2
  3. Go back to the original terminal window. You should see hello printed there!

that’s all for now!

Hopefully you have a better idea of how hello world gets printed! I’m going to stop adding more details for now because this is already pretty long, but obviously there’s more to say and I might add more if folks chip in with extra details. I’d especially love suggestions for other tools you could use to inspect parts of the process that I haven’t explained here.

Thanks to everyone who suggested corrections / additions – I’ve edited this blog post a lot to incorporate more things :)

Some things I’d like to add if I can figure out how to spy on them:

  • the kernel loader and ASLR (I haven’t figured out yet how to use bpftrace + kprobes to trace the kernel loader’s actions)
  • TTYs (I haven’t figured out how to trace the way write(1, "hello world", 11) gets sent to the TTY that I’m looking at)

I’d love to see a Mac version of this

One of my frustrations with Mac OS is that I don’t know how to introspect my system on this level – when I print hello world, I can’t figure out how to spy on what’s going on behind the scenes the way I can on Linux. I’d love to see a really in depth explainer.

Some Mac equivalents I know about:

  • ldd -> otool -L
  • readelf -> otool
  • supposedly you can use dtruss or dtrace on mac instead of strace but I’ve never been brave enough to turn off system integrity protection to get it to work
  • strace -> sc_usage seems to be able to collect stats about syscall usage, and fs_usage about file usage

more reading

Some more links: