Reading List
The most recent articles from a list of feeds I subscribe to.
Making a DNS query in Ruby from scratch
Hello! A while back I wrote a post about how to write a toy DNS resolver in Go.
In that post I left out “how to generate and parse DNS queries” because I thought it was boring, but a few people pointed out that they did not know how to parse and generate DNS queries and they were interested in how to do it.
This made me curious – how much work is it do the DNS parsing? It turns out we can do it in a pretty nice 120-line Ruby program, which is not that bad.
So here’s a quick post on how to generate DNS queries and parse DNS responses! We’re going to do it in Ruby because I’m giving a talk at a Ruby conference soon, and this blog post is partly prep for that talk :). I’ve tried to keep it readable for folks who don’t know Ruby though, I’ve only used pretty basic Ruby code.
At the end we’re going to have a very simple toy Ruby version of dig that can
look up domain names like this:
$ ruby dig.rb example.com
example.com 20314 A 93.184.216.34
The whole thing is about 120 lines of code, so it’s not that much. (The final program is dig.rb if you want to skip the explanations and just read some code.) We won’t implement the “how a DNS resolver works” from the previous post because, well, we already did that. Let’s get into it!
Along the way I’m going to try to explain how you could figure out some of this stuff yourself if you were trying to figure out how DNS queries are formatted from scratch. Mostly that’s “poke around in Wireshark” and “read RFC 1035, the DNS RFC”.
step 1: open a UDP socket
We need to actually send our queries, so to do that we need to open a UDP
socket. We’ll send our queries to 8.8.8.8, Google’s DNS server.
Here’s the code to set up a UDP connection to 8.8.8.8, port 53 (the DNS port).
require 'socket'
sock = UDPSocket.new
sock.bind('0.0.0.0', 12345)
sock.connect('8.8.8.8', 53)
a quick note on UDP
I’m not going to say too much about UDP here, but I will say that the basic unit of computer networking is the “packet” (a packet is a string of bytes), and in this program we’re going to do the simplest possible thing you can do with a computer network – send 1 packet and receive 1 packet in response.
So UDP is a way to send packets in the simplest possible way.
It’s the most common way to send DNS queries, though you can also use TCP or DNS-over-HTTPS instead.
step 2: copy a DNS query from Wireshark
Next: let’s say we have no idea how DNS works but we want to send a working query as fast as possible. The easiest way to get a DNS query to play with and make sure our UDP connection is working is to just copy one that already works!
So that’s what we’re going to do, using Wireshark (an incredible packet analysis tool)
The steps I used to this are roughly:
- Open Wireshark and click ‘capture’
- Enter
udp.port == 53as a filter (in the search bar) - Run
ping example.comin my terminal (to generate a DNS query) - Click on the DNS query (“Standard query A example.com”)
- Right click on “Domain Name System (query”) in the bottom left pane
- Click ‘Copy’ -> ‘as a hex stream’
- Now I have “b96201000001000000000000076578616d706c6503636f6d0000010001” on my clipboard, to use in my Ruby program. Hooray!
step 3: decode the hex stream and send the DNS query
Now we can send our DNS query to 8.8.8.8! Here’s what that looks like: we just need to add 5 lines of code
hex_string = "b96201000001000000000000076578616d706c6503636f6d0000010001"
bytes = [hex_string].pack('H*')
sock.send(bytes, 0)
# get the reply
reply, _ = sock.recvfrom(1024)
puts reply.unpack('H*')
[hex_string].pack('H*') is translating our hex string into a byte string. At
this point we don’t really know what this data means but we’ll get there in a
second.
We can also take this opportunity to make sure our program is working and is sending valid data, using tcpdump. How I did that:
- Run
sudo tcpdump -ni any port 53 and host 8.8.8.8in a terminal tab - In a different terminal tab, run this Ruby program (
ruby dns-1.rb)
Here’s what the output looks like:
$ sudo tcpdump -ni any port 53 and host 8.8.8.8
08:50:28.287440 IP 192.168.1.174.12345 > 8.8.8.8.53: 47458+ A? example.com. (29)
08:50:28.312043 IP 8.8.8.8.53 > 192.168.1.174.12345: 47458 1/0/0 A 93.184.216.34 (45)
This is really good - we can see the DNS request (“what’s the IP for
example.com”) and the response (“it’s 93.184.216.34”). So everything is
working. Now we just need to, you know, figure out how to generate and decode this data ourselves.
step 4: learn a little about how DNS queries are formatted
Now that we have a DNS query for example.com, let’s learn about what it means.
Here’s our query, formatted as hex.
b96201000001000000000000076578616d706c6503636f6d0000010001
If you poke around in Wireshark, you’ll see that this query has 2 parts:
- The header (
b96201000001000000000000) - The question (
076578616d706c6503636f6d0000010001)
step 5: make the header
Our goal in this step is to generate the byte string
b96201000001000000000000, but with a Ruby function instead of hardcoding it.
So: the header is 12 bytes. What do those 12 bytes mean? If you look at Wireshark (or read RFC 1035), you’ll see that it’s 6 2-byte numbers concatenated together.
The 6 numbers correspond to the query ID, the flags, and then the number of questions, answer records, authoritative records, and additional records in the packet.
We don’t need to worry about what all those things are yet though – we just need to put in 6 numbers.
And luckily we know exactly which 6 numbers to put because our goal is to
literally generate the string b96201000001000000000000.
So here’s a function to make the header. (note: there’s no return because you don’t need to write return in Ruby if it’s the last line of the function)
def make_question_header(query_id)
# id, flags, num questions, num answers, num auth, num additional
[query_id, 0x0100, 0x0001, 0x0000, 0x0000, 0x0000].pack('nnnnnn')
end
This is very short because we’ve hardcoded everything except the query ID.
what’s nnnnnn?
You might be wondering what nnnnnn is in .pack('nnnnnn'). That’s a format
string telling .pack() how to convert that array of 6 numbers into a byte
string.
The documentation for .pack is here, and it says that n means
“represent it as “16-bit unsigned, network (big-endian) byte order”.
16 bits is the same as 2 bytes, and we need to use network byte order because this is computer networking. I’m not going to explain byte order right now (though I do have a comic attempting to explain it)
test the header code
Let’s quickly test that our make_question_header function works.
puts make_question_header(0xb962) == ["b96201000001000000000000"].pack("H*")
This prints out “true”, so we win and we can move on.
step 5: encode the domain name
Next we need to generate the question (“what’s the IP for example.com?“). This has 3 parts:
- the domain name (for example “example.com”)
- the query type (for example “A” is for “IPv4 Address”
- the query class (which is always the same, 1 is for IN is for INternet)
The hardest part of this is the domain name so let’s write a function to do that.
example.com is encoded in a DNS query, in hex, as 076578616d706c6503636f6d00. What does that mean?
Well, if we translate the bytes into ASCII, it looks like this:
076578616d706c6503636f6d00
7 e x a m p l e 3 c o m 0
So each segment (like example) has its length (like 7) in front of it.
Here’s the Ruby code to translate example.com into 7 e x a m p l e 3 c o m 0:
def encode_domain_name(domain)
domain
.split(".")
.map { |x| x.length.chr + x }
.join + "\0"
end
Other than that, to finish generating the question section we just need to append the type and class onto the end of the domain name.
step 6: write make_dns_query
Here’s the final function to make a DNS query:
def make_dns_query(domain, type)
query_id = rand(65535)
header = make_question_header(query_id)
question = encode_domain_name(domain) + [type, 1].pack('nn')
header + question
end
Here’s all the code we’ve written before in dns-2.rb –
it’s still only 29 lines.
now for the parsing
Now that we’ve managed to generate a DNS query, we get into the hard part: the parsing. Again, we’ll split this into a bunch of different
- parse a DNS header
- parse a DNS name
- parse a DNS record
The hardest part of this (maybe surprisingly) is going to be “parse a DNS name”.
step 7: parse the DNS header
Let’s start with the easiest part: the DNS header. We already talked about how it’s 6 numbers concatenated together.
So all we need to do is
- read the first 12 bytes
- convert that into an array of 6 numbers
- put those numbers in a class for convenience
Here’s the Ruby code to do that.
class DNSHeader
attr_reader :id, :flags, :num_questions, :num_answers, :num_auth, :num_additional
def initialize(buf)
hdr = buf.read(12)
@id, @flags, @num_questions, @num_answers, @num_auth, @num_additional = hdr.unpack('nnnnnn')
end
end
Quick Ruby note: attr_reader is a Ruby thing that means “make these instance
variables accessible as methods”. So you can call header.flags to look at the
@flags variable.
We can call this with DNSHeader(buf). Not so bad.
Let’s move on to the hardest part: parsing a domain name.
step 8: parse a domain name
First, let’s write a partial version.
def read_domain_name_wrong(buf)
domain = []
loop do
len = buf.read(1).unpack('C')[0]
break if len == 0
domain << buf.read(len)
end
domain.join('.')
end
This repeatedly reads 1 byte and then reads that length into a string until the length is 0.
This works great, for the first time we see a domain name (example.com) in our DNS response.
trouble with domain names: compression!
But the second time example.com appears, we run into trouble – in Wireshark,
it says that the domain is represented cryptically as just the 2 bytes c00c.
This is something called DNS compression and if we want to parse any DNS responses we’re going to have to implement it.
This is luckily not that hard. All c00c is saying is:
- The first 2 bits (
0b11.....) mean “DNS compression ahead!” - The remaining 14 bits are an integer. In this case that integer is
12(0x0c), so that means “go back to the 12th byte in the packet and use the domain name you find there”
If you want to read more about DNS compression, I found the explanation in the DNS RFC relatively readable.
step 9: implement DNS compression
So we need a more complicated version of our read_domain_name function
Here it is.
domain = []
loop do
len = buf.read(1).unpack('C')[0]
break if len == 0
if len & 0b11000000 == 0b11000000
# weird case: DNS compression!
second_byte = buf.read(1).unpack('C')[0]
offset = ((len & 0x3f) << 8) + second_byte
old_pos = buf.pos
buf.pos = offset
domain << read_domain_name(buf)
buf.pos = old_pos
break
else
# normal case
domain << buf.read(len)
end
end
domain.join('.')
Basically what’s happening is:
- if the first 2 bits are
0b11, we need to do DNS compression. Then:- read the second byte and do a little bit arithmetic to convert that into the offset
- save the current position in the buffer
- read the domain name at the offset we calculated
- restore our position in the buffer
This is kind of messy but it’s the most complicated part of parsing the DNS response, so we’re almost done!
a DNS compression exploit
Someone pointed out that a malicious actor could exploit this code by sending a
DNS response with a DNS compression entry that points to itself, so that
read_domain_name would end up in an infinite loop. I won’t update it (the
code is already complicated enough!) but a real DNS parser would be
smarter and deal with that. For example here’s the code that avoids infinite loops in miekg/dns
There are also probably other edge cases that would be problematic if this were a real DNS parser.
step 10: parse a DNS query
You might think “why do we need to parse a DNS query? This is the response!”. But every DNS response has the original query in it, so we need to parse it.
Here’s the code for parsing the DNS query.
class DNSQuery
attr_reader :domain, :type, :cls
def initialize(buf)
@domain = read_domain_name(buf)
@type, @cls = buf.read(4).unpack('nn')
end
end
There’s not very much to it: the type and class are 2 bytes each.
step 11: parse a DNS record
This is the exciting part – the DNS record is where our query data lives! The “rdata field” (“record data”) is where the IP address we’re going to get in response to our DNS query lives.
Here’s the code:
class DNSRecord
attr_reader :name, :type, :class, :ttl, :rdlength, :rdata
def initialize(buf)
@name = read_domain_name(buf)
@type, @class, @ttl, @rdlength = buf.read(10).unpack('nnNn')
@rdata = buf.read(@rdlength)
end
We also need to do a little work to make the rdata field human readable. The
meaning of the record data depends on the record type – for example for an
“A” record it’s a 4-byte IP address, for but a “CNAME” record it’s a domain
name.
So here’s some code to make the request data human readable:
def read_rdata(buf, length)
@type_name = TYPES[@type] || @type
if @type_name == "CNAME" or @type_name == "NS"
read_domain_name(buf)
elsif @type_name == "A"
buf.read(length).unpack('C*').join('.')
else
buf.read(length)
end
end
This function uses this TYPES hash to map the record type to a human-readable name:
TYPES = {
1 => "A",
2 => "NS",
5 => "CNAME",
# there are a lot more but we don't need them for this example
}
The most interesting part of read_rdata is probably the line buf.read(length).unpack('C*').join('.') – it’s saying “hey, an IP address is 4 bytes,
so convert it into an array of 4 numbers and then join those with “.“s”.
step 12: finish parsing the DNS response
Now we’re ready to parse the DNS response!
Here’s some code to do that:
class DNSResponse
attr_reader :header, :queries, :answers, :authorities, :additionals
def initialize(bytes)
buf = StringIO.new(bytes)
@header = DNSHeader.new(buf)
@queries = (1..@header.num_questions).map { DNSQuery.new(buf) }
@answers = (1..@header.num_answers).map { DNSRecord.new(buf) }
@authorities = (1..@header.num_auth).map { DNSRecord.new(buf) }
@additionals = (1..@header.num_additional).map { DNSRecord.new(buf) }
end
end
This mostly just calls the other functions we’ve written to parse the DNS response.
It uses this cute (1..@header.num_answers).map construction to create an
array of 2 DNS records if @header.num_answers is 2. (which is maybe a
little bit of Ruby magic but I think it’s kind of fun and hopefully isn’t too hard
to read)
We can integrate this code into our main function like this:
sock.send(make_dns_query("example.com", 1), 0) # 1 is "A", for IP address
reply, _ = sock.recvfrom(1024)
response = DNSResponse.new(reply) # parse the response!!!
puts response.answers[0]
Printing out the records looks awful though (it says something like
#<DNSRecord:0x00000001368e3118>). So we need to write some pretty printing
code to make it human readable.
step 13: pretty print our DNS records
We need to add a .to_s field to DNS records to make them have a nice string
representation. This is just a 1-line method in DNSRecord:
def to_s
"#{@name}\t\t#{@ttl}\t#{@type_name}\t#{@parsed_rdata}"
end
You also might notice that I left out the class field of the DNS record. That’s because it’s
always the same (IN for “internet”) so I felt it was redundant. Most DNS tools
(like real dig) will print out the class though.
and we’re done!
Here’s our final main function:
def main
# connect to google dns
sock = UDPSocket.new
sock.bind('0.0.0.0', 0)
sock.connect('8.8.8.8', 53)
# send query
domain = ARGV[0]
sock.send(make_dns_query(domain, 1), 0)
# receive & parse response
reply, _ = sock.recvfrom(1024)
response = DNSResponse.new(reply)
response.answers.each do |record|
puts record
end
I don’t think there’s too much to say about this – we connect, send a query, print out each of the answers, and exit. Success!
$ ruby dig.rb example.com
example.com 18608 A 93.184.216.34
You can see the final program as a gist here: dig.rb. You could add more features to it if you want, like
- pretty printing for other query types
- options to print out the “authority” and “additional” sections of the DNS response
- retries
- making sure that the DNS response we see is actually a response to the query we sent (the query ID has to match!
Also you can let me know on Twitter if I’ve made a mistake in this post somewhere – I wrote this pretty quickly so I probably got something wrong.
Why do domain names sometimes end with a dot?
Hello! When I was writing the zine How DNS Works
earlier this year, someone asked me – why do people sometimes put a dot at the
end of a domain name? For example, if you look up the IP for example.com by
running dig example.com, you’ll see this:
$ dig example.com
example.com. 5678 IN A 93.184.216.34
dig has put a . to the end of example.com – now it’s example.com.! What’s up with that?
Also, some DNS tools require domains to have a "." at the end: if you try to pass example.com to miekg/dns, like this, it’ll fail:
// trying to send this message will return an error
m := new(dns.Msg)
m.SetQuestion("example.com", dns.TypeA)
Originally I thought I knew the answer to this (“uh, the dot at the end means the domain is fully qualified?“). And that’s true – a fully qualified domain name is a domain with a “.” at the end!
But that doesn’t explain why dots at the end are useful or important.
in a DNS request/response, domain names don’t have a trailing “.”
I once (incorrectly) thought the answer to “why is there a dot at the end?” might be “In a DNS request/response, domain names have a “.” at the end, so we put it in to match what actually gets sent/received by your computer”. But that’s not true at all!
When a computer sends a DNS request or response, the domain names in it don’t have a trailing dot. Actually, the domain names don’t have any dots.
Instead, they’re encoded as a series of length/string pairs. For example,
the domain example.com is encoded as these 13 bytes:
7example3com0
So there are no dots at all. Instead, an ASCII domain name (like “example.com”) gets translated into the format used in a DNS request / response by various DNS software.
So let’s talk about one place where domain names are translated into DNS responses: zone files.
the trailing “.” in zone files
One way that some people manage DNS records for a domain is to create a text
file called a “zone file” and then configure some DNS server software (like nsd
or bind) to serve the DNS records specified in that zone file.
Here’s an imaginary zone file for example.com:
orange 300 IN A 1.2.3.4
fruit 300 IN CNAME orange
grape 3000 IN CNAME example.com.
In this zone file, anything that doesn’t end in a "." (like "orange") gets
.example.com added to it. So "orange" is shorthand for
"orange.example.com". The DNS server knows from its configuration that this
is a zone file for example.com, so it knows to automatically append
example.com at the end of any name that doesn’t end with a dot.
I assume the idea here is just to save typing – you could imagine writing this zone file by fully typing out all of the domain names:
orange.example.com. 300 IN A 1.2.3.4
fruit.example.com. 300 IN CNAME orange.example.com.
grape.example.com. 3000 IN CNAME example.com.
But that’s a lot of typing.
you don’t need zone files to use DNS
Even though the zone file format is defined in the official DNS RFC (RFC 1035), you don’t have to use zone files at all to use DNS. For example, AWS Route 53 doesn’t use zone files to store DNS records! Instead you create records through the web interface or API, and I assume they store records in some kind of database and not a bunch of text files.
Route 53 (like many other DNS tools) does support importing and exporting zone files though and it can be a good way to migrate records from one DNS provider to another.
the trailing “.” in dig
Now, let’s talk about dig’s output:
$ dig example.com
; <<>> DiG 9.18.1-1ubuntu1.1-Ubuntu <<>> +all example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10712
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;example.com. IN A
;; ANSWER SECTION:
example.com. 81239 IN A 93.184.216.34
One weird thing about this is that almost every line starts with a ;;. What’s
up with that? Well ; is the comment character in zone files!
So I think the reason that dig prints out its output in this weird way is so that if you wanted, you could just paste this into a zone file and have it work without any changes.
This also explains why there’s a . at the end of example.com. – zone files
require a trailing dot at the end of a domain name (because otherwise they’re
interpreted as being relative to the zone). So dig does too.
I really wish dig had a +human flag that printed out all of this information
in a more human readable way, but for now I’m too lazy to put in the work to
actually contribute code to do that (and I’m a pretty bad C programmer) so I’ll
just complain about it on my blog instead :)
the trailing "." in curl
Let’s talk about another case where the trailing "." shows up: curl!
One of the computers in my house is called “grapefruit”, and it’s running a
webserver. Here’s what happens if I run curl grapefruit:
$ curl grapefruit
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
It works! Cool. But what happens if I add a . at the end? Suddenly it doesn’t work:
$ curl grapefruit.
curl: (6) Could not resolve host: grapefruit.
What’s going on? To understand, we need to learn about search domains:
meet search domains
When I run curl grapefrult, how does that get translated into a DNS request?
You might think that my computer would send a request for the domain
grapefruit, right? But that’s not true.
Let’s use tcpdump to see what domain is actually being looked up:
$ sudo tcpdump -i any port 53
[...] A? grapefruit.lan. (32)
It’s actually sending a request for grapefruit.lan. What’s up with that?
Well, what’s going on is that:
- To look up
grapefruit,curlcalls a function calledgetaddrinfo getaddrinfolooks in a file on my computer called/etc/resolv.conf/etc/resolv.confcontains these 2 lines:nameserver 127.0.0.53 search lan
- Because it sees
search lan,getaddrinfoadds alanat the end ofgrapefruitand looks upgrapefruit.laninstead
when are search domains used?
Now we know something weird: that when we look up a domain, sometimes an extra
thing (like lan) will be added to the end. But when does that happen?
- If we put a
"."at the end of the domain (likecurl grapefruit., then search domains aren’t used - If the domain has an
"."inside it (likeexample.comhas a dot in it), then by default search domains aren’t used either. But this can be changed with configuration (see this blog post about ndots that talks about this more)
So now we know why curl grapefruit. has different results than curl grapefruit – it’s because one looks up the domain grapefruit. and the other one looks up grapefruit.lan.
how does my computer know what search domain to use?
When I connect to my router, it tells me that its search domain is lan with
DHCP – it’s the same way that my computer gets assigned an IP address.
so why do people put a dot at the end of domain names?
Now that we know about zone files and search domains, here’s why I think people like to put dots at the end of a domain name.
There are two contexts where domain names are modified and get something else added to the end:
- in a zone file for
example.com,grapefruitget translated tograpefruit.example.com - on my local network (with my computer configured to use the search domain
lan),grapefruitgets translated tograpefruit.lan
So because domain names can actually be translated to something else in some
cases, people like to put a "." at the end to communicate “THIS IS THE
DOMAIN NAME, NOTHING GETS ADDED AT THE END, THIS IS THE WHOLE THING”. Because
otherwise it can get confusing.
The technical term for “THIS IS THE WHOLE THING” is “fully qualified domain
name” or “FQDN”. So google.com. is a fully qualified domain name, and
google.com isn’t.
I always have to remind myself for the reasons for this because I rarely use
zone files or search domains, so I often feel like – “of course I mean
google.com and not google.com.something.else! Why would I mean anything
else?? That’s silly!”
But some people do use zone files and search domains (search domains are used in Kubernetes, for example!), so the “.” at the end is useful to make it 100% clear that nothing else should be added.
when to put a “.” at the end?
Here are a couple of quick notes about when to put a “.” at the end of your domain names:
Yes: when configuring DNS
It’s never bad to use fully qualified domain names when configuring DNS. You don’t always have to: a non-fully-qualified domain name will often work just fine as well, but I’ve never met a piece of DNS software that wouldn’t accept a fully qualified domain name.
And some DNS software requires it: right now the DNS server I use for jvns.ca
makes me put a "." at the end of domains names (for example in CNAME records)
and warns me otherwise it’ll append .jvns.ca to whatever I typed in. I don’t
agree with this design decision but it’s not a big deal, I just put a “.” at
the end.
No: in a browser
Confusingly, it often doesn’t work to put a "." at the end of a domain name in a
browser! For example, if I type https://twitter.com. into my browser, it
doesn’t work! It gives me a 404.
I think what’s going on here is that it’s setting the HTTP Host header to
Host: twitter.com. and the web server on the other end is expecting Host: twitter.com.
Similarly, https://jvns.ca. gives me an SSL error for some reason.
I think relative domain names used to be more common
One last thing: I think that “relative” domain names (like me using
grapefruit to refer to the other computer in my house, grapefruit.lan) used
to be more commonly used, because DNS was developed in the context of
universities or other big institutions which have big internal networks.
On the internet today, it seems like it’s more common to use “absolute” domain
names (like example.com).
How to send raw network packets in Python with tun/tap
Hello!
Recently I’ve been working on a project where I implement a bunch of tiny toy working versions of computer networking protocols in Python without using any libraries, as a way to explain how computer networking works.
I’m still working on writing up that project, but today I wanted to talk about how to do the very first step: sending network packets in Python.
In this post we’re going to send a SYN packet (the first packet in a TCP
connection) from a tiny Python program, and get a reply from example.com. All the code from this post is in this gist.
what’s a network packet?
A network packet is a byte string. For example, here’s the first packet in a TCP connection:
b'E\x00\x00,\x00\x01\x00\x00@\x06\x00\xc4\xc0\x00\x02\x02"\xc2\x95Cx\x0c\x00P\xf4p\x98\x8b\x00\x00\x00\x00`\x02\xff\xff\x18\xc6\x00\x00\x02\x04\x05\xb4'
I’m not going to talk about the structure of this byte string in this post (though I’ll say that this particular byte string has two parts: the first 20 bytes are the IP address part and the rest is the TCP part)
The point is that to send network packets, we need to be able to send and receive strings of bytes.
why tun/tap?
The problem with writing your own TCP implementation on Linux (or any operating system) is – the Linux kernel already has a TCP implementation!
So if you send out a SYN packet on your normal network interface to a host like example.com, here’s what will happen:
- you send a SYN packet to example.com
- example.com replies with a SYN ACK (so far so good!)
- the Linux kernel on your machine gets the SYN ACK, thinks “wtf?? I didn’t make this connection??”, and closes the connection
- you’re sad. no TCP connection for you.
I was talking to a friend about this problem a few years ago and he said “you should use tun/tap!“. It took quite a few hours to figure out how to do that though, which is why I’m writing this blog post :)
tun/tap gives you a “virtual network device”
The way I like to think of tun/tap is – imagine I have a tiny computer in my
network which is sending and receiving network packets. But instead of it being
a real computer, it’s just a Python program I wrote.
That explanation is honestly worse than I would like. I wish I understood exactly how tun/tap devices interfaced with the real Linux network stack but unfortunately I do not, so “virtual network device” is what you’re getting. Hopefully the code examples below will make all it a bit more clear.
tun vs tap
The system called “tun/tap” lets you create two kinds of network interfaces:
- “tun”, which lets you set IP-layer packets
- “tap”, which lets you set Ethernet-layer packets
We’re going to be using tun, because that’s what I could figure out how to get to work. It’s possible that tap would work too.
how to create a tun interface
Here’s how I created a tun interface with IP address 192.0.2.2.
sudo ip tuntap add name tun0 mode tun user $USER
sudo ip link set tun0 up
sudo ip addr add 192.0.2.1 peer 192.0.2.2 dev tun0
sudo iptables -t nat -A POSTROUTING -s 192.0.2.2 -j MASQUERADE
sudo iptables -A FORWARD -i tun0 -s 192.0.2.2 -j ACCEPT
sudo iptables -A FORWARD -o tun0 -d 192.0.2.2 -j ACCEPT
These commands do two things:
- Create the
tundevice with the IP192.0.2.2(and give your user access to write to it) - set up
iptablesto proxy packets from that tun device to the internet using NAT
The iptables part is very important because otherwise the packets would only exist inside my computer and wouldn’t be sent to the internet, and what fun would that be?
I’m not going to explain this ip addr add command because I don’t understand
it, I find ip to be very inscrutable and for now I’m resigned to just copying
and pasting ip commands without fully understanding them. It does work
though.
how to connect to the tun interface in Python
Here’s a function to open a tun interface, you call it like openTun('tun0').
I figured out how to write it by searching through the
scapy source code for “tun”.
import struct
from fcntl import ioctl
def openTun(tunName):
tun = open("/dev/net/tun", "r+b", buffering=0)
LINUX_IFF_TUN = 0x0001
LINUX_IFF_NO_PI = 0x1000
LINUX_TUNSETIFF = 0x400454CA
flags = LINUX_IFF_TUN | LINUX_IFF_NO_PI
ifs = struct.pack("16sH22s", tunName, flags, b"")
ioctl(tun, LINUX_TUNSETIFF, ifs)
return tun
All this is doing is
- opening
/dev/net/tunin binary mode - calling an
ioctlto tell Linux that we want atundevice, and that the one we want is calledtun0(or whatevertunNamewe’ve passed to the function).
Once it’s open, we can read from and write to it like any other file in Python.
let’s send a SYN packet!
Now that we have the openTun function, we can send a SYN packet!
Here’s what the Python code looks like, using the openTun function.
syn = b'E\x00\x00,\x00\x01\x00\x00@\x06\x00\xc4\xc0\x00\x02\x02"\xc2\x95Cx\x0c\x00P\xf4p\x98\x8b\x00\x00\x00\x00`\x02\xff\xff\x18\xc6\x00\x00\x02\x04\x05\xb4'
tun = openTun(b"tun0")
tun.write(syn)
reply = tun.read(1024)
print(repr(reply))
If I run this as sudo python3 syn.py, it prints out the reply from example.com:
b'E\x00\x00,\x00\x00@\x00&\x06\xda\xc4"\xc2\x95C\xc0\x00\x02\x02\x00Px\x0cyvL\x84\xf4p\x98\x8c`\x12\xfb\xe0W\xb5\x00\x00\x02\x04\x04\xd8'
Obviously this is a pretty silly way to send a SYN packet – a real implementation would have actual code to generate that byte string instead of hardcoding it, and we would parse the reply instead of just printing out the raw byte string. But I didn’t want to go into the structure of TCP in this post so that’s what we’re doing.
looking at these packets with tcpdump
If we run tcpdump on the tun0 interface, we can see the packet we sent and the answer from example.com:
$ sudo tcpdump -ni tun0
12:51:01.905933 IP 192.0.2.2.30732 > 34.194.149.67.80: Flags [S], seq 4101019787, win 65535, options [mss 1460], length 0
12:51:01.932178 IP 34.194.149.67.80 > 192.0.2.2.30732: Flags [S.], seq 3300937416, ack 4101019788, win 64480, options [mss 1240], length 0
Flags [S] is the SYN we sent, and Flags [S.] is the SYN ACK packet in
response! We successfully communicated! And the Linux network stack didn’t
interfere at all!
tcpdump also shows us how NAT is working
We can also run tcpdump on my real network interface (wlp3so, my wireless card), to see the packets being sent and received. We’ll pass -i wlp3s0 instead of -i tun0.
$ sudo tcpdump -ni wlp3s0 host 34.194.149.67
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wlp3s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:56:01.204382 IP 192.168.1.181.30732 > 34.194.149.67.80: Flags [S], seq 4101019787, win 65535, options [mss 1460], length 0
12:56:01.228239 IP 34.194.149.67.80 > 192.168.1.181.30732: Flags [S.], seq 144769955, ack 4101019788, win 64480, options [mss 1240], length 0
12:56:05.334427 IP 34.194.149.67.80 > 192.168.1.181.30732: Flags [S.], seq 144769955, ack 4101019788, win 64480, options [mss 1240], length 0
12:56:13.524973 IP 34.194.149.67.80 > 192.168.1.181.30732: Flags [S.], seq 144769955, ack 4101019788, win 64480, options [mss 1240], length 0
12:56:29.705007 IP 34.194.149.67.80 > 192.168.1.181.30732: Flags [S.], seq 144769955, ack 4101019788, win 64480, options [mss 1240], length 0
A couple of things to notice here:
- The IP addresses are different – that IPtables rule from above has rewritten them from
192.0.2.2to192.168.1.181. This rewriting is called “network address translation”, or “NAT”. - We’re getting a bunch of replies from
example.com– it’s doing an exponential backoff where it retries after 4 seconds, then 8 seconds, then 16 seconds. This is because we didn’t finish the TCP handshake – we just sent a SYN and left it hanging! There’s actually a type of DDOS attack like this called SYN flooding, but just sending one or two SYN packets isn’t a big deal. - I had to add
host 34.194.149.67because there are a lot of TCP packets being sent on my real wifi connection so I needed to ignore those
I’m not totally sure why we see more SYN replies on wlp3s0 than on tun0, my
guess is that it’s because we only read 1 reply in our Python program.
this is pretty easy and really reliable
The last time I tried to implement TCP in Python I did it with something called “ARP spoofing”. I won’t talk about that here (there are some posts about it on this blog back in 2013), but this way is a lot more reliable.
And ARP spoofing is kind of a sketchy thing to do on a network you don’t own.
here’s the code
I put all the code from this blog post in this gist, if you want to try it yourself, you can run
bash setup.sh # needs to run as root, has lots of `sudo` commands
python3 syn.py # runs as a regular user
It only works on Linux, but I think there’s a way to set up tun/tap on Mac too.
a plug for scapy
I’ll close with a plug for scapy here: it’s a really great Python networking library for doing this kind of experimentation without writing all the code yourself.
This post is about writing all the code yourself though so I won’t say more about it than that.
Some ways to get better at debugging
Hello! I’ve been working on writing a zine about debugging for a while (here’s an early draft of the table of contents).
As part of that I thought it might be fun to read some academic papers about debugging, and last week Greg Wilson sent me some papers about academic research into debugging.
One of those papers (Towards a framework for teaching debugging [paywalled]) had a categorization I really liked of the different kinds of knowledge/skills we need to debug effectively. It comes from another more general paper on troubleshooting: Learning to Troubleshoot: A New Theory-Based Design Architecture.
I thought the categorization was a very useful structure for thinking about how to get better at debugging, so I’ve reframed the five categories in the paper into actions you can take to get better at debugging.
Here they are:
1. learn the codebase
To debug some code, you need to understand the codebase you’re working with. This seems kind of obvious (of course you can’t debug code without understanding how it works!).
This kind of learning happens pretty naturally over time, and actually debugging is also one of the best ways to learn how a new codebase works – seeing how something breaks helps you learn a lot about how it works.
The paper calls this “System Knowledge”.
2. learn the system
The paper mentions that you need to understand the programming language, but I think there’s more to it than that – to fix bugs, often you need to learn a lot about the broader environment than just the language.
For example, if you’re a backend web developer, some “system” knowledge you might need includes:
- how HTTP caching works
- CORS
- how database transactions work
I find that I often have to be a bit more intentional about learning systemic things like this – I need to actually take the time to look them up and read about them.
The paper calls this “Domain Knowledge”.
3. learn your tools
There are lots of debugging tools out there, for example:
- debuggers (gdb etc)
- browser developer tools
- profilers
- strace / ltrace
- tcpdump / wireshark
- core dumps
- and even basic things like error messages (how do you read them properly)
I’ve written a lot about debugging tools on this blog, and definitely learning these tools has made a huge difference to me.
The paper calls this “Procedural Knowledge”.
4. learn strategies
This is the fuzziest category, we all have a lot of strategies and heuristics we pick up along the way for how to debug efficiently. For example:
- writing a unit test
- writing a tiny standalone program to reproduce the bug
- finding a working version of the code and seeing what changed
- printing out a million things
- adding extra logging
- taking a break
- explaining the bug to a friend and then figuring out what’s wrong halfway through
- looking through the github issues to see if anything matches
I’ve been thinking a lot about this category while writing the zine, but I want to keep this post short so I won’t say more about it here.
The paper calls this “Strategic Knowledge”.
5. get experience
The last category is “experience”. The paper has a really funny comment about this:
Their findings did not show a significant difference in the strategies employed by the novices and experts. Experts simply formed more correct hypotheses and were more efficient at finding the fault. The authors suspect that this result is due to the difference in the programming experience between novices and experts.
This really resonated with me – I’ve had SO MANY bugs that were really frustrating and difficult the first time I ran into them, and very straightforward the fifth or tenth or 20th time.
This also feels like one of the most straightforward categories of knowledge to acquire to me – all you need to do is investigate a million bugs, which is our whole life as programmers anyway :). It takes a long time but I feel like it happens pretty naturally.
The paper calls this “Experiential Knowledge”.
that’s all!
I’m going to keep this post short, I just really liked this categorization and wanted to share it.
A toy remote login server
Hello! The other day we talked about what happened when you press a key in your terminal.
As a followup, I thought it might be fun to implement a program that’s like a tiny ssh server, but without the security. You can find it on github here, and I’ll explain how it works in this blog post.
the goal: “ssh” to a remote computer
Our goal is to be able to login to a remote computer and run commands, like you do with SSH or telnet.
The biggest difference between this program and SSH is that there’s literally no security (not even a password) – anyone who can make a TCP connection to the server can get a shell and run commands.
Obviously this is not a useful program in real life, but our goal is to learn a little more about how terminals works, not to write a useful program.
(I will run a version of it on the public internet for the next week though, you can see how to connect to it at the end of this blog post)
let’s start with the server!
We’re also going to write a client, but the server is the interesting part, so let’s start there. We’re going to write a server that listens on a TCP port (I picked 7777) and creates remote terminals for any client that connects to it to use.
When the server receives a new connection it needs to:
- create a pseudoterminal for the client to use
- start a
bashshell process for the client to use - connect
bashto the pseudoterminal - continuously copy information back and forth between the TCP connection and the pseudoterminal
I just said the word “pseudoterminal” a lot, so let’s talk about what that means.
what’s a pseudoterminal?
Okay, what the heck is a pseudoterminal?
A pseudoterminal is a lot like a bidirectional pipe or a socket – you have two ends, and they can both send and receive information. You can read more about the information being sent and received in what happens if you press a key in your terminal
Basically the idea is that on one end, we have a TCP connection, and on the
other end, we have a bash shell. So we need to hook one part of the
pseudoterminal up to the TCP connection and the other end to bash.
The two parts of the pseudoterminal are called:
- the “pseudoterminal master”. This is the end we’re going to hook up to the TCP connection.
- the “slave pseudoterminal device”. We’re going to set our bash shell’s
stdout,stderr, andstdinto this.
Once they’re conected, we can communicate with bash over our TCP connection
and we’ll have a remote shell!
why do we need this “pseudoterminal” thing anyway?
You might be wondering – Julia, if a pseudoterminal is kind of like a socket,
why can’t we just set our bash shell’s stdout / stderr / stdin to the TCP
socket?
And you can! We could write a TCP connection handler like this that does exactly that, it’s not a lot of code (server-notty.go).
func handle(conn net.Conn) {
tty, _ := conn.(*net.TCPConn).File()
// start bash with tcp connection as stdin/stdout/stderr
cmd := exec.Command("bash")
cmd.Stdin = tty
cmd.Stdout = tty
cmd.Stderr = tty
cmd.Start()
}
It even kind of works – if we connect to it with nc localhost 7778, we can
run commands and look at their output.
But there are a few problems. I’m not going to list all of them, just two.
problem 1: Ctrl + C doesn’t work
The way Ctrl + C works in a remote login session is
- you press ctrl + c
- That gets translated to
0x03and sent through the TCP connection - The terminal receives it
- the Linux kernel on the other end notes “hey, that was a Ctrl + C!”
- Linux sends a
SIGINTto the appropriate process (more on what the “appropriate process” is exactly later)
If the “terminal” is just a TCP connection, this doesn’t work, because when you
send 0x04 to a TCP connection, Linux won’t magically send SIGINT to any
process.
problem 2: top doesn’t work
When I try to run top in this shell, I get the error message top: failed tty get. If we strace it, we see this system call:
ioctl(2, TCGETS, 0x7ffec4e68d60) = -1 ENOTTY (Inappropriate ioctl for device)
So top is running an ioctl on its output file descriptor (2) to get some
information about the terminal. But Linux is like “hey, this isn’t a terminal!”
and returns an error.
There are a bunch of other things that go wrong, but hopefully at this point you’re convinced that we actually need to set bash’s stdout/stderr to be a terminal, not some other thing like a socket.
So let’s start looking at the server code and see what creating a pseudoterminal actually looks like.
step 1: create a pseudoterminal
Here’s some Go code to create a pseudoterminal on Linux. This is copied from github.com/creack/pty, but I removed some of the error handling to make the logic a bit easier to follow:
pty, _ := os.OpenFile("/dev/ptmx", os.O_RDWR, 0)
sname := ptsname(p)
unlockpt(p)
tty, _ := os.OpenFile(sname, os.O_RDWR|syscall.O_NOCTTY, 0)
In English, what we’re doing is:
- open
/dev/ptmxto get the “pseudoterminal master” Again, that’s the part we’re going to hook up to the TCP connection - get the filename of the “slave pseudoterminal device”, which is going to be
/dev/pts/13or something. - “unlock” the pseudoterminal so that we can use it. I have no idea what the point of this is (why is it locked to begin with?) but you have to do it for some reason
- open
/dev/pts/13(or whatever number we got fromptsname) to get the “slave pseudoterminal device”
What do those ptsname and unlockpt functions do? They just make some
ioctl system calls to the Linux kernel. All of the communication with the
Linux kernel about terminals seems to be through various ioctl system calls.
Here’s the code, it’s pretty short: (again, I just copied it from creack/pty)
func ptsname(f *os.File) string {
var n uint32
ioctl(f.Fd(), syscall.TIOCGPTN, uintptr(unsafe.Pointer(&n)))
return "/dev/pts/" + strconv.Itoa(int(n))
}
func unlockpt(f *os.File) {
var u int32
// use TIOCSPTLCK with a pointer to zero to clear the lock
ioctl(f.Fd(), syscall.TIOCSPTLCK, uintptr(unsafe.Pointer(&u)))
}
step 2: hook the pseudoterminal up to bash
The next thing we have to do is connect the pseudoterminal to bash. Luckily,
that’s really easy – here’s the Go code for it! We just need to start a new
process and set the stdin, stdout, and stderr to tty.
cmd := exec.Command("bash")
cmd.Stdin = tty
cmd.Stdout = tty
cmd.Stderr = tty
cmd.SysProcAttr = &syscall.SysProcAttr{
Setsid: true,
}
cmd.Start()
Easy! Though – why do we need this Setsid: true thing, you might ask? Well,
I tried commenting out that code to see what went wrong. It turns out that what
goes wrong is – Ctrl + C doesn’t work anymore!
Setsid: true creates a new session for the new bash process. But why does
that make Ctrl + C work? How does Linux know which process to send SIGINT
to when you press Ctrl + C, and what does that have to do with sessions?
how does Linux know which process to send Ctrl + C to?
I found this pretty confusing, so I reached for my favourite book for learning about this kind of thing: the linux programming interface, specifically chapter 34 on process groups and sessions.
That chapter contains a few key facts: (#3, #4, and #5 are direct quotes from the book)
- Every process has a session id and a process group id (which may or may not be the same as its PID)
- A session is made up of multiple process groups
- All of the processes in a session share a single controlling terminal.
- A terminal may be the controlling terminal of at most one session.
- At any point in time, one of the process groups in a session is the foreground process group for the terminal, and the others are background process groups.
- When you press
Ctrl+Cin a terminal, SIGINT gets sent to all the processes in the foreground process group
What’s a process group? Well, my understanding is that:
- processes in the same pipe
x | y | zare in the same process group - processes you start on the same shell line (
x && y && z) are in the same process group - child processes are by default in the same process group, unless you explicitly decide otherwise
I didn’t know most of this (I had no idea processes had a session ID!) so this was kind of a lot to absorb. I tried to draw a sketchy ASCII art diagram of the situation
(maybe) terminal --- session --- process group --- process
| |- process
| |- process
|- process group
|
|- process group
So when we press Ctrl+C in a terminal, here’s what I think happens:
\x04gets written to the “pseudoterminal master” of a terminal- Linux finds the session for that terminal (if it exists)
- Linux find the foreground process group for that session
- Linux sends
SIGINT
If we don’t create a new session for our new bash process, our new pseudoterminal
actually won’t have any session associated with it, so nothing happens when
we press Ctrl+C. But if we do create a new session, then the new
pseudoterminal will have the new session associated with it.
how to get a list of all your sessions
As a quick aside, if you want to get a list of all the sessions on your Linux machine, grouped by session, you can run:
$ ps -eo user,pid,pgid,sess,cmd | sort -k3
This includes the PID, process group ID, and session ID. As an example of the output, here are the two processes in the pipeline:
bork 58080 58080 57922 ps -eo user,pid,pgid,sess,cmd
bork 58081 58080 57922 sort -k3
You can see that they share the same process group ID and session ID, but of course they have different PIDs.
That was kind of a lot but that’s all we’re going to say about sessions and process groups in this post. Let’s keep going!
step 3: set the window size
We need to tell the terminal how big to be!
Again, I just copied this from creack/pty. I decided to hardcode the size to 80x24.
Setsize(tty, &Winsize{
Cols: 80,
Rows: 24,
})
Like with getting the terminal’s pts filename and unlocking it, setting the
size is just one ioctl system call:
func Setsize(t *os.File, ws *Winsize) {
ioctl(t.Fd(), syscall.TIOCSWINSZ, uintptr(unsafe.Pointer(ws)))
}
Pretty simple! We could do something smarter and get the real window size, but I’m too lazy.
step 4: copy information between the TCP connection and the pseudoterminal
As a reminder, our rough steps to set up this remote login server were:
- create a pseudoterminal for the client to use
- start a
bashshell process - connect
bashto the pseudoterminal - continuously copy information back and forth between the TCP connection and the pseudoterminal
We’ve done 1, 2, and 3, now we just need to ferry information between the TCP connection and the pseudoterminal.
There are two io.Copy calls, one to copy the input from the tcp connection, and one to copy the output to the TCP connection. Here’s what the code looks like:
go func() {
io.Copy(pty, conn)
}()
io.Copy(conn, pty)
The first one is in a goroutine just so they can both run in parallel.
Pretty simple!
step 5: exit when we’re done
I also added a little bit of code to close the TCP connection when the command exits
go func() {
cmd.Wait()
conn.Close()
}()
And that’s it for the server! You can see all of the Go code here: server.go.
next: write a client
Next, we have to write a client. This is a lot easier than the server because we don’t need to do quite as much terminal setup. There are just 3 steps:
- Put the terminal into raw mode
- copy stdin/stdout to the TCP connection
- reset the terminal
client step 1: put the terminal into “raw” mode
We need to put the client terminal into “raw” mode so that every time you press a key, it gets sent to the TCP connection immediately. If we don’t do this, everything will only get sent when you press enter.
“Raw mode” isn’t actually a single thing, it’s a bunch of flags that you want to turn off. There’s a good tutorial explaining all the flags we have to turn off called Entering raw mode.
Like everything else with terminals, this requires ioctl system calls. In
this case we get the terminal’s current settings, modify them, and save the old
settings so that we can restore them later.
I figured out how to do this in Go by going to https://grep.app and typing in
syscall.TCSETS to find some other Go code that was doing the same thing.
func MakeRaw(fd uintptr) syscall.Termios {
// from https://github.com/getlantern/lantern/blob/devel/archive/src/golang.org/x/crypto/ssh/terminal/util.go
var oldState syscall.Termios
ioctl(fd, syscall.TCGETS, uintptr(unsafe.Pointer(&oldState)))
newState := oldState
newState.Iflag &^= syscall.ISTRIP | syscall.INLCR | syscall.ICRNL | syscall.IGNCR | syscall.IXON | syscall.IXOFF
newState.Lflag &^= syscall.ECHO | syscall.ICANON | syscall.ISIG
ioctl(fd, syscall.TCSETS, uintptr(unsafe.Pointer(&newState)))
return oldState
}
client step 2: copy stdin/stdout to the TCP connection
This is exactly like what we did with the server. It’s very little code:
go func() {
io.Copy(conn, os.Stdin)
}()
io.Copy(os.Stdout, conn)
client step 3: restore the terminal’s state
We can put the terminal back into the mode it started in like this (another ioctl!):
func Restore(fd uintptr, oldState syscall.Termios) {
ioctl(fd, syscall.TCSETS, uintptr(unsafe.Pointer(&oldState)))
}
we did it!
We have written a tiny remote login server that lets anyone log in! Hooray!
Obviously this has zero security so I’m not going to talk about that aspect.
it’s running on the public internet! you can try it out!
For the next week or so I’m going to run a demo of this on the internet at
tetris.jvns.ca. It runs tetris instead of a shell because I wanted to avoid
abuse, but if you want to try it with a shell you can run it on your own
computer :).
If you want to try it out, you can use netcat as a client instead of the
custom Go client program we wrote, because copying information to/from a TCP
connection is what netcat does. Here’s how:
stty raw -echo && nc tetris.jvns.ca 7777 && stty sane
This will let you play a terminal tetris game called tint.
You can also use the client.go program and run go run client.go tetris.jvns.ca 7777.
this is not a good protocol
This protocol where we just copy bytes from the TCP connection to the terminal and nothing else is not good because it doesn’t allow us to send over information information like the terminal or the actual window size of the terminal.
I thought about implementing telnet’s protocol so that we could use telnet as a client, but I didn’t feel like figuring out how telnet works so I didn’t. (the server 30% works with telnet as is, but a lot of things are broken, I don’t quite know why, and I didn’t feel like figuring it out)
it’ll mess up your terminal a bit
As a warning: using this server to play tetris will probably mess up your terminal a bit because it sets the window size to 80x24. To fix that I just closed the terminal tab after running that command.
If we wanted to fix this for real, we’d need to restore the window size after we’re done, but then we’d need a slightly more real protocol than “just blindly copy bytes back and forth with TCP” and I didn’t feel like doing that.
Also it sometimes takes a second to disconnect after the program exits for some reason, I’m not sure why that is.
other tiny projects
That’s all! There are a couple of other similar toy implementations of programs I’ve written here: