Reading List

The most recent articles from a list of feeds I subscribe to.

Writing Javascript without a build system

Hello! I’ve been writing some Javascript this week, and as always when I start a new frontend project, I was faced with the question: should I use a build system?

I want to talk about what’s appealing to me about build systems, why I (usually) still don’t use them, and why I find it frustrating that some frontend Javascript libraries require that you use a build system.

I’m writing this because most of the writing I see about JS assumes that you’re using a build system, and it can be hard to navigate for folks like me who write very simple small Javascript projects that don’t require a build system.

what’s a build system?

The idea is that you have a bunch of Javascript or Typescript code, and you want to translate it into different Javascript code before you put it on your website.

Build systems can do lots of useful things, like:

  • combining 100s of JS files into one big bundle (for efficiency reasons)
  • translating Typescript into Javascript
  • typechecking Typescript
  • minification
  • adding polyfills to support older browsers
  • compiling JSX
  • treeshaking (remove unused JS code to reduce file sizes)
  • building CSS (like tailwind does)
  • and probably lots of other important things

Because of this, if you’re building a complex frontend project today, probably you’re using a build system like webpack, rollup, esbuild, parcel, or vite.

Lots of those features are appealing to me, and I’ve used build systems in the past for some of these reasons: Mess With DNS uses esbuild to translate Typescript and combine lots of files into one big file, for example.

the goal: easily make changes to old tiny websites

I make a lot of small simple websites, I have approximately 0 maintenance energy for any of them, and I change them very infrequently.

My goal is that if I have a site that I made 3 or 5 years ago, I’d like to be able to, in 20 minutes:

  • get the source from github on a new computer
  • make some changes
  • put it on the internet

But my experience with build systems (not just Javascript build systems!), is that if you have a 5-year-old site, often it’s a huge pain to get the site built again.

And because most of my websites are pretty small, the advantage of using a build system is pretty small – I don’t really need Typescript or JSX. I can just have one 400-line script.js file and call it a day.

example: trying to build the SQL playground

One of my sites (the sql playground) uses a build system (it’s using Vue). I last edited that project 2 years ago, on a different machine.

Let’s see if I can still easily build it today on my machine. To start out, we have to run npm install. Here’s the output I get.

$ npm install
[lots of output redacted]
npm ERR! code 1
npm ERR! path /Users/bork/work/sql-playground.wizardzines.com/node_modules/grpc
npm ERR! command failed
npm ERR! command sh /var/folders/3z/g3qrs9s96mg6r4dmzryjn3mm0000gn/T/install-b52c96ad.sh
npm ERR! CXX(target) Release/obj.target/grpc/deps/grpc/src/core/lib/surface/init.o
npm ERR!   CXX(target) Release/obj.target/grpc/deps/grpc/src/core/lib/avl/avl.o
npm ERR!   CXX(target) Release/obj.target/grpc/deps/grpc/src/core/lib/backoff/backoff.o
npm ERR!   CXX(target) Release/obj.target/grpc/deps/grpc/src/core/lib/channel/channel_args.o
npm ERR!   CXX(target) Release/obj.target/grpc/deps/grpc/src/core/lib/channel/channel_stack.o
npm ERR!   CXX(target) Release/obj.target/grpc/deps/grpc/src/core/lib/channel/channel_stack_builder.o
npm ERR!   CXX(target) Release/obj.target/grpc/deps/grpc/src/core/lib/channel/channel_trace.o
npm ERR!   CXX(target) Release/obj.target/grpc/deps/grpc/src/core/lib/channel/channelz.o

There’s some kind of error building grpc. No problem. I don’t really need that dependency anyway, so I can just take 5 minutes to tear it out and rebuild. Now I can npm install and everything works.

Now let’s try to build the project:

$ npm run build
  ?  Building for production...Error: error:0308010C:digital envelope routines::unsupported
    at new Hash (node:internal/crypto/hash:71:19)
    at Object.createHash (node:crypto:130:10)
    at module.exports (/Users/bork/work/sql-playground.wizardzines.com/node_modules/webpack/lib/util/createHash.js:135:53)
    at NormalModule._initBuildHash (/Users/bork/work/sql-playground.wizardzines.com/node_modules/webpack/lib/NormalModule.js:414:16)
    at handleParseError (/Users/bork/work/sql-playground.wizardzines.com/node_modules/webpack/lib/NormalModule.js:467:10)
    at /Users/bork/work/sql-playground.wizardzines.com/node_modules/webpack/lib/NormalModule.js:499:5
    at /Users/bork/work/sql-playground.wizardzines.com/node_modules/webpack/lib/NormalModule.js:356:12
    at /Users/bork/work/sql-playground.wizardzines.com/node_modules/loader-runner/lib/LoaderRunner.js:373:3
    at iterateNormalLoaders (/Users/bork/work/sql-playground.wizardzines.com/node_modules/loader-runner/lib/LoaderRunner.js:214:10)
    at iterateNormalLoaders (/Users/bork/work/sql-playground.wizardzines.com/node_modules/loader-runner/lib/LoaderRunner.js:221:10)
    at /Users/bork/work/sql-playground.wizardzines.com/node_modules/loader-runner/lib/LoaderRunner.js:236:3
    at runSyncOrAsync (/Users/bork/work/sql-playground.wizardzines.com/node_modules/loader-runner/lib/LoaderRunner.js:130:11)
    at iterateNormalLoaders (/Users/bork/work/sql-playground.wizardzines.com/node_modules/loader-runner/lib/LoaderRunner.js:232:2)
    at Array.<anonymous> (/Users/bork/work/sql-playground.wizardzines.com/node_modules/loader-runner/lib/LoaderRunner.js:205:4)
    at Storage.finished (/Users/bork/work/sql-playground.wizardzines.com/node_modules/enhanced-resolve/lib/CachedInputFileSystem.js:43:16)
    at /Users/bork/work/sql-playground.wizardzines.com/node_modules/enhanced-resolve/lib/CachedInputFileSystem.js:79:9

This stack overflow answer suggests running export NODE_OPTIONS=--openssl-legacy-provider to fix this error.

That works, and finally I can npm run build to build the project.

This isn’t really that bad (I only had to remove a dependency and pass a slightly mysterious node option!), but I would rather not be derailed by those build errors.

for me, a build system isn’t worth it for small projects

For me, a complicated Javascript build system just doesn’t seem worth it for small 500-line projects – it means giving up being able to easily update the project in the future in exchange for some pretty marginal benefits.

esbuild seems a little more stable

I want to give a quick shoutout to esbuild: I learned about esbuild in 2021 and used for a project, and so far it does seem a more reliable way to build JS projects.

I just tried to build an esbuild project that I last touched 8 months ago on a new computer, and it worked. But I can’t say for sure if I’ll be able to easily build that project in 2 years. Maybe it will, I hope so!

not using a build system is usually pretty easy

Here’s what the part of nginx playground code that imports all the libraries looks like:

<script src="js/vue.global.prod.js"></script>
<script src="codemirror-5.63.0/lib/codemirror.js"></script>
<script src="codemirror-5.63.0/mode/nginx/nginx.js"></script>
<script src="codemirror-5.63.0/mode/shell/shell.js"></script>
<script src="codemirror-5.63.0/mode/javascript/javascript.js"></script>
<link rel="stylesheet" href="codemirror-5.63.0/lib/codemirror.css">
<script src="script.js "></script>

This project is also using Vue, but it just uses a <script src to load Vue – there’s no build process for the frontend.

a no-build-system template for using Vue

A couple of people asked how to get started writing Javascript without a build system. Of course you can write vanilla JS if you want, but my usual framework is Vue 3.

Here’s a tiny template I built for starting a Vue 3 project with no build system. It’s just 2 files and ~30 lines of HTML/JS.

some libraries require you to use a build system

This build system stuff is on my mind recently because I’m using CodeMirror 5 for a new project this week, and I saw there was a new version, CodeMirror 6.

So I thought – cool, maybe I should use CodeMirror 6 instead of CodeMirror 5. But – it seems like you can’t use CodeMirror 6 without a build system (according to the migration guide). So I’m going to stick with CodeMirror 5.

Similarly, you used to be able to just download Tailwind as a giant CSS file, but Tailwind 3 doesn’t seem to be available as a big CSS file at all anymore, you need to run Javascript to build it. So I’m going to keep using Tailwind 2 for now. (I know, I know, you’re not supposed to use the big CSS file, but it’s only 300KB gzipped and I really don’t want a build step)

(edit: it looks like Tailwind released a standalone CLI in 2021 which seems like a nice option)

I’m not totally sure why some libraries don’t provide a no-build-system version – maybe distributing a no-build-system version would add a lot of additional complexity to the library, and the maintainer doesn’t think it’s worth it. Or maybe the library’s design means that it’s not possible to distribute a no-build-system version for some reason.

I’d love more tips for no-build-system javascript

My main strategies so far are:

  • search for “CDN” on a library’s website to find a standalone javascript file
  • use https://unpkg.com to see if the library has a built version I can use
  • host my own version of libraries instead of relying on a CDN that might go down
  • write my own simple integrations instead of pulling in another dependency (for example I wrote my own CodeMirror component for Vue the other day)
  • if I want a build system, use esbuild

A couple of other things that look interesting but that I haven’t looked into:

Print copies of The Pocket Guide to Debugging have arrived

Hello! We released The Pocket Guide to Debugging back in December, and here’s a final update: the print copies are done printing and they’ve arrived at the warehouse, ready to ship to anyone who wants one.

You can buy the print or PDF version now, and if you preordered it, your copy should already have shipped. Some people have told me that they already received theirs! Email me if you haven’t gotten the shipping confirmation.

some pictures

Here are some photos of what the print version looks like:

what was involved in printing it

In case anyone is interested, here’s what was involved in putting together the print version:

  1. Make a PDF copy that people can print on their home printer (with a 360-line Python program)
  2. Test on my home printer that the “print at home version” prints properly
  3. Release the “print at home” version (this was back in December)
  4. Take a couple of weeks off, since it’s the end of the year
  5. Ask the illustrator to make a back cover
  6. Get a quote from the print company
  7. Agonize a bit over whether to print the zine as perfect bound or saddle stitched (stapled). Pick perfect bound.
  8. Find out from the print company how wide the spine has to be
  9. With the help of the illustrator, make a design for the spine.
  10. Get an ISBN number (just a couple of clicks at Libraries and Archives Canada)
  11. Get a bar code for the ISBN (from bookow), edit it to make it a little smaller, and put it on the back cover
  12. Send the new PDF to the print company and request a print proof
  13. Wait a week or so for the proof to get shipped across the continent
  14. Once the proof arrives, realize that the inner margins are too small, because it was perfect bound and perfect bound books need bigger margins (We’d already tried to account for that, but we didn’t make them big enough)
  15. Measure various books I have around the house and print some new sample pages to figure out the right margins
  16. Painstakingly manually readjust every single page to have slightly different proportions, so that I can increase the margins
  17. Edit the Python script to make a new PDF with the bigger margins
  18. Send the final files to the print company
  19. Wait a week for them to print 1500 copies
  20. The print copies arrive at the warehouse!
  21. Wait another 3 business days for the (amazing) folks who do the shipping to send out all 700 or so preorders
  22. Success!

Printing 1500 copies of something is always a little scary, but I’m really happy with how it turned out.

thanks so much to everyone who preordered!

If you preordered the print version, thanks so much for your patience – having the preorders really helps me decide how many to print.

And please let me know if something went wrong – 1 or 2 packages always get lost in the mail and while I can’t help find them, it’s very easy for me to just ship you another one :)

Why does 0.1 + 0.2 = 0.30000000000000004?

Hello! I was trying to write about floating point yesterday, and I found myself wondering about this calculation, with 64-bit floats:

>>> 0.1 + 0.2
0.30000000000000004

I realized that I didn’t understand exactly how it worked. I mean, I know floating point calculations are inexact, and I know that you can’t exactly represent 0.1 in binary, but: there’s a floating point number that’s closer to 0.3 than 0.30000000000000004! So why do we get the answer 0.30000000000000004?

If you don’t feel like reading this whole post with a bunch of calculations, the short answer is that 0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 lies exactly between 2 floating point numbers, 0.299999999999999988897769753748434595763683319091796875 (usually printed as 0.3) and 0.3000000000000000444089209850062616169452667236328125 (usually printed as 0.30000000000000004). The answer is 0.30000000000000004 (the second one) because its significand is even.

how floating point addition works

This is roughly how floating point addition works:

  1. Add together the numbers (with extra precision)
  2. Round the result to the nearest floating point number

So let’s use these rules to calculate 0.1 + 0.2. I just learned how floating point addition works yesterday so it’s possible I’ve made some mistakes in this post, but I did get the answers I expected at the end.

step 1: find out what 0.1 and 0.2 are

First, let’s use Python to figure out what the exact values of 0.1 and 0.2 are, as 64-bit floats.

>>> f"{0.1:.80f}"
'0.10000000000000000555111512312578270211815834045410156250000000000000000000000000'
>>> f"{0.2:.80f}"
'0.20000000000000001110223024625156540423631668090820312500000000000000000000000000'

These really are the exact values: because floating point numbers are in base 2, you can represent them all exactly in base 10. You just need a lot of digits sometimes :)

step 2: add the numbers together

Next, let’s add those numbers together. We can add the fractional parts together as integers to get the exact answer:

>>> 1000000000000000055511151231257827021181583404541015625 + 2000000000000000111022302462515654042363166809082031250
3000000000000000166533453693773481063544750213623046875

So the exact sum of those two floating point numbers is 0.3000000000000000166533453693773481063544750213623046875

This isn’t our final answer though because 0.3000000000000000166533453693773481063544750213623046875 isn’t a 64-bit float.

step 3: look at the nearest floating point numbers

Now, let’s look at the floating point numbers around 0.3. Here’s the closest floating point number to 0.3 (usually written as just 0.3, even though that isn’t its exact value):

>>> f"{0.3:.80f}"
'0.29999999999999998889776975374843459576368331909179687500000000000000000000000000'

We can figure out the next floating point number after 0.3 by serializing 0.3 to 8 bytes with struct.pack, adding 1, and then using struct.unpack:

>>> struct.pack("!d", 0.3)
b'?\xd3333333'
# manually add 1 to the last byte
>>> next_float = struct.unpack("!d", b'?\xd3333334')[0]
>>> next_float
0.30000000000000004
>>> f"{next_float:.80f}"
'0.30000000000000004440892098500626161694526672363281250000000000000000000000000000'

Apparently you can also do this with math.nextafter:

>>> math.nextafter(0.3, math.inf)
0.30000000000000004

So the two 64-bit floats around 0.3 are 0.299999999999999988897769753748434595763683319091796875 and 0.3000000000000000444089209850062616169452667236328125

step 4: find out which one is closest to our result

It turns out that 0.3000000000000000166533453693773481063544750213623046875 is exactly in the middle of 0.299999999999999988897769753748434595763683319091796875 and 0.3000000000000000444089209850062616169452667236328125.

You can see that with this calculation:

>>> (3000000000000000444089209850062616169452667236328125000 + 2999999999999999888977697537484345957636833190917968750) // 2 == 3000000000000000166533453693773481063544750213623046875
True

So neither of them is closest.

how does it know which one to round to?

In the binary representation of a floating point number, there’s a number called the “significand”. In cases like this (where the result is exactly in between 2 successive floating point number, it’ll round to the one with the even significand.

In this case that’s 0.300000000000000044408920985006261616945266723632812500

We actually saw the significand of this number a bit earlier:

  • 0.30000000000000004 is struct.unpack('!d', b'?\xd3333334')
  • 0.3 is struct.unpack('!d', b'?\xd3333333')

The last digit of the big endian hex representation of 0.30000000000000004 is 4, so that’s the one with the even significand (because the significand is at the end).

let’s also work out the whole calculation in binary

Above we did the calculation in decimal, because that’s a little more intuitive to read. But of course computers don’t do these calculations in decimal – they’re done in a base 2 representation. So I wanted to get an idea of how that worked too.

I don’t think this binary calculation part of the post is particularly clear but it was helpful for me to write out. There are a really a lot of numbers and it might be terrible to read.

how 64-bit floats numbers work: exponent and significand

64-bit floating point numbers are represented with 2 integers: an exponent and the significand and a 1-bit sign.

Here’s the equation of how the exponent and significand correspond to an actual number

$$\text{sign} \times 2^\text{exponent} (1 + \frac{\text{significand}}{2^{52}})$$

For example if the exponent was 1 the significand was 2**51, and the sign was positive, we’d get

$$2^{1} (1 + \frac{2^{51}}{2^{52}})$$

which is equal to 2 * (1 + 0.5) , or 3.

step 1: get the exponent and significand for 0.1 and 0.2

I wrote some inefficient functions to get the exponent and significand of a positive float in Python:

def get_exponent(f):
    # get the first 12 bytes
    bytestring = struct.pack('!d', f)
    return int.from_bytes(bytestring, byteorder='big') >> 52
def get_significand(f):
    # get the last 52 bytes
    bytestring = struct.pack('!d', f)
    x = int.from_bytes(bytestring, byteorder='big')
    exponent = get_exponent(f)
    return x ^ (exponent << 52)

I’m ignoring the sign bit (the first bit) because we only need these functions to work on two numbers (0.1 and 0.2) and those two numbers are both positive.

First, let’s get the exponent and significand of 0.1. We need to subtract 1023 to get the actual exponent because that’s how floating point works.

>>> get_exponent(0.1) - 1023
-4
>>> get_significand(0.1)
2702159776422298

The way these numbers work together to get 0.1 is 2**exponent + significand / 2**(52 - exponent).

Here’s that calculation in Python:

>>> 2**-4 + 2702159776422298 / 2**(52 + 4)
0.1

(you might legitimately be worried about floating point accuracy issues with this calculation, but in this case I’m pretty sure it’s fine because these numbers by definition don’t have accuracy issues – the floating point numbers starting at 2**-4 go up in steps of 1/2**(52 + 4))

We can do the same thing for 0.2:

>>> get_exponent(0.2) - 1023
-3
>>> get_significand(0.2)
2702159776422298

And here’s how that exponent and significand work together to get 0.2:

>>> 2**-3 + 2702159776422298 / 2**(52 + 3)
0.2

(by the way, it’s not a coincidence that 0.1 and 0.2 have the same significand – it’s because x and 2*x always have the same significand)

step 2: rewrite 0.1 to have a bigger exponent

0.2 has a bigger exponent than 0.1 – -3 instead of -4.

So we need to rewrite

2**-4 + 2702159776422298 / 2**(52 + 4)

to be X / (2**52 + 3)

If we solve for X in 2**-4 + 2702159776422298 / 2**(52 + 4) = X / (2**52 + 3), we get:

X = 2**51 + 2702159776422298 /2

We can calculate that in Python pretty easily:

>>> 2**51 + 2702159776422298 //2
3602879701896397

step 3: add the significands

Now we’re trying to do this addition

2**-3 + 2702159776422298 / 2**(52 + 3) + 3602879701896397 / 2**(52 + 3)

So we need to add together 2702159776422298 and 3602879701896397

>>> 2702159776422298  + 3602879701896397
6305039478318695

Cool. But 6305039478318695 is more than 2**52 - 1 (the maximum value for a significand), so we have a problem:

>>> 6305039478318695 > 2**52
True

step 4: increase the exponent

Right now our answer is

2**-3 + 6305039478318695 / 2**(52 + 3)

First, let’s subtract 2**52 to get

2**-2 + 1801439850948199 / 2**(52 + 3)

This is almost perfect, but the 2**(52 + 3) at the end there needs to be a 2**(52 + 2).

So we need to divide 1801439850948199 by 2. This is where we run into inaccuracies – 1801439850948199 is odd!

>>> 1801439850948199  / 2
900719925474099.5

It’s exactly in between two integers, so we round to the nearest even number (which is what the floating point specification says to do), so our final floating point number result is:

>>> 2**-2 + 900719925474100 / 2**(52 + 2)
0.30000000000000004

That’s the answer we expected:

>>> 0.1 + 0.2
0.30000000000000004

this probably isn’t exactly how it works in hardware

The way I’ve described the operations here isn’t literally exactly what happens when you do floating point addition (it’s not “solving for X” for example), I’m sure there are a lot of efficient tricks. But I think it’s about the same idea.

printing out floating point numbers is pretty weird

We said earlier that the floating point number 0.3 isn’t equal to 0.3. It’s actually this number:

>>> f"{0.3:.80f}"
'0.29999999999999998889776975374843459576368331909179687500000000000000000000000000'

So when you print out that number, why does it display 0.3?

The computer isn’t actually printing out the exact value of the number, instead it’s printing out the shortest decimal number d which has the property that our floating point number f is the closest floating point number to d.

It turns out that doing this efficiently isn’t trivial at all, and there are a bunch of academic papers about it like Printing Floating-Point Numbers Quickly and Accurately. or How to print floating point numbers accurately.

would it be more intuitive if computers printed out the exact value of a float?

Rounding to a nice clean decimal value is nice, but in a way I feel like it might be more intuitive if computers just printed out the exact value of a floating point number – it might make it seem a lot less surprising when you get weird results.

To me, 0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 = 0.3000000000000000444089209850062616169452667236328125 feels less surprising than 0.1 + 0.2 = 0.30000000000000004.

Probably this is a bad idea, it would definitely use a lot of screen space.

a quick note on PHP

Someone in the comments somewhere pointed out that <?php echo (0.1 + 0.2 );?> prints out 0.3. Does that mean that floating point math is different in PHP?

I think the answer is no – if I run:

<?php echo (0.1 + 0.2 )- 0.3);?> on this page, I get the exact same answer as in Python 5.5511151231258E-17. So it seems like the underlying floating point math is the same.

I think the reason that 0.1 + 0.2 prints out 0.3 in PHP is that PHP’s algorithm for displaying floating point numbers is less precise than Python’s – it’ll display 0.3 even if that number isn’t the closest floating point number to 0.3.

that’s all!

I kind of doubt that anyone had the patience to follow all of that arithmetic, but it was helpful for me to write down, so I’m publishing this post anyway. Hopefully some of this makes sense.

Examples of problems with integers

Hello! A few days back we talked about problems with floating point numbers.

This got me thinking – but what about integers? Of course integers have all kinds of problems too – anytime you represent a number in a small fixed amount of space (like 8/16/32/64 bits), you’re going to run into problems.

So I asked on Mastodon again for examples of integer problems and got all kinds of great responses again. Here’s a table of contents.

example 1: the small database primary key
example 2: integer overflow/underflow
aside: how do computers represent negative integers?
example 3: decoding a binary format in Java
example 4: misinterpreting an IP address or string as an integer
example 5: security problems because of integer overflow
example 6: the case of the mystery byte order
example 7: modulo of negative numbers
example 8: compilers removing integer overflow checks
example 9: the && typo

Like last time, I’ve written some example programs to demonstrate these problems. I’ve tried to use a variety of languages in the examples (Go, Javascript, Java, and C) to show that these problems don’t just show up in super low level C programs – integers are everywhere!

Also I’ve probably made some mistakes in here, I learned several things while writing this.

example 1: the small database primary key

One of the most classic (and most painful!) integer problems is:

  1. You create a database table where the primary key is a 32-bit unsigned integer, thinking “4 billion rows should be enough for anyone!”
  2. You are massively successful and eventually, your table gets close to 4 billion rows
  3. oh no!
  4. You need to do a database migration to switch your primary key to be a 64-bit integer instead

If the primary key actually reaches its maximum value I’m not sure exactly what happens, I’d imagine you wouldn’t be able to create any new database rows and it would be a very bad day for your massively successful service.

example 2: integer overflow/underflow

Here’s a Go program:

package main

import "fmt"

func main() {
	var x uint32 = 5
	var length uint32 = 0
	if x < length-1 {
		fmt.Printf("%d is less than %d\n", x, length-1)
	}
}

This slightly mysteriously prints out:

5 is less than 4294967295

That true, but it’s not what you might have expected.

what’s going on?

0 - 1 is equal to the 4 bytes 0xFFFFFFFF.

There are 2 ways to interpret those 4 bytes:

  1. As a signed integer (-1)
  2. As an unsigned integer (4294967295)

Go here is treating length - 1 as a unsigned integer, because we defined x and length as uint32s (the “u” is for “unsigned”). So it’s testing if 5 is less than 4294967295, which it is!

what do we do about it?

I’m not actually sure if there’s any way to automatically detect integer overflow errors in Go. (though it looks like there’s a github issue from 2019 with some discussion)

Some brief notes about other languages:

  • Lots of languages (Python, Java, Ruby) don’t have unsigned integers at all, so this specific problem doesn’t come up
  • In C, you can compile with clang -fsanitize=unsigned-integer-overflow. Then if your code has an overflow/underflow like this, the program will crash.
  • Similarly in Rust, if you compile your program in debug mode it’ll crash if there’s an integer overflow. But in release mode it won’t crash, it’ll just happily decide that 0 - 1 = 4294967295.

The reason Rust doesn’t check for overflows if you compile your program in release mode (and the reason C and Go don’t check) is that – these checks are expensive! Integer arithmetic is a very big part of many computations, and making sure that every single addition isn’t overflowing makes it slower.

aside: how do computers represent negative integers?

I mentioned in the last section that 0xFFFFFFFF can mean either -1 or 4294967295. You might be thinking – what??? Why would 0xFFFFFFFF mean -1?

So let’s talk about how computers represent negative integers for a second.

I’m going to simplify and talk about 8-bit integers instead of 32-bit integers, because there are less of them and it works basically the same way.

You can represent 256 different numbers with an 8-bit integer: 0 to 255

00000000 -> 0
00000001 -> 1
00000010 -> 2
...
11111111 -> 255

But what if you want to represent negative integers? We still only have 8 bits! So we need to reassign some of these and treat them as negative numbers instead.

Here’s the way most modern computers do it:

  1. Every number that’s 128 or more becomes a negative number instead
  2. How to know which negative number it is: take the positive integer you’d expect it to be, and then subtract 256

So 255 becomes -1, 128 becomes -128, and 200 becomes -56.

Here are some maps of bits to numbers:

00000000 -> 0
00000001 -> 1
00000010 -> 2
01111111 -> 127
10000000 -> -128 (previously 128)
10000001 -> -127 (previously 129)
10000010 -> -126 (previously 130)
...
11111111 -> -1 (previously 255)

This gives us 256 numbers, from -128 to 127.

And 11111111 (or 0xFF, or 255) is -1.

For 32 bit integers, it’s the same story, except it’s “every number larger than 2^31 becomes negative” and “subtract 2^32”. And similarly for other integer sizes.

That’s how we end up with 0xFFFFFFFF meaning -1.

there are multiple ways to represent negative integers

The way we just talked about of representing negative integers (“it’s the equivalent positive integer, but you subtract 2^n”) is called two’s complement, and it’s the most common on modern computers. There are several other ways though, the wikipedia article has a list.

weird thing: the absolute value of -128 is negative

This Go program has a pretty simple abs() function that computes the absolute value of an integer:

package main

import (
	"fmt"
)

func abs(x int8) int8 {
	if x < 0 {
		return -x
	}
	return x
}

func main() {
	fmt.Println(abs(-127))
	fmt.Println(abs(-128))
}

This prints out:

127
-128

This is because the signed 8-bit integers go from -128 to 127 – there is no +128! Some programs might crash when you try to do this (it’s an overflow), but Go doesn’t.

Now that we’ve talked about signed integers a bunch, let’s dig into another example of how they can cause problems.

example 3: decoding a binary format in Java

Let’s say you’re parsing a binary format in Java, and you want to get the first 4 bits of the byte 0x90. The correct answer is 9.

public class Main {
    public static void main(String[] args) {
        byte b = (byte) 0x90;
        System.out.println(b >> 4);
    }
}

This prints out “-7”. That’s not right!

what’s going on?

There are two things we need to know about Java to make sense of this:

  1. Java doesn’t have unsigned integers.
  2. Java can’t right shift bytes, it can only shift integers. So anytime you shift a byte, it has to be promoted into an integer.

Let’s break down what those two facts mean for our little calculation b >> 4:

  • In bits, 0x90 is 10010000. This starts with a 1, which means that it’s more than 128, which means it’s a negative number
  • Java sees the >> and decides to promote 0x90 to an integer, so that it can shift it
  • The way you convert a negative byte to an 32-bit integer is to add a bunch of 1s at the beginning. So now our 32-bit integer is 0xFFFFFF90 (F being 15, or 1111)
  • Now we right shift (b >> 4). By default, Java does a signed shift, which means that it adds 0s to the beginning if it’s positive, and 1s to the beginning if it’s negative. (>>> is an unsigned shift in Java)
  • We end up with 0xFFFFFFF9 (having cut off the last 4 bits and added more 1s at the beginning)
  • As a signed integer, that’s -7!

what can you do about it?

I don’t the actual idiomatic way to do this in Java is, but the way I’d naively approach fixing this is to put in a bit mask before doing the right shift. So instead of:

b >> 4

we’d write

(b & 0xFF) >> 4

b & 0xFF seems redundant (b is already a byte!), but it’s actually not because b is being promoted to an integer.

Now instead of 0x90 -> 0xFFFFFF90 -> 0xFFFFFFF9, we end up calculating 0x90 -> 0xFFFFFF90 -> 0x00000090 -> 0x00000009, which is the result we wanted: 9.

And when we actually try it, it prints out “9”.

Also, if we were using a language with unsigned integers, the natural way to deal with this would be to treat the value as an unsigned integer in the first place. But that’s not possible in Java.

example 4: misinterpreting an IP address or string as an integer

I don’t know if this is technically a “problem with integers” but it’s funny so I’ll mention it: Rachel by the bay has a bunch of great examples of things that are not integers being interpreted as integers. For example, “HTTP” is 0x48545450 and 2130706433 is 127.0.0.1.

She points out that you can actually ping any integer, and it’ll convert that integer into an IP address, for example:

$ ping 2130706433
PING 2130706433 (127.0.0.1): 56 data bytes
$ ping 132848123841239999988888888888234234234234234234
PING 132848123841239999988888888888234234234234234234 (251.164.101.122): 56 data bytes

(I’m not actually sure how ping is parsing that second integer or why ping accepts these giant larger-than-2^64-integers as valid inputs, but it’s a fun weird thing)

example 5: security problems because of integer overflow

Another integer overflow example: here’s a search for CVEs involving integer overflows. There are a lot! I’m not a security person, but here’s one random example: this json parsing library bug

My understanding of that json parsing bug is roughly:

  • you load a JSON file that’s 3GB or something, or 3,000,000,000
  • due to an integer overflow, the code allocates close to 0 bytes of memory instead of ~3GB amount of memory
  • but the JSON file is still 3GB, so it gets copied into the tiny buffer with almost 0 bytes of memory
  • this overwrites all kinds of other memory that it’s not supposed to

The CVE says “This vulnerability mostly impacts process availability”, which I think means “the program crashes”, but sometimes this kind of thing is much worse and can result in arbitrary code execution.

My impression is that there are a large variety of different flavours of security vulnerabilities caused by integer overflows.

example 6: the case of the mystery byte order

One person said that they’re do scientific computing and sometimes they need to read files which contain data with an unknown byte order.

Let’s invent a small example of this: say you’re reading a file which contains 4 bytes - 00, 00, 12, and 81 (in that order), that you happen to know represent a 4-byte integer. There are 2 ways to interpret that integer:

  1. 0x00001281 (which translates to 4737). This order is called “big endian”
  2. 0x81120000 (which translates to 2165440512). This order is called “little endian”.

Which one is it? Well, maybe the file contains some metadata that specifies the endianness. Or maybe you happen to know what machine it was generated on and what byte order that machine uses. Or maybe you just read a bunch of values, try both orders, and figure out which makes more sense. Maybe 2165440512 is too big to make sense in the context of whatever your data is supposed to mean, or maybe 4737 is too small.

A couple more notes on this:

  • this isn’t just a problem with integers, floating point numbers have byte order too
  • this also comes up when reading data from a network, but in that case the byte order isn’t a “mystery”, it’s just going to be big endian. But x86 machines (and many others) are little endian, so you have to swap the byte order of all your numbers.

example 7: modulo of negative numbers

This is more of a design decision about how different programming languages design their math libraries, but it’s still a little weird and lots of people mentioned it.

Let’s say you write -13 % 3 in your program, or 13 % -3. What’s the result?

It turns out that different programming languages do it differently, for example in Python -13 % 3 = 2 but in Javascript -13 % 3 = -1.

There’s a table in this blog post that describes a bunch of different programming languages’ choices.

example 8: compilers removing integer overflow checks

We’ve been hearing a lot about integer overflow and why it’s bad. So let’s imagine you try to be safe and include some checks in your programs – after each addition, you make sure that the calculation didn’t overflow. Like this:

#include <stdio.h>

#define INT_MAX 2147483647

int check_overflow(int n) {
    n = n + 100;
    if (n + 100 < 0)
        return -1;
    return 0;
}

int main() {
    int result = check_overflow(INT_MAX);
    printf("%d\n", result);
}

check_overflow here should return -1 (failure), because INT_MAX + 100 is more than the maximum integer size.

$ gcc  check_overflow.c  -o check_overflow && ./check_overflow
-1 
$ gcc -O3 check_overflow.c  -o check_overflow && ./check_overflow
0

That’s weird – when we compile with gcc, we get the answer we expected, but with gcc -O3, we get a different answer. Why?

what’s going on?

My understanding (which might be wrong) is:

  1. Signed integer overflow in C is undefined behavior. I think that’s because different C implementations might be using different representations of signed integers (maybe they’re using one’s complement instead of two’s complement or something)
  2. “undefined behaviour” in C means “the compiler is free to do literally whatever it wants after that point” (see this post With undefined behaviour, anything is possible by Raph Levine for a lot more)
  3. Some compiler optimizations assume that undefined behaviour will never happen. They’re free to do this, because – if that undefined behaviour did happen, then they’re allowed to do whatever they want, so “run the code that I optimized assuming that this would never happen” is fine.
  4. So this if (n + 100 < 0) check is irrelevant – if that did happen, it would be undefined behaviour, so there’s no need to execute the contents of that if statement.

So, that’s weird. I’m not going to write a “what can you do about it?” section here because I’m pretty out of my depth already.

I certainly would not have expected that though.

My impression is that “undefined behaviour” is really a C/C++ concept, and doesn’t exist in other languages in the same way except in the case of “your program called some C code in an incorrect way and that C code did something weird because of undefined behaviour”. Which of course happens all the time.

example 9: the && typo

This one was mentioned as a very upsetting bug. Let’s say you have two integers and you want to check that they’re both nonzero.

In Javascript, you might write:

if a && b {
    /* some code */
}

But you could also make a typo and type:

if a & b {
    /* some code */
}

This is still perfectly valid code, but it means something completely different – it’s a bitwise and instead of a boolean and. Let’s go into a Javascript console and look at bitwise vs boolean and for 9 and 4:

> 9 && 4
4
> 9 & 4
0
> 4 && 5
5
> 4 & 5
4

It’s easy to imagine this turning into a REALLY annoying bug since it would be intermittent – often x & y does turn out to be truthy if x && y is truthy.

what to do about it?

For Javascript, ESLint has a no-bitwise check check), which requires you manually flag “no, I actually know what I’m doing, I want to do bitwise and” if you use a bitwise and in your code. I’m sure many other linters have a similar check.

that’s all for now!

There are definitely more problems with integers than this, but this got pretty long again and I’m tired of writing again so I’m going to stop :)

Examples of floating point problems

Hello! I’ve been thinking about writing a zine about how things are represented on computers in bytes, so I was thinking about floating point.

I’ve heard a million times about the dangers of floating point arithmetic, like:

  • addition isn’t associative (x + (y + z) is different from (x + y) + z)
  • if you add very big values to very small values, you can get inaccurate results (the small numbers get lost!)
  • you can’t represent very large integers as floating numbers
  • NaN/infinity values can propagate and cause chaos
  • there are two zeros (+0 and -0), and they’re not represented the same way
  • denormal/subnormal values are weird

But I find all of this a little abstract on its own, and I really wanted some specific examples of floating point bugs in real-world programs.

So I asked on Mastodon for examples of how floating point has gone wrong for them in real programs, and as always folks delivered! Here are a bunch of examples. I’ve also written some example programs for some of them to see exactly what happens. Here’s a table of contents:

how does floating point work?
floating point isn’t “bad” or random
example 1: the odometer that stopped
example 2: tweet IDs in Javascript
example 3: a variance calculation gone wrong
example 4: different languages sometimes do the same floating point calculation differently
example 5: the deep space kraken
example 6: the inaccurate timestamp
example 7: splitting a page into columns
example 8: collision checking

None of these 8 examples talk about NaNs or +0/-0 or infinity values or subnormals, but it’s not because those things don’t cause problems – it’s just that I got tired of writing at some point :).

Also I’ve probably made some mistakes in this post.

how does floating point work?

I’m not going to write a long explanation of how floating point works in this post, but here’s a comic I wrote a few years ago that talks about the basics:

floating point isn’t “bad” or random

I don’t want you to read this post and conclude that floating point is bad. It’s an amazing tool for doing numerical calculations. So many smart people have done so much work to make numerical calculations on computers efficient and accurate! Two points about how all of this isn’t floating point’s fault:

  • Doing numerical computations on a computer inherently involves some approximation and rounding, especially if you want to do it efficiently. You can’t always store an arbitrary amount of precision for every single number you’re working with.
  • Floating point is standardized (IEEE 754), so operations like addition on floating point numbers are deterministic – my understanding is that 0.1 + 0.2 will always give you the exact same result (0.30000000000000004), even across different architectures. It might not be the result you expected, but it’s actually very predictable.

My goal for this post is just to explain what kind of problems can come up with floating point numbers and why they happen so that you know when to be careful with them, and when they’re not appropriate.

Now let’s get into the examples.

example 1: the odometer that stopped

One person said that they were working on an odometer that was continuously adding small amounts to a 32-bit float to measure distance travelled, and things went very wrong.

To make this concrete, let’s say that we’re adding numbers to the odometer 1cm at a time. What does it look like after 10,000 kilometers?

Here’s a C program that simulates that:

#include <stdio.h>
int main() {
    float meters = 0;
    int iterations = 100000000;
    for (int i = 0; i < iterations; i++) {
        meters += 0.01;
    }
    printf("Expected: %f km\n", 0.01 * iterations / 1000 );
    printf("Got: %f km \n", meters / 1000);
}

and here’s the output:

Expected: 10000.000000 km
Got: 262.144012 km

This is VERY bad – it’s not a small error, 262km is a LOT less than 10,000km. What went wrong?

what went wrong: gaps between floating point numbers get big

The problem in this case is that, for 32-bit floats, 262144.0 + 0.01 = 262144.0. So it’s not just that the number is inaccurate, it’ll actually never increase at all! If we travelled another 10,000 kilometers, the odometer would still be stuck at 262144 meters (aka 262.144km).

Why is this happening? Well, floating point numbers get farther apart as they get bigger. In this example, for 32-bit floats, here are 3 consecutive floating point numbers:

  • 262144.0
  • 262144.03125
  • 262144.0625

I got those numbers by going to https://float.exposed/0x48800000 and incrementing the ‘significand’ number a couple of times.

So, there are no 32-bit floating point numbers between 262144.0 and 262144.03125. Why is that a problem?

The problem is that 262144.03125 is about 262144.0 + 0.03. So when we try to add 0.01 to 262144.0, it doesn’t make sense to round up to the next number. So the sum just stays at 262144.0.

Also, it’s not a coincidence that 262144 is a power of 2 (it’s 2^18). The gaps been floating point numbers change after every power of 2, and at 2^18 the gap between 32-bit floats is 0.03125, increasing from 0.016ish.

one way to solve this: use a double

Using a 64-bit float fixes this – if we replace float with double in the above C program, everything works a lot better. Here’s the output:

Expected: 10000.000000 km
Got: 9999.999825 km

There are still some small inaccuracies here – we’re off about 17 centimeters. Whether this matters or not depends on the context: being slightly off could very well be disastrous if we were doing a precision space maneuver or something, but it’s probably fine for an odometer.

Another way to improve this would be to increment the odometer in bigger chunks – instead of adding 1cm at a time, maybe we could update it less frequently, like every 50cm.

If we use a double and increment by 50cm instead of 1cm, we get the exact correct answer:

Expected: 10000.000000 km
Got: 10000.000000 km

A third way to solve this could be to use an integer: maybe we decide that the smallest unit we care about is 0.1mm, and then measure everything as integer multiples of 0.1mm. I have never built an odometer so I can’t say what the best approach is.

example 2: tweet IDs in Javascript

Javascript only has floating point numbers – it doesn’t have an integer type. The biggest integer you can represent in a 64-bit floating point number is 2^53.

But tweet IDs are big numbers, bigger than 2^53. The Twitter API now returns them as both integers and strings, so that in Javascript you can just use the string ID (like “1612850010110005250”), but if you tried to use the integer version in JS, things would go very wrong.

You can check this yourself by taking a tweet ID and putting it in the Javascript console, like this:

>> 1612850010110005250 
   1612850010110005200

Notice that 1612850010110005200 is NOT the same number as 1612850010110005250!! It’s 50 less!

This particular issue doesn’t happen in Python (or any other language that I know of), because Python has integers. Here’s what happens if we enter the same number in a Python REPL:

In [3]: 1612850010110005250
Out[3]: 1612850010110005250

Same number, as you’d expect.

example 2.1: the corrupted JSON data

This is a small variant of the “tweet IDs in Javascript” issue, but even if you’re not actually writing Javascript code, numbers in JSON are still sometimes treated as if they’re floats. This mostly makes sense to me because JSON has “Javascript” in the name, so it seems reasonable to decode the values the way Javascript would.

For example, if we pass some JSON through jq, we see the exact same issue: the number 1612850010110005250 gets changed into 1612850010110005200.

$ echo '{"id": 1612850010110005250}' | jq '.'
{
  "id": 1612850010110005200
}

But it’s not consistent across all JSON libraries Python’s json module will decode 1612850010110005250 as the correct integer.

Several people mentioned issues with sending floats in JSON, whether either they were trying to send a large integer (like a pointer address) in JSON and it got corrupted, or sending smaller floating point values back and forth repeatedly and the value slowly diverging over time.

example 3: a variance calculation gone wrong

Let’s say you’re doing some statistics, and you want to calculate the variance of many numbers. Maybe more numbers than you can easily fit in memory, so you want to do it in a single pass.

There’s a simple (but bad!!!) algorithm you can use to calculate the variance in a single pass, from this blog post. Here’s some Python code:

def calculate_bad_variance(nums):
    sum_of_squares = 0
    sum_of_nums = 0
    N = len(nums)
    for num in nums:
        sum_of_squares += num**2
        sum_of_nums += num
    mean = sum_of_nums / N
    variance = (sum_of_squares - N * mean**2) / N

    print(f"Real variance: {np.var(nums)}")
    print(f"Bad variance: {variance}")

First, let’s use this bad algorithm to calculate the variance of 5 small numbers. Everything looks pretty good:

In [2]: calculate_bad_variance([2, 7, 3, 12, 9])
Real variance: 13.84
Bad variance: 13.840000000000003 <- pretty close!

Now, let’s try it the same 100,000 large numbers that are very close together (distributed between 100000000 and 100000000.06)

In [7]: calculate_bad_variance(np.random.uniform(100000000, 100000000.06, 100000))
Real variance: 0.00029959105209321173
Bad variance: -138.93632 <- OH NO

This is extremely bad: not only is the bad variance way off, it’s NEGATIVE! (the variance is never supposed to be negative, it’s always zero or more)

what went wrong: catastrophic cancellation

What’s going here is similar to our odometer number problem: the sum_of_squares number gets extremely big (about 10^21 or 2^69), and at that point, the gap between consecutive floating point numbers is also very big – it’s 2**46. So we just lose all precision in our calculations.

The term for this problem is “catastrophic cancellation” – we’re subtracting two very large floating point numbers which are both going to be pretty far from the correct value of the calculation, so the result of the subtraction is also going to be wrong.

The blog post I mentioned before talks about a better algorithm people use to compute variance called Welford’s algorithm, which doesn’t have the catastrophic cancellation issue.

And of course, the solution for most people is to just use a scientific computing library like Numpy to calculate variance instead of trying to do it yourself :)

example 4: different languages sometimes do the same floating point calculation differently

A bunch of people mentioned that different platforms will do the same calculation in different ways. One way this shows up in practice is – maybe you have some frontend code and some backend code that do the exact same floating point calculation. But it’s done slightly differently in Javascript and in PHP, so you users end up seeing discrepancies and getting confused.

In principle you might think that different implementations should work the same way because of the IEEE 754 standard for floating point, but here are a couple of caveats that were mentioned:

  • math operations in libc (like sin/log) behave differently in different implementations. So code using glibc could give you different results than code using musl
  • some x86 instructions can use 80 bit precision for some double operations internally instead of 64 bit precision. Here’s a GitHub issue talking about that

I’m not very sure about these points and I don’t have concrete examples I can reproduce.

example 5: the deep space kraken

Kerbal Space Program is a space simulation game, and it used to have a bug called the Deep Space Kraken where when you moved very fast, your ship would start getting destroyed due to floating point issues. This is similar to the other problems we’ve talked out involving big floating numbers (like the variance problem), but I wanted to mention it because:

  1. it has a funny name
  2. it seems like a very common bug in video games / astrophysics / simulations in general – if you have points that are very far from the origin, your math gets messed up

Another example of this is the Far Lands in Minecraft.

example 6: the inaccurate timestamp

I promise this is the last example of “very large floating numbers can ruin your day”. But! Just one more! Let’s imagine that we try to represent the current Unix epoch in nanoseconds (about 1673580409000000000) as a 64-bit floating point number.

This is no good! 1673580409000000000 is about 2^60 (crucially, bigger than 2^53), and the next 64-bit float after it is 1673580409000000256.

So this would be a great way to end up with inaccuracies in your time math. Of course, time libraries actually represent times as integers, so this isn’t usually a problem. (there’s always still the year 2038 problem, but that’s not related to floats)

In general, the lesson here is that sometimes it’s better to use integers.

example 7: splitting a page into columns

Now that we’ve talked about problems with big floating point numbers, let’s do a problem with small floating point numbers.

Let’s say you have a page width, and a column width, and you want to figure out:

  1. how many columns fit on the page
  2. how much space is left over

You might reasonably try floor(page_width / column_width) for the first question and page_width % column_width for the second question. Because that would work just fine with integers!

In [5]: math.floor(13.716 / 4.572)
Out[5]: 3

In [6]: 13.716 % 4.572
Out[6]: 4.571999999999999

This is wrong! The amount of space left is 0!

A better way to calculate the amount of space left might have been 13.716 - 3 * 4.572, which gives us a very small negative number.

I think the lesson here is to never calculate the same thing in 2 different ways with floats.

This is a very basic example but I can kind of see how this would create all kinds of problems if I was doing page layout with floating point numbers, or doing CAD drawings.

example 8: collision checking

Here’s a very silly Python program, that starts a variable at 1000 and decrements it until it collides with 0. You can imagine that this is part of a pong game or something, and that a is a ball that’s supposed to collide with a wall.

a = 1000
while a != 0:
    a -= 0.001

You might expect this program to terminate. But it doesn’t! a is never 0, instead it goes from 1.673494676862619e-08 to -0.0009999832650532314.

The lesson here is that instead of checking for float equality, usually you want to check if two numbers are different by some very small amount. Or here we could just write while a > 0.

that’s all for now

I didn’t even get to NaNs (the are so many of them!) or infinity or +0 / -0 or subnormals, but we’ve already written 2000 words and I’m going to just publish this.

I might write another followup post later – that Mastodon thread has literally 15,000 words of floating point problems in it, there’s a lot of material! Or I might not, who knows :)