Reading List
Time is not a synchronization primitive from Christine Dodrill's Blog RSS feed.
Time is not a synchronization primitive
Programming is so complicated. I know this is an example of the nostalgia paradox in action, but it easily feels like everything has gotten so much more complicated over the course of my career. One of the biggest things that is really complicated is the fact that working with other people is always super complicated.
One of the axioms you end up working with is "assume best intent". This has sometimes been used as a dog-whistle to defend pathological behavior; but really there is a good idea at the core of this: everyone is really trying to do the best that they can given their limited time and energy and it's usually better to start from the position of "the system that allowed this failure to happen is the thing that must be fixed".
However, we work with other people and this can result in things that can troll you on accident. One of the biggest sources of friction is when people end up creating tests that can fail for no reason. To make this even more fun, this will end up breaking people's trust in CI systems. This lack of trust trains people that it's okay for CI to fail because sometimes it's not your fault. This leads to hacks like the flaky attribute on python where it will ignore test failures. Or even worse, it trains people to merge broken code to main because they're trained that sometimes CI just fails but everything is okay.
Today I want to talk about one of the most common ways that I see things fall apart. This has caused tests, production-load-bearing bash scripts, and normal application code to be unresponsive at best and randomly break at worst. It's when people use time as a synchronization mechanism.
Time as an effect
I think that the best way to explain this is to start with a flaky test that I wrote years ago and break it down to explain why things are flaky and what I mean by a "synchronization mechanism". Consider this Go test:
func TestListener(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
go func() {
lis, err := net.Listen("tcp", ":1337")
if err != nil {
t.Error(err)
return
}
defer lis.Close()
for {
select {
case <- ctx.Done():
return
default:
}
conn, err := lis.Accept()
if err != nil {
t.Error(err)
return
}
// do something with conn
}()
time.Sleep(150*time.Millisecond)
conn, err := net.Dial("tcp", "127.0.0.1:1337")
if err != nil {
t.Error(err)
return
}
// do something with conn
}
This code starts a new goroutine that opens a network listener on port 1337 and then waits for it to be active before connecting to it. Most of the time, this will work out okay. However there's a huge problem lurking at the core of this: This test will take a minimum of 150 milliseconds to run no matter what. If the logic of starting a test server is lifted into a helper function then every time you create a test server from any downstream test function, you spend that additional 150 milliseconds.
Additionally, the TCP listener is probably ready near instantly, but also if you run multiple tests in parallel then they'll all fight for that one port and then everything will fail randomly.
This is what I mean by "synchronization primitive". The idea here is that by having the main test goroutine wait for the other one to be ready, we are using the effect of time passing (and the Go runtime scheduling/executing that other goroutine) as a way to make sure that the server is ready for the client to connect. When you are synchronizing the state of two goroutines (the client being ready to connect and the server being ready for connections), you generally want to use something that synchronizes that state, such as a channel or even by eliminating the need to synchronize things at all.
Consider this version of that test:
func TestListener(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
lis, err := net.Listen("tcp", ":0")
if err != nil {
t.Error(err)
return
}
go func() {
defer lis.Close()
for {
select {
case <- ctx.Done():
return
default:
}
conn, err := lis.Accept()
if err != nil {
t.Error(err)
return
}
// do something with conn
}()
conn, err := net.Dial(lis.Addr().Network(), lis.Addr().String())
if err != nil {
t.Error(err)
return
}
// do something with conn
}
Not only have we gotten rid of that time.Sleep call, we also made it support having multiple instances of the server in parallel! This code is ultimately much more robust than the old test ever was and will easily scale for your needs. If your tests took a total of 600 ms to run each, cutting out that one 150 ms sleep removes 25% of the wait!
Putting it into practice
So let's put this into practice and make this kind of behavior more
difficult to cause. Let's add a roadblock for trying to use
time.Sleep
in tests by using the nosleep
linter. nosleep
is a Go
linter that checks for the presence of time.Sleep
in your test code
and fails your code if it finds it. That's it. That's the whole tool.
You can run it against your Go code by installing it with go install
:
go install within.website/x/linters/cmd/nosleep@latest
And then you can run it with the nosleep
command:
nosleep ./...
I do recognize that sometimes you actually do need to use time as a
synchronization method because god is dead and you have no other
option. If this does genuinely happen, you can use the magic command
//nosleep:bypass here's a very good reason
. If you don't put a
reason there, the magic comment won't work.
Let me know how it works for you! Add it to your CI config if you dare.