Tips for analyzing logs

Julia Evans

Hello! I’ve been working on writing a zine about debugging for a while now (we’re getting close to finishing it!!!!), and one of the pages is about analyzing logs. I asked for some tips on Mastodon and got WAY more tips than could fit on the page, so I thought I’d write a quick blog post.

I’m going to talk about log analysis in the context of distributed systems debugging (you have a bunch of servers with different log files and you need to work out what happened) since that’s what I’m most familiar with.

search for the request’s ID

Often log lines will include a request ID. So searching for the request ID of a failed reques will show all the log lines for that request.

This is a GREAT way to cut things down, and it’s one of the first helpful tips I got about distributed systems debugging – I was staring at a bunch of graphs on a dashboard fruitlessly trying to find patterns, and a coworker gave me the advice (“julia, try looking at the logs for a failed request instead!”). That turned out to be WAY more effective in that case.

correlate between different systems

Sometimes one set of logs doesn’t have the information you need, but you can get that information from a different service’s logs about the same request.

If you’re lucky, they’ll both share a request ID.

More often, you’ll need to manually piece together context from clues and the timestamps of the request.

This is really annoying but I’ve found that often it’s worth it and gets me a key piece of information.

beware of time issues

If you’re trying to correlate events based on time, there are a couple of things to be aware of:

sometimes the time in a logging system is based on the time the log was ingested, not the time that the event actually happened. Sometimes you have to write a date parser to get the actual time the event happened.
different machines can have slightly skewed clocks

log lines for the same request can be very far apart

Especially if a request takes a long time (maybe it took 5 minutes because of a long timeout!), the log lines for the request might be much more spread out than you expected. You can accumulate many thousands of log lines in 5 minutes!

Searching for the request ID really helps with this – it makes it harder to accidentally miss a log entry with an important clue.

Also, log lines can occasionally get completely lost if a server dies.

build a timeline

Keeping all of the information straight in your head can get VERY confusing, so I find it helpful to keep a debugging document where I copy and paste bits of information.

This might include:

key error messages
links to relevant dashboards / log system searches
pager alerts
graphs
human actions that were taken (“right before this message, we restarted the load balancer…”)
my interpretation of various messages (“I think this was caused by…”)

reformat them into a table

Sometimes I’ll reformat the log lines to just print out the information I’m interested in, to make it easier to scan. I’ve done this on the command line with a simple awk command:

cat ... | awk '{print $5 - $8}'

but also with fancy log analysis tools (like Splunk) that let you make a table on the web

check that a “suspicious” error is actually new

Sometimes I’ll notice a suspicious error in the logs and think “OH THERE’S THE CULPRIT!!!“. But when I search for that message to make sure that it’s actually new, I’ll find out that this error actually happens constantly during normal operation, and that it’s completely unrelated to the (new) situation that I’m dealing with.

use the logs to make a graph

Some log analysis tools will let you turn your log lines into a graph to detect patterns.

You can also make a quick histogram with grep and sort. For example I’ve often done something like:

grep -o (some regex) | sort | uniq -c | sort -n

to count how many of each line matching my regular expression there are

filter out irrelevant lines

You can remove irrelevant lines with grep like this:

cat file | grep -v THING1 | grep -v THING2 | grep -v THING3 | grep -v THING4

for the reply guys: yes, we all know you don’t need to use cat here :)

Or if your log system has some kind of query language, you can search for NOT THING1 AND NOT THING2 ...

find the first error

Often an error causes a huge cascade of related errors. Digging into the later errors can waste a lot of your time – you need to start by finding the first thing that triggered the error. Often you don’t need to understand the exact deals of why the 15th thing in the error cascade failed, you can just fix the original problem and move on.

scroll through the log really fast

If you already have an intuition for what log lines for this service should normally look like, sometimes scrolling through them really fast will reveal something that looks off.

turn the log level up (or down)

Sometimes turning up the log level will give you a key error message that explains everything.

But other times, you’ll get overwhelmed by a million irrelevant messages because the log level is set to INFO, and you need to turn the log level down.

put it in a spreadsheet/database

I’ve never tried this myself, but a couple of people suggested copying parts of the logs into a spreadsheet (with the timestamp in a different column) to make it easier to filter / sort.

You could also put the data into SQLite or something (maybe with sqlite-utils?) if you want to be able to run SQL queries on your logs.

on generating good logs

A bunch of people also had thoughts on how to output easier-to-analyze logs. This is a bigger topic than a few bullet points but here are a few quick things:

use a standard schema/format to make them easier to parse
include a transaction ID/request ID, to make it easier to filter for all lines related to a single transaction/request
include relevant information. For example, “ERROR: Invalid msg size” is less helpful than “ERROR: Invalid msg size. Msg-id 234, expected size 54, received size 0”.
avoid logging personally identifiable information
use a logging framework instead of using print statements (this helps you have things like log levels and a standard structure)

that’s all!

Let me know on Twitter/Mastodon if there’s anything I missed! I might edit this to add a couple more things.

blogs
tech

Response: "Note-taking apps: Bear and Joplin"

Ru Singh

It's interesting to me Will was able to move from Bear to Joplin and generally enjoys it as much as he did Bear.

I use and love Bear, and the closest I have to that with a more cross-platform-friendly constraint is Standard Notes, who also recently added regional pricing. Not to my surprise, though, as they were willing to offer economic difficulty discounts for a long time. All you had to do was drop them an email, but now it's all automated.

Still, I'm not sure I can leave Bear just yet. I do have one eye on a migration to SN because Bear is, as Will mentioned, strictly Apple-only, and in the long-run, that is very much a "problem."

What's going to convince me SN is good enough is when their editors feel as polished as Bear's do. My experience in the past with its editors has left me wanting for much more. Bear is also cheaper which is very high on my priority list right now.

I encourage you to get in touch over email by using the following convenience link for any discussion: comment via email.

Automagically assimilating NixOS machines into your Tailnet with Terraform

Christine Dodrill's Blog

tech
blogs

ACT Rules CG published implementations

hiddedevries.nl Blog

The ACT Rules Community Group (CG) writes testing rules for accessibility standards conformance. In the process, they find consensus in interpretation issues. This week, the CG published an “Implementations” page that shows these rules are not used in just one place, but in many: from manual methodologies to semi-automated (axe DevTools) and automated (axe-core, Alfa, Qualweb and more) tools.

Before going into why this is great, I will give some context. Accessibility standards like WCAG define criteria to conform with, they help makers of websites understand what a web without barriers means in practice. For instance, in WCAG, it means colours have sufficient contrast, pages have titles, buttons have names, and so on. In WAI-ARIA, it means elements with a cell role must be contained in or owned by an element with the row role, to name just one example. These, and all other requirements (“MUST”s) in standards, are testable requirements. This makes testing basically the process of finding out if you meet the standards (you, a product, a browser, a tool, an assistive technology, etc). It is a great lens to view accessibility through. and definitely a lens I use a lot when teaching.

But if you ever speak about the same issue with multiple accessibility specialists, you'll find that there can be differences in interpretation of these standards (see also: The Accessibility Interpretation Problem by Glenda Sims and Wilco Fiers). One way to avoid interpretation issues would probably be to have thousands of success criteria WCAG that allow for zero interpretation, but that imaginary world would be a much worse world. Instead, agreeing on how to test might be the best option we have.

The ACT Rules Community Group brings together makers of tools and other stakeholders to make testing accessibility standards more consistent, by finding agreement. They then write rules according to a format they specified. The rules that apply specifically to WCAG are also reviewed by AGWG, the group that produces WCAG.

This is tedious work and I, for one, am very pleased to see these rules now have implementations across different methodologies and tools. We'll probably see them more and more in the wild (I know I will in some tools I use). Congrats, Community Group, and keep up the great work!

Originally posted as ACT Rules CG published implementations on Hidde's blog.

Reply via email

blogs
tech

How to optimize your website for SEO

Marko Denic

By following these tips, you can improve SEO for your website and increase its visibility on search engine results pages.

tech
blogs

Reading List