Reading List

The ElasticSearch Rant from Christine Dodrill's Blog RSS feed.

The ElasticSearch Rant

hero image volcano-waifu
Image generated by SCMix -- volcano, hellfire, burning, fire, 1girl, light green hair, dark green eyes, hoodie, denim, long hair, portrait, masterpiece, best quality, high quality, absurdres, tarot, detailed background

Mara is hacker
<Mara> This post is a rant. Do not take this as seriously as you would other posts.

As a part of my continued efforts to heal, one of the things I've been trying to do is avoid being overly negative and venomous about technology. I don't want to be angry when I write things on this blog. I don't want to be known as someone who is venomous and hateful. This is why I've been disavowing my articles about the V programming language among other things.

ElasticSearch makes it difficult for me to keep this streak up.

I have never had the misfortune of using technology that has worn down my sanity and made me feel like I was fundamentally wrong about my understanding about computer science like I have with ElasticSearch. Not since Kubernetes, and I totally disavow using Kubernetes now at advice of my therapist. Maybe ElasticSearch is actually this bad, but I'm so close to the end that I feel like I have no choice but to keep progressing forward.

This post outlines all of my suffering getting ElasticSearch working for something at work. I have never seen suffering quite as much as this and I am now two months into a two week project. I have to work on this on fridays so that I'm not pissed off and angry at computers for the rest of the week.

Buckle up.

For doublerye to read only
Cadey is coffee
<Cadey> Hi, manager! Surely you are reading this post. This post is not an admission of defeat. This post should not be construed as anything but a list of everything that I've had to suffer through to get search working. I'm sorry this has taken so long but I can only do so much when the tools are lying to me.

Limbo

ElasticSearch is a database I guess, but the main thing it's used for is as a search engine. The basic idea of a search engine is to be a central aggregation point where you can feed in a bunch of documents (such as blogposts or knowledge base entries) and then let users search for words in those documents. As it turns out, most of the markdown front matter that is present in most markdown deployments combined with the rest of the body reduced to plain text is good enough to feed in as a corpus for the search machine.

However, there was a slight problem: I wanted to index documents from two projects and they use different dialects of Markdown. Markdown is about as specified as the POSIX standard and one of the dialects was MDX, a tool that lets you mix Markdown and React components. At first I thought this was horrifying and awful, but eventually I've turned around on it and think that MDX is actually pretty convenient in practice. The other dialect was whatever Hugo uses, but specifically with a bunch of custom shortcodes that I had to parse and replace.

Mara is hacker
<Mara> Explaining the joke: the POSIX standard has a lot of core behavior that is formally defined as "implementation defined" or casually known as "undefined behavior". This does not make it easy to have programs do fancy things and be portable, so most of the time people copy the semantics of other POSIX oses just to make life easier.

Turns out writing the parser that was able to scrape out the important bits was easy, and I had that working within a few hours of hacking thanks to judicious abuse of line-by-line file reading.

Cadey is coffee
<Cadey> Sorry for the code maintenance of that! It was my only real option.

However, this worked and I was at a point where I was happy with the JSON objects that I was producing. Now, we can get to the real fun: actually using ElasticSearch.

Lust

When we first wanted to implement search, we were gonna use something else (maybe Sonic), but we eventually realized that ElasticSearch was the correct option for us. I was horrified because I have only ever had bad experiences with it. I was assured that most of the issues were with running it on your own hardware and that using Elastic Cloud was the better option. We were also very lucky at work because we had someone from the DevRel team at Elastic on the team. I was told that there was this neat feature named AppSearch that would automatically crawl and index everything for us, so I didn't need to write that hacky code at all.

So we set up AppSearch and it actually worked pretty well at first. We didn't have to care and AppSearch dilligently scraped over all of the entries, adding them to ElasticSearch without us having to think. This was one of the few parts of this process where everything went fine and things were overall very convenient.

After being shown how to make raw queries to ElasticSearch with the Kibana developer tools UI (which unironically is an amazing tool for doing funky crap with ElasticSearch), I felt like I could get things set up easily. I was feeling hopeful.

Gluttony

Then I tried to start using the Go library for ElasticSearch. I'm going to paste one of the Go snippets I wrote for this, this is for the part of the indexing process where you write objects to ElasticSearch.

data, err := json.Marshal(esEntry)
if err != nil {
	log.Fatalf("failed to marshal entry: %v", err)
}

resp, err := es.Index("site-search-kb", bytes.NewBuffer(data), es.Index.WithDocumentID(entry.ID))
if err != nil {
	log.Fatal(err)
}

switch resp.StatusCode {
case http.StatusOK, http.StatusCreated:
	// ok
default:
	log.Fatalf("failed to index entry %q: %s", entry.Title, resp.String())
}

To make things more clear here: I am using the ElasticSearch API bindings from Elastic. This is the code you have to write with it. You have to feed it raw JSON bytes (which I found out was the literal document body after a lot of fruitless searching through the documentation, I'll get back to the documentation later) as an io.Reader (in this case a bytes.Buffer wrapping the byte slice of JSON). This was not documented in the Go code. I had to figure this out by searching GitHub for that exact function name.

Aoi is wut
<Aoi> Wait, you're using an API wrapper. Why do you need to check the HTTP status code manually? Those status codes are documented somewhere, right?
Cadey is percussive-maintenance
<Cadey> Oh dear, bless your sweet and innocent heart. The API client only really handles authentication and randomly trying queries between several nodes in a cluster. It doesn't handle raising errors on unsuccessful HTTP status codes. God forbid you have to read anything out of the reply, because you have to parse the JSON yourself.
Aoi is coffee
<Aoi> How does this program exist.

Oh yeah, I forgot to mention this but they don't ship error types for ElasticSearch errors, so you have to json-to-go them yourself. I really wish they shipped the types for this and handled that for you. Holy cow.

Greed

Now you may wonder why I went through that process if AppSearch was working well for us. It was automatically indexing everything and it should have been fine. No. It was not fine, but the reason it's not fine is very subtle and takes a moment to really think through.

In general, you can break most webpages down into three basic parts:

  • The header which usually includes navigation links and the site name
  • The article contents (such as this unhinged rant about ElasticSearch)
  • The footer which usally includes legal overhead and less frequently used navigation links (such as linking to social media).

When you are indexing things, you usually want to index that middle segment, which will usually account for the bulk of the page's contents. There's many ways to do this, but the most common is the Readability algorithm which extracts out the "signal" of a page. If you use the reader view in Firefox and Safari, it uses something like that to break the text free from its HTML prison.

So, knowing this, it seems reasonable that AppSearch would do this, right? It doesn't make sense to index the site name and navigation links because that would mean that searching for the term "Tailscale" would get you utterly useless search results.

Guess what AppSearch does?

Aoi is facepalm
<Aoi> Oh god, how does this just keep getting worse every time you mention anything? Who hurt the people working on this?

Even more fun, you'd assume this would be configurable. It's not. The UI has a lot of configuration options but this seemingly obvious configuration option wasn't a part of it. I can only imagine how sites use this in practice. Do they just return article HTML when the AppSearch user-agent is used? How would you even do that?

I didn't want to figure this out. So we decided to index things manually.

Anger

So at this point, let's assume that there's documents in ElasticSearch from the whole AppSearch thing. We knew we weren't gonna use AppSearch, but I felt like I wanted to have some façade of progress to not feel like I had been slipping into insanity. I decided to try to search things in ElasticSearch because even though the stuff in AppSearch was not ideal, there were documents in ElasticSearch and I could search for them. Most of the programs that I see using ElasticSearch have a fairly common query syntax system that lets you write searches like this:

author:Xe DevRel

And then you can search for articles written by Xe about DevRel. I thought that this would rougly be the case with how you query ElasticSearch.

Turns out this is nowhere near the truth. You actually search by POST-ing a JSON document to the ElasticSearch server and reading back a JSON document with the responses. This is a bit strange to me, but I guess this means that you'd have to implement your own search DSL (which probably explains why all of those search DSLs vaguely felt like special snowflakes). The other main problem is how ElasticSearch uses JSON.

To be fair, the way that ElasticSearch uses JSON probably makes sense in Java or JavaScript, but not in Go. In Go encoding/json expects every JSON field to only have one type. In basically every other API I've seen, it's easy to handle this because most responses really do have one field mean one thing. There's rare exceptions like message queues or event buses where you have to dip into json.RawMessage to use JSON documents as containers for other JSON documents, but overall it's usually fine.

ElasticSearch is not one of these cases. You can have a mix of things where simplified things are empty objects (which is possible but annoying to synthesize in Go), but then you have to add potentially dynamic child values.

Mara is hacker
<Mara> To be fair there is a typed variant of the ElasticSearch client for Go that does attempt to polish over most of the badness, but they tried to make it very generic in ways that just fundamentally don't work in Go. I'm pretty sure the same type-level tricks would work a lot better in Rust.

Go's type system is insufficiently typeful to handle the unrestrained madness that is ElasticSearch JSON.

When I was writing my search querying code, I tried to use mholt's excellent json-to-go which attempts to convert arbitrary JSON documents into Go types. This did work, but the moment we needed to customize things it became a process of "convert it to Go, shuffle the output a bit, and then hope things would turn out okay". This is fine for the Go people on our team, but the local ElasticSearch expert wasn't a Go person.

Cadey is coffee
<Cadey> If you are said ElasticSearch expert that has worked with me throughout this whole debacle, please don't take my complaining about this all as a slight against you. You have been the main reason that I haven't given up on this project and when this thing gets shipped I am going to make every effort to highlight your support throughout this. I couldn't have made it this far without you.

Then my manager suggested something as a joke. He suggested using text/template to generate the correct JSON for querying ElasticSearch. It worked. We were both amazed. The even more cursed thing about it was how I quoted the string.

In text/template, you can set variables like this:

{{ $foo := <expression> }}

And these values can be the results of arbitrary template expressions, such as the printf function:

{{ $q := printf "%q" .Query }}

The thing that makes me laugh about this is that the grammar of %q in Go is enough like the grammar for JSON strings that I don't really have to care (unless people start piping invalid Unicode into it, in which case things will probably fall over and the client code will fall back to suggesting people search via Google). This is serendipidous, but very convenient for my usecase.

If it ever becomes an issue, I'll probably encode the string to JSON with json.Marshal, cast it to a string, and then have that be the thing passed to the template. I don't think it will matter until we have articles with emoji or Hanzi in them, which doesn't seem likely any time soon.

Aoi is wut
<Aoi> I guess this would also be really good for making things maintainable such that a non-Go expert can maintain it too. At first this seemed like a totally cursed thing, but somehow I guess it's a reasonable idea? Have I just been exposed to way too much horror to think this is terrible?

Treachery

Another fun thing that I was running into when I was getting all this set up is that seemingly all of the authentication options for ElasticSearch are broken in ways that defy understanding. The basic code samples tell you to get authentication credentials from Elastic Cloud and use that in your program in order to authenticate.

This does not work. Don't try doing this. This will fail and you will spend hours ripping your hair out because the documentation is lying to you. I am unaware of any situation where the credentials in Elastic Cloud will let you contact ElasticSearch servers and overall this was really frustrating to find out the hard way. The credentials in Elastic Cloud are for programmatically spending up more instances in Elastic Cloud.

Aoi is coffee
<Aoi> Why isn't any of this documented?

ElasticSeach also has a system for you to get an API key to make requests against the service so that you can use the principle of least privilege to limit access accordingly. This doesn't work. What you actually need to do is use username and password authentication. This is the thing that works.

Cadey is coffee
<Cadey> To be honest, I felt like giving up at this point. Using a username and password was a thing that we could do, but that felt wrong in a way that is difficult for me to describe. I was exceedingly happy that anything worked and it felt like I was crawling up a treadmill to steal forward millimeters at a time while the floor beneath me was pulling away meters at a time. This is when I started to wonder if working on a project like this to have "impact" for advancing my career was really worth the sanity cost.

Then I found out that the ping API call is a privileged operation and you need administrative permissions to use it. Same with the "whoami" call.

Aoi is rage
<Aoi> Okay, how the fuck does that make any sense at all? Isn't it a fairly common pattern to do a "check if the server is working and the authentication is being accepted appropriately" call before you try to do anything with the service? Why would you protect that? Why isn't that a permissionless call? What about the "whoami" call? Why would you need to lock that down behind any permissions? This just makes no sense. I don't get it. Maybe I'm missing something here, maybe there's some fundamental aspect about database design that I'm missing. Am I missing anything here or am I just incompetent?
Numa is happy
<Numa> Welcome to the CLUB!
Aoi is facepalm
<Aoi> This never ends, does it?
Numa is happy
<Numa> Nope. Welcome to hell.

Really though, that last string of "am I missing something fundamental about how database design works or am I incompetent?" thoughts is one of the main things that has been really fucking with me throughout this entire saga. I try to not let things that I work on really bother me (mostly so I can sleep well at night), but this whole debacle has been very antithetical to that goal.

Blasphemy

Once we got things in production in a testing capacity (after probably spending a bit of my sanity that I won't get back), we noticed that random HTTP responses were returning 503 errors without a response body. This bubbled up weirdly for people and ended up with the frontend code failing in a weird way (turns out it didn't always suggest searching Google when things fail). After a bit of searching, I think I found out what was going on and it made me sad. But to understand why, let's talk about what ElasticSearch does when it returns fields from a document.

In theory, any attribute in an ElasticSearch document can have one or more values. Consider this hypothetical JSON document:

{
  "id": "https://xeiaso.net/blog/elasticsearch",
  "slug": "blog/elasticsearch",
  "body_content": "I've had nightmares that are less bad than this shit",
  "tags": ["rant", "philosophy", "elasticsearch"]
}

If you index such a document into ElasticSearch and do a query, it'll show up in your response like this:

{
  "id": ["https://xeiaso.net/blog/elasticsearch"],
  "slug": ["blog/elasticsearch"],
  "body_content": ["I've had nightmares that are less bad than this shit"],
  "tags": ["rant", "philosophy", "elasticsearch"]
}

And at some level, this really makes sense and is one of the few places where ElasticSearch is making a sensible decision when it comes to presenting user input back to the user. It makes sense to put everything into a string array.

However in Go this is really inconvenient and if you run into a situation where you search "NixOS" and the highlighter doesn't return any values for the article (even though it really should because the article is about NixOS), you can get a case where there's somehow no highlighted portion. Then because you assumed that it would always return something (let's be fair: this is a reasonable assumption), you try to index the 0th element of an empty array and it panics and crashes at runtime.

Aoi is wut
<Aoi> Oh, wait, is this why Rust makes Vec indexing return an Option<T> instead of panicking if the index is out of bounds like Go does? That design choice was confusing to me, but it makes a lot more sense now.

We were really lucky that "NixOS" was one of the terms that did this behavior, otherwise I suspect we would never have found it. I did a little hack where it'd return the first 150 characters of an article instead of a highlighted portion if no highlighted portion could be found (I don't know why this would be the case but we're rolling with it I guess) and that seems to work fine...until we start using Hanzi/emoji in articles and we end up cutting a family in half. We'll deal with that when we need to I guess.

Thievery

While I was hacking at this, I kept hearing mentions that the TypeScript client was a lot better, mostly due to TypeScript's type system being so damn flexible. You can do the N-Queens problem in the type solver alone!

However, this is not the case and for some of it it's not really Elastic's fault that the entire JavaScript ecosystem is garbage right now. As for why: consider this code:

import * as dotenv from "dotenv";

This will blow up and fail if you try to run it with ts-node. Why? It's because it's using ECMAScript Modules instead of "classic" CommonJS imports. This means that you have to transpile your code from the code you wish you could write (using the import keyword) to the code you have to write (using the require function) in order to run it.

Or you can do what I did and just give up and use CommonJS imports:

const dotenv = require("dotenv");

That works too, unless you try to import an ECMAScript module, then you have to use the import function in an async function context, but the top level isn't an async context, so you have to do something like:

(async () => {
  const foo = await import("foo");
})();

This does work, but it is a huge pain to not be able to use the standard import syntax that you should be using anyways (and in many cases, the rest of your project is probably already using this standard import syntax).

Mara is hacker
<Mara> This is one of the cases where Deno gets things right, Deno uses ECMAScript modules by default. Why can't everything else just do this? So frustrating.

Anyways, once you get past the undefined authentication semantics again, you can get to the point where your client is ready to poke the server. Then you take a look at the documentation for creating an index.

Aoi is wut
<Aoi> Seriously, what is it with these ElasticSearch clients and not having strictly defined authentication semantics, I don't get it.

In case Elastic has fixed it, I have recreated the contents of the indicies.create function below:

create

Creates an index with optional settings and mappings.

Endpoint documentation

client.indices.create(...)

Upon seeing this, I almost signed off of work for the day. What are the function arguments? What can you put in the function? Presumably it takes a JSON object of some kind, but what keys can you put in the object? Is this library like the Go one where it's thin wrappers around the raw HTTP API? How do JSON fields bubble up into HTTP request parts?

Turns out that there's a whole lot of conventions in the TypeScript client that I totally missed because I was looking right at the documentation for the function I wanted to look at. Every method call takes a JSON object that has a bunch of conventions for how JSON fields map to HTTP request parts, and I missed it because that's not mentioned in the documentation for the method I want to read about.

Actually wait, that's apparently a lie because the documentation doesn't actually spell out what the conventions are.

Cadey is coffee
<Cadey> I realize that my documentation consumption strategy may be a bit backwards here. I usually start from the immediate thing I'm trying to do and then work backwards from there to my solution. A code sample for the function in question at the place where the function is documented would have helped a lot.

I had to use ChatGPT as a debugging tool of last resort in order to get things working at all. To my astonishment, the suggestions that ChatGPT made worked. I have never seen anything documented as poorly as this and I thought that the documentation for NixOS was bad.

Cadey is percussive-maintenance

Fin

If your documentation is bad, your user experience is bad. Companies are usually cursed to recreate copies of their communication structures in their products, and with the way the Elastic documentation is laid out I have to wonder if there is any communication at all inside there.

One of my coworkers was talking about her experience trying to join Elastic as a documentation writer and apparently part of the hiring test was to build something using ElasticSearch. Bless her heart, but this person in particular is not a programmer. This isn't a bad thing, it's perfectly reasonable to not expect people with different skillsets to be that cross-functional, but good lord if I'm having this much trouble doing basic operations with the tool I can't expect anyone else to really be able to do it without a lot of hand-holding. That coworker asked if it was a Kobayashi Maru situation (for the zoomers in my readership: this is an intentionally set up no-win scenario designed to test how you handle making all the correct decisions and still losing), and apparently it was not.

Any sufficiently bad recruiting process is indistinguishable from hazing.

I am so close to the end with all of this, I thought that I would put off finalizing and posting this to my blog until I was completely done with the project, but I'm 2 months into a 2 week project now. From what I hear, apparently I got ElasticSearch stuff working rather quickly (???) and I just don't really know how people are expected to use this. I had an ElasticSearch expert on my side and we regularly ran into issues with basic product functionality that made me start to question how Elastic is successful at all.

I guess the fact that ElasticSearch is the most flexible option on the market helps. When you start to really understand what you can do with it, there's a lot of really cool things that I don't think I could expect anything else on the market to realistically accomplish in as much time as it takes to do it with ElasticSearch.

ElasticSearch is just such a huge pain in the ass that it's making me ultimately wonder if it's really worth using and supporting as a technology.