From 96ae6401c598e0d1e9c3bbb684439949a7adf21c Mon Sep 17 00:00:00 2001 From: Avraham Sakal Date: Wed, 20 Nov 2024 22:02:27 -0500 Subject: [PATCH] write jounral-entries/2024-11-20 --- src/content/journal-entries/2024-11-20.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) create mode 100644 src/content/journal-entries/2024-11-20.md diff --git a/src/content/journal-entries/2024-11-20.md b/src/content/journal-entries/2024-11-20.md new file mode 100644 index 0000000..e3481df --- /dev/null +++ b/src/content/journal-entries/2024-11-20.md @@ -0,0 +1,16 @@ +--- +title: Elasticsearch Ingestion Daemon +date: 2024-11-20 +--- + +I still saw requests coming in through an old Cloudflare-Worker-based proxy I had set up, before I released the current one, which rewrites `m3u8` files on-the-fly, besides proxying the segment files themselves (among other features). I updated our website and app to use the new proxy; where were these requests coming from? I inspected some requests as they came in using the Cloudflare interface, and I found that the User Agent was always one of our apps; and different versions of it at that. We still had un-updated versions of our app out in the wild, but I also saw requests from the latest version! How could this be? + +Then it dawned on me: I had a vague recollection that I saw in the app code that sometimes it doesn't fetch a lecture when it's chosen, because the data is already available from search results. Our search endpoints query Elasticsearch, which is _not_ the source of truth--MySQL is. So obviously it was serving stale data. Lo and behold, I ran a query, and found old-style lecture video URLs in Elasticsearch. The lecture documents had to be re-ingested from MySQL. + +We had LogStash keeping Elasticsearch in sync with MySQL about a year and a half ago; but it kept running out of memory. I ended up writing a daemon in Typescript to do the job. It works by only ingesting the latest data from MySQL (i.e. rows that have an `updated_at` value that is newer than the last ingestion time). But yesterday when I re-defined the `vimeo_video_links` column in MySQL, I didn't set the `updated_at` value to the current time on all the rows. Perhaps I should have, but it didn't seem like the right thing to do, because the record was _updated_ semantically; it was just re-structured, but _represented_ the same data. + +Anyway, I reset the daemon's cursor state so it would start ingesting from the beginning. + +The next problem was that in order to prevent the daemon from OOMing and/or overloading MySQL, I have it fetch small batches of records (500 at a time). The next batch is supposed to begin at the row after the previous batch. I used `updated_at` as the cursor state (i.e. the next batch began with the first lecture who's `updated_at` was after the last lecture in the previous batch). I then discovered that many lectures had the same `updated_at` value (strange, yes, but many things have happened over the years including bulk-update scripts; and in MySQL the `updated_at` only had a resolution of seconds); and many batch boundaries fell squarely among such lectures; so the next batch began at the first lecture with a `updated_at` that was _newer_ than the last lecture in the previous batch (I couldn't use `>=` as the comparison operator because then script would keep getting back the old batch of lectures). So I ordered by `updated_at` _and_ by `id`, and adjusted the comparison to consider both `updated_at` and `id`. This solved the problem. + +By the way, I didn't use regular `OFFSET` instead of this cursor state, because records may change between fetching batches, thus what was at "offset 501" from the point-of-view of the first query may not be at "offset 501" from the point-of-view of the second query. Namely, if a record from an already-done batch was updated (and thus its `updated_at` value changed to `NOW()`), what was at "offset 501" is now at "offset 500" in the second query, but the second query specifies to start from `OFFSET 501`, so it will skip that one record.