How to stop synchronous code in a web worker?

I recently ported Zebra, an Othello/Reversi AI, to Rust and then wrote a web client, using WASM build of the Rust port.

At some point in the process, I encountered what seemed to be like an unsolvable problem that took me almost a year to figure out:

How do I actually stop the engine? 🔗

When the engine is "thinking", it can take arbitrarily long time, depending on how you set the search depth. No only that, it's also a synchronous process, so what other frontends (WZebra, Reversatile) do is that they run the code on a separate thread to not block UI input. When the user interacts, and we need to stop the search, UI thread sets a stop flag and the engine thread stops the search once it reads it.

Naturally, I worked towards the same solution in the web client, using web worker to run the engine and keep UI thread free for user interaction. But here's the catch - you can't create the stop flag. On the web, every thread has its own event loop, and they can only send messages to each other. If you send a message to a web worker, worker has to finish its event loop iteration and exit, only then the event loop executes the next message handler. There's no way to block or wait on condition variable.

let stopFlag = false
self.addEventListener("message", message => {
    if (message.data === 'start-engine-search') {
        while(stopFlag === false) {
            // This code now blocks the worker thread
            // How can we stop it from the main thread?
        }
    } else if (message.data === 'stop-flag') {
      // this doesn't work, because this message handler 
      // can't get executed until the previous handler exits
      stopFlag = true
    }
})

We'd need to call something like some synchronous readNextMessage, to read the message from the main thread but that API doesn't exist - you have to exit and wait for the event loop to call your callback instead.

What are the options? 🔗

This is not totally unsolvable problem, but available solutions are not great.

Atomics and SharedArrayBuffer 🔗

There's a way to do exactly this using relatively new features - Atomics and SharedArrayBuffer, the problem is that they were disabled to mitigate hardware exploits, and they are not widely supported at the moment, so this is a non-starter.

Kill the worker 🔗

Another solution is to kill the whole worker and start a new one. This is pretty ugly solution but it could work. The problem with Zebra is that it has a lot of internal state that'd have to be saved and restored, so it'd require some refactoring. Management of workers would also get more complicated. I was pretty sceptical about this solution, so I didn't consider it very seriously.

State Machine transform 🔗

The only clean solution that I saw at first was to break up the computation into slices by transforming the engine code to a granular state machine - instead of doing the recursive search all at once, we'd save the state every once in a while and exit from the event loop, scheduling the next step with something like a setTimeout(..., 0). This gives the event loop an opportunity to actually deliver the stop message and call the handler which aborts the search process.

This is doable, but it's a huge amount of work. The engine is around 50k lines of complicated, fine-tuned and heavily optimized code, with a lot of shortcuts and micro optimizations and ton of shared state. Many algorithms look quite different if you rewrite them to a state machine form. I actually already did that for the main loop of the engine, to solve the same problem with getting user input for the next move. This was much smaller problem - only a single function call - but even that made the code much more complex and hard to navigate. And it also took a lot of time.

One way to cheat here is to make the functions in the engine async - I tried this with the previous problem, but it wasn't very good either, because it broke all the synchronous users of that code and introduced a bunch of other complexity and new problems to solve.

The solution is in the air 🔗

At first, I just postponed the problem, set the search depth low and worked on other things. I gave this some more thought after maybe a year. I went to bed after a rough day (covid + work) and started pondering.

What if there is a way the worker can get the information from main thread synchronously? Maybe I could use some flag in a LocalStorage? Well, no because LocalStorage is not available in a web worker. Maybe I can use IndexedDB? Still no, because it has only asynchronous API.

Web Worker has a pretty restricted set of APIs, are there any notable synchronous ones? Yes, but most of them are useless, because they don't really communicate with an outside world. There's a location API, which you can change from the main thread, but this doesn't work because web worker can't see the changes until the next event loop iteration.

There's one interesting API, though - it's a classic old school XmlHttpRequest with async=false flag. This one looks promising, but what will this request call? Some remote server? That's kinda expensive just for a stop flag, right? Besides that, I don't even have a server, WebZebra is just a static site.

The ultimate hack 🔗

And then it all clicked. I can use URL.createObjectUrl to create a URL to a local dummy object, give this URL to a worker, let the worker continuously poll it with sync HTTP request, and when the main thread needs to stop it, it just calls URL.revokeObjectUrl to signal the worker that it should stop - when this happens, the URL is invalidated, the request from worker returns 404 and the worker can exit.

I was so excited to try this that I got up from bed and implemented a prototype right away - and it worked. After a year of despair, the solution, which I called stopToken, was unbelievably simple:

// give this token to a worker and the main thread
function createStopToken(): string {
    return URL.createObjectURL(new Blob());
}

// call this from the main thread when you want to stop the worker
function stop(stopToken: string): void {
    return URL.revokeObjectURL(stopToken)
}

// call this from the worker periodically
function shouldStop(stopToken: string): boolean {
    let xhr = new XMLHttpRequest();
    xhr.open("GET", stopToken, /* async= */false);
    try {
        xhr.send(null);
    } catch (e) {
        return true // request failed, URL has been revoked
    }
    return false // URL is still valid, we can continue
}

Pretty cool fake atomic, right?

In practice, there's some more juggling around this - you have to recreate the token each time you need to start some stoppable operation. Also, the check is quite expensive (as you might expect). It's not really that bad, because the request doesn't escape the browser, but still, it's usually around 1-2ms, which is quite a lot for something which should ideally be a single atomic read.

The code in the engine is checking the flag quite frequently, so I implemented some throttling/caching logic, to avoid making the request too often. I also skip the check completely during the first 300ms after the search starts, because it's unlikely someone will stop it that soon, and we don't want to block the engine when it does the most valuable work at the beginning.

Happy end 🔗

So here it is - I think it's a pretty neat trick.

The code is on Github. Feel free to use it when you encounter a similar problem. I explicitly added MIT license to the stopToken file to make it more permissive explicitly, but be careful with other code from that repo - most of it has to be GPL 2, because it's derived from original Zebra, which is also GPL 2.

Now I am going to dig through Stack Overflow to find that one desperate question without answer I've encountered when trying to resolve this and hopefully make that person's day a bit brighter.

Discussions: r/javascript, r/programming, Hacker news

7 May 2022