Async Global Isolation

When developing systems with some internal state that cannot be accessed by multiple writers simultaenously, it's common to approach the problem by wrapping the state with locks or other concurrency primitives. The goal is to ensure that any operation on the state is both atomic and isolated. The operation on the state should occur fully, or be rolled back to some known state. Two or more operations on that shared state should also not happen concurrently.

This scenario gets more interesting when writing single threaded, co-operative systems (e.g., async), where multiple coroutines exist in some state, but only one of which can be running at any single point in time. In this scenario, achieving an atomic and isolated transaction on some state can be done by not yielding the coroutine during the transaction. If the operation succeeds, the coroutine continues as normal until it eventually yields back to the event loop. If it fails, the coroutine cleans up and either immediately yields or continues on. Either path results in an atomic and isolated transaction where no other coroutine could be interacting with the state at the same time.

The issue here is that relying on the coroutine not yielding during the operation is both error prone (many developers do not understand how co-operative async works) and it implicitly makes a local state into a global state. We may only need to ensure that the state is isolated within a specific context. Preventing all other coroutines from running promotes this locking to a global lock.

Recently, I wrote some async Python code to handle OAuth token authorization. Multiple async tasks existed simultaneously, and any one of them could use an instance of an authorization class to add a shared OAuth token to their HTTP requests. In the case that this token was invalid, they would refresh the token inline with their request. Here's an abridged example of it:

class OAuthAuthorizer:
    _token: str
    _alock: asyncio.Lock

    def _is_token_valid(self) -> bool: ...

    def _add_auth_to_request(self, request: Request) -> Request: ...

    async def _refresh_token(self) -> None: ...

    async def _maybe_refresh_token(self) -> None:
        if not self._is_token_valid():
            await self._refresh_token()

    # Requests flow through this function and have authorization added
    # before being sent to the remote server.
    async def handle_request(self, request: Request) -> Request:
        if not self._is_token_valid():
            async with self._alock:
                await self._maybe_refresh_token()
        # Add "Authorization: Bearer <token>" to the request
        return self._add_auth_to_request(request)

If one task made a request and found an invalid token, it would begin refreshing the token. If a second task then attempted to make a request while the first task was refreshing, it would hit the lock and block until the first task finished refreshing. The second self._is_token_valid() check is there to handle the case where we blocked, but were not the ones refreshing and thus need to validate if the token was refreshed while we were waiting.

Could this code be re-written to not use locks and still remain isolated? Given the above code, we could just remove the async from _refresh_token(), then run the code synchronously:

class OAuthAuthorizer:

    def _refresh_token(self) -> None: ...

    def handle_request(self, request: Request) -> Request:
        if not self._is_token_valid():
            self._refresh_token()
        return self._add_auth_to_request(request)

Now, when a thread begins refreshing the token, it blocks when making it's external requests for fetching a new token. No other task can run during this period, so the operation is guaranteed to be isolated. We can then remove the lock due to this guarantee. We also remove async from the handle_request function to mark how this is a blocking operation.

As mentioned earlier, there are two issues here. One is that we now must maintain the invariant that no other coroutine can run while refreshing the token. The other is that we just promoted this context-specific state to a global state.

Invalid Lock Invariants

Lets look at the first issue. If we aren't careful, we can easily introduce an await during the lock. Doing so could cause more than one coroutine to concurrently begin the refresh token operation. Lets say we have a new requirement where we need to attach user information to our OAuth refresh requests:

class OAuthAuthorizer:

    _user_manager: UserManager

    def _refresh_token(self, user: User) -> None: ...

    async def _fetch_user_information(self) -> User: ...

    async def handle_request(self, request: Request) -> Request:
        if not self._is_token_valid():
            user = await self._user_manager.fetch_current_user()
            self._refresh_token(user)
        return self._add_auth_to_request(request)

A well-intentioned developer introduced a new _user_manager attribute with an async method on it. The user manager was likely built for async so calling it asynchronously is a fine assumption. In doing this, they remarked this function as async and added an await for the call. The result is that we now release control of the coroutine in the middle of the refresh process, which can easily result in another coroutine entering the refresh token block while we're waiting for our user information. The operation is no longer isolated and results in multiple refreshes.

Adding a sync interface to UserManager is possible, but likely expands the scope of the changes, so the developer is not interested in doing that. I'm also not entirely sure if there is an "approved" method of blocking on another awaitable without letting any other coroutines run. Spawning an entirely new event loop should work, but is obviously a heavyweight solution.

Adding a comment to the code here to ensure developers don't do this is only solution I can think of. As is the case with any code comment, they require developers actually read them and keep them up-to-date for them to be useful.

Promoted Global State

The second issue involves global state versus context-specific state. A global state is one where the entire program relies on it's existence to operate. Global state is rarely a good program choice (though not universally), but that's not the point of this article.

Taking the previous example where we remove the lock and relied on the coroutine blocking during it's refresh request, we have the following (repeated from earlier):

class OAuthAuthorizer:

    def _refresh_token(self) -> None: ...

    def handle_request(self, request: Request) -> Request:
        if not self._is_token_valid():
            self._refresh_token()
        return self._add_auth_to_request(request)

What we did in the process of removing the lock was take a context-specific barrier and moved it out from the class and into the event loop. We now rely on the event loop's ability to only run a single coroutine to ensure isolation. Even if we have tasks that have nothing to do with this refresh process, they are blocked from running.

It should be fairly easy to see how this causes a "stop the world"-type of lockup in the program. This is fairly big downgrade for this specific scenario. It would be rare that you have a single OAuth process that all tasks immediately rely on.

Should you do this?

Maybe. I don't know what problem you are working on. In some cases, stopping the entire event loop may already be the result of a lock around the state. If every task in your program relies on the same shared state, then a lock around that state is basically already global. Moving the lock to the event loop will simplify the code, but then force developers to pay more attention to the invariant of never yielding during the operation. An explicit lock, while more complex, is at least a flag that forces developers to think about the operation before changing it.