Author Topic: Avoiding Race Conditions in Asynchronous Code  (Read 4821 times)

Offline leeor_net

  • Administrator
  • Hero Member
  • *****
  • Posts: 2350
  • OPHD Lead Developer
    • LairWorks Entertainment
Avoiding Race Conditions in Asynchronous Code
« on: October 22, 2018, 11:32:56 PM »
Hooman suggested I make a separate thread for this. So that's what I'm doing as this is a rather complex problem with a few different solutions.

So first to explain. Contrary to popular belief race conditions aren't limited to multi-threaded code. While most often observed in multi-threaded code, single thread asynchronous code is just as capable of ending up in race conditions. That stated, what is a race condition? Wikipedia explains it like this:

Race conditions arise in software when an application depends on the sequence or timing of processes or threads for it to operate properly. As with electronics, there are critical race conditions that result in invalid execution and bugs. Critical race conditions often happen when the processes or threads depend on some shared state. Operations upon shared states are critical sections that must be mutually exclusive. Failure to obey this rule opens up the possibility of corrupting the shared state.

It's the last statement in this quote that's applicable to the issue I ran into with OutpostHD and the event handling.

In an effort to keep this as understandable as possible, I'm going to address only the core components and design paradigms used providing code examples with high level pseudocode... so if something is unclear let me know so I can improve on that. So let's dig into it.

OutpostHD uses an event system based on a paradigm called Signals & Slots (sigslots). This is a design concept introduced by Qt and is a simple and effective implementation of the Observer Pattern. Basically, it's a way to 'signal' to other objects that something has happened. There are other uses of signals but it's a very effective way to handle events like key strokes, button presses, mouse motion, window state changes, etc. in a GUI system which is exactly what OutpostHD does.

There are three common functions used in the implementation of a sigslots interface: connect, disconnect and emit. The implication of what these functions do should be obvious; connect an observer to a signal, disconnect an observer from a signal and emit a signal to observers. The low level details are unimportant.

OPHD's GUI Controls connect themselves to a number of signals (MouseDown, MouseUp, KeyDown, KeyUp and MoveMove). In many of the handler functions for these (the observers), there was code along these lines:

Code: [Select]
on MouseDown()
{
    if (not visible or not enabled) return;

    /* do stuff related to mouse down event here */
}

Every handler function needs some form of code like this. Suffice it to say this makes it very easy to make mistakes where controls are responding to events that they shouldn't be responding to.

In an effort to reduce boiler plate code like the above, I started experimenting with having controls connect and disconnect themselves from signals based on visibility, enabled and focus states. E.g., if not visible or not enabled or not has focus, disconnect. Connect for the inverse. Simple, right? Right. Here's where the race condition comes in.

In order to implement the full screen UI panels as described in this form post, I built UI Panels that would be shown or hidden based on user input. Again, simple, right? So if I clicked on a Factory in the Map View, the Master Factory Report UI Panel would be displayed. Here's where the sigslots implementation in NAS2D fails and causes a race condition.

In NAS2D's implementation of sigslots, slots (or observers) are stored in a Signal as a list. The problem arises during the iteration over the list:

Code: [Select]
let iterator = slot_list.begin();
while(iterator is not equal to slot_list.end())
{
    iterator.emit();
    iterator = slot_list.next();
}

During this update loop, particularly in MouseDown, OPHD would set the Master Factory Report UI Panel to Visible. This would cause all of the controls in the UI Panel to connect to their respective Signals. Because many of these controls would connect to the MouseDown signal, this would invalidate the iterator in the original loop and this is the race condition that existed and would cause a very immediate program termination. To visualize it:


The Yellow?Green (can't tell, colorblind) segment is the normal flow of the iteration process. The problem is during the emit phase of the process -- the handler in red causes an invalidation of the iterator in the yellow?green process. Instant program crash. A race condition caused by asynchronous code modifying the slot list during an iteration over the signal list. Whoops.

So there are two possible solutions here. The easy solution: don't modify the slot list when responding to emitted signals (e.g., never connect/disconnect an observer from an observer function). The harder but much better solution: lock the slots list during iteration and defer any changes until after the iteration is complete (aka, semaphores).

Both options have their pros and cons. The first solution is a lot faster and easier to implement... but leads to more flaggy code (e.g., if (not visible or not enabled or not hasfocus) return). On the other hand, the second solution takes more time to implement properly and leads to more complex sigslots code... but avoids the race condition outlined here.

For OutpostHD, I opted for the first solution. I wanted to get through this problem and just move on with as little modification to the middleware as possible.

Interestingly enough, I had almost exactly the same problem early on in OPHD's development with the Robots and Structures. These could change states during an iteration over the list and could destroy itself or otherwise modify the list. This would lead to iterator invalidation and thus a crash. In both cases I went a semaphore solution -- lock the list, iterate over the list, defer modifications to the list until after the iteration is over.



Side note: Hooman objected to my use of the terms 'race condition' and 'semaphore' but according to all definitions I could find (that were not language or problem domain specific) these terms seem to be entirely accurate and appropriate for the issues described here. Discuss.
« Last Edit: October 23, 2018, 09:01:27 AM by leeor_net »

Offline Hooman

  • Administrator
  • Hero Member
  • *****
  • Posts: 4954
Re: Avoiding Race Conditions in Asynchronous Code
« Reply #1 on: October 23, 2018, 08:00:29 AM »
Hmm, your problem description makes what's going on much more clear.

I'm curious, when you show and hide windows, do you create and destroy the objects associated with them, or do you have multiple windows in memory, with one marked as active? I've noticed connect and disconnect calls, though I mostly saw it associated with constructors and destructors, so I assume it is tied to the lifetime of the UI objects. I suppose this may depend on the UI object in question.

It looks like the events are all dispatched from a single global source, and delivered directly to the listener without any routing.

I was thinking, what if you had routing of UI events. Instead of a button listening for global UI events, it could listen to events from its container, such as the window it is part of. The window could get it's events from upstream, either listening to the global object, or some other container objects. When an event is fired, it can be sent to a listening window, which can in turn deliver it to a listening button. If the window is not active, it would not route events, and so you wouldn't need to worry about connecting or disconnecting buttons from a global event source.

In terms of the checks done in every listener routine, you could potentially move those up to the parent object. If a window had a concept of a visible control, at a defined location, then it would know if a message was potentially of interest based on the control's location, visibility, and enabled status. The checks then become a part of message routing, rather than something every handler has to take care of.

For the full screen UI panels, I assume you mean the UI never really closes, it's just hidden from view, but still active in memory. If you had a list of such full screen panels, the container of all those panels would decide which one to route messages to. Then there's no need to connect or disconnect anything.

This would side step the list update problem, and iterator invalidation, in two separate ways. One, by not connecting or disconnecting anything, and instead routing the events (for always active elements). The other being you now have separate lists of listening objects at different levels, and you wouldn't need to update a list you were iterating over. Even if you were creating/destroying objects (non-always active elements), which involved connections and disconnections, those operations would be done on a different object than the one being iterated over. You iterate over the parent while updating the leafs.



As for terminology, I'd say it's not a race condition because the code is not truly asynchronous. The updates occur in the code called by emit, and is hence synchronous. To get a race condition, you need asynchronous operations, which could happen with multithreading, or with hardware level events (things that happen outside the current thread of the CPU).

By that light, semaphores don't make sense, as they are for signalling between different threads. You don't really have separate threads though. I'm tempted to suggest the term Coroutine, which is used in non-preemptive cooperative multitasking, though I'm not sure its use is quite appropriate here. A coroutine can typically run multiple times, passing control back and forth, and so may need it's own stack, much like a thread. What you have here is a simple subroutine. When the call done by emit finishes and returns, that's it, that method is done.

You're right that you have a problem with updates to a shared resource. I'd be hesitant to use the term critical section outside the realm of threading though. Especially since it generally implies a locking mechanism, which makes no sense in a single threaded context.

I will admit the problem is related though, with some of the terms rather close in meaning, but perhaps a little too specific so as to not apply. Perhaps there is a more general term for this problem, which encompasses both the multi threaded case, and single threaded case. Nothing coming to mind though.

Offline leeor_net

  • Administrator
  • Hero Member
  • *****
  • Posts: 2350
  • OPHD Lead Developer
    • LairWorks Entertainment
Re: Avoiding Race Conditions in Asynchronous Code
« Reply #2 on: October 23, 2018, 09:30:07 AM »
I'm curious, when you show and hide windows, do you create and destroy the objects associated with them, or do you have multiple windows in memory, with one marked as active?

They're resident in memory and behave according to finite states (e.g., visible, enabled, has_focus... actually I think those are the only three). Anyway, destroying them between show/hide calls would be extremely inefficient. Memory usage is negligible for these controls except those that use larger resources (like images).

I've noticed connect and disconnect calls, though I mostly saw it associated with constructors and destructors, so I assume it is tied to the lifetime of the UI objects. I suppose this may depend on the UI object in question.

Right. This is part of the solution -- connect them at construction and disconnect them during destruction and respond to events based on state.

It looks like the events are all dispatched from a single global source, and delivered directly to the listener without any routing.

Yes and yes.

I was thinking, what if you had routing of UI events. Instead of a button listening for global UI events, it could listen to events from its container, such as the window it is part of. The window could get it's events from upstream, either listening to the global object, or some other container objects. When an event is fired, it can be sent to a listening window, which can in turn deliver it to a listening button. If the window is not active, it would not route events, and so you wouldn't need to worry about connecting or disconnecting buttons from a global event source.

The first time I built GUI code years and years and years ago this is exactly how I did it. I passed an actual Event structure which had a 'consumed' field. The problem? This gets really complicated really fast and gets difficult to manage. So I opted for something much simpler. Sure, it's not as efficient but iterating over a list of 10 objects and having those objects route accordingly is much harder to get right than just iterating over a list of 60 or 70 objects. And since the iteration only happens during actual events the loss in efficiency isn't really noticed (especially on modern machines running at 2.5+ ghz vs. 800 mhz from back in the 90's where this really would have mattered).

In terms of the checks done in every listener routine, you could potentially move those up to the parent object. If a window had a concept of a visible control, at a defined location, then it would know if a message was potentially of interest based on the control's location, visibility, and enabled status. The checks then become a part of message routing, rather than something every handler has to take care of.

The UIContainer control does this with mouse motion events and setting focus for controls. Controls by default set themselves to have focus when constructed. Only when they're contained in a UIContainer does that change. The master factory view demonstrates this with the list box outlines changing color when they gain focus.

For the full screen UI panels, I assume you mean the UI never really closes, it's just hidden from view, but still active in memory. If you had a list of such full screen panels, the container of all those panels would decide which one to route messages to. Then there's no need to connect or disconnect anything.

Right, this would make sense and is sort of how it currently works in terms of setting visibility of the panels and handling focus. But, as with the above, the routing code can get really complicated really fast and is difficult to get right. Regardless, you're not wrong. It would probably be a better design but it would require a rewrite of the GUI code, something I'm not really prepared to do at this time. It's a case of "push it out, it's good enough". I can worry about elegant design later. :D



As for terminology, I'd say it's not a race condition because the code is not truly asynchronous. The updates occur in the code called by emit, and is hence synchronous. To get a race condition, you need asynchronous operations, which could happen with multithreading, or with hardware level events (things that happen outside the current thread of the CPU).

By that light, semaphores don't make sense, as they are for signalling between different threads. You don't really have separate threads though. I'm tempted to suggest the term Coroutine, which is used in non-preemptive cooperative multitasking, though I'm not sure its use is quite appropriate here. A coroutine can typically run multiple times, passing control back and forth, and so may need it's own stack, much like a thread. What you have here is a simple subroutine. When the call done by emit finishes and returns, that's it, that method is done.

You're right that you have a problem with updates to a shared resource. I'd be hesitant to use the term critical section outside the realm of threading though. Especially since it generally implies a locking mechanism, which makes no sense in a single threaded context.

I will admit the problem is related though, with some of the terms rather close in meaning, but perhaps a little too specific so as to not apply. Perhaps there is a more general term for this problem, which encompasses both the multi threaded case, and single threaded case. Nothing coming to mind though.

I suppose? But I'm still hung up on the need to go with the common definition of a CompSci semaphore (for instance) versus the metaphor that it's intended to be. Let's look at the real-world definition of a semaphore:

Quote
noun
  • a system of sending messages by holding the arms or two flags or poles in certain positions according to an alphabetic code.
  • an apparatus for signaling, consisting of an upright with movable parts.

verb
  • send (a message) by semaphore or by signals resembling semaphore.
    "Josh stands facing the rear and semaphoring the driver's intentions"

Does this not exactly fit what I'm doing with the deferred insertions? The 'semaphore' that I'm using is literally a signal to incoming code calls to prevent a crash. That variable acts as the flag man. Same thing, no?

As for the race condition, I still think it perfectly defines the problem here -- two piece of code that, if not careful of eachother, can corrupt a shared resource. The list of slots is the shared resource and modifying it during an iteration loop creates all kinds of problems.

I mean, I guess from the standpoint of not having multiple definitions for the same term it makes sense to have 'race condition' and 'semaphore' have only one well defined meaning. I'm hesitant to invent new terminology though because do we really need Yet Another BuzzwordTM?

Hrmm... this is a more complex problem than I anticipated... even moreso than the issue we're talking about in OPHD.  :o

Offline Hooman

  • Administrator
  • Hero Member
  • *****
  • Posts: 4954
Re: Avoiding Race Conditions in Asynchronous Code
« Reply #3 on: October 27, 2018, 11:06:49 PM »
How does it defer insertions? Does it use a secondary list when the primary list is being iterated over?

Offline leeor_net

  • Administrator
  • Hero Member
  • *****
  • Posts: 2350
  • OPHD Lead Developer
    • LairWorks Entertainment
Re: Avoiding Race Conditions in Asynchronous Code
« Reply #4 on: October 27, 2018, 11:23:36 PM »
Yes, that's how the Structure and Robot lists do it. A 'deferred' list.