Hooman
suggested I make a separate thread for this. So that's what I'm doing as this is a rather complex problem with a few different solutions.
So first to explain. Contrary to popular belief race conditions aren't limited to multi-threaded code. While most often observed in multi-threaded code, single thread asynchronous code is just as capable of ending up in race conditions. That stated, what is a race condition? Wikipedia explains it like this:
Race conditions arise in software when an application depends on the sequence or timing of processes or threads for it to operate properly. As with electronics, there are critical race conditions that result in invalid execution and bugs. Critical race conditions often happen when the processes or threads depend on some shared state. Operations upon shared states are critical sections that must be mutually exclusive. Failure to obey this rule opens up the possibility of corrupting the shared state.
It's the last statement in this quote that's applicable to the issue I ran into with OutpostHD and the event handling.
In an effort to keep this as understandable as possible, I'm going to address only the core components and design paradigms used providing code examples with high level pseudocode... so if something is unclear let me know so I can improve on that. So let's dig into it.
OutpostHD uses an event system based on a paradigm called
Signals & Slots (sigslots). This is a design concept introduced by Qt and is a simple and effective implementation of the
Observer Pattern. Basically, it's a way to 'signal' to other objects that something has happened. There are other uses of signals but it's a very effective way to handle events like key strokes, button presses, mouse motion, window state changes, etc. in a GUI system which is exactly what OutpostHD does.
There are three common functions used in the implementation of a sigslots interface:
connect,
disconnect and
emit. The implication of what these functions do should be obvious;
connect an observer to a signal,
disconnect an observer from a signal and
emit a signal to observers. The low level details are unimportant.
OPHD's GUI Controls connect themselves to a number of signals (MouseDown, MouseUp, KeyDown, KeyUp and MoveMove). In many of the handler functions for these (the observers), there was code along these lines:
on MouseDown()
{
if (not visible or not enabled) return;
/* do stuff related to mouse down event here */
}
Every handler function needs some form of code like this. Suffice it to say this makes it very easy to make mistakes where controls are responding to events that they shouldn't be responding to.
In an effort to reduce boiler plate code like the above, I started experimenting with having controls connect and disconnect themselves from signals based on visibility, enabled and focus states. E.g., if not visible or not enabled or not has focus, disconnect. Connect for the inverse. Simple, right? Right. Here's where the race condition comes in.
In order to implement the full screen UI panels as described in
this form post, I built UI Panels that would be shown or hidden based on user input. Again, simple, right? So if I clicked on a Factory in the Map View, the Master Factory Report UI Panel would be displayed. Here's where the sigslots implementation in NAS2D fails and causes a race condition.
In NAS2D's implementation of sigslots, slots (or observers) are stored in a Signal as a list. The problem arises during the iteration over the list:
let iterator = slot_list.begin();
while(iterator is not equal to slot_list.end())
{
iterator.emit();
iterator = slot_list.next();
}
During this update loop, particularly in MouseDown, OPHD would set the Master Factory Report UI Panel to Visible. This would cause all of the controls in the UI Panel to
connect to their respective Signals. Because many of these controls would connect to the MouseDown signal, this would invalidate the iterator in the original loop and this is the race condition that existed and would cause a very immediate program termination. To visualize it:
The Yellow?Green (can't tell, colorblind) segment is the normal flow of the iteration process. The problem is during the emit phase of the process -- the handler in red causes an invalidation of the iterator in the yellow?green process. Instant program crash. A race condition caused by asynchronous code modifying the slot list during an iteration over the signal list. Whoops.
So there are two possible solutions here. The easy solution: don't modify the slot list when responding to emitted signals (e.g., never connect/disconnect an observer from an observer function). The harder but much better solution: lock the slots list during iteration and defer any changes until after the iteration is complete (aka,
semaphores).
Both options have their pros and cons. The first solution is a lot faster and easier to implement... but leads to more flaggy code (e.g.,
if (not visible or not enabled or not hasfocus) return). On the other hand, the second solution takes more time to implement properly and leads to more complex sigslots code... but avoids the race condition outlined here.
For OutpostHD, I opted for the first solution. I wanted to get through this problem and just move on with as little modification to the middleware as possible.
Interestingly enough, I had almost exactly the same problem early on in OPHD's development with the Robots and Structures. These could change states during an iteration over the list and could destroy itself or otherwise modify the list. This would lead to iterator invalidation and thus a crash. In both cases I went a semaphore solution -- lock the list, iterate over the list, defer modifications to the list until after the iteration is over.
Side note: Hooman objected to my use of the terms 'race condition' and 'semaphore' but according to all definitions I could find (that were not language or problem domain specific) these terms seem to be entirely accurate and appropriate for the issues described here. Discuss.