Friday, September 9, 2011

“Long Running Processes”, “Asynchronous Communication”, and “Everything In the Database”

This blog post comes from technical design considerations for the Market game, specifically related to interoperability (multiple ways of interacting with the core services), security (more specifically application layer responsibilities), and scalability.

Security/Design Concerns
The problem was initially raised in relation to the execution of long running processes in the Service layer, and specifically in my case the "production process", but morphed into a major design and architecture rethink.

The production process is the process by which an item is created - consuming ingredients, and taking a specified amount of time to complete.
I was working on an AI process that would continually run item production, and initially had a simple method for production

Choose what to create
Determine how long it would
take
Start a timer (short)
On timer completion, create
the item, start the timer again
Start a timer (long)
On timer completion, stop
production and choose something else

The issue here is that the AI process (the client) is responsible for the production process, which would be a very bad thing to leave up to a client. This is a fundamental security issue, not only for games (where we don't want cheaters) but also for general business processes (we don't want the client to run banking transactions).

The obvious fact here is that the service should be performing the production of an item; the client should request to 'produceAnItem' which performs the steps to create an item, including waiting the correct amount of 'build time'. The AI client can then worry about the 'big picture' which is specific to its own processing (choosing what to build). By doing this in our service method we are relying on either a blocking call to the service method, or implementing a callback/event to the client when the action is complete.

Asynchronous Issues
This works fine for 'connected' systems, but asynchronous systems such as WCF or ASP.net will not be able to run a service method designed this way. For example, using WCF to process this request means that our WCF call will either block until complete meaning the service method could timeout; or the WCF call will complete immediately, but when the callback/event fires we have no communication channel to inform the client.
WCF can work around this by using duplex communication, but this is limited to WCF and even further limited to the full .net framework (i.e. no silverlight/ WP7 support), so this cannot be used (it is also unreliable).

Polling and Feedback
A generally accepted solution then is to start the process in the service method and have the client check back to see if the process is complete. While this can be bandwidth inefficient if your polling frequency is too high, it is a reliable solution. This process can also solve one of the key issues with long-running processes, and that is progress reporting, as each time the client checks for completion, the server can respond with any progress information.

This then brings me to the "Everything in the Database" point. If we have a long running process, triggered by a WCF call, or on a background thread, or anything other than a global static variable, then we cannot (easily) access that process to determine progress or completion. So while our service could be sitting there creating all these items, how does the client know that their particular run is completed? In order to support this we need to actually write a "ProductionRun" item into the database for the requested production, and we can then update that from the long running process, and read it back again when the client wants to know progress. Potentially more importantly, we can recover a production run from an application/server crash as we have all the details persisted.

Ok, so we now have a working solution across any client type

Client -> Request Production Run
Service -> Create production run record (including calculated completion time)
Start production timer
Client -> Check for completion/progress
Client -> Check for completion/progress
Service -> On production complete, mark the production run complete, and do the item creation etc
Client -> Check for completion/progress

Worker Processing / Message Bus
The above process can be modified slightly to reduce the responsibility of the "Service" and introduce a "Worker" system that performs the processing of actions. The Service method becomes a broker, that writes requests to the database and returns responses to the client. A separate process running on a separate system then reads the requests from the database and performs actions on these requests. This allows for increased scalability and reliability as we are reducing the responsibility of the client-facing system, and can use multiple workers to perform processing on these requests. This is essentially a Message Bus architecture, which is a proven and reliable architecture for highly scalable solutions, and taking the solution described above, the implementation of a Message Bus would not require major application redesign.

No comments:

Post a Comment