Multi-user time based simulation - deploying new version without downtime

Question

I am making a multiplayer strategy simulation game. The game runs in turns of fixed duration, e.g. 1 minute.

For every user, there is a set of state variables that can change every turn, e.g. amount of worker units assigned to gathering food, amount of food already gathered, etc.
The values of these variables in every turn are calculated based on their values on the previous turn: e.g. the amount of food in turn 3 = food_in_turn2 + food_workers_turn2*food_gathered_per_worker
The user can tweak some of these variables, e.g. how many workers are assigned to gathering food.

The state variables for each user can be saved in the DB every turn, or can be lazily calculated on user request, and only stored to DB when this lazy calculation happens. E.g. when the user's last known turn is turn 5, and the request is asking for turn 10, we do calculations for 5 turns and return the last values, the ones for turn 10. Right now I have the lazy solution implemented.

Let's say I want to deploy a new version of the service, where the calculation of the state in a new turn has changed. E.g. food_gathered_per_worker was changed from 1 to 2 in the new version. I would like this update to come into effect for all users on the same turn, otherwise it can be unfair for some users.

How do I handle deploying this without downtime? The deployment is just docker in kubernetes pods.

Some simple solutions I considered that would not work:

If I just deploy the new service and shut down the old one, the lazy calculation would start from different points for different users whenever there's a new request. E.g. user1's last state was at 10:30, user2's last state was at 10:40, new service deployed at 10:50 --> user1 would have 20 minutes (10:30-10:50) of turns lazily calculated with the new service, while user 2 would have 10 minutes (10:40-10:50). Problem: The difference in calculation for the period 10:30-10:40 would be unfair to one of the users.
Update the state for every user for the last game turn that should run with the old version. Start deployment the new version and pause the simulation until deployment is done. Problem: this would be downtime.
Whenever I want to deploy, calculate the state for every user for several turns in the future, until a predefined time of switching to the new service, e.g. 11:00. Deploy the new version, with the calculation configured to take effect at 11:00. Until then, return the pre-calculated state. Problem: If the user wants to tweak something during this pre-calculated period, the state for all the following turns until 11:00 would need to be recalculated. However the old version of the service that should be used here, might have already been shut down and replaced by the new one. Recalculating with the new simulation service would bring us back to the first case where results would be unfair for some users.

The example with food gathering is a simplification - the actual calculations are non-trivial, and would be too complicated to store externally and load dynamically while the service is running.

Possible solution I see:

Start from idea 3 above, and keep both the old and new version up. Have a smart gateway/load-balancer service that notices two versions are running, and redirects incoming requests from users correctly during the transition time. If the user is tweaking state variables, redirect to the old service and recalculate future states until the switching time of 11:00. When we pass the switch time of 11:00, the old version can be shut down and all user requests will go to the new version of the simulation.

Am I overengineering this? Is there a simpler solution I'm missing?

DavidT · Accepted Answer · 2024-09-16 16:15:48Z

4

You have two different calculations, per your example this may be as simple as changing the value of a constant or it may be arbitrarily complicated.

The new version of your application needs to be able to perform either calculation. Further it needs to be able to combine them, so for example if you have a cut over time of 11:00 and a user requests an update at some point after 11:00 you may need to apply the old logic for all turns before 11:00, then switch to the new logic for any calculations after 11:00.

With respect to deploying the change, if you are certain that you can deploy the new version of the app (removing all old instances) before 10:30 it may be as simple as hard coding the cut over time into the new version for 11:00 that way no user sees any change in functionality until 11:00 and then every user sees the change.

If you want more flexibility with respect to when the cut over occurs, you can create/integrate some dynamic configuration loading / feature flagging system, however the critical point is that you need a cut over time as a configuration variable, so that you can ensure the value is replicated to all instances before it needs to take effect.

TL;DR - the problem you describe is simply a specialized version of backwards compatibility - the solution is just to maintain at least one prior version of the logic in future versions so that you can cut over at an arbitrary point.

answered Sep 16, 2024 at 16:15

DavidT

4,6297 silver badges18 bronze badges

+1. Let me add such a deployment can be done in two steps: one where both calculations are integrated into the service (and can be switched on and off by some flag), and a second where the deprecated calculation was removed from the code since it is not needed anymore (and should not pile up as waste).

Doc Brown
– Doc Brown

2024-09-16 19:43:23 +00:00
Commented Sep 16, 2024 at 19:43
1

This sounds like the "possible solution" I wrote, except with both versions of the calculation put into one service instead of a separate service for each of the two calculations existing side by side. Indeed this sounds good - I avoided this idea, thinking that the service should work without needing to have knowledge of deployment, but calling it a "specialized version of backward compatibility" puts it in the right perspective.

devil0150
– devil0150

2024-09-16 19:57:34 +00:00
Commented Sep 16, 2024 at 19:57
The "specialization" is the fact that a single client request may need to use both old and new logic - if there are multiple turns to be processed that cross the boundary condition - with two different application versions deployed, neither version can handle such a request.

DavidT
– DavidT

2024-09-16 20:25:49 +00:00
Commented Sep 16, 2024 at 20:25

Add a comment |

Stack Exchange Network

Multi-user time based simulation - deploying new version without downtime

1 Answer 1

Hot Network Questions

Multi-user time based simulation - deploying new version without downtime

1 Answer 1

Related

Hot Network Questions