I've been discussing with Govert how to handle wedged games more cleanly. We have had multiple games generate over 20 gb of logfiles, at least partly due to a broker sending huge numbers (10^8 or more) of HourlyCharge messages. This is slowing the server to a crawl, and risks overflowing the disk space.
The problem is that if we kill a server process, the brokers don't know about it and don't recover. They will either get an exception or not, but in most cases they will neither exit nor proceed to the next session. They are effectively hung at this point. We put a lot of effort into making the login process resilient and robust, but had not worked on this issue.
For the short term, I have implemented a small feature in the server that will allow us to abort a game early in a way that appears to brokers like a normal end-of-game. It sends the sim-end message, sends out the results, etc. This will take care of almost all of the problem, we think.
For the longer term, brokers need to be somewhat more robust to failed servers or failed communications. We are discussing some sort of heartbeat mechanism, but it will require an update to broker code, and we don't want to impose that on you right now.
So I have a couple of questions:
1. If we were to come up with an update to the broker that times out in a configurable way and goes on to the next session (or quits, if there are no more sessions), how many of you would be interested in implementing it now? Basically, if you are using the current sample broker core package unchanged, it should be a simple update for you.
2. Do any of you have alternative ideas for making the broker/server connection more robust in these situations?
Of course, we hope to not have to abort any games, and it's reasonable to expect that games will run and end normally once some broker problems like the HourlyCharge flood are resolved. One bit of good news is that the login problems that plagued us in June seem to be resolved nicely.