技术讲解，慢慢看

qmmuta · 2011 年8 月 24 日 11:28

原地址
http://altdevblogaday.com/2011/0 … -a-tale-of-desyncs/

作者
Forrest Smith
Software Engineer at Uber Entertainment where he shipped Monday Night Combat on 360 and Steam. An ordinary programmer by day but by night he… is also an ordinary programmer working on iphone side projects.

Synchronous RTS Engines and a Tale of Desyncs
39 Comments
Forrest Smith 2:00 pm on July 9, 2011 39 Comments

Have you ever played a game like Starcraft or Supreme Commander and gotten an error message that says “Desync Detected” followed by the game closing? Do you wonder what that means? It stems from certain engine architectures commonly used by RTS games1.

My experience in this area comes from working with the Supreme Commander engine at Gas Powered Games. Starcraft and Warcraft 3 have had desync bugs during beta periods so it’s safe to say they work in a similar manner. For simplicity’s sake I’m going to discuss the SupCom engine specifically from this point forward. Finding similarities between the SupCom engine and other games shall be left as an exercise for the reader.

Requirements

First things first, what are the requirements for our game? To help give an idea, here’s the announcement trailer for Supreme Commander 1 (2006).

It must support 8 player multiplayer, on the internet, with hundreds of units per army. That’s thousands of units in a single game. Holy crap. The typical FPS client-server architecture will clearly not work. With so many units it would require multiple orders of magnitude more bandwidth than is acceptable.

How can we accomplish this feat?
Synchronous Engine Architecture

With a fully synchronous lockstep architecture! In a synchronous engine every client executes the exact same code at the exact same frame rate. Let that sink in for a moment. In an 8 player game of SupCom every player has an identical copy of the game state and follows an identical code path. Instead of transferring over per unit state information (position, health, etc) over the network only player input needs to be sent across the networks2. If all players have an identical game state and process the same input then their output state should also be identical.

It’s the same principle as instant replays in many games, including shooters. Have you ever wondered why file sizes of instant replays are so tiny? It’s because the replay file only needs to store player inputs. Simply re-run the game feeding the inputs from the replay file and you’ll get the exact same result. This is also why replays stop working3 when the game is patched and why you frequently can’t rewind them4. It’s also why some RTS games don’t allow join in progress. For a player to join mid-game the entire game state would have to be sync’d. If the game has 3000 units that’s just too much.
Layers

Take a look at the video if you haven’t already. What frame rate do you think the game is running at? The correct answer is 10 frames per second. Wait, what? It looks far smoother than 10 fps you say! It is and it isn’t. The game is actually running at two separate frame rates.

The SupCom engine uses two layers – simulation and user. The simulation layer runs at a fixed 10 fps all the time. This is what you could consider the “real game”. All units, all AI, and all physics are updating within a SimTick function running at a mere 10 hz. Each SimTick needs to run within 100ms or the game will play in slow motion. In a multiplayer game if one player is unable to fully process the SimTick in 100ms then all other players can become held up and have to wait wait.

The user layer runs at full frame rate. This layer can be thought of as strictly visual. User interface, rendering, animation, and even unit position can run at a silky smooth 60fps. Each UserTick updates at a variable time delta which is used to interpolate the game state (such as unit positions). This is why the game can look and feel smooth when it’s secretly slow in the background.
Determinism

Hold on a moment the clever readers cried! If each player is independently updating the game state does that mean the game simulation must be fully deterministic? It sure does. Isn’t that hard? Yep. It’s even more difficult in the modern world of multi-threading.

A lot of pain in creating a deterministic game stems from floating point numbers. Rather than cover this topic in-depth I refer readers to Glenn Fiedler’s fantastic post on the subject - Floating Point Determinism.

In the comments Elijah specifically discusses Supreme Commander. Setting the cpu to strictly follow the IEEE754 standard gets the job done. It comes with a performance cost and the game can never perform an operation with an undefined result, but you shouldn’t be doing that anyways now should you?
Inherent Latency

There are some distinct downsides to a synchronous multiplayer game. Aside from the complexity of creating a massive deterministic simulation there is also some required latency on input. I went over how each user in a multiplayer game is updating an identical game state and they only need to process input. This means for any new piece of input it can’t be processed until all clients agree on which tick to process it!

For example three players – player A, B, and C – are all running SimTick 1. During this time Player A issues an attack command. The UI instantly flashes in response as UserTick updates at 60hz. In a single player game this attack command would be processed in SimTick 2 (0-100ms latency). However all three players must execute the command during the same SimTick to get the same results. Instead of attempting to process the command on SimTick 2 Player A sends a network packet to Player B and Player C with data to execute on SimTick4 (200-300ms latency). This gives enough time for all players to receive the command. The game may be forced to stall if input information is not received and/or acknowledged in some form. I don’t know what that mechanism was exactly for SupCom, but I’ll update this post if I can find out. The exact number of SimTicks into the future to execute a command can be dynamically determined based on the peer-to-peer topology5.

The latency from player click to unit response is always going to be at least 0-100ms (the next SimTick). This can be masked in a few ways. The UI response, usually something flashing, is immediate. There is frequently an audio response as well. “My life for Aiur!” “Zug Zug”

In a single-player game this is fine, but in multiplayer it can become noticeable as the delay is likely several hundred milliseconds. I always wanted to experiment with immediate UserTick animation responses. For example if you issue a move command the unit could start slowly moving in the user layer immediately and then blend into it’s “true” sim layer location when the command actually executes. This would be extra helpful to more twitchy games such as Demigod or DOTA. There are some pretty ugly edge cases to handle though so I’ve never had the chance. I’d love to hear what other folks have done in the comments.
Desyncs – The Bugs from Hell

One of the most vile bugs in the universe is the desync bug. They’re mean sons of bitches. The grand assumption to the entire engine architecture is all players being fully synchronous. What happens if they aren’t? What if the simulations diverge? Chaos. Anger. Suffering.6

In SupCom the entire game state is hashed once per second. If any clients disagree on the hash that’s it. Game over. The end. An error box pops up that says “Desync Detected” and you have to quit. Something in their SimTicks varied and now the games are different. They have diverged and they will only get further apart from this point on. There is no recovery mechanism.

Desyncs are usually programmer error. A desync may repro 5% in games lasting more than 60 minutes. Fixing a desync generally involves a binary search of printf-ing the current memory hash as the state is walked. On low repro-rate desyncs this usually leads to a massive spamming of the hash while a half dozen machines loop the game as fast as they can waiting for it to break. Adding insult to injury, one of the most common cases is an uninitialized variable.
A Demigod Tale

A lot of my work with the SupCom engine was actually on Demigod, which used a modified version of the engine.

Near the end of development there was a long-standing, but highly infrequent desync, that was handed to me. In Demigod there are dozens of tiny fodder that run across the map. On extremely rare occasion the location of a fodder would vary by a few centimeters on different machines. Sounds innocuous but like the wings of a butterfly a hurricane of woe can ensue.

I distinctly recall not being sure I could fix it and my lead saying “I’m trusting you to get this fixed. I know you will.” Geez, no pressure right? Every morning we had a 10am stand up and every day my response was the same – “desync hunting.” After almost a week of spiraling into madness I found the issue. If you watched the trailer you’ll notice some hero abilities that knock fodder into the air. When the giant walking castle smashes his hammer down all the fodder go flying. The bug was a pointer to a steering based pathfinding component that dangled until the fodder crashed into the ground and disappeared.

For a desync to occur it wasn’t just that simple. First the fodder had to be killed by one of only a few special abilities. This deleted the pathing component and dangled a pointer. With our memory allocator the deleted component’s memory block was simply moved to a free list, but otherwise unchanged. Then, before the fodder landed, a new memory allocation needed to occur. That allocation needed to be handed the same memory block used by our just deleted memory component. Then and only then would a desync occur. Appropriately setting the pointer to NULL fixed the issue.
Final Thoughts

This has been a very brief overview of a synchronous engine as used by Supreme Commander. I know many older games have worked in similar manner. The latest generation may be much fancier, particularly in terms of handling input latency. I know that Starcraft 2 can desync so it’s likely similar. Other games to look at would be Heroes of Newarth or League of Legends. They aren’t nearly as massive as SupCom and feel highly responsive but I haven’t investigated in depth to see what clever tricks they pull.

If anyone has a good desync war story please share in the comments.
Footnotes
Halo actually uses a synchronous lockstep model for Campaign co-op and Firefight.
In SupCom input is handled as commands to groups of units. Commands to move, attack, defend, use ability, etc.
Old replays can be supported if you can run the old game code and data.
Rewinding was accomplished in Halo replays by storing “save points” which store the game state. You can’t smoothly rewind, but you can jump to any previous save points and play forward. I think.
SupCom uses a fully peer-to-peer networking system.
But no force lightning unfortunately.

qmmuta · 2011 年8 月 24 日 11:29

http://altdevblogaday.com/2011/07/24/synchronous-rts-engines-2-sync-harder/

Synchronous RTS Engines 2: Sync Harder
10 Comments
Forrest Smith 3:00 pm on July 24, 2011 10 Comments

This is a follow up to my last post, Synchronous RTS Engines and a Tale of Desyncs which, much to my surprise, was quite popular! It’s received at least four times as many views as all my other posts combined. Even better were the numerous user comments in the post itself, on Hacker News, and Reddit. If it generated interesting discussion else where please let me know as I’d love to see more reader comments.

For Part 2 I have two goals. First, to answer questions that frequently came up in the various comments sections. Second, to share some of the fantastic replies that came from various readers.
Recap

Before all that, let’s recap key points from the first post. If you haven’t read it yet I strongly recommend doing so by clicking here. My experience with synchronous engines is from working with the Supreme Commander engine at Gas Powered Games so that’s what I talked about.
All clients in a Supreme Commander game run fully in sync.
Multiplayer games are fully peer-to-peer
There are two layers, Simulation and User.
Sim Layer runs at a fixed 10 fps.
Sim Layer includes everything important – physics, ai, movement, etc.
User Layer runs freely up to 60 fps.
User Layer includes UI, animation, and other strictly visual elements.
All clients run 100% identical simulation layers.
Only input commands are sent over the network.
The game is fully deterministic to allow each client to update based only on input.
If that’s too much to keep in the front lobes at once then I recommend re-reading the original post.
Command Messages

One area I glossed over was how command messages are handled across the network. Luckily, I work with the person who wrote it, William Howe-Lott, so I asked for details! Player input is handled in the form of commands to groups of units. The command (attack, move, stop, etc) for a group of units is bundled up and sent across the network to all other players. The message defines a Sim Tick slightly in the future to execute it on due to the synchronous engine.

SupCom uses what I shall unscientifically call the AckAck method1. Assume four players A, B, C, and D. Player A issues an attack command sometime during Sim Tick 7. Player A sends the command to player B, C, and D to execute on Sim Tick 11. Player B sends an acknowledgement, an Ack, to player A to say that they got it. Player B also sends a message to players C and D – an AckAck for lack of a better term – to tell them that they got player A’s input. Players C and D do the same thing. Each player will only process a command when every player in the game has acknowledged that they have it. For example player B will not execute the player A command until they know that player C and player D also have the command. It’s a lot of messages, but it works and it shipped.

An alternative method, and popular suggestion, would be to process a tick as soon as you have commands for all other players. There’d be no need to wait on Acks or AckAcks. As soon as Player B has messages from players A (attack), C (empty), and D (empty) it’s good to go. This could work, but has a nasty edge case. Imagine that player A sends an attack command for Sim Tick 11 to players B and C and then disconnects. Player D is stuck while players B and C processed Sim Tick 11 and moved on to 12. This is recoverable as player B or C could send player A’s command to player D. This would allow player D to execute Sim Tick 11 and then everyone can boot player A from the game and carry on. It can become pretty messy.
Disconnect Recovery

Two of the most popular questions were on the topic of recovering from a game disconnect or from a desync. Let’s look at the disconnect case first. Say you are playing a skirmish with one or more friends and your internet connection blips out causing you to get booted. Can you rejoin? How can you get back in sync? There are two obvious approaches.

First, a full resync. Starting from scratch you need to receive a fully copy of the entire state from the other players. This is rather similar to loading a save game file. In SupCom all weapons are simulated projectiles with physics. With multiple massive armies this can lead to managing on the order of 8000 entities. Some forum searching tells me that save files for SupCom 1 were in the 50 – 200 mb range uncompressed. Multiplayer games may be even larger. Compressed size is on the order of tens of megabytes. Keeping in mind user 2006 upload speeds, not download, the time to transfer state will average a few minutes with many users taking up to half an hour if not longer.

Second, re-sim the game. This would be similar to playing a replay file. Reconnect to the server, receive all input commands, and run the game from scratch until you catch up. The input data is tiny so that’s good. You can also ignore the user layer and run the sim layer as fast as possible (keeping a per frame “time delta” of 100ms) which helps. Worst case scenario is still awful. For a game two hours deep it will take many users a full hour to resimulate.

A third option I haven’t put much thought into would be to keep the full game state when getting disconnected and use that as a base. I don’t think a delta-based resync from that point would be useful. Given the amount of data devoted to short lifespan projectiles and units that continuously move and die the savings wouldn’t be amazing. Running the sim from the disconnect point however would help a lot. The game could be saved on disconnect, reloaded on reconnect, and then only a few minutes of Sim Ticks to catch up on.

All of these methods however are far, far easier said that done. I have a question for the readers. How many RTS games have supported reconnections? I do not believe any Blizzard game has thus far. I’m sure others have, but I can’t name any.
Desync Recovery

Now to discuss desync recovery. It’s very much an extension of disconnect recovery so everything stated above applies. How could we implement it?

First thought is to find the offending bits of data and fix them. For a two user game it’s impossible to know which user has the correct state. For multiple users the incorrect user may be obvious. The issue with desyncs is that the bug is always there, it just shows up some of the time. If any user has desync’d it’s likely all users have as well. Worst case would be all users desync’d in a different manner. There is no correct state! Even worse would be two users who performed an undefined operation (using deleted memory) but did so in the same way! Technically they aren’t desync’d, but the game is possibly “wrong”. Desyncs are evil sons of bitches.

At this point we’re quickly going down the rabbit hole. No matter what you do there will always be unresolvable scenarios. Writing a recovery mechanism would take a huge amount of time to fix an issue that frankly shouldn’t have happened to begin with. Man months of time could be devoted to developing a recovery system, or you could just fix the desyncs.
Reader Highlights

Numerous game devs from other projects posted about their games working in a similar manner for multiplayer. Little Big Planet 1/2, Commandos, Praetorians, Command & Conquer, Age of Empires, Halo Wars, Starcraft, Warcraft, Madden/NCAA football, Halo co-op/firefight, and many more. There were also some posts discussing the pain required to create a deterministic game for replay purposes, particularly for physics based games. I apologize to all devs who had nightmare flashbacks due to my post, sorry about that.

Spring RTS is an open source RTS engine that I had never heard of. It’s quite cool looking. Support for online or LAN, massive numbers of units, giant maps, and other neat stuff. It even appears to support join in progress via re-simulation.

Back in 2001 Paul Bettner and Mark Terrano wrote an article similar to mine for Age of Empires – 1500 Archers on a 28.8: Network Programming in Age of Empires and Beyond Their implementation follows the same core ideas as SupCom. It’s an old article but highly relevant even today. It goes more in-depth than I did so if you are working on a synchronous type game I highly recommend given it a read.

My favorite comment, by far, is the following. “Posts like this make me want to switch fields again! Video games can have so many interesting problems to solve.” So very, very true. If there is anything video games do not have a shortage of it would be interesting problems to solve.
Footnotes
The name comes from William writing AckAck on the whiteboard because it got the point across.

ACU008 · 2011 年10 月 21 日 02:51

哪位能翻译成中文的？？？？

找小黑试试

advlex · 2011 年10 月 24 日 15:44

= =?

这个讲同步的跟游戏没啥关系吧……