Courses.cs.washington.edu



03/05/2019 CSEP503 Lecture Notes:Title: "Patches, root cause analysis, post mortem Panel"Guest: Henry GoffinHenry Goffin Bio:Entered Video Game industry about 15 years agoWorked at EA, Valve, other studios. Recently worked on Steam at Valve working on scaling issues. Now working at Microsoft in Xbox Studios.Front end PHP code to DB scaling, payment methods, security, reverse engineering, debugging, post mortems, shipped 6 commercial titles. His experience runs the gamut. Likes Video Gaming because people are always ready to advance to the next frontier/ stagnation isn’t an option.Dealing with patches for clients that want to have complete control of all updates, balancing the importance of the patch vs stability etc. Q&A:How does the mobile market, where updates are automatic and involuntary, compare to PC where many people are using Windows 7?Henry: There will always be legacy clients that will never be updated. DOTA 2 was targeted at China, and XP machines were targeted as minimum spec so that old, pirated machines would be able to play the game. Balancing lowest common denominator (spec-wise) vs fidelity of the game, LCD almost always wins. “We wish we could use these features, but we can’t because X% of our users are still on [old version]”. Mac and PC both are trying to push towards an automated update model. Theo: For Android, the API provided by the system allowing you to get information from the device was buggy or interfered with by ISPs or had faulty hardware. Amazon started creating a versioning system to only surface apps compatible with the specific device. Amazon’s app store model was one of ‘entitlement’ – if you bought an app for one particular phone you are ‘entitled’ to the appropriate app on whatever you happen to be accessing the App Store on, so a complex compatibility matrix had to be developed. An intern wrote ‘Sherlock’ and Moriarty’ apps to try to dig out legitimate information about a phone given valid and routed configurations, respectively. Now you may even have to determine if it’s even a phone vs a smart fridge or other appliance.“Semantic versioning” – The most common versioning convention : Major.Minor.Patch. If someone messes it up in a public repository it will black out half the userbase. Henry: In Video games the SOP is to never have dependencies – basically build everything you need into a brick of code in order to avoid any versioning issues. Q: You mentioned cracking, how was that skillset applicable to your position in the Game development field?Henry: I used to crack games to figure out how they worked, for fun. Developers are brought in to do the reverse - to scan the computers of cheaters to identify foreign executables and build an anti-cheat fingerprint. Also, software compatibility at the OS level is hard – different vendor’s computers come with their own odd custom programs and they interfere with games. Solution can be to actually buy problematic machines (specific laptops etc) and investigate what’s causing the bug. Q: Do you see these issues being helped by platforms moving to the Cloud? A: Henry: Once you only have a thin access layer on the client side it will be much easier to manage. If something’s wrong with your Chromebook just reboot or reformat it and you’ll be back and running. However the flip side of this is that it presents a new set of security risks (man in the middle attacks etc). Henry: The internet is not as pure a connection between the client and the server as you’d like to think it is. You’d like to think it’s passing data cleanly between the two but more often than not there’s some intermediary logic – internal network monitors etc and that can be manipulated as well. You can’t trust any part of the stack and unless it’s all in your building and you laid the cable and you write the operating system, there’s something in there that will screw you. Henry: CPUs are full of bugs, Macs are a terrible platform to develop on because they haven’t felt any pressure to update their drivers and there are essentially a slew of ‘known issues’ that they have no intention of fixing. Post Mortems from HenryEngineering post mortems are usually centered around high-impact bugs that took time to discover the root cause of. It can be fun to run a post mortem because it is the product of an esoteric edge case and is basically a puzzle to figure out. Working on “State of Decay 2” which shipped with a known bug where the UI would just fail to render. It turned out it went away because of misunderstandings of multithreaded timing and memory transfers (see: esoteric edge case). Typically 50+% of engineers don’t have any exposure to the systems level, dwelling entirely on the application level. Henry was the systems guy on the State of Decay team and used the Post Mortem for this issue to do a dive on the hardware fault that caused this and to teach the other developers about this sort of issue – the specific issue wasn’t likely to come up again but an issue like it could. Post Mortems are useful for revealing new types of problems. Q: Do you think Video games today require more patches than before connectivity was assumed for all users?A: Yes, but also because every game is expected to be at least as good as every other game that has ever been released. The complexity of systems that drive games this generation are 100x as complicated as those from a generation ago and so on. Super Mario Bros 3 has a few known bugs but the total code is pretty small, an error rate of 3% or so maps to around 300 bugs, but something like Assassin’s Creed Origins is bigger than some operating systems, so a similar error rate would result in a much higher number of bugs. It’s very hard to make a game that doesn’t have a million lines of code now unless you’re working in a minimalist indy space.Q:So how do you test millions of lines of code?A: That is an open problem, and games are inevitably buggier because of that. Throwing bodies at the problem (armies of testers) is one way to help but it results in a vast budget required for QA. We don’t know how to build testable software that is that interactable and that deep. We hope that AI /ML can be used to improve things (e.g. virtual game players). But no one has built an at-scale testing mechanism like this yet. All we can do is have a very basic test suite and require developers to run it before checking in code but if that test suite doesn’t cover it then we may not discover it. Story from Jim: In my last job at Google I was running a team in the Virtual Machine org and one of our customers called in with a problem- their data was being corrupted moving from one machine to another. We went through the code, found the problem, fixed the problem, but then realized it was an old bug and couldn’t have been just caused. But a driver that changed the error behavior triggered by the bug had been changed, allowing the system to continue (erroneously) instead of halting. Q: Thoughts of the rise and fall of the mobile market/ Zynga?A: I don’t have a lot of insight on the mobile market, I have more experience with ‘sit down, at home experiences’ which have not experienced the same boom and bust cycle that mobile games have. Mobile games are tough because it’s very hard to get people to know about the game, so app stores are full of crap game because people don’t expect to make money. The top 10 games are the same as they were a year ago because the most effective way to draw new users is to be in the top 10… I like working in a market that I know, and the mobile industry is a mystery to me. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download