Courses.cs.washington.edu



Panel Notes (3/5/2019) – taken by Misha DharPatches, root cause analysis, post mortemGuests: Henry, Theo, JimIntroduction from Henry ()H: I’ve stuck my finger in pretty much every pie there isTheo presents his background wrt today’s topic, specifically some of his work at AmazonJim presents his background, touching on last week’s lecture, and the “Google perspective”Henry:Platform owners (e.g. Microsoft/XBL) like to maintain control of release schedulesBugs often sit in the queue for a while, and pushing patches/fixes are costly in many waysHow does one make the right business call to balance a quality product vs paying for QA, waiting, etc?User interface/experience tradeoffs come with frequent patchingQuestion: How does this correlate in the PC game market?Depends on the user’s OS. E.g. tons of people are still using Windows XP, especially in ChinaSo we had to build software that was backwards compatibleAnnoying because we have to duplicate work, reduce to the lowest common denominatorNeed to make business decisions on what and who to supportAndroid is another good example – most devices don’t run the same versionWindows 10 is probably the last standalone Windows – using service packs/updates to push new featuresApple is doing the same, and keeps users on the lastest platform to make life easier for devsTheo:Can have issues where functionality might be present on the device, but blocked by the phone service providerCan’t always trust that data incoming is right, even if a connection to a sensor/feature “works”Want to present only truly compatible apps to a userHad to create a device fingerprint identifier, because we can’t always trust the data that comes inBrief anecdote about intern project to read device data (Sherlock and Moriarty)Henry:Relieved to have fewer compatibility/sensor touchpoints on PC/consoleTheo:A “bloody mess”Semantic versioningOpen-source software versioning pattern: <major>.<minor>.<patch>Many systems use versioning patterns and numbering to validate compatibilityHenry:Remove external dependencies – they cause too much of a headache, even when they do something good and fix a bug!Question: In what way is reverse engineering/cracking pertinent in the video game industry?Brief anecdote from Henry describing how this can assist with tracking down issuesGained the skillset to cheat at games (a long time ago), but has been valuable professionallyDevs today are kept on staff to get ahead of cheaters before they can compromise integrity of the game and keep loyal, honest customers happySoftware compatibility is difficultSo much bloatwareOften vendor-specific programs and are uselessIssues can be device/software combo specificNeed to test on these specific combos to really repro the deviceQuestion: What is the impact of the cloud on these issues?Makes things easier to maintainCan just reboot and restoreSimplifies support modelNot all good news – can also introduce other complexities or vulnerabilitiesSecurity vs stabilityTheo:Easier to hide and mitigate issues from customer’s POVHenry:Information pathways on the internet are far more complicated than we imagineLots of devices/processes between what you and what you’re trying to accessMore touchpoints means higher chance of failure at any single oneYou can’t REALLY trust any part of the stack, unless you’ve literally built everything yourselfApple’s drivers are a crapshoot – just aren’t very good. So even though there is hardware unity, still not good to dev on. And since Apple doesn’t care (not a priority for them), never gets fixed unless you “know a guy”.Intel – CPUS are FULL of bugs, and you will never know about 99% of themWindow ships with secret Intel microcode updatesAllows Intel to quietly push firmware updates via Windows updateIf you have Linux, you better know otherwise you’re SOLPatch space being exhausted can brick a machine99% of computer activities are simpleChances of users noticing issues is often rareIssues often impact the top level performance, not day-to-day perfSoftlock – application failure/OS vs hardlock – hardware failureBSOD – Windows fails to prevent the user from causing more damageQuestion: tell us about engineering postmortemsUsually centered around some horrible bugSpread the knowledge of the issue as widely as possible within org to prevent a repeatAlso a good way to look smart in front of everyone elseWorking on State of DecayShipped with a known bugSometimes the 2D interface overlaid on the 3D interface would ‘go away’Had to dig deep to root causeValue of postmortem is a chance to expose engineers to the systems level, because most are many layers up on the stackH used the postmortem as a teaching tool as well as a way to bring the team together and expose them to things they didn’t know existedPostmortems are important for raising awareness of TYPES of problems, not necessarily solving them. Need to identify the issue so the right expert can come and fix it.Question: Do video games need more patches today than in the past due to internet connectivity?Yes, but not because of internet connectivity. Rather, because it is to keep up with the competition and because our systems today are way more complex than they ever were (by several orders of magnitude)Older games don’t have many touchpoints for “failure”, i.e. fewer bugsNewer games have millions of lines of code, can be larger than many OSesMore realistic games mean dramatically increased complexity, and since gamers expect this, can’t go back to the old days from a business standpointQuestion: So how do you test this many lines of code?You can’t!It’s why games are buggier these daysEA/Microsoft throw bodies at the gameMany hours of QA testingHave to invest a lot of resourcesDon’t know how to build testable software at this scaleHoping AI/ML can streamline the processUse bots to proceed thru the game flow and stress the systemStill very theoreticalJim:Story of tester who gave automation to stress programsUpdates had to pass the running test in order for him to test itJIT compiler too complicated to testGenerate random code that meets specsJIT compiler had to run on this and not failThis is the bar for any manual testing to followWas running part of the VM team at GoogleCustomer called in with a serious issueData was being corruptedTeam investigated code, found issue, fixed issue, and pushed it outThen found that the bug they fixed had been around for a long time – this was not the real issue Fixing the issue in the driver caused failures elsewhereHenry:Can’t always isolate the root causeSometime fixing one bug can cause a ton of other problemsNeed to move to languages that make it harder to introduce bugsQuestion: Thoughts on mobile gaming, especially wrt Zynga and its rise/fallDon’t have a lot of insight to the mobile marketFocused more on traditional console/PC gamingLess of a boom-bust cycleHard to make revenue with the mobile market due to how to market the game and hoping it goes viral, i.e. user acquisitionApp stores are saturated with garbage gamesBest way to make revenue is to be in the top 10 – rich stay richConsoles curate their markets and will feature top content – no flood because they control access to their marketsNot a healthy marketMost companies don’t understand their own business model, which is why they boom and then bust ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download