Summary

Although keeping performance levels in check is an important part of my job, it’s up to production to decide when it’s the appropriate time to devote a high amount of resources to fixing problems. This is the description of how I successfully handled one time where there was a sudden switch in priorities and performance went from a neglected red flag in the background to the most important task for the whole team’s sprint.

Goal

The shipping requirement for our milestone was to hit maximum fps allowed by the platform as a baseline, with some spikes allowed. To achieve that I had a small group of tech designers and programmers that were able to investigate and fix issues, plus a relatively large amount of artists, level designers and vfx/sfx artists that could “land a hand”. My internal objective was therefore to avoid wasting a single person-hour due to interdependencies and to avoid the small group of “fixers” wasting time in collecting data that anyone else in the team could get for them. This would require extensive coordination and planning.

Execution

Before the focus sprint began, some preliminary work was necessary. I organized the workflow, divided tasks in stages, wrote the necessary documentation. During the focus sprint I coordinated the work of the team, 14 people, a solid three quarters of the total core team size. I had to ensure that no duplicate investigations happened and that everyone always had both everything they needed for their current task and the next task already lined up. Once the team work was up to speed, I focused on activating and testing both ApplicationSpaceWarp and AntiAliasing MSAA 4x. After the fact we had a retrospective, agreed on new rules to keep performances up to standard and new best practices.

Preparation

I divided the workflow in 3 stages for each issue:

PE – Preliminary Evaluation
DC – Data Collection
IF – Investigation & Fixing

Before the start of the focus sprint, I already had an array of data collected, both through our CICD-pipeline performance tests and by captures made by the QA team.
I had already spent a couple of days preparing a performance report and had a good grasp on the issues known from that data. I coupled that with a list of our game features and levels.

This gave me a first assessment of the knowns & unknowns we had to deal with.

The known issues were thus assigned DC and IF stages depending on data availability. The rest was listed under PE.
With some help from other team leads guidelines were prepared on how to execute every step of our workflow, how to use all the necessary profiling tools and how to make local builds. Everything was communicated to the team by end of week, so that we could be ready to start the performance work at the beginning of the new sprint.

The focus sprint

Initially we had only a small amount of tasks in IF stage, ready for the fixers to start. Then there was a medium amount of tasks that required a data collection pass (in depth cpu data captures or renderdoc captures needed), and a vast amount of tasks in need of a preliminary evaluation (where either had superficial data hinting at a problem or none at all, therefore requiring someone to have a look at it).

The week was kickstarted with a meeting with all available technical designers and programmers, in order to distribute the tasks in the IF stage. Everyone got something to start working on, almost no such tasks left next. This was a risk: if they were to finish quickly they would have needed to start working on DC-stage tasks, thus wasting precious fixer time.

My focus at this stage was to:

coordinate the people doing the PE work so that we could line up new DC tasks if needed or close that line of investigation if no further action was needed
ensure that people doing DC work delivered the data properly, then create jiras associated to the IF stage tasks with links to the available information
contacting the people doing IF work to find owners for the newly created IF tasks, while also checking in for any red flags

Slowly a buffer of fully prepared IF tasks started to form and by day 3 I was free to focus on my own task: ApplicationSpaceWarp and AntiAliasing.
We also had a mid-way meeting for people to share the lessons they were learning and useful tips&tricks that could speed up each other’s work.

Post-mortem

Before the sprint end we got to the point that all the aspects of the game were checked and target FPS were hit on all features and in all levels, now even with MSAA 4x active.
The team was congratulated for a job well done and the tech designers and programmers were invited for a retrospective meeting in order to consolidate the lessons learned along the way.
We ended up with newly agreed best practices.

To ensure that performances would stay up, a new rule was estabilished that every playtest from that point on would happen with a visible fps counter and that any lowly performing area would be immediately flagged.

Retrospect

Although the preparation time was not ideal, we managed to achieve our performance goals, learn new lessons, estabilish new practices and improve on visual quality. The culture of the studio was also significantly impacted by the now widespread awareness that performance is everyone’s responsibility and not just something a couple of people need to be bothered with.

Even if I would have preferred that we intervened gradually and earlier rather than having such a sudden disruption of work, I must admit that the impact on the team wouldn’t have been the same. The lessons learned this way wouldn’t have had the same adoption if there wasn’t such an emphasis.