Sept. 19, 2008

Memory Debugging on "24: The Game" (PS2)

Author's Note

I originallly wrote this piece for an internal Sony discussion about memory debugging, about 2 years ago. In lieu of actual creating any original content for this website, I thought I'd resurrect this piece, and polish it to make it less dry than the original report. It's unusual, since games developers rarely admit in public to anything that doesn't sound like a shiny, new, perfect game engine. But game development is now more the art of the possible than the art of whizzy code.

I should also point out that a lot of the hard work was done by the two people who worked on the systems originally: Chris Sorrell and Gavin Bell. Kudos to them.

Background to "24"

I originally started working on "24" in mid-2005. By that point the game had had over 2 years of development on it. "24" was based on Cambridge Studio’s in-house PS2 engine, started in 1999. This is the fourth game based around the PS2 engine. It was originally developed for Primal, a third-person action adventure with on-foot action. "24" had a very different requirement set, as it attempted to incorporate free-roaming driving missions. The loading patterns for driving areas caused vastly more "problematic" memory usage than the slower on-foot sections. There were also quite a few subtle issues caused by persistent memory allocations, which we will see later.

With a couple of months left on the project we were still having serious memory and stability problems, so extra staff was added to work on diagnostics (i.e. me).

Some Terminology

Terminology for memory allocation seems to change with every engine I see, so here is the list for this piece. (This differs from the class names in our code, which reused "pool" in more than one context).

  • Heap: a large block of memory using a standard memory-allocation algorithm: for us, it was usually a "best-fit" algorithm;
  • Pool: chunk of memory acting more like an array of fixed-sized allocations for quick allocation. Its own internal array was allocated from a heap;
  • Allocator: a chunk of memory allocated on the heap, inside which another custom allocator type was used (usually first-fit).

The Standard Heap Memory Allocator

The core memory allocator had been custom-written from the ground up, six years earlier. It was, in essence, a best-fit allocator using boundary tags (as far as I could tell). It also had some custom bucketing code to try to boost performance -- to skip any candidates that were known to be too small for a potential allocation, it tracked the free areas in "size buckets" and only looked in the bucket of sizes that were as big as, or bigger than, the amount of memory you wanted to allocate.

The basic heap allocator also offered the usual additions:

  • Alignment for the base allocation plus alignment for elements of array;
  • Deferred free processing (for render allocations). This is where the allocation would stay around for a couple of frames after the allocation was explicitly freed, to allow it to be read by the in-progress GS graphics rendering.
  • "Slave" allocations, where deallocation of one memory location would cause the slaved allocations to also be freed. (By the end of "24" this was unused, after a rewrite of the rendering system)

Memory layouts and patterns

There were several heaps in the game:

  • The main, or "global" heap: for all allocations over 8K in size, or any allocation that could be relocated. In the final shipped disk build, this was about 23MB in size, out of the PS2's 32MB total.
  • The "bantam heap": this was a separate heap, for any allocations under 8K in size. This ended up around 5MB in size. If the bantam heap overflowed, and it did, allocations would overspill into the main heap. (Those of a nervous disposition will get worried hearing that).
  • Debug heap. Debug allocations would go in the bantam heap on non-debug versions, or be blocked at compile time (debug allocations like this were prevented unless they absolutely had to be allowed).
  • Havok heap. Havok kept a completely separate heap of about 2.5MB, not managed by our systems.

All the heaps and their sizes were allocated at boot-up of the game. We couldn't reinitialise heaps at runtime, as the engine design meant that some systems (particularly the render system, as it turned out), couldn't cope with it.

The Hard Bits

The Really Hard Bit was getting rid of memory fragmentation.

The resource system and loading system on "24" made a separate allocation for each loaded resource (and possibly extra allocations on post-load processing). During runtime, there could be up to around 2500 allocations in the main memory heap, most of which would be resources.

These are the main root causes of memory crashes we had in 24.

  • Heap fragmentation: the main heap in particular would fill up due to fragmentation. Even with resource relocation (see later), we would often see out-of-memory crashes with around 1.5MB still free in the heap. Also fragmentation would be extremely unpredictable on different play-throughs.
  • Granular handle-based loading system. Our game resource system was completely handle-based and tried to reuse already-loaded resource data. Rather than have a hard-specific process for the loading of a chunk of assets, creating a handle for an object would register a request to load the data and it would appear some time later as a separate allocation. Certain "group" handle requests would pull in a block of files linked to that group (an informal form of WAD handling, although without compression). An example would be the set of geometry files for an area of map. Each file within that group would still count as a separate resource and have separate allocations. Creating a handle to an already-loaded object would simply reuse that resource and increment its reference count. Resources were deallocated only when all handle references to the object were lost.

    While this was often a boon for programmers’ productivity (although all our code had to check for handles pointing to loaded data), it meant that it was very difficult to explicitly control loading/unloading of resource allocations, and so have any control on the memory allocation patterns. Code would often almost invisibly trigger loads/unloads.

    As a result, resources loaded in at the same time, and hence usually allocated close to one another, could have completely different deallocation times. Looking at research papers, this is a core root cause of fragmentation.
  • Leak tracking. This was particularly problematic. Until the end of development, the engine did not reset most systems between levels/game starts. This led to leaks going undetected for long stages of development. In the end, there was a lot of work recoding systems so we could reset them, and adding tagging information to reliably track these problems. In addition, some systems did not leak per se, but held on to allocations beyond level scope. See the "Conclusions" section for more about this.
  • Bantam Heap overflow. Due to historical reasons, we had to allow the bantam heap to overflow, particularly in areas with heavy rendering or lots of NPCs. Normally the situation would "right itself" when memory levels dropped again. However, some allocations would persist for the course of a whole level, or in some cases even longer (see "Leak tracking" above).
  • Render allocations. Memory allocation for rendering was highly dynamic and used granular allocations in the bantam, then main heap. This would cause spikes in the memory and frequent bantam heap overflow. However it would often try to reuse allocations to save processor time, meaning that it would try to hang on to allocations in the main heap. This exacerbated the fragmentation.
  • Load system differences. To try to improve load times on PS2, initially-loaded data for the start of a level was compressed and interleaved with the movies, and also supplied as a WAD (if the movie was skipped). However, this also caused the memory allocation patterns to vary wildly, as allocations for movie playback buffers affected the patterns. As a result, memory fragmentation was far worse after a movie was played back. This situation only ever happened on final disk builds, so having diagnostic data on disk builds on Test machines was essential for us.

Custom heap allocation types

Although best-fit allocation performs well in general tests, we ended up with some interesting problems innate to its approach. In particular, memory "islands" caused by persistent allocations in otherwise free space caused us serious problems later on.

A bit of reading up showed that best-fit allocators are very prone to this, which is intuitive when you start to think abou it. Guess what? We were using a best-fit allocator.

Very late in the project we added custom allocation strategies to force allocations to the top or bottom of memory heaps, or into heaps that they would not have used with the default strategy. An example would be that when movies with embedded WAD data were played back, movie buffers were forced to the top of the heap, and resource allocations to the bottom. Although this was an unpleasant hack, it gave us a measure of robustness to the "islands" problem. It was invaluable in actually getting the game to ship. (In essence we were switching to using a first-fit allocator for critical areas).

Runtime Relocation

Resource relocation was added to the engine before "24". It helped fix any fragmentation problems we had seen on earlier titles. However, on "24" it wasn’t powerful enough to fix everything. This was mainly because the balance of the mixture of relocatable to non-relocatable objects in the heap shifted between projects – there weren’t enough movable objects in the heap any more, and we didn’t have time to recode systems to cope with relocation.

Relocatable types were:

  • Animations
  • Textures
  • Model instances
  • Localised text

Everything else (in particular, character model data) was non-relocatable. Analysing the patterns of how this caused fragmentation was essential for us.

The relocation allocator code just tended to push resources to the earliest big-enough free space in the top of memory. To keep overheads down, relocation was constrained to a maximum of 4 resources, or 64K, per frame. Generally, fragmentation wasn’t limited by this relocation rate; it was more that the memory patterns were unrecoverable by that point because of large, unrelocatable resources sat at inconvenient positions in memory.

In conclusion, although we had a very sophisticated memory system, it stored up quite a lot of problems for us later down the line. What had served us comparatively well in early games didn’t stand up very well for a game with very different gameplay styles.

If I had my time again, and was the Lead, I would force non-relocatable and relocatable resources to be separate heaps, and set my budgets accordingly (easier said than done, though).

Runtime Statistics

As might have become obvious now, our engine did a lot of allocation. Loading up a driving level and checking the number of active allocations gives a figure of over 13000! Of these allocations, 12000 are in the bantam ("small") heap.

At the same point there are 1360 separate loaded resource files, of which 900 are theoretically relocatable. However, in a typical test these relocatable resources comprised only about 6MB of the 20+MB of the allocated heap space (this ratio varies, of course).

Some tests we did after the game was released showed that we were doing around 200-300 memory operations (including freeing memory) per frame during shooting levels.  On driving levels it was around 100-200, but with spikes of up to 700 per frame when streamed loading caused object creation (a spike has been seen with 5000 deallocations in a frame during gameplay). On previous PS2 titles this was actually worse; there was a lot of work done to recode rendering allocations which cut down this figure quite significantly. Nearly all these per-frame allocations were of small size and went in the bantam heap.

Clearly this would have an effect on real-time update for any external tool. We can get some more exhaustive statistics on this if anyone wants them.

To run reliably on driving levels, we had to maintain an overall level of at least 1.5MB free in the main heap. Any less and fragmentation was liable to cause a crash.

Debug Facilities

Per-allocation data

These are the debug fields saved per-allocation on the main different build types.

Feature

Debug

Internal disks

Master

Allocation size

Yes

Yes

Yes

Allocation time

Yes

Yes

Yes

Full name tag (supplied in code)

Yes

No

No

Allocation PC

Yes

Yes

No

Mission ID

Yes

Yes

No

Runtime overhead was 32 bytes/allocation on debug disks/masters, 48 bytes on full-debug builds.

The "mission ID" field was a late addition. It allowed us to track quite effectively leaks between missions, particularly when doing automated tests, and see how long some allocations actually persisted.

Using names tags for allocations often led to problems. Cut and paste code, and template code, often made it difficult to provide useful tags and a lot of STL-style classes ended up with all their allocations with the same name until we reworked the code. This often led to a lot of subtle leaks being ignored by programmers!

Boundary corruption checks

There was the facility to do boundary corruption checks by tagging data beyond allocations with known values. Surprisingly, this feature wasn’t that useful to us, although it took a lot of work – we found only 2 cases where it fired, and they were probably debuggable anyway, as the bugs would have corrupted internal memory pointers.

The usefulness of this feature was lessened because the checks for corruption usually fired a long time after the code responsible for the corruption, sometimes several seconds later. Checks were usually done when the block itself, or surrounding blocks changed state (usually being freed). Corrupting the memory system pointers inadvertently made the problems usually come to light earlier -- usually when doing debug per-frame integrity checks.

Reports

Memory reports were normally triggered from in-game menus and provided the usual set of filtering per-pool and per-type, full dumps, sorting by tag etc. I guess this is all pretty standard.

In-game fragmentation display

The in-game display of fragmentation gave us a good quick overview of memory layout. It had a small memory footprint so we could leave it operational on all disks other than master submissions.

In-game fragmentation display. Top panel is main heap, lower left is bantam heap (dangerously full!). Different colours represent unique allocations.

When used in-game, there were keyboard controls to browse memory and check the type of any selected allocation. There were also options to highlight allocations with specific names, any allocations after a set time, or by which mission they were allocated on. This gave a quick and easy way of spotting obvious leaks, and watching allocation patterns (a sobering experience).

Unfortunately, since the control was coarse, it wasn’t that useful for in-depth analysis; we generally had to resort to written log data for that.

Memory card logs

In the final stages we saw a lot of discrepancies between even "near disk" builds -- where all build options were the same as a disk build except for using the disk as data source -- and the disk builds themselves. We also had a lot of cases where memory fragmentation was only being reported on testers’ PS2 Debug kits.

Since we couldn’t afford the IOP memory to add network support, we logged quite a lot of statistics to memory card files. This had near-zero runtime/memory overhead.

Statistics included:

  • Full memory dumps at given points (in particular after level shutdowns);
  • Overall memory levels every 5 seconds, very useful for memory hotspots;
  • Full exception crash information, including memory areas around register addresses (the name of the log file matching the play-through would then appear on the exception screen for the testers to add to crash reports).

The dumps were all in normal ASCII format; this made parsing very easy. Generally the testers played all day then we copied all their memory card logs off for later analysis. Using the memory cards also meant that the testers didn’t need much external tools to be run; they could use it in a fire-and-forget manner.

We then very quickly knocked up Python scripts to parse the outputs. By also parsing the MAP file that matched the disk’s executable, we could also do lookups on all the code that allocated the memory, rather than storing debug strings at runtime.

The scripts could then give full annotated and totalised reports, and visual representation of the memory layout (similar to the in-game fragmentation display). There was no interactive use of the tools; we never really found a case where it was necessary. If you really wanted to know details of an exact allocation we just looked in the text reports.

The exception handler scripts could also generate a full binary image file that could get loaded into the PS2 debugger – this was terrific for checking out complicated crashes.

In short, this system was probably critical in hunting down most of the really nasty crash problems we had.

output from a Python script showing global heap memory layout extracted from memory card logs. This shows a snapshot dumped after an out-of-memory exception, drawn in a similar way to the in-game fragmentation display. White areas are remaining free memory. The text at the start of each allocation is the truncated name of the C++ function that did the allocation. In this view, all allocations from the same code have the same colour

the same view as above, but this time filtered to show allocations inside the current level (in pink) and those from before (green). Here it’s clearer to the trained eye that we are seeing minor fragmentation in the middle of the heap. This is because of allocations that live longer than one mission – here script allocations and the language manager

Allocation histories

The debug builds supported a rolling buffer of the last 8192 memory operations in the system. It was optionally built in using conditional compilation.

Surprisingly, we only found this feature useful on one occasion. This was a crash bug involving render DMA and cache invalidation on non-disk builds, where the fact that near-consecutive allocations partially overlapped causing unexpected data cache write-backs.

Some Conclusions

"Leaks" between levels

On PS1 our approach to memory leaks was simple; we made a simplified snapshot of the memory state at the start of, say, a level. Then on exit of the level we compared the memory state to the snapshot. Anything that didn’t match was a leak.

On PS2 this approach fell down quite badly. As well as the fact that not all systems shut down on level exit, it seems that C++ code (or maybe just our C++ code!) gave memory usage that didn’t fit this model. One example would be our implementation of dynamically sized arrays, similar to STL vectors. Although they did not leak memory, they would end up with internal buffers that would grow but would never shrink back if the number of objects within it decreased. Often these reallocated buffers would be created at extremely inopportune times e.g. when the bantam heap had overflowed into the main heap, causing consistent fragmentation.

Early on in PS2 development the decision was taken not to assert when possible leaks were flagged. (This may well have been a mistake.) The warnings were always ignored by users; often the warnings were spurious, and in many cases even if the warnings were valid, they were unhelpful e.g. on the lines of "there are more generic list objects in the world than when you started".

Given no asserts, the main approaches we found were:

  • a list of "acceptable" tags for leaks. This was actually quite dangerous and led to some allocations that looked OK sneaking through later on.
  • lots of hard work checking every potential leak. This is what we actually ended up doing.

The automated tests and memory card logs were invaluable in gathering the data for the "hard work".

Real-time update versus snapshots

At the start of doing the final work on memory debugging we thought that real-time update would be invaluable. As it turned out, it wasn’t nearly as important as we expected. Although the in-game display was handy for getting an overall view of what was going on, most of the crucial events were only exposed in later analysis.

The most useful debug capability to me, by far, was the ability to dump a snapshot at a known point (triggered from code), and potentially visualise them offline. This let us see very clearly obvious leaks and fragmentation islands at important points. Because we still had allocation time in the debug dumps, it was easy to trace patterns of allocation.

The other really obvious point was that having this data available in an easily-parsed format was vital. We just used simple CSV-like text files. It meant that offline processing was trivial and would allow us to do automation.

Conclusions

Here are the main things to take away from all this:

  • Think about your memory policy as early as possible. Get at least the ability to dump the memory layout as soon as you can -- you can go far with just that
  • Understand the different types of memory allocator, and their advantages and disadvantages. If I had my time again, I'd stick with first-fit allocators. Be aware that fragmentation is caused at the time of de-allocation, not allocation -- so try to ensure neighbouring allocations get freed at roughly the same time.
  • Measure -- don't guess! Every man and his dog has his opinions about why you're running out of memory. It's the art. It's the maps. It's Programmer X doing lots of little allocations. Ignore them and sit and find out how much things are taking. We only needed to look at about the top 20 memory allocating functions to account for 80-90% of the memory.
  • You don't need to overboard on memory pools. I've seen games that split all their allocations in to different pools. For most cases these are wasting memory -- you can intermingle the allocations and still track them, particularly if the allocations are relatively small. 3 or 4 pools is usually enough.
  • Don't mix relocatable and non-relocatable allocations. It might buy you some time, but mainly you are just postponing the point that your problems start...
  • Don't over-engineer your debugging. The time spent getting that "perfect system" can be better used elsewhere.

And finally...

  • Everything is a trade-off. People might think "what a rubbish system", but if you don't ship your game, you won't even have an article to write...