Switching to a Data-Oriented Mindset

10-08-24

One programming paradigm that is overwhelmingly popular is Object Oriented programming. Even though languages like Golang doesn't have "OOP" in the way that C++ and Java has OOP. You can still model your data in an OOP-like fashion. My first intro to data-oriented programming was during my time doing data-engineering. Our main focus is to enable our data-scientist to work more efficiently. We'll typically model data in ways we think they'll have to access it.

In OOP you model the data in a way we think of real world objects. In a Data-Oriented Design you're modeling data in a way we're trying to use it. As an example, lets say we have a collection of Dogs

        
          type Dog struct {
            name  string
            age   int
            breed string
          }

          dog1 := Dog{"Bob", 4, "Yorkie"}
          dog2 := Dog{"Rob", 8, "Husky"}
          dog3 := Dog{"Lob", 2, "Dachsund"}

          var AOSdogs []Dog := []Dog{dog1, dog2, dog3}

An Array-of-Structs is the typical way we've learned to organize our data. We breakdown a problem and model the data into what a Dog is. But in a big data setting we're usually working with a large collections of data. I don't actually care about what each individual dog is, I'm trying to analyze the names, age and breed and compare the values amongst themselves. So the more data-oriented approach would be:

      
        type Dogs struct {
          names  []string
          ages   []int
          breeds []string
        }

        names := []string{"Bob", "Rob", "Lob"}
        ages := []int{4, 8, 2}
        breeds := []string{"Yorkie", "Husky", "Dachsund"}

        var SOAdogs = Dogs{names, ages, breeds}

In this example, we've created a Struct-Of-Arrays. We don't have a single representation of a Dog anymore, we're focused on what the data looks like as a group. This way of structuring data is alot more useful when doing operations on many Dogs at one time. I've never dove too deep into micro-architecture, but in general Modern CPU's favor data that is contiguous in memory. This stackoverflow post explains it really well https://stackoverflow.com/a/2021868

Pitfalls I ran into in Gamedev

In the tactics game I'm working on, I have a Grid class which contains info of each cell and any data thats occupying that cell. This includes a pointer to a character which also contains data needed for rendering, animation data, coordinates and spritesheets.

        
          type GridCell struct {
            x0         float32
            y0         float32
            unit       *Unit
            isOccupied bool
          }

          type Character struct {
            pX          float32
            pY          float32
            idleAnim    AnimationData
            spritesheet *ebiten.Image
          }

          type MGrid struct {
            turnState    TurnState
            grid         [GRIDSIZE][GRIDSIZE]GridCell
            pc           PlayerCursor
            Units        []Unit
            selectedUnit *Unit
          }

When I wrote this originally I sort of knew it was a bad idea to pass around pointers but it was hard to break away from the OOP mindset so I did it anyways. It's just more intuitive to think of things as individual objects rather than a group of related data. Games are messy, things will inevitably mutate other things, data is inherently tied to other moving parts within the system. Thinking in the form of individual pieces just does not work nicely. When I needed to move characters around the board the typical way to access Character data was to use indices of the Grid than access the objects attributes. Directly acessing a Character is fast, but introduced common problems associated with pointers, like nil pointers and dangling references. But the main problem I dealt with was implementing a history system. Since this is a tactics game it's important to rollback board states. Theres 2 ways to implement this. For every action, you do the opposite action or the simpler way is to just save the entire board state after every action. The former is more space efficient but harder to implement, the latter takes up more space but easier to implement since it's just the entire Grid saved into an array. For each action I was saving so many copies of the same pointer that didn't actually contain the data needed for history to work. I was running into so many bugs and had to refactor.

One thing I noticed is that I had several ways to access to move positions. I already had an array of characters that I was looping through each frame, while also having code that was accessing characters via the Grid Indices. Having 2 ways of accessing the same code made things incredibly confusing.

        
          type GridCell struct {
            x0     float32
            y0     float32
            CharacterID int
          }

          type Character struct {
            id        int
            pX        int
            pY        int
            rd RenderData
          }

          type MGrid struct {
            turnState    TurnState
            grid         [GRIDSIZE][GRIDSIZE]GridCell
            pc           PlayerCursor
            Units        []*Unit
            selectedUnit int
          }

The changes I made are small but the new one rid of having pointers within each cell. Whenever mutating any piece of data, I need to do that operation within the array of characters. This adds an extra look up step, but for now the characterID matches the indice, so it's O(1). But even if I use a UUID, I could easily switch to a hashtable so lookup will still be fast. These changes made debugging alot easier because of the simplified memory model. My debugger no longer has pointers everywhere, and accessing data is now more consistent since every operation goes through an array.

Possible future changes

So these changes don't completely show the Struct-Of-Array style of modeling data completely. It's more of a mix because I didn't want to drastically change everything all at once when I wasn't completely sure doing ID lookups was the right way to go. But after the refactor it fixed alot of the issues I was dealing with before.

      
        type Characters struct {
          ids    []int
          pX     []int
          pY     []int
          rd     []RenderData
        }

The code above is something I might possibly do in the future. Going back to how I don't really need to care for individual pieces of data, most things need to be done in a batch. Need to draw things to screen? I don't need IDs but I need every position and everyones RenderData. Mutating positions? I need everyones position and ID's but don't need RenderData. Tracking position history? I only need a list of positions. I rarely need to operate on every piece of data at once. Instead, most of my operations focus on specific sets of data. This makes Data-Oriented programming a bit more compelling.

The decision between Struct of Arrays (SoA) and Array of Structs (AoS) comes down to how data is accessed and processed. With AoS, individual data points are more easily accessible and managed as complete units, making it simpler for operations that involve entire structs. However, in scenarios where most operations involve handling specific fields in bulk—such as drawing positions or updating multiple attributes at once—SoA offers better performance and memory efficiency due to improved cache locality.