Speeding up Unitys Vector in lists/dictionaries

By | 2018-01-13

Introduction

With this post I am digging into some performance improvements for Unity 3D’s Vector2, Vector3, Vector4 and Quaternion. The short version is that they really need IEquatable<T>  and can benefit from a better GetHashCode() . I’m demonstrating this with example of how it severely decreased performance in my project.

Adding IEquatable<T> has no side-effects, if is actually best practice and documented. More info can be read at https://docs.microsoft.com/en-us/dotnet/api/system.iequatable-1?view=netframework-4.7.2 Quote from docs:

For a value type, you should always implement IEquatable<T> and override Equals(Object) for better performance. Equals(Object) boxes value types and relies on reflection to compare two values for equality. Both your implementation of Equals(T) and your override of Equals(Object) should return consistent results.

Voxel engine

Christmas, spare time, Unity got support for 32-bit mesh indexes and so thought I would look into writing a voxel rendering engine and optimizing it a bit.

The engine is basically a Minecraft-style block based rendering engine. This is pretty basic stuff. You store volumetric information about the whole world in a format that can be divided up and rendered part by part. A common approach is to use chunks of 16x16x16 (4096) blocks, each consisting of a number which references a mesh model (often a cube) and a texture. Even air is represented as a number.

The trick with voxel rendering is always what you do not render. In a flat world with completely filled ground you would only render the flat surface of the ground, not the air, and not the walls in-between each block underground. So two sides of a block facing each other should not render. In the picture above I have randomized the data (air / solid block) to challenge the engine a bit. My test setup keeps randomizing the data at a very high rate.

The format I chose was the simplest I could think of. Each block has a set of triangles, and each triangle has a direction. This allows me to determine which sides not to render when blocks are adjacent to each other (covering each others sides). Rendering the blocks is a simpe for-loop iterating the chunks volumetric data.

In the initial setup I would generate all meshes (2 per surface) for each side (6 sides) for each cube (32.768 cubes) regardless. This meant uploading 393.216 meshes to the graphics card. Though only triangles and normals actually needed was uploaded, which could be anywhere from 0 to 32.76862. I chose a chunk size of 32x32x32 (32.768 blocks), which put it around 1 chunk every few seconds..

Optimizing: 2-4 chunks/sec

Initial optimization was to only add the meshes actually needed, which worked fine and speed was much better. To optimize it further I used a Dictionary<Vector3, *>  to keep track of existing positions. This allowed me to not add the same position twice, and hence in corner situations share mesh positions in triangles. It’s not an huge save in terms of veritces, but I was interested in seeing the result.

The result was not encouraging: 2-4 chunks per second. Which was interesting. I have tested the speed of Dictionary<TKey, TValue> before and found it to perform very well. Dictionary operates on int that it gets from GetHashCode(), and uses object comparison to sort out collisions. Using Vector3 as key would add a GetHashCode and 3 float compares. This adds a few more CPU instructions to the mix, but its not a major change.

Profiling

So what is happening here? Why is it slow? Luckily these are questions profilers like to answer. And it so just happen to be that I write my projects in a shared setup with a Unit Test runner outside of Unity, so I can run a proper profiler on it.

From this we see that Dictionary is spending 50% of total rendering time. That seems a bit odd. I am using Vector3 as key in the dictionary, so next step is to decompile UnityEngine.dll to have a look at Vector3.

Decompiling Unitys Vector3

Vector3 contains a lot of functions and data. I have extracted the bits relevant to Dictionary. GetHashCode()  is used to get an int which is used for indexing. Since the hashcode can cause collisions (two different objects share same hashcode) Dictionary will compare the objects to see that they are in fact different, then keep a list of them.

Immediately we can see that Vector3 does have GetHashCode, that is nice.

Where is IEquatable<T>?

Vector3 is lacking IEquatable<Vector3> .

Non-generic objects such as ArrayList  that operate on Object-types requires Object.Equals()  to be overridden for structs, or it will fail on .Remove and .Contains. But generic objects such as Dictionary<TKey, TValue> , HashSet<T>  and List<T>  require both GetHashCode()  (except for List<T> )and IEquatable<T> . Lacking this implementation, Dictionary is forced to use object.Equals, which as we see is an “object” type. Passing a struct (value type) as parameter to an object type causes boxing, which in turn produces work for garbage collector.

This means that any List<T>.Contains() , List<T>.IndexOf() , List<T>.Remove() , Dictionary<TKey, TValue>  and HashSet<T>  operations would generate objects that GC have to clean up. In the case of list, one object for every Equals-operation. This meaning in some cases one object for every item in the list for every operation. That is pretty bad.

Tip! You can implement a IEqualityComparer<T>  and pass it to your strongly typed objects such as Dictionary<TKey, TValue>(IEqualityComparer)  to work around this problem.

Testing with IEquatable<T>: 100 chunks/sec

So I might have a clue as to why things are slower than expected. I  implement a simple FasterVector3 which I just initialize from a Vector3. I get as slight overhead on copying the data into a new object, but still, lets see…

So that took us from a variable 2-4 chunks per second up to a fairly stable ~100 chunks per second. It was variable before because GC was working “randomly”, while now we have zero GC. I will get back to proof of that a bit later.

 

Profiling looks slightly better, but not quite what I expected. We have solved the GC-problem, which gave us a fairly good speed increase. But Dictionary time only decreased by around 10 percent point. The next suspect on my list is GetHashCode() .

GetHashCode()

The purpose of GetHashCode()  is to provide an Int32 hash of the object. Although this is not an unique representation of the object, it gives lookup objects such as Dictionary<TKey, TValue>  and HashSet<T>  something to index on. If two different objects return the same hashcode we have whats called a collision, forcing Dictionary to revert into a flat list mode of operation on the colliding objects. Immediately I can see that I don’t like that >>2 bitshift, which causes loss of precision. But these things are a bit tricky, and a more extensive test setup is required.

It should be noted that the GetHashCode()  call on a float returns the 32-bit raw value of the float, which is a very good representation of the value (0 collisions on a single float). Running a quick test of Unitys implementation shows that as long as you keep the values low (-/+ 4 million of 0) then there are no collisions on Z<<2, but once you get past that similar numbers start colliding. Meaning edge cases is a problem, the further from 0 you get the bigger the problem. Collision on 10000000: Hashcode 314975648
Collision on 10000001: Hashcode 314975648
Collision on 10000002: Hashcode 314975648
Collision on 10000003: Hashcode 314975648
Collision on 10000004: Hashcode 314975649
Collision on 10000005: Hashcode 314975649
Collision on 10000006: Hashcode 314975649
Collision on 10000007: Hashcode 314975649
Collision on 10000008: Hashcode 314975650
Collision on 10000009: Hashcode 314975650
Collision on 10000010: Hashcode 314975650
Collision on 10000011: Hashcode 314975650

In my use case a single chunk would never be that large, so I would never encounter these limits. But collisions don’t show the whole picture of the internal mechanics of indexing, we need proper tests to experiment further.

How about precomputing GetHashCode()?

To experiment I added caching of the hashcode to my FasterVector3. When created hashcode is computed, and after that it is served from memory. That seems to improve my use case a bit, but at the cost of memory. My Vector3 now went from 3×4 (12) to 4×4 (16) bytes, taking up 33% more memory, memory bandwidth and CPU cache. I also short-circuited Equals(Vector3)  to check hashcode first, saving me a few cycles.

Running benchmarks

So far I have used Unity to tell me how many chunks per second I can render. Initially at 2-4 chunks per second this was an easy way, but as the speed increased it became difficult to get a good number because other factors started affecting result. For example how many chunks was on the screen in total, render queue machanism, etc.. But the tests showed that further investigation is warranted.

So I set up a test project with BenchmarkDotNet and isolated the components in question. I had so far looked at Vector3, but from decompiling Vector2, Vector4 and Quaternion I can see the same most likely applies.

The tests are fairly basic. I set up a radius of 50, meaning 100x100x100 around 0 and add the vectors to a dictionary.

Since I wanted to test the actual performance with some changes to GetHashCode I also run the tests on an object similar to Unitys Vector3 where the only change I’ve made is to the hashcode, as well as calling GetHashCode directly on these.

The new and improved Vector3 looks like this.

With the helper object:

The results: 50x-100x improvement in speed

From this we can see that “GetHashCode() using Vector3.Equals(object)” causes allocation, and hence some work for garbage collect. Using IEquatable<T>  consistently avoids this. Note that all Tedd:Vector3.Equals(object) are only there for reference to see differences.

For Vector4 and Quaternion the result is even better, at almost 150x increase in speed.

Though isolated, my implementation of GetHashCode() was slower than Unitys, tests with only GetHashCode improvement shows its faster when used in Dictionary.

While I was at it I noticed that ToString could benefit from a small change too. Normally we don’t print Vector3.ToString() so often, but what we may not realize is that if included in a log-line it usually gets executed even if logging is disabled. And besides, why not make it faster? Though this one only gave us a 1,6x speed increase.

For List<T> with 33% of adds last item is immediately removed after being added we see the same problem with garbage collection as well as some increase in speed. Lists are then ~2-4,5x faster.

Conclusion

For my specific use case where I am optimizing meshes before rendering I saw a dramatic performance increase. Not only did the code overall execute 25 times faster, but it was a lot smoother with less garbage collection. These are fixes that would be quick for Unity to test and include in a future release.

I set up isolated tests and ran benchmarks on both Mono and .Net with LegacyJit, RyuJit and Llvm. Though I have not published the results of all 16 tests with ~15 subtests the result is consistent on all of them.

In the isolated use of Dictionary<TKey, TValue>  with IEquatable<T>  we see a performance increase of 2.3 to 4.3x depending on what Jit and platform it runs on. This test has very little searching/removing, so I will re-visit it with better test cases.

In the isolated use of List<T> with IEquatable<T>  and new GetHashCode()  we see a performance increase of 50x to 100x depending on what Jit and platform it runs on.

My implementation of GetHashCode()  was slower than Unitys implementation by itself, but was faster in actual use.

Benchmark code (with results)

Remember not execute code from random strangers on the internet. I provide you code with result for reference.

ZIP-source with results

(PS! I made TeddVector3 x,y,z readonly … Should not be. Also missing overload for == and !=.)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.