What are the implications of boxing/unboxing? How can we decrease heap memory allocation? Read on to find out
In this post, we will look at how integral Scala types such as
Long are represented down to the bytecode level. This will help us understand what the performance effects of using them in generic classes are. We will also explore the functionalities that the Scala compiler provides us for mitigating such performance penalties. Furthermore, we will take a look at concrete benchmark results and convince ourselves that boxing/unboxing can have a significant effect on the latency of an application.
Scala has tremendously powerful type system that allows us to express complex concepts. In addition to that language and compiler designers have put significant amount of engineering effort into making sure that the expressiveness of the language does not come at the expense of its performance. Furthermore a lot of lessons that have been learned in the past (primarily from Java) have been put into practice.The result of that is a language that has a logical and coherent type system. One clear example of that is the fact that Scala does not have the concept of primitive types in the same way Java does. Most of us know that Java defines 8 primitive types -
byte. One of the benefits of working with the so called primitive types is that they are not instantiated as objects and most of the time can live entirely on the stack. This has a number of important implications, probably the most notable being the fact that the cost of object instantiation and subsequent garbage collection is decreased. This is particularly important in systems that need to be highly performant and provide quite predictable latency characteristics. Garbage collection cycles often incur what is known as “stop the world” events, which can significantly decrease the level to which the performance of system running on the JVM can be predicted. This is why most of the time working with primitive types when dealing with critical sections of latency sensitive systems is preferred.
Unfortunately, a logical gap exists in Java’s type system. Because of the fact that primitive types cannot be contained in generic containers and the way in which generics work in Java (through type erasure), for each primitive type, a wrapper type exists. For example we do have the type
int which has the wrapper type
Integer. And because generic collections types such as
List<T> can only hold wrapper types, every time we try and add an
int type to a
List<Integer>, we end up converting the primitive to a wrapper object automatically. This process happens transparently and is called autoboxing. The problem with autoboxing is that every time we do this conversion a new instance of
Integer (or whatever the wrapper class might be) is created. This instance however does not live on the stack but is instead allocated memory on the heap. This does come with all the performance implications mentioned earlier.
Scala takes a different approach and does not partition its type system into primitive and wrapper types. After all, this is part of the reason why its type system is referred to as an “Unified Type System”. Instead all integral types (Int, Double, Long) are defined as classes that extend
AnyVal. What happens upon compilation is that the bytecode generated represents them as primitive types, thereby avoiding heap memory allocation. This is indeed a clever way to provide a very unified view of types while still preserving the benefits that JVM primitives provide. Unfortunately however, when it comes to using these
AnyVal types in generic types, things cannot be optimized that easily.
If we have a class with a type argument, and we parametrize it to use any of Scala’s integral types like
Long, we are faced with the problem of boxing. The bytecode that will get produced upon execution on the JVM will actually create a wrapper object that will contain the value of the
int and allocate it on the heap instead of keeping it only on the stack. Lets looks at an example. Imagine you want to model a rook that can occupy a position on a chess board. A piece of code that places a rook on a every square of a 2048 by 2048 board might look like:
Once compiled all of the
Int objects will be treated as primitive
int types and will in fact live on the stack, thereby not occupying any heap space. To prove that to ourselves, lets look at the bytecode that the Scala compiler generates.
This is the code which gets executed when the for expression yields on each generated square. The code is called 2048 * 2048 times. Looking at the operations, it is clear that we do not allocate any heap objects for the passed
Imagine now however that we want to modify our code a bit in order to allow the developer who uses our
Rook class to choose any type of the coordinates
y. Perhaps the chess board is so big that its width and height do not fit in the 32 bits that are available in an
Int. Or perhaps someone wants to use a custom type to express the position. An easy way to provide that functionality is to modify our code to use generics.
This provides us with the flexibility to choose our own representation of the
y coordinates. However, flexibility comes at a certain price:
If we take a look at the generated bytecode, we can clearly see that we are calling
BoxesRunTime.boxToInteger for each of the position arguments when creating a
Rook[Int]. If we ignore
Long pools, and we do a bit of simple math, we can convince ourselves that for a board of size 2048 in one direction, we are creating 2048 * 2048 * 2 = 8388608 objects on the heap. Each of these objects incurs a number of performance penalties which can prove to be quite harmful in a system that is latency sensitive.
If we really want to be sure that the latter is true, we can delve deeper and profile the two different programs. Lets look at the state of the heap of the two versions of our little Rook placement program. In the generic version, when we map the heap with
jmap -histo, we observe the following:
We have allocated 8827741 Integer objects that live in the heap. This actually amounts to 141.2 MB of data. Not such a great amount of memory footprint, but you can guess how this can increase tremendously for applications that are dealing with a lot more data. As a result we might incur a lot more GC pauses and degrade the overall performance of the application. In comparison to that, we can observe that using the non generic version of our program, which works only with Int types, we have significantly less
Integer objects living in the heap memory:
To put it simply, we have decreased the memory footprint of the
Integer objects from 141.2MB to 27.7MB. Of course there is some additional integer allocation coming from the Scala collection framework itself and the fact that we are using for comprehensions. This, however, is not a topic of the discussion at hand and therefore we ignore it. If you want to dig deeper, you can look at the full bytecode representation and try to understand for yourself what is really happening under the hood when we use generators such as
start to end or
start until end.
One possible solution that we can think of is to provide separate implementations for all the different types that we need. That being said, we can have
LongRook. This will solve our performance problems and will allow the runtime to avoid boxing the primitive types into wrappers. Although this might sound like a plausible solution, it really comes at the price of bloating the code base with a number of classes that are largely the same. All in all it does not seem like the best solution in practice. Thankfully, the Scala compiler provides with a feature called specialization. What this allow us to do is to mark the T type as
@specialized. Upon compilation this will generate concrete implementation for all the integral Scala types and will avoid the boxing problem. Following is a code snippet that uses the
@specialized annotation to do that.
Note that we specialize just for
Int. This means that an Int version for the class will be generated by the compiler and will be used whenever we parametrize our Rook instance to use positions of type
Int. If we omit the type parameter of the specialized annotation and just use
@specialized, the compiler will generate additional class versions for all integral types such as
Rook[Boolean], etc. To convince ourselves that this is what happens, we can take a look at the bytecode representation:
We can see that in the preceding sequence of bytecode ops the class that is getting instantiated is in fact
PlacingRooks$Rook$mcI$sp, which is the Int specialized version of our Rook case class. The other observation that can be made is that there is no sign of any boxing as the class is compiled to work directly with
int primitive types.
As already mentioned, boxing and unboxing can have quite significant impact on the performance of a piece of code. It mainly has two performance hits:
To put that into more concrete perspective, the code was benchmarked using JMH running 20 iterations and 10 warm up cycles. What was measured was the operation of filling a board of size 2048 by 2048 with
Rook objects. The average time to complete the task is measured using the specialized and the non specialized version of the class:
Results speak for themselves. As we can see, the non specialized version of the code is over 65% slower. Bear in mind that this is just measuring the average time. There are other profound performance implications that stem from maintaining a much larger heap.
There are quite a bit of types in the standard library that can make use of specialization in order to avoid autoboxing when instantiated. One such type is the Tuple. Most of us already know that there are 22 tuple classes to cover the range of arity needed (e.g.
Tuple2[+T1, +T2]). Below are the declarations of the Tuple class for for arity of 1 and 3.
An interesting observation is that Tuple1 is specialized but only for Int, Long and Double, while Tuple3 is not specialized at all. If you peek at the source code of other standard library classes, you will see that
@specialized has been used quite conservatively. The reason for that is simple - the number of generated class files grows exponentially as the number of type parameters increases. Let’s take for example Tuple3. If it was specialized for all 9 primitive types and 1 reference type that would result in 10^3 = 1000 generated classes. It is pretty clear that this quickly becomes unmanageable as the number of type parameters grows. That being said, one needs to be careful when using
@specialized as excessive specialization can have significant impact on compile time and size of the bytecode.
There are considerable performance penalties introduced by the JVM whenever primitive types and generics are combined. This is mainly due to the design decisions that have been embedded into the virtual machine itself. Although Scala does a great job covering some of Java’s problems by providing an Unified Typesystem, the performance of Scala applications can still suffer from excessive heap memory allocation. This is why Scala had provided the
@specialized annotation to be used as a hint to the compiler that generation of more specific classes from the generic ones is needed. Although this increases the size of the compiled artifacts, it can result in significant performance boost whenever used appropriately.