Reference: http://java-performance.info/primitive-types-collections-trove-library/
This article would describe Trove library, which contains a set of primitive collection implementations. The latest version of Trove (3.1a1 at the time of writing) would be described here.
Why should you use Trove? Why not to keep using well known JDK collections? The answer is performance and memory consumption. Trove doesn’t use any java.lang.Number
subclasses internally, so you don’t have to pay for boxing/unboxing each time you want to pass/query a primitive value to/from the collection. Besides, you don’t have to waste memory on the boxed numbers (24 bytes for Long/Double
, 16 bytes for smaller types) and reference to them. For example, if you want to store an Integer
in JDK map, you need 4 bytes for a reference (or 8 bytes on huge heaps) and 16 bytes for an Integer
instance. Trove, on the other hand, uses just 4 bytes to store an int
. Trove also doesn’t create Map.Entry for each key-value pair unlikejava.util.HashMap
, further reducing the map memory footprint. For sets, it doesn’t use a Map
internally, just keeping set values.
There are 3 main collection types in Trove: array lists, sets and maps. There are also queues, stacks and linked lists, but they are not so important (and usually instances of these collections tend to be rather small).
Array lists
All array lists are built on top of an array of corresponding data type (for example, int[]
forTIntArrayList
). There is a small problem which you should deal with: a value for the absent elements (default value). It is zero by default, but you can override it using
1 2 public TIntArrayList(int capacity, int no_entry_value); public static TIntArrayList wrap(int[] values, int no_entry_value);
There are 2 useful methods called getQuick/setQuick
– they just access the underlying array without any additional checks. As a side-effect they would allow to access elements between list size and capacity (don’t use it too much – it is still better to add
values legally and when getQuick
values as long as you are inside array list boundaries.
Each Trove array list has several helper methods which implement the java.util.Collections
functionality:
1 2 3 4 5 6 7 8 9 10 11 12 public void reverse() public void reverse(int from, int to) public void shuffle(java.util.Random rand) public void sort() public void sort(int fromIndex, int toIndex) public void fill(int val) public void fill(int fromIndex, int toIndex, int val) public int binarySearch(int value) public int binarySearch(int value, int fromIndex, int toIndex) public int max() public int min() public int sum()
Trove array lists support three iteration types:
- By index – use any loop from index = 0 to list.size() – 1. This approach is useful when you need to know the index of current element.
- Using iterator: there are several iterators defined for each primitive type. This one seems to be less useful because it doesn’t provide any new functionality compared to other methods.
- Using functional approach – a user-defined method implementing one of Trove interfaces would be called for each value in the list.
Two first iteration methods are rather common to Java developers, so let’s take a closer look at the last one. There are two iteration methods defined for Trove array lists (example for
TIntArrayList
):
1 2 public boolean forEach( TIntProcedure procedure ); public boolean forEachDescending( TIntProcedure procedure );First method iterates all elements from the beginning of the list to the end, second – from the end to the beginning. Implemented method should usually return
true
. If it returnsfalse
an iteration will stop. That’s how one could print all elements in the array list (yeah, too verbose compared to Scala…):
1 2 3 4 5 6 7 lst.forEach( new TIntProcedure() { @Override public boolean execute( final int value ) { System.out.println( value ); return true; } });There is another pair of methods:
1 2 public TIntList grep( TIntProcedure condition ); public TIntList inverseGrep( TIntProcedure condition );They grep the array list. First method accumulates all values for which implemented method returns
true
, second does the opposite, accumulating all values for whichfalse
was returned.This approach has one disadvantage – if you need to update a primitive variable from these methods, you’ll have to use either any of
Mutable*
classes from Apache Commons Lang package or implement one yourself:
1 2 3 4 private static final class MutableInt { public int value; }You have to use such class because all variables accessed from inner classes must be final, so you can’t use primitive type variable, but instead use not-final field of some class.
for
*-methods are also IDE friendly – most IDEs support autocompleting methods parameters, including interface implementations. You simply need to typeforEach( new
and press your IDE autocomplete key combination.Sets
Trove primitive sets use one array to store set values (using open indexed approach) and one more array of bytes to save each cell status: open, used or deleted. Besides, there are
THashSet<E>
andTLinkedHashSet<E>
, which can replacejava.util.HashSet
andjava.util.LinkedHashSet
from JDK. These object sets don’t use a separate array of bytes to keep the slot state (unlike primitive sets).Both JDK sets are based on the
HashMap<E,Object>
, whereObject
values are just cell state flags.HashMaps
, in turn, are based on array ofMap.Entry
. As a result, JDKHashSet
uses threeObjects
per each stored values. This is especially expensive in case when you need to store primitive values in a set.JDK
LinkedHashSet
is built on top ofHashSet
by adding double linked references to each map entry to keep order in which records have been inserted. Let’s see how much memory will consume a set of 10M integers (in all cases we create a set with initial capacity=10M elements and fill factor=0.75, which is the default fill factor of JDK maps/sets):
JDK HashSet<Integer> JDK LinkedHashSet<Integer> Trove THashSet<Integer> Trove TLinkedHashSet<Integer> Trove TIntHashSet 521M 598M 207M 258M 69M Here is the memory consumption test code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 public class TroveTests { private static final int SIZE = 10 * 1000 * 1000; private static final float FILL_FACTOR = 0.75f; public static void main(String[] args) { testJdkHashSet(); testJdkLinkedHashSet(); testTHashSet(); testTLinkedHashSet(); testTIntHashSet(); } private static long getUsedMemory() { System.gc(); return Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory(); } private static void testJdkHashSet() { final long start = getUsedMemory(); final Set<Integer> set = new HashSet<>( SIZE, FILL_FACTOR ); for ( int i = 0; i < SIZE; ++i ) set.add( i ); final long size = ( getUsedMemory() - start ) / 1024 / 1024; System.out.println( "HashSet<Integer> = " + size + 'M' ); if ( set.size() == 1 ) System.out.println( 1 ); } private static void testJdkLinkedHashSet() { final long start = getUsedMemory(); final Set<Integer> set = new LinkedHashSet<>( SIZE, FILL_FACTOR ); for ( int i = 0; i < SIZE; ++i ) set.add( i ); final long size = ( getUsedMemory() - start ) / 1024 / 1024; System.out.println( "LinkedHashSet<Integer> = " + size + 'M' ); if ( set.size() == 1 ) System.out.println( 1 ); } private static void testTHashSet() { final long start = getUsedMemory(); final Set<Integer> set = new THashSet<>( SIZE, FILL_FACTOR ); for ( int i = 0; i < SIZE; ++i ) set.add( i ); final long size = ( getUsedMemory() - start ) / 1024 / 1024; System.out.println( "THashSet<Integer> = " + size + 'M' ); if ( set.size() == 1 ) System.out.println( 1 ); } private static void testTLinkedHashSet() { final long start = getUsedMemory(); final Set<Integer> set = new TLinkedHashSet<>( SIZE, FILL_FACTOR ); for ( int i = 0; i < SIZE; ++i ) set.add( i ); final long size = ( getUsedMemory() - start ) / 1024 / 1024; System.out.println( "TLinkedHashSet<Integer> = " + size + 'M' ); if ( set.size() == 1 ) System.out.println( 1 ); } private static void testTIntHashSet() { final long start = getUsedMemory(); final TIntSet set = new TIntHashSet( SIZE, FILL_FACTOR, -1 ); for ( int i = 0; i < SIZE; ++i ) set.add( i ); final long size = ( getUsedMemory() - start ) / 1024 / 1024; System.out.println( "TIntHashSet<Integer> = " + size + 'M' ); if ( set.size() == 1 ) System.out.println( 1 ); } }There is no difference between Java 7 and 8 in this test. Trove
THashSet
uses simple array ofObjects
to keep the set.TLinkedHashSet
addsTIntList
to keep order of records. So, it makes sense to use TroveObject
sets as a replacement for JDK sets, because they implementCollection
,Set
andIterable
interfaces.Note that for a set of integer values (
short/int/long
) it may be worth to replace such set with a bit set. Read "Bit sets" article for more details.Let's return to primitive type sets.
addAll
,containsAll
,removeAll
andretainAll
methods are overloaded in order to supportjava.util.Collection
as well asTTypeCollection
, whereType
is any of primitive types, and primitive type arrays as arguments, providing compatibility to JDK collections, Trove collections and lower level code.Maps
Trove maps could be divided into 4 groups:
- Maps from one primitive type to another.
- Maps from objects to primitive types and vice versa.
java.util.HashMap
replacement calledTHashMap
.- Maps from objects to primitive types/objects which use user-defined hashing strategy.
All primitive value maps are built on top of three arrays - one for keys, another for values and the last one for cell states (free, occupied, deleted). Object value maps need only 2 arrays - keys and values. All of them use open addressing to determine position of key in a map. How does it work?
A key hash code is calculated and modulo of dividing it by internal array length is taken. That would be initial position in the search. After that some prime number would be added to this position (still by modulo equal to array length) to find next attempt position. Any other function could also be applied to define new position. Finally an empty cell would be found and new element would be inserted at that position. Same logic is used to find a key in the set, the only difference is that search key should be compared to existing keys on each step.
This means that if your keys implement sufficiently good
hashCode
method and your map fill factor is not too close to 1 (by too close I mean something above 0.9), then the cost of hash map operations is expected to be O(1). You should keep in mind thathashCode
method should also be sufficiently fast, otherwise you risk spending more time calculating hash codes rather than working with a hash map.Methods called
keys
,keySet
,valueCollection
andvalues
could be used to iterate map keys and values, but it is more recommended to use eitheriterator
or functional approach:forEachEntry/Key/Value
. Both iterators and callbacks are roughly similarly fast, but useiterator
if you need to throw an exception or to accumulate some scalar value. Otherwise use any offor*
methods - they provide more structured code.You should also keep in mind that any Trove map/set methods returning an array have to allocate that array (it costs CPU and RAM). On the other hand, methods returning collections or iterators return the thin wrappers over the Trove sets/maps.
Trove map iterators are slightly unusual. The main difference from JDK iterators is that two fields are accessible on each step - a key and a value. That's why you can't implicitly advance to the next value by calling
next
method if it were here. Instead, there is anadvance
method, which sole purpose is to get to the next map entry. Keys and values are accessible viakey
andvalue
methods. Current entry could be removed from the map byremove
method. Current entry value could be updated bysetValue
method.Trove JavaDoc provides the perfect examples:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 // accessing keys/values through an iterator: for ( TDoubleIntIterator it = map.iterator(); it.hasNext(); ) { it.advance(); if ( satisfiesCondition( it.key() ) { doSomethingWithValue( it.value() ); } } // modifying values in-place through iteration: for ( TDoubleIntIterator it = map.iterator(); it.hasNext(); ) { it.advance(); if ( satisfiesCondition( it.key() ) { it.setValue( newValueForKey( it.key() ) ); } } // deleting entries during iteration: for ( TDoubleIntIterator it = map.iterator(); it.hasNext(); ) { it.advance(); if ( satisfiesCondition( it.key() ) { it.remove(); } } // faster iteration by avoiding hasNext(): TDoubleIntIterator iterator = map.iterator(); for ( int i = map.size(); i-- > 0; ) { iterator.advance(); doSomethingWithKeyAndValue( iterator.key(), iterator.value() ); }New map methods
Trove collections are not a plain vanilla collections. There is always something new around. All examples are given for
TDoubleIntMap
. First of all, there is
1 int putIfAbsent( double v, int i )which proved to be useful in
ConcurrentHashMap
. It allows user to get rid ofcontainsKey
call in
1 2 if ( !map.containsKey( key ) ) map.put( key, value );making only one map lookup instead of two.
The next useful method is
1 public key adjustOrPutValue( double key, int adjust_amount, int put_amount )If you are building a frequency map (number of occurrences of values in some collection), you usually write something like:
1 2 3 4 5 6 7 8 9 10 final double[] array = new double[ 100 ]; ... array is filled here ... final TDoubleIntMap map = new TDoubleIntHashMap( 100 ); for ( final double key : array ) { final int cnt = map.get( key ); map.put( key, cnt + 1 ); }With Trove maps this could be simplified to:
1 2 3 4 for ( final double key : array ) { map.adjustOrPutValue( key, 1, 1 ); }There are two more similar methods:
1 2 public boolean increment( double key ) public boolean adjustValue( double key, int amount )They update value for the given key, but they wouldn't set it up initially. So, if a key is in the map, its value would be updated and both methods would return
true
, otherwise nothing would happen andfalse
would be returned. That's why they are not the best candidates for the previous example.The last but not the least useful method is
1 boolean retainEntries( TDoubleIntProcedure tDoubleIntProcedure )If you need to clean a map from some entries, this is the right method for that job. Just provide a predicate which returns
true
for all entries to be retained andfalse
for all other entries. The only small disadvantage of this method is that it doesn't compact a map after cleaning, so if you expect to clean a lot of entries, call
1 public void compact()afterwards to shrink and rehash a map.
Hashing strategies (foreign classes, arrays)
Maps usually depend on their keys to override
equals
andhashCode
methods. This is fine as long as you can modify that class. But in some cases you can't. Either you aren't allowed to modify a class, or you want to use a different hashing/comparison logic in some special cases. For example, if there is a set of strings representing some sort of ids and all these strings start with some common prefix (state code, for example), when we may want to skip this prefix in order to improvehashCode
quality.Another case is arrays. By default they use implementations inherited from
Object
.java.util.Arrays
provide implementations for two these methods for all array types (all primitive types andObject
). So, in order to use these helper methods, we need to use custom hashing strategy:
1 2 3 4 5 6 7 8 9 10 11 final TObjectIntCustomHashMap<String[]> map = new TObjectIntCustomHashMap<String[]>( new HashingStrategy<String[]>() { @Override public int computeHashCode( String[] array ) { return Arrays.hashCode( array ); } @Override public boolean equals( String[] ar1, String[] ar2 ) { return Arrays.equals( ar1, ar2 ); } });Now such map would print
100
in a following code snippet:
1 2 3 4 final String[] array1 = new String[] { "String 1", "String 2" } ; final String[] array2 = new String[] { "String 1", "String 2" } ; map.put( array1, 100 ); System.out.println( map.get( array2 ) );Another useful case is to replace
java.util.IdentityHashMap
. For Object to primitive type mappings we can useTObject<primitive type>CustomHashMap
(for example,TObjectIntCustomHashMap
), for object to object mappings -TCustomHashMap
. We can implement required hashing strategy for this case manually:
1 2 3 4 5 6 7 8 9 10 11 final TObjectIntCustomHashMap<String> identityMap1 = new TObjectIntCustomHashMap<String>( new HashingStrategy<String>() { @Override public int computeHashCode( String str ) { return System.identityHashCode( str ); } @Override public boolean equals( String str1, String str2 ) { return str1 == str2; } });But it has been already implemented by library developers:
1 final TObjectIntCustomHashMap<String> identityMap2 = new TObjectIntCustomHashMap<String>( IdentityHashingStrategy.INSTANCE );JDK
IdentityHashMap
uses its own data structure to store data - keys and values are interleaved in the same array. This is CPU cache-friendly solution (provided that you have found the correct slot in the array using the key hash code on the first attempt). It also doesn't keep redundant Map.Entry objects, so it consumes less memory thanjava.util.HashMap
. Here is a small test. We would add 10M mappings from string to int to three maps:IdentityHashMap
,HashMap
,TObjectIntCustomHashMap
. Of course, this is just a memory consumption test, becauseHashMap
can not substituteIdentityHashMap
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 public class TroveTests { private static final int SIZE = 10 * 1000 * 1000; private static final float FILL_FACTOR = 0.75f; public static void main(String[] args) { testJdkHashMap(); testJdkIdentityHashMap(); testTroveIdentityHashMap(); } private static long getUsedMemory() { System.gc(); return Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory(); } private static void testJdkHashMap() { final long start = getUsedMemory(); final Map<String, Integer> map = new HashMap<>( SIZE, FILL_FACTOR ); for ( int i = 0; i < SIZE; ++i ) map.put( "string #" + i, i ); final long size = ( getUsedMemory() - start ) / 1024 / 1024; System.out.println( "HashMap<String, Integer> = " + size + 'M' ); if ( map.size() == 1 ) System.out.println( 1 ); } private static void testJdkIdentityHashMap() { final long start = getUsedMemory(); final Map<String, Integer> map = new IdentityHashMap<>( SIZE ); for ( int i = 0; i < SIZE; ++i ) map.put("string #" + i, i); final long size = ( getUsedMemory() - start ) / 1024 / 1024; System.out.println( "IdentityHashMap<String, Integer> = " + size + 'M' ); if ( map.size() == 1 ) System.out.println( 1 ); } private static void testTroveIdentityHashMap() { final long start = getUsedMemory(); //INSTANCE is available from Trove 3.1a1 final TObjectIntCustomHashMap<String> map = new TObjectIntCustomHashMap<>( IdentityHashingStrategy.INSTANCE, SIZE, FILL_FACTOR, -1 ); for ( int i = 0; i < SIZE; ++i ) map.put("string #" + i, i); final long size = ( getUsedMemory() - start ) / 1024 / 1024; System.out.println( "TObjectIntCustomHashMap<String> = " + size + 'M' ); if ( map.size() == 1 ) System.out.println( 1 ); } }
JDK HashMap<String, Integer> JDK IdentityHashMap<String, Integer> Trove TObjectIntCustomHashMap<String> 1207M 967M 797M JDK to Trove migration hints
Migrating your code from JDK maps/sets to Trove maps/sets is quite straightforward. Nevertheless, there are a few traps to remember.
Converting object-to-object maps
HashMap<T, V> (where T and V are not Number-s)
This is the easiest case - just change the collection type in its initialization from
HashMap
toTHashMap
(just add letter 'T' and imports). Your code should work without any other changes provided that you have defined your maps byMap
interface instead ofHashMap
class (otherwise it is a good time to start using this good practice).HashMap<Number, T>
You need to change both definition and initialization of this map. For example,
Map<Integer, String>
should be replaced withTIntObjectHashMap<String>
. Generally you would not need any other changes except the case ofkeySet
method - it returns a Trove primitive-based set (TIntSet
in the example). You may want (note the difference from 'need') to check the map key accesses and get rid of now redundant boxed types.HashMap<T, Number>
Similar to previous case, both definition and initialization should be changed. For example,
Map<String, Integer>
should be replaced withTObjectIntHashMap<String>
. But this time you must pay much more attention.All Trove "something-into-primitive" maps require specifying a "no value", which is zero by default. You need it because
get/remove
methods can not returnnull
now - they return a primitive instead. You have 2 possible solutions:
- Choose a "no value". For example, it could be any negative value for time since epoch.
- Don't make such choice. Instead wrap all methods relying on "no value" with
containsKey
call. The disadvantage of this method is a double lookup costs in too many cases. So,choose a "no value".A common mistake in this type of migration is keeping
null
-checks forget/remove
results. Remember, these methods return "no value" instead ofnull
in Trove!HashMap<Number, Number>
You new type would be a fully primitive map like
TIntLongHashMap
. Besides that look for the same problems as listed above in "object-to-number map" section.IdentityHashMap
As I have mentioned above,
IdentityHashMap
initialization should be replaced withnew TCustomHashMap<>( IdentityHashingStrategy.INSTANCE )
. You don't need any other changes if yourIdentityHashMap
was defined by an interface.LinkedHashMap
Sorry, Trove does not have any maps with a fixed iteration order. But it may be a good time to check if you actually need a fixed iteration order for your map.
SortedMap
There are no sorted maps in Trove.
HashSet
Similarly to
HashMap
, all you need to do is to replaceHashSet
withTHashSet
in the map initialization.LinkedHashSet
Similarly to
HashSet
, all you need to do is to replaceLinkedHashSet
withTLinkedHashSet
in the map initialization.Map/set keys/values iteration
You should avoid using any methods returning an array of keys/values in the high performance code. These methods allocate a new array, iterate the whole collection to collect the valid keys/values and return an array of results. Try using iterators, methods returning collections or
forEach*
methods - all of them return the thin wrappers over underlying collections.Trove to JDK memory consumption comparison
The following table was borrowed from Memory consumption of popular Java data types - part 2 article.
JDK collection Size Possible Trove substitution Size HashMap
32 * SIZE + 4 * CAPACITY bytes THashMap
8 * CAPACITY bytes HashSet
32 * SIZE + 4 * CAPACITY bytes THashSet
4 * CAPACITY bytes LinkedHashMap
40 * SIZE + 4 * CAPACITY bytes None LinkedHashSet
40 * SIZE + 4 * CAPACITY bytes TLinkedHashSet
8 * CAPACITY bytes TreeMap, TreeSet
40 * SIZE bytes None Let's compare the memory consumption of a few numeric hash maps/sets. Note that sets/maps with primitive values require one more byte per slot - it is used to track the slot state.
JDK collection Size Possible Trove substitution Size HashMap<Integer, Integer>
(32(Map.Entry) + 16(Integer1) + 16(Integer2)) * SIZE + 4 * CAPACITY bytes TIntIntHashMap
9(int+int+byte state) * CAPACITY bytes HashSet<Integer>
(32(Map.Entry) + 16(Integer)) * SIZE + 4 * CAPACITY bytes TIntHashSet
5(int+byte state) * CAPACITY bytes LinkedHashSet<Integer>
(40(LinkedHashMap.Entry) + 16(Integer)) * SIZE + 4 * CAPACITY bytes TLinkedHashSet
8(data ref+int order) * CAPACITY + 16(Integer) * SIZE bytes Let's compare theory with practice. We will create maps/sets with 10M size and fill factor 0.75 (which means the initial actual capacity of 13.33M slots):
Collection Theoretical size (fill factor=0.75) Actual size (fill factor=0.75) HashMap<Integer, Integer>
(32(Map.Entry) + 16(Integer1) + 16(Integer2)) * SIZE + 4 * CAPACITY = 64*SIZE + 4*CAPACITY = 610M+53M = 663M 669M TIntIntHashMap
9(int+int+byte state) * CAPACITY bytes = 120M 124M HashSet<Integer>
(32(Map.Entry) + 16(Integer)) * SIZE + 4 * CAPACITY bytes = 458M + 53M = 511M 521M TIntHashSet
5(int+byte state) * CAPACITY bytes = 67M 69M LinkedHashSet<Integer>
(40(LinkedHashMap.Entry) + 16(Integer)) * SIZE + 4 * CAPACITY bytes = 534M + 53M = 587M 598M TLinkedHashSet<Integer>
8(data ref+int order) * CAPACITY + 16(Integer) * SIZE bytes = 106M + 152M = 258M 258M The lesson you can learn from these numbers is that the combination of boxed primitives and inefficient data structures could skyrocket the data storage cost (compare 521M required by
HashSet<Integer>
with a theoretical limit of 40M per 10Mint
values).Summary
- The main reason to use Trove maps/sets is the seriously reduced memory consumption compared to JDK maps/sets. If there is a large array list/set/map with keys or values that could be a primitive type, it is worth replacing it with Trove collection. If there are some maps from a primitive type to a primitive type, it is especially worth to replace them.
- Trove maps and sets support custom hashing strategies which allow to implement map/set specific
equals
andhashCode
, for example to implement identity set or map.- Trove collections implement several additional methods, like
grep
,retainEntries
oradjustOrPutValue
. They allow to reduce code required for many common tasks.- JDK to Trove migration is quite an easy task - all you need to do in a lot of cases is to add letter 'T' to initialization:
new HashMap
should be rewritten asnew THashMap
. Migration to primitive-based collections require a bit more work, but this work will be paid off by massive reduction in memory consumption.