Optimization Game - How Far Can You Go?
November 5, 2024 · Rabbit HolesI came across this question which was asked in some big tech on Leetcode, and thought I’d make a challenge out of it.
How can you optimize the below code?
LinkedList<Integer> list;
Long sum = 0;
for (int i = 0; i < list.size(); ++i) {
Long value = list.get(i);
sum += value;
}
Identifying Initial Bottlenecks
The first major issue with this code is the use of list.get(i). Since LinkedList doesn’t support O(1) retrieval by index, each get() call iterates up to that index, making the overall complexity O(n²). To fix this, we can use an Iterator to traverse the LinkedList or switch to an ArrayList, which offers O(1) access time due to its contiguous memory structure.
Another inefficiency is the unnecessary auto-boxing when initializing Long sum = 0, which could be simplified. Let’s address these two points in our first optimization:
LinkedList<Integer> list;
Long sum = 0L;
for (Integer value : list) {
sum += value;
}
Using an enhanced for-loop with an Iterator or switching to an ArrayList ensures O(n) traversal. Additionally, initializing sum with 0L avoids any auto-boxing.
At this point, we have an optimized, generic version of the code suitable for most situations. However, for the sake of experimentation, let’s explore further optimizations without focusing on edge cases. This exercise will give us insight into different performance trade-offs, touching on some core Java and computer science concepts along the way.
Going Further with Arrays
Instead of a LinkedList, we can use a primitive int array, which offers significant advantages:
1. Cache Locality
Arrays in Java are contiguous blocks of memory, which allows the CPU to load large chunks of the array into the cache after the first access. This enables faster access to subsequent elements due to spatial locality. Here’s a simplified memory layout comparison to illustrate:
Address Contents | Address Contents
ffff 0000 data[0] | ffff 1000 l_data
ffff 0040 data[1] | ....
ffff 0080 data[2] | ffff 3460 l_data->next
ffff 00c0 data[3] | ....
ffff 0100 data[4] | ffff 8dc0 l_data->next->next
| ffff 8e00 l_data->next->next->next
| ....
| ffff 8f00 l_data->next->next->next->next
When we access data[0], the CPU loads a chunk of memory starting at ffff 0000 into the cache, enabling quick access to data[1], data[2], and so on. By contrast, a LinkedList has scattered nodes, each pointing to the next. Accessing l_data->next requires loading a different part of memory, resulting in more cache misses and reduced efficiency.
2. Dereferencing Overhead
In a LinkedList, each element resides in a node with a pointer to the next element. This design causes:
- Indirect Access: Accessing elements requires traversing pointers, adding multiple memory lookups and slowing access.
- Pointer Chasing: Nodes are scattered in memory, leading to frequent cache misses as the CPU fetches each pointer location separately, increasing access latency.
- Memory Overhead: Each node’s pointer and wrapper object (e.g.,
Integer) consume extra memory, which reduces cache efficiency and adds garbage collection overhead.
In contrast, an int[] array stores values contiguously, allowing direct access without pointers, minimizing cache misses, and avoiding the extra memory and garbage collection costs. This results in faster, more efficient sequential access.
Utilizing Parallel Streams
Since we’re using Java, we can leverage parallel streams to speed up summation over large datasets. Here’s a quick overview of how parallel streams operate:
- Splitting the Data Source: A parallel stream divides the data source (e.g., a collection or array) into chunks. Each chunk is assigned to a separate thread for concurrent processing, leveraging multiple CPU cores.
- Work-Stealing: Java’s
ForkJoinPoolpowers parallel streams and supports work-stealing. If one thread finishes early, it can take over work from another thread that’s still processing.
Parallel streams work best with large data sources, as the overhead of splitting and managing multiple threads is only worth it when there’s significant computation. In this example, we’ll assume a large dataset and implement parallel streams:
Optimizing Further: Choosing int over long
Finally, to squeeze out the last bits of performance, we can replace long with int, which has some subtle advantages:
- Memory Efficiency:
intuses 4 bytes, whereaslonguses 8. For large datasets, this reduces memory footprint. - Cache Efficiency: Smaller data types like
intoccupy less space in the cache, which may reduce cache misses when handling large arrays. - Avoiding Type Promotion: In Java, mixing
intandlongin expressions promotesintvalues tolong, introducing minor overhead. By keeping everything asint, we skip this conversion.
With all these optimizations, here’s our final optimized code:
int[] array = { /* your integers */ };
int sum = java.util.Arrays.stream(array).parallel().sum();
Although the result itself is not of much use, the journey was worth it. We ventured through several fundamental concepts trying to optimize it to the extreme.
I think this is as far as you can go staying in bounds of the language. Can you go any further?