#1BRC Timeline
1BRC Timeline
Important moments, key ideas and most performance improvements in The One Billion Row Challenge
The goal is to learn which changes most improved performance, and also to give credit to the original authors.
ps: it doesn’t necessarily mean that other people copied the code. Others might have arrived at the same/similar solution on their own.
Leaderboard Changes
git log -p --reverse README.md | rg "\+\|\s1"
# | Result (m:s.ms) | Implementation | JDK | Submitter | Notes |
---|---|---|---|---|---|
1. | 04:13.449 | CalculateAverage.java (baseline) | 21.0.1-open | Gunnar Morling | |
1. | 02:08.845 | CalculateAverage_royvanrijn.java | 21.0.1-open | Roy van Rijn | Process lines in parallel |
1. | 02:08.315 | CalculateAverage_itaske.java | 21.0.1-open | itaske | |
1. | 00:38.510 | CalculateAverage_bjhara.java | 21.0.1-open | Hampus Ram | Memory mapped file using FileChannel and MappedByteBuffer |
1. | 00:23.366 | link | 21.0.1-open | Roy van Rijn | SWAR |
1. | 00:14.848 | link | 21.0.1-graalce | Sam Pullara | Lazy String creation and better hash table? |
1. | 00:12.063 | link | 21.0.1-graalce | Sam Pullara | |
1. | 00:10.870 | link | 21.0.1-graalce | Elliot Barlas | |
1. | 00:07.999 | link | 21.0.1-GraalVM Native - Unsafe | Roy van Rijn | |
1. | 00:07.620 | link | 21.0.1-open | Quan Anh Mai | Vector API, Magic Branchless Number Parsing |
1. | 00:06.159 | link | 21.0.1-GraalVM Native - Unsafe | royvanrijn | |
1 | 00:03.604 | link | 21.0.1-GraalVM Native - Unsafe | Roy van Rijn | New server |
1 | 00:03.044 | link | 21.0.1-GraalVM Native - Unsafe | Thomas Wuerthinger, Quan Anh Mai, Alfonso² Peterssen | |
1 | 00:02.941 | link | 21.0.1-GraalVM Native - Unsafe | Roy van Rijn | |
1 | 00:02.904 | link | 21.0.1-GraalVM Native - Unsafe | Roy van Rijn | |
1 | 00:02.575 | link | 21.0.1-open | Quan Anh Mai | Use Unsafe His previous time after server changed was 00:03.258, so 683ms faster. |
1 | 00:02.552 | link | 21.0.1-GraalVM Native - Unsafe | Thomas Wuerthinger, Quan Anh Mai, Alfonso² Peterssen | |
1* | 00:02.461 | link | 21.0.1-GraalVM Native - Unsafe | Artsiom Korzun | |
1* | 00:02.477 | link | 21.0.1-GraalVM Native - Unsafe | Van Phu DO | |
1 | 00:02.336 | link | 21.0.1-GraalVM Native - Unsafe | Van Phu DO | |
1 | 00:02.195 | link | 21.0.1-GraalVM Native - Unsafe | Thomas Wuerthinger, Quan Anh Mai, Alfonso² Peterssen | Subprocess spawn trick |
1 | 00:02.019 | link | 21.0.2-GraalVM Native - Unsafe | Artsiom Korzun | |
1 | 00:01.893 | link | 21.0.2-GraalVM Native - Unsafe | Thomas Wuerthinger, Quan Anh Mai, Alfonso² Peterssen | Process 2MB segments to make all threads finish at the same time. Process with 3 scanners in parallel in the same thread. |
1 | 00:01.878 | link | 21.0.2-GraalVM Native - Unsafe | Thomas Wuerthinger, Quan Anh Mai, Alfonso² Peterssen | |
1 | 00:01.832 | link | 21.0.2-GraalVM Native - Unsafe | Thomas Wuerthinger, Quan Anh Mai, Alfonso² Peterssen | |
1 | 00:01.645 | link | 21.0.2-GraalVM Native - Unsafe | Jaromir Hamala | |
1 | 00:01.535 | link | 21.0.2-GraalVM Native - Unsafe | Thomas Wuerthinger, Quan Anh Mai, Alfonso² Peterssen |
Timeline
Dec 28, 2023 07:44:58 - Project is created
Jan 1, 2024 11:38:16 - Baseline time: 04:13.449
Jan 1, 2024 14:33:40 - royvanrijn - Processing lines in parallel - 02:08.845
Map<String, Measurement> resultMap = Files.lines(Path.of(FILE)).parallel()
Jan 2, 2024 10:04:29 - bjhara - Memory Mapped File - 00:38.510
https://github.com/gunnarmorling/1brc/pull/10
Key idea:
- Memory mapped file using
FileChannel
andMappedByteBuffer
private static Stream<ByteBuffer> splitFileChannel(final FileChannel fileChannel) throws IOException {
return StreamSupport.stream(Spliterators.spliteratorUnknownSize(new Iterator<ByteBuffer>() {
private static final long CHUNK_SIZE = 1024 * 1024 * 10L;
private final long size = fileChannel.size();
private long start = 0;
@Override
public boolean hasNext() {
return start < size;
}
@Override
public ByteBuffer next() {
try {
MappedByteBuffer mappedByteBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, start,
Math.min(CHUNK_SIZE, size - start));
// don't split the data in the middle of lines
// find the closest previous newline
int realEnd = mappedByteBuffer.limit() - 1;
while (mappedByteBuffer.get(realEnd) != '\n')
realEnd--;
realEnd++;
mappedByteBuffer.limit(realEnd);
start += realEnd;
return mappedByteBuffer;
}
catch (IOException ex) {
throw new UncheckedIOException(ex);
}
}
}, Spliterator.IMMUTABLE), false);
}
Jan 2, 2024 16:14:32 - padreati - Vector API - 00:50.547
Key ideas:
- First use of the Vector API
JAVA_OPTS="--enable-preview --add-modules jdk.incubator.vector"
time java $JAVA_OPTS --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage_padreati
+ <compilerArgs>
+ <compilerArg>--enable-preview</compilerArg>
+ <compilerArg>--add-modules</compilerArg>
+ <compilerArg>java.base,jdk.incubator.vector</compilerArg>
+ </compilerArgs>
import jdk.incubator.vector.ByteVector;
import jdk.incubator.vector.VectorOperators;
import jdk.incubator.vector.VectorSpecies;
Jan 3, 2024 - royvanrijn - SWAR - 00:23.366
Key ideas:
- SIMD Within A Register (SWAR) for finding ‘;’.
- Iterates over a long, instead of a byte.
- Use
int
instead ofdouble
- Branchless parse
int
https://github.com/gunnarmorling/1brc/pull/5
/**
* Changelog:
*
* Initial submission: 62000 ms
* Chunked reader: 16000 ms
* Optimized parser: 13000 ms
* Branchless methods: 11000 ms
* Adding memory mapped files: 6500 ms (based on bjhara's submission)
* Skipping string creation: 4700 ms
* Custom hashmap... 4200 ms
* Added SWAR token checks: 3900 ms
* Skipped String creation: 3500 ms (idea from kgonia)
* Improved String skip: 3250 ms
* Segmenting files: 3150 ms (based on spullara's code)
* Not using SWAR for EOL: 2850 ms
*
* Best performing JVM on MacBook M2 Pro: 21.0.1-graal
* `sdk use java 21.0.1-graal`
*
*/
findNextSWAR
/**
* -------- This section contains SWAR code (SIMD Within A Register) which processes a bytebuffer as longs to find values:
*/
private static final long SEPARATOR_PATTERN = compilePattern((byte) ';');
private int findNextSWAR(ByteBuffer bb, long pattern, int start, int limit) {
int i;
for (i = start; i <= limit - 8; i += 8) {
long word = bb.getLong(i);
int index = firstAnyPattern(word, pattern);
if (index < Long.BYTES) {
return i + index;
}
}
// Handle remaining bytes
for (; i < limit; i++) {
if (bb.get(i) == (byte) pattern) {
return i;
}
}
return limit; // delimiter not found
}
private static int firstAnyPattern(long word, long pattern) {
final long match = word ^ pattern;
long mask = match - 0x0101010101010101L;
mask &= ~match;
mask &= 0x8080808080808080L;
return Long.numberOfTrailingZeros(mask) >>> 3;
}
Read more about this here: https://richardstartin.github.io/posts/finding-bytes
compilePattern
It replicates the byte value in a long.
private static long compilePattern(byte value) {
return ((long) value << 56) | ((long) value << 48) | ((long) value << 40) | ((long) value << 32) |
((long) value << 24) | ((long) value << 16) | ((long) value << 8) | (long) value;
}
jshell> var p = compilePattern((byte) ';')
p ==> 4268070197446523707
jshell> Long.toHexString(p)
$4 ==> "3b3b3b3b3b3b3b3b"
jshell> Integer.toHexString((byte) ';')
$5 ==> "3b"
branchlessParseInt
/**
* Branchless parser, goes from String to int (10x):
* "-1.2" to -12
* "40.1" to 401
* etc.
*
* @param input
* @return int value x10
*/
private static int branchlessParseInt(final byte[] input, int start, int length) {
// 0 if positive, 1 if negative
final int negative = ~(input[start] >> 4) & 1;
// 0 if nr length is 3, 1 if length is 4
final int has4 = ((length - negative) >> 2) & 1;
final int digit1 = input[start + negative] - '0';
final int digit2 = input[start + negative + has4];
final int digit3 = input[start + negative + has4 + 2];
return (-negative ^ (has4 * (digit1 * 100) + digit2 * 10 + digit3 - 528) - negative); // 528 == ('0' * 10 + '0')
}
franz1981 comment on ByteBuffer and big-endian
By default ByteBuffers uses Big Indian, which is native for M1/M2…but not on x86, which is the one used for the benchmark. It translates in:
wrong position, while found a useless reverse bytes
It would be better to store in a static final, the native byte order and based on it, correctly reverse the bytes yourself (look at what I have done in the Netty code). Conversely you can always assume little endian, setting the mapped byte buffer order, and leave M1 to have the performance hit
https://github.com/gunnarmorling/1brc/pull/5#discussion_r1440157267
franz1981 - get long idea
Hash code here has a data dependency: you either manually unroll this or just relax the hash code by using a var handle and use getLong amortizing the data dependency in batches, handing only the last 7 (or less) bytes separately, using the array. In this way the most of computation would like resolve in much less loop iterations too, similar to https://github.com/apache/activemq-artemis/blob/25fc0342275b29cd73123523a46e6e94582597cd/artemis-commons/src/main/java/org/apache/activemq/artemis/utils/ByteUtil.java#L299
Jan 3, 2024 - Interesting comment about ZGC by fisk:
https://github.com/gunnarmorling/1brc/pull/15#issuecomment-1875495420
https://github.com/fisk/jdk/commit/8ce820de84b7031ced52fb63d190d9c8546f6730
@lobaorn do you really think you can nerd snipe me like that with benchmarking games?
Having said that... I implemented some experimental leyden support for generational ZGC, with some object streaming shenanigans allowing loading of archived objects, and support for archived code from the JIT cache so we can run with compiled code immediately and dodge most of the warmup costs.
On my machine with 256 cores...
real 0m2.318s
user 1m24.229s
sys 0m2.009s
...beat that!
Oh and here are the JVM flags: -XX:+UseLargePages -XX:+UseZGC -XX:+ZGenerational -Xms8G -Xmx8G -XX:+AlwaysPreTouch -XX:ConcGCThreads=4 -XX:+UnlockDiagnosticVMOptions -XX:-ZProactive -XX:ZTenuringThreshold=1 -Xlog:gc*:file=fatroom_gc.log -XX:SharedArchiveFile=Fatroom-dynamic.jsa -XX:+ReplayTraining -XX:+ArchiveInvokeDynamic -XX:+LoadCachedCode -XX:CachedCodeFile=Fatroom-dynamic.jsa-sc
Here is my experimental Generational ZGC leyden branch: https://github.com/fisk/jdk/tree/1brc_genz_leyden
Takes about 1/3 of the time compared to normal ZGC on my machine.
Jan 3, 2024 11:35:51 - Nice 35 lines solution by Sam Pullara
package dev.morling.onebrc;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.concurrent.ConcurrentSkipListMap;
import java.util.stream.Collectors;
public class CalculateAverage_naive {
record Result(double min, double max, double sum, long count) {
}
public static void main(String[] args) throws FileNotFoundException {
long start = System.currentTimeMillis();
var results = new BufferedReader(new FileReader("./measurements.txt"))
.lines()
//.parallel() // I don't know why Sam removed this. But it makes it much faster.
.map(l -> l.split(";"))
.collect(Collectors.toMap(
parts -> parts[0],
parts -> {
double temperature = Double.parseDouble(parts[1]);
return new Result(temperature, temperature, temperature, 1);
},
(oldResult, newResult) -> {
double min = Math.min(oldResult.min, newResult.min);
double max = Math.max(oldResult.max, newResult.max);
double sum = oldResult.sum + newResult.sum;
long count = oldResult.count + newResult.count;
return new Result(min, max, sum, count);
}, ConcurrentSkipListMap::new));
System.out.println(System.currentTimeMillis() - start);
System.out.println(results);
}
}
Jan 3, 2024 11:35:51 - spullara - lazy String creation and better hash table (?) - 00:14.848
https://github.com/gunnarmorling/1brc/pull/21
It was not clear at first what made such big difference, because his solution:
- does not use SWAR
- work with doubles
…, but creating String name only when aggregating certainly speeds things up.
class ByteArrayToResultMap {
public static final int MAPSIZE = 1024*128;
Result[] slots = new Result[MAPSIZE];
byte[][] keys = new byte[MAPSIZE][];
private int hashCode(byte[] a, int fromIndex, int length) {
int result = 0;
int end = fromIndex + length;
for (int i = fromIndex; i < end; i++) {
result = 31 * result + a[i];
}
return result;
}
// ...
}
Jan 3, 2024 - ebarlas - Changed JVM to GraalVM CE 21.0.1
https://github.com/gunnarmorling/1brc/pull/45
Jan 3, 2024 - ddimtirov - use Epsilon GC, MemorySegment and Arena
Nice example of how to replace MappedByteBuffer
with MemorySegment
:
https://github.com/gunnarmorling/1brc/pull/32
switched to the foreign memory access preview API for another 10% speedup
-JAVA_OPTS="-XX:+UseParallelGC"
+# --enable-preview to use the new memory mapped segments
+# We don't allocate much, so just give it 1G heap and turn off GC; the AlwaysPreTouch was suggested by the ergonomics
+JAVA_OPTS="--enable-preview -Xms1g -Xmx1g -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:+AlwaysPreTouch"
time java $JAVA_OPTS --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage_ddimtirov
Jan 4, 2024 - royvanrijn - Trying to fix the endian issue #68
https://github.com/gunnarmorling/1brc/pull/68/files
Jan 4, 2024 - artsiomkorzun first submission
https://github.com/gunnarmorling/1brc/pull/83
Jan 5, 2024 - yemreinci - Calculate the hashcode while reading the data
https://github.com/gunnarmorling/1brc/pull/86
Improvements
- Calculate the hashcode while reading the data, instead of later in the hashmap implementation. This is expected to increase instruction level parallelism as CPU can work on the math while waiting for data from memory/cache.
- Convert the number parsing while loop into a few if statements. While loop with a switch-case inside is likely not so great for the branch predictor.
int hash = 0;
while ((b = bb.get(currentPosition++)) != ';') {
buffer[offset++] = b;
hash = (hash << 5) - hash + b;
}
Jan 5, 2024 3:02 GMT-3 - artsiomkorzun - AtomicInteger, AtomicReference
AtomicInteger counter = new AtomicInteger();
AtomicReference<Aggregates> result = new AtomicReference<>();
Aggregator[] aggregators = new Aggregator[PARALLELISM];
Jan 6, 2024 6:55 AM GMT-3 - thomaswue - Unsafe, GraalVM Native Image - 9.625
https://github.com/gunnarmorling/1brc/pull/70
Update: Thanks to tuning from @mukel using sun.misc.Unsafe to directly access the mapped memory, it is now down to 1.28s (total CPU user time 32.2s) on my machine. Also, instead of PGO, this is now just using a single native image build run with tuning flags “-O3” and “-march=native”.
mmap the entire file, use Unsafe directly instead of ByteBuffer, avoid byte[] copies. These tricks give a ~30% speedup, over an already fast implementation.
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk use java 21.0.1-graal 1>&2
NATIVE_IMAGE_OPTS="--gc=epsilon -O3 -march=native --enable-preview"
native-image $NATIVE_IMAGE_OPTS -cp target/average-1.0.0-SNAPSHOT.jar -o image_calculateaverage_thomaswue dev.morling.onebrc.CalculateAverage_thomaswue
private static final Unsafe UNSAFE = initUnsafe();
private static Unsafe initUnsafe() {
try {
Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
theUnsafe.setAccessible(true);
return (Unsafe) theUnsafe.get(Unsafe.class);
}
catch (NoSuchFieldException | IllegalAccessException e) {
throw new RuntimeException(e);
}
}
Jan 6, 2024 1:01 PM GMT-3 - merykitty - Vector API, magic branchless number parsing - 00:07.620
https://github.com/gunnarmorling/1brc/pull/114
// The technique here is to align the key in both vectors so that we can do an
// element-wise comparison and check if all characters match
var nodeKey = ByteVector.fromArray(BYTE_SPECIES, node.data, BYTE_SPECIES.length() - localOffset);
var eqMask = line.compare(VectorOperators.EQ, nodeKey).toLong();
long validMask = (-1L >>> -semicolonPos) << localOffset;
if ((eqMask & validMask) == validMask) {
aggr = node.aggr;
break;
}
// Parse a number that may/may not contain a minus sign followed by a decimal with
// 1 - 2 digits to the left and 1 digits to the right of the separator to a
// fix-precision format. It returns the offset of the next line (presumably followed
// the final digit and a '\n')
private static long parseDataPoint(Aggregator aggr, MemorySegment data, long offset) {
long word = data.get(JAVA_LONG_LT, offset);
// The 4th binary digit of the ascii of a digit is 1 while
// that of the '.' is 0. This finds the decimal separator
// The value can be 12, 20, 28
int decimalSepPos = Long.numberOfTrailingZeros(~word & 0x10101000);
int shift = 28 - decimalSepPos;
// signed is -1 if negative, 0 otherwise
long signed = (~word << 59) >> 63;
long designMask = ~(signed & 0xFF);
// Align the number to a specific position and transform the ascii code
// to actual digit value in each byte
long digits = ((word & designMask) << shift) & 0x0F000F0F00L;
// Now digits is in the form 0xUU00TTHH00 (UU: units digit, TT: tens digit, HH: hundreds digit)
// 0xUU00TTHH00 * (100 * 0x1000000 + 10 * 0x10000 + 1) =
// 0x000000UU00TTHH00 +
// 0x00UU00TTHH000000 * 10 +
// 0xUU00TTHH00000000 * 100
// Now TT * 100 has 2 trailing zeroes and HH * 100 + TT * 10 + UU < 0x400
// This results in our value lies in the bit 32 to 41 of this product
// That was close :)
long absValue = ((digits * 0x640a0001) >>> 32) & 0x3FF;
long value = (absValue ^ signed) - signed;
aggr.min = Math.min(value, aggr.min);
aggr.max = Math.max(value, aggr.max);
aggr.sum += value;
aggr.count++;
return offset + (decimalSepPos >>> 3) + 3;
}
TODO Understand how this code works.
Interesting way to create a HashMap
, using a static function
instead of the constructor HashMap(int initialCapacity)
https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/HashMap.html#newHashMap(int)
var res = HashMap.<String, Aggregator> newHashMap(processorCnt);
API note in the JDK:
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/HashMap.java#L462
/**
* Constructs an empty {@code HashMap} with the specified initial
* capacity and the default load factor (0.75).
*
* @apiNote
* To create a {@code HashMap} with an initial capacity that accommodates
* an expected number of mappings, use {@link #newHashMap(int) newHashMap}.
*
* @param initialCapacity the initial capacity.
* @throws IllegalArgumentException if the initial capacity is negative.
*/
public HashMap(int initialCapacity) {
this(initialCapacity, DEFAULT_LOAD_FACTOR);
}
Jan 7, 2024 - royvanrijn - Adding Unsafe and merykitty’s branchless parser - 00:06.159
https://github.com/gunnarmorling/1brc/pull/194
Jan 7, 2024 - thomaswue - search and compare > 1 byte - 00:06.532
https://github.com/gunnarmorling/1brc/pull/213
This is some tuning over the initial submission now searching for the delimiter and comparing names more than 1 byte at a time. Should be ~30% faster than initial version.
Jan 9, 2024 - mukel - Optimizing findDelimiter
https://github.com/gunnarmorling/1brc/pull/263#discussion_r1446699670
private static int findDelimiter(long word) {
long input = word ^ 0x3B3B3B3B3B3B3B3BL;
long tmp = (input - 0x0101010101010101L) & ~input & 0x8080808080808080L;
return Long.numberOfTrailingZeros(tmp) >>> 3;
}
TODO Understand how this code works.
Jan 10, 2024 - Server change - Solutions got faster
Old server: dedicated virtual cloud environment (Hetzner CCX33)
New server: Hetzner AX161 dedicated server (32 core AMD EPYC™ 7502P (Zen2), 128 GB RAM)
royvanrijn: 6.159 -> 3.604
thomaswue.: 6.532 -> 3.911
merykitty.: 7.620 -> 4.496
Jan 11, 2024 - mtopolnik - focus on 10k (what was really asked for!)
https://github.com/gunnarmorling/1brc/pull/246
My primary focus was performance on the “new” dataset with 10,000 keys and names in the full range of 1-100 bytes. I find that the timing on that dataset degrades to 10.1 seconds.
You can generate the dataset with create_measurements3.sh, and I encourage contestants to try it out and use as the optimization target. It’s unquestionably a kick to see oneself at the top of the leaderboard, but it’s more fun (and useful!) to design a solution that works well beyond the 416 station names of max length 26 chars.
Jan 14, 2024 - Cliff Click submission =D - 4.741
Adding Unsafe:
https://github.com/gunnarmorling/1brc/pull/185/commits/5f8fb7ce09d3f74ab5beb68522008ac2687e8c32
Jan 15, 2024 - jerrinot - Instruction-Level Parallelism (ILP) - 3.409
https://github.com/gunnarmorling/1brc/pull/424
This initial submission is aiming to exploit instruction-level parallelism: Each thread pulls from multiple chunks in each iteration. This breaks nice sequential access, but a CPU gets independent streams so it can go bananas with out-of-order processing.
There are optimization opportunities left on the table, this is to give me some feeling of how it may perform on the testing box.
Credits:
@mtopolnik - for many conversations we have had in the last 2 weeks!
@merykitty - for the amazing branch-less parser
@royvanrijn - for the hashing-as-you-go idea
chunkStartOffsets[0] = inputBase;
chunkStartOffsets[chunkCount] = inputBase + length;
Processor[] processors = new Processor[THREAD_COUNT];
Thread[] threads = new Thread[THREAD_COUNT];
for (int i = 0; i < THREAD_COUNT; i++) {
long startA = chunkStartOffsets[i * chunkPerThread];
long endA = chunkStartOffsets[i * chunkPerThread + 1];
long startB = chunkStartOffsets[i * chunkPerThread + 1];
long endB = chunkStartOffsets[i * chunkPerThread + 2];
long startC = chunkStartOffsets[i * chunkPerThread + 2];
long endC = chunkStartOffsets[i * chunkPerThread + 3];
long startD = chunkStartOffsets[i * chunkPerThread + 3];
long endD = chunkStartOffsets[i * chunkPerThread + 4];
Processor processor = new Processor(startA, endA, startB, endB, startC, endC, startD, endD);
processors[i] = processor;
Thread thread = new Thread(processor);
threads[i] = thread;
thread.start();
}
Jan 16, 2024 - plevart: Look Mom No Unsafe! - 5.336
ffb09bf4bf0b41835b3340415be4f3c34565c126
Final version - 4.676
Jan 31, 2024 - Shiplëv submission =D - No Unsafe! - 4.884
Beautiful solution with great comments!!!
Jan 31, 2024 - Serkan ÖZAL - Fastest Solution running on the JVM!!! - 21.0.1-open - 1.880
Amazing code using the Vector API.
https://github.com/gunnarmorling/1brc/pull/679
First Use of Important Concepts
MappedByteBuffer
$ git log -S"MappedBy" --reverse
commit 6b13d52b676d4ccfe7d766ac46f81abee8225816
Author: Hampus Ram
Date: Tue Jan 2 14:04:29 2024 +0100
Implementation using memory mapped file
Vector API
$ git log -S"jdk.incubator.vector" --reverse
commit 17218485709af71ebd7cb7c2c47f3abc262f5a2d
Author: Aurelian Tutuianu
Date: Tue Jan 2 21:14:32 2024 +0200
- implementation by padreati
AtomicReference
git log -S"AtomicReference" --reverse
commit 0ba5cf33d4a4ce290dbe4fa5e4e4c37b424c8675
Author: Richard Startin <richardstartin@apache.org>
Date: Wed Jan 3 18:15:06 2024 +0000
richardstartin submission
+import java.util.concurrent.atomic.AtomicIntegerFieldUpdater;
+import java.util.concurrent.atomic.AtomicReferenceArray;
+import java.util.concurrent.atomic.AtomicReferenceFieldUpdater;
$ git log -S"AtomicReference;" --reverse
commit cec579b506d448afd2365e0197ed5b98bd0d6a5e
Author: Artsiom Korzun <akorzun@deltixlab.com>
Date: Fri Jan 5 12:15:54 2024 +0100
improved artsiomkorzun solution
He uses it to merge the results!
AtomicInteger counter = new AtomicInteger();
AtomicReference<Aggregates> result = new AtomicReference<>();
int parallelism = Runtime.getRuntime().availableProcessors();
Aggregator[] aggregators = new Aggregator[parallelism];
for (int i = 0; i < aggregators.length; i++) {
aggregators[i] = new Aggregator(counter, result, fileAddress, fileSize, segmentCount);
aggregators[i].start();
}
// ...
while (!result.compareAndSet(null, aggregates)) {
Aggregates rights = result.getAndSet(null);
if (rights != null) {
aggregates.merge(rights);
}
}
That’s a great idea, because when joining on threads sequentially, the first thread you’re waiting might be the last one to finish …
TODO: can’t that code get in an infinite loop?
You can use AtomicReference when applying optimistic locks. You have a shared object and you want to change it from more than 1 thread.
You can create a copy of the shared object Modify the shared object You need to check that the shared object is still the same as before - if yes, then update with the reference of the modified copy.
As other thread might have modified it and/can modify between these 2 steps. You need to do it in an atomic operation. this is where AtomicReference can help
HamoriZ answered Jun 12, 2015 at 12:30 https://stackoverflow.com/a/30803137/339561
MemorySegment
$ git log -S"MemorySegm" --reverse
commit d73457872f0b9990ab6258d0a96d3edad3b24b1a
Author: Dimitar Dimitrov
Date: Thu Jan 4 04:22:39 2024 +0900
ddimtirov - switched to the foreign memory access preview API for another 10% speedup
Epsilon GC
$ git log -S"Epsilon" --reverse
commit d73457872f0b9990ab6258d0a96d3edad3b24b1a
Author: Dimitar Dimitrov
Date: Thu Jan 4 04:22:39 2024 +0900
Unsafe
$ git log -S"Unsafe" --reverse
commit a53aa2e6fdbcc74e0c6a9a308f0a8c3507b0e06a
Author: Thomas Wuerthinger
Date: Sat Jan 6 10:55:07 2024 +0100
Instruction-Level Parallelism (ILP)
I didn’t look at all other solutions, so if there was a previous one, let me know =)
commit dbdd89a84779761ca092e5aaeb6f6e92394a422d
Author: Jaromir Hamala
Date: Mon Jan 15 18:55:22 2024 +0100
jerrinot's initial submission (#424)
* initial version
let's exploit that superscalar beauty!
Conclusion
There were a few changes that really changed the game.
The others were minor improvements and mostly people combining code
from other solutions.
Here are the important changes:
- Process lines in parallel (one line change =));
- Memory map the file, split in chunks and process them in parallel;
- Great hash table;
- Create UTF-8 String for each city after the main loop;
- SWAR;
- Vector API;
- Unsafe;
- ILP.
That’s it for the timeline. If I missed something, let me know here: https://twitter.com/tivrfoa/status/1754607677362106552
In the next post I’ll explore the performance impact of using Unsafe
.
References
https://github.com/gunnarmorling/1brc
https://en.wikipedia.org/wiki/SWAR
Multiple byte processing with full-word instructions - 1975
GENERAL-PURPOSE SIMD WITHIN A REGISTER:PARALLEL PROCESSING ON CONSUMER MICROPROCESSORS - 2003