We found an embedding indexing bottleneck in the most unexpected place: JSON parsing
While benchmarking Nixiesearch, we discovered that JSON parsing took up 20% of indexing time. Switching from the user-friendly Circe to the low-level Jsoniter made it 5× (wow!) faster.
As a bold and perhaps overly optimistic person, I’m building a search engine from scratch, the right way: Nixiesearch. Think Elastisearch, but on top of S3.
When evaluating a new search engine, performance is usually the second thing you look at, right after glancing through its feature list. The third, of course, is checking whether it’s written in Rust.
Oh no a yet another vendor benchmark
You should always take vendor benchmarks with a grain of salt. If a vendor’s solution ranks #1, there’s a good chance someone couldn’t resist the temptation to cherry-pick the results in a not entirely fair way.
That said, here I am, building a yet another vendor benchmark — github.com/hseb-benchmark/hseb. But I’m fully aware that Nixiesearch isn’t going to be the fastest here; the goal is simply to uncover every possible bottleneck and fix them one by one.
Indexing process
HSEB spins up a search engine Docker container and batch-loads a document corpus into it. For Nixiesearch, I also added an async-profiler directly inside the container to generate nice flamegraphs showing what’s happening under the hood. The indexing-time flamegraph, however, looked suspicious.
The main culprit turned out to be java.lang.Float.parseFloat — a JDK method used to convert strings to floats. Normally, you’d just accept that you can’t make a standard library method any faster. But what if you actually can?
How JSON parsing works
Most traditional JSON parsers prioritize flexibility over performance and typically make two passes over your data:
Convert the input into an internal AST-like representation, similar to what Python’s json.loads() does.
Map the AST into your domain object.
In most real-world applications, JSON parsing isn’t a major bottleneck, except in edge cases with very large payloads. And payloads containing document embeddings can be enormous; some embedding models, like voyage, have a dimensionality of 2048!
So, can we make it faster by skipping the AST phase and speeding up number parsing a bit? Yes - by using iterator-based parsers and a little bit of Computer Science.
Meet simdjson and jsoniter
Simdjson is one of the fastest JSON parsing libraries ever built, and its parse_number implementation is a masterpiece of engineering - about 1,500 lines of SIMD-optimized C++ code.
The catch is that simdjson is a native library, while Nixiesearch is written in Scala on top of the JVM. That means we can’t easily use it without sacrificing managed runtime portability:
We’d need to bundle separate builds of the library for every supported platform (like Windows, and I’d like to avoid it as much as possible).
JNI calls from the JVM to native code are expensive, since they can’t be inlined and add significant call overhead.
Another drawback of iterator-based parsers like simdjson is that they’re very low-level. Instead of simply writing something like json.as[Document] and letting the parser map fields to your DTO automatically, you have to deal with JSON tokens manually - losing much of the flexibility and convenience of high-level parsers.
The good news is that simdjson’s number parsing logic has been ported to many popular JSON libraries - and one of them is jsoniter.
It can generate decoders for arbitrary DTO classes at compile time, so your code like json.as[Document] still works seamlessly.
it’s still a iterator based decoder, so when you really brave enough to go low-level, you can.
Clinger’s algorithm
simdjson and jsoniter are using optimistic number parsing algorithm called Clinger’s algorithm. The core insight is when you parse a decimal number like “123.45” into a binary float, you need to convert from base-10 to base-2 with correct rounding. Clinger discovered that for most practical numbers, you can do this with simple floating-point arithmetic instead of expensive arbitrary-precision math, like JVM stdlib does.
Given a decimal number represented as:
Significand
f(the digits as an integer)Exponent
e(power of 10)Example: “123.45” → f = 12345, e = -2 (since 12345 × 10^-2 = 123.45)
Clinger’s fast path:
Why this works:
Magic Condition 1: Range [-22, 22]
All powers of 10 in this range (10^-22 to 10^22) can be exactly represented as IEEE 754 doubles.
No rounding error when computing 10^e.
Magic Condition 2: Significand ≤ 2^53 - 1
Doubles have 53 bits of precision.
Any integer up to 9,007,199,254,740,991 (2^53 - 1) converts to double without loss.
So double(f) is exact.
When you multiply or divide two exact doubles, IEEE 754 guarantees the result is correctly rounded to the nearest representable double. No precision lost!
Clinger’s vs naive float parsing
I implemented a simple JMH-based benchmark to compare the old Circe-based decoder with the new jsoniter-based one across different embedding sizes - and the results were impressive.
Clinger’s fast-path decoding turned out to be roughly 4.5× faster than the naive approach! In addition to much faster number parsing, skipping the AST phase entirely also cut down memory allocations dramatically:
For a 1024-dimensional embedding, the old parser was allocating a whopping 347 KB of RAM, while the new one uses only 19 KB - an 18× reduction. Not bad.
Yes you can beat the stdlib
The PR for this change is already in master, and is going to be a part of the upcoming Nixiesearch 0.8.x release.
Yes, switching the JSON decoder ended up requiring a massive refactor - sorry for the 3,000 changed lines for what seems like a minor change. But hey, the refactoring will continue until morale improves.











have you considered yyjson?
https://github.com/ibireme/yyjson