A new search engine in 2024? Yes, but stateless — index on S3, serverless — no cluster state, with all Lucene features — filters, autocomplete, facets. And also with local embedding & RAG inference.
Regarding sharding, I think that one reason people use them, concurrent retrieval and ranking, can be mitigated by using a concurrent searcher that uses parallel processing of segments. This parallelism of course is limited to the number of available CPU cores on a given node, but to most folks that should be enough. I'm not up to date and I don't know if Lucene has that already, but I remember AWS people discussed it few years ago. The problem with segments is that they are heavily unbalanced in size.
Another reason for sharding I think is indexing throughput. There, some form of partitioning of the input might be needed to some folks.
As for Lucene side, it all boils down to the segment merge strategy used (which is itself a Lucene feature, and not Elastic/SOLR):
* AFAIK Elastic/SOLR and also Nixiesearch all do the segment paralellism by default as you've described. So you search over segments in parallel, and then combine results into the final response.
* But then segments should be nicely balanced in size: otherwise small segments will be processed quickly, but you still need to wait for the largest segment.
I've checked Lucene docs, and it seems to be not an issue with the default merge policy:
> Merges segments of approximately equal size, subject to an allowed number of segments per tier. This is similar to LogByteSizeMergePolicy, except this merge policy is able to merge non-adjacent segment, and separates how many segments are merged at once (setMaxMergeAtOnce(int)) from how many segments are allowed per tier (setSegmentsPerTier(double)). This merge policy also does not over-merge (i.e. cascade merges).
> For normal merging, this policy first computes a "budget" of how many segments are allowed to be in the index. If the index is over-budget, then the policy sorts segments by decreasing size (pro-rating by percent deletes), and then finds the least-cost merge. Merge cost is measured by a combination of the "skew" of the merge (size of largest segment divided by smallest segment), total merge size and percent deletes reclaimed, so that merges with lower skew, smaller size and those reclaiming more deletes, are favored.
Regarding sharding, I think that one reason people use them, concurrent retrieval and ranking, can be mitigated by using a concurrent searcher that uses parallel processing of segments. This parallelism of course is limited to the number of available CPU cores on a given node, but to most folks that should be enough. I'm not up to date and I don't know if Lucene has that already, but I remember AWS people discussed it few years ago. The problem with segments is that they are heavily unbalanced in size.
Another reason for sharding I think is indexing throughput. There, some form of partitioning of the input might be needed to some folks.
Amazing post!
As for Lucene side, it all boils down to the segment merge strategy used (which is itself a Lucene feature, and not Elastic/SOLR):
* AFAIK Elastic/SOLR and also Nixiesearch all do the segment paralellism by default as you've described. So you search over segments in parallel, and then combine results into the final response.
* But then segments should be nicely balanced in size: otherwise small segments will be processed quickly, but you still need to wait for the largest segment.
I've checked Lucene docs, and it seems to be not an issue with the default merge policy:
https://lucene.apache.org/core/9_12_0/core/org/apache/lucene/index/TieredMergePolicy.html
> Merges segments of approximately equal size, subject to an allowed number of segments per tier. This is similar to LogByteSizeMergePolicy, except this merge policy is able to merge non-adjacent segment, and separates how many segments are merged at once (setMaxMergeAtOnce(int)) from how many segments are allowed per tier (setSegmentsPerTier(double)). This merge policy also does not over-merge (i.e. cascade merges).
> For normal merging, this policy first computes a "budget" of how many segments are allowed to be in the index. If the index is over-budget, then the policy sorts segments by decreasing size (pro-rating by percent deletes), and then finds the least-cost merge. Merge cost is measured by a combination of the "skew" of the merge (size of largest segment divided by smallest segment), total merge size and percent deletes reclaimed, so that merges with lower skew, smaller size and those reclaiming more deletes, are favored.