CNDB-10629 Faster SAI predicate selectivity estimation #1253

pkolaczk · 2024-08-29T15:08:38Z

CNDB-10629: Estimate predicate selectivity using histograms

Fixes https://github.com/riptano/cndb/issues/10629

This commit introduces a new system for estimation of predicate
selectivities used by the SAI query planner.

So far the SAI planner had to perform search on each index referenced
by the query in order to learn the expected number of rows matched
by each predicate. That required hitting each sstable index,
going down the KD-tree/trie, and collecting/merging the posting lists.
Even though the posting lists didn't have to be iterated fully, in
some cases the cost of obtaining the posting lists could be
significant. That effort could be wasted if the planner decided
to not include the index in the final plan.

This new feature adds extended statistical information to the
SegmentMetadata. When building index segments, a histogram and
a table of most frequent terms are built and saved with the segment
metadata. Then that information is used to estimate the number
of matching rows by each query predicate at the query planning stage.
Because histograms are tiny and all kept in memory, this is orders
of magnitude faster than searching the index.

Summary of the code changes introduced by this commit:

Introduced TermsDistribution class that records histograms / most
frequent terms in memory and quickly estimates the number of rows
matched by predicates.
Introduced SegmentMetadataBuilder class responsible for building
SegmentMetadatas with TermsDistributions. This class provides
interceptors that plug into the index building process and allow
it to intercept additions to the index in an ordered way.
Introduced a new version of the OnDiskFormat of SAI indexes: "EA".
Unfortunately adding new segment metadata could not be performed in
a forward-compatible way because SegmentMetadata did not store
any length fields in the serialization. If any of the indexes
is on the older version, the whole estimation will be done the old
way, by performing search.
Memory indexes got new methods for estimation of row counts.
Estimation can be performed faster than running the search.
QueryController was reorganized to use the new / old system
of estimation, depending on the index version.
QueryView has been refactored and made more flexible.
Query view builder is now an inner class,
following the pattern we use in many other places.
It doesn't need an explicit reference to Orderer now, but takes
a more generic IndexContext.
QueryController now uses QueryView / QueryView.Builder to lock
references to indexes for all types of queries (earlier it used
QueryView for vector search and similar duplicated logic inline for
non-vector search) - overall a lot of code removed / simplified
there.
TermsIterator is now responsible for both increasing and decreasing
the reference counts on the referenced iterators. This allows
sharing the same QueryView between multiple iterators.
QueryView#referencedIndexes and referencedMemtables collections
are now immutable.

Limitations:

Memory indexes do not use histograms at the moment.
For point lookups, search is performed the old way, because
this is usually very fast anyway. For range queries, we perform
search only partially and estimate the total number of matches by
extrapolation.
Prefiltering the set of indexes selected for intersections
got removed from QueryController.
This optimization was complex and looked misplaced.
If performance testing shows the removal of it caused a performance
regression, it should be reintroduced in the planner.

Fixes riptano/cndb#10629 This commit introduces a new system for estimation of predicate selectivities used by the SAI query planner. So far the SAI planner had to perform search on each index referenced by the query in order to learn the expected number of rows matched by each predicate. That required hitting each sstable index, going down the KD-tree/trie, and collecting/merging the posting lists. Even though the posting lists didn't have to be iterated fully, in some cases the cost of obtaining the posting lists could be significant. That effort could be wasted if the planner decided to not include the index in the final plan. This new feature adds extended statistical information to the SegmentMetadata. When building index segments, a histogram and a table of most frequent terms are built and saved with the segment metadata. Then that information is used to estimate the number of matching rows by each query predicate at the query planning stage. Because histograms are tiny and all kept in memory, this is orders of magnitude faster than searching the index. Summary of the code changes introduced by this commit: * Introduced `TermsDistribution` class that records histograms / most frequent terms in memory and quickly estimates the number of rows matched by predicates. * Introduced `SegmentMetadataBuilder` class responsible for building `SegmentMetadata`s with `TermsDistribution`s. This class provides interceptors that plug into the index building process and allow it to intercept additions to the index in an ordered way. * Introduced a new version of the `OnDiskFormat` of SAI indexes: "EA". Unfortunately adding new segment metadata could not be performed in a forward-compatible way because `SegmentMetadata` did not store any length fields in the serialization. If any of the indexes is on the older version, the whole estimation will be done the old way, by performing search. * Memory indexes got new methods for estimation of row counts. Estimation can be performed faster than running the search. * `QueryController` was reorganized to use the new / old system of estimation, depending on the index version. * `QueryView` has been refactored and made more flexible. Query view builder is now an inner class, following the pattern we use in many other places. It doesn't need an explicit reference to `Orderer` now, but takes a more generic `IndexContext`. * `QueryController` now uses `QueryView` / `QueryView.Builder` to lock references to indexes for all types of queries (earlier it used QueryView for vector search and similar duplicated logic inline for non-vector search) - overall a lot of code removed / simplified there. * TermsIterator is now responsible for both increasing and decreasing the reference counts on the referenced iterators. This allows sharing the same `QueryView` between multiple iterators. `QueryView#referencedIndexes` and `referencedMemtables` collections are now immutable. Limitations: * Memory indexes do not use histograms at the moment. For point lookups, search is performed the old way, because this is usually very fast anyway. For range queries, we perform search only partially and estimate the total number of matches by extrapolation. * Prefiltering the set of indexes selected for intersections got removed from `QueryController`. This optimization was complex and looked misplaced. If performance testing shows the removal of it caused a performance regression, it should be reintroduced in the planner.

sonarcloud · 2024-09-18T16:47:21Z

Quality Gate passed

Issues
7 New issues
2 Accepted issues

Measures
0 Security Hotspots
90.4% Coverage on New Code
0.1% Duplication on New Code

See analysis details on SonarCloud

pkolaczk changed the title ~~10629 sai stats~~ CNDB-10629 Faster SAI predicate selectivity estimation Aug 29, 2024

pkolaczk force-pushed the 10629-sai-stats branch 3 times, most recently from 443ba9e to d592fb2 Compare August 29, 2024 16:02

pkolaczk force-pushed the 10629-sai-stats branch 3 times, most recently from 5f146f1 to aa61fdf Compare September 17, 2024 07:58

pkolaczk marked this pull request as ready for review September 17, 2024 08:15

pkolaczk force-pushed the 10629-sai-stats branch from 74d7428 to 9fea046 Compare September 18, 2024 15:14

pkolaczk requested a review from michaeljmarshall September 18, 2024 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNDB-10629 Faster SAI predicate selectivity estimation #1253

CNDB-10629 Faster SAI predicate selectivity estimation #1253

pkolaczk commented Aug 29, 2024 •

edited

Loading

sonarcloud bot commented Sep 18, 2024

CNDB-10629 Faster SAI predicate selectivity estimation #1253

Are you sure you want to change the base?

CNDB-10629 Faster SAI predicate selectivity estimation #1253

Conversation

pkolaczk commented Aug 29, 2024 • edited Loading

sonarcloud bot commented Sep 18, 2024

Quality Gate passed

pkolaczk commented Aug 29, 2024 •

edited

Loading