Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ShardManagerTokenAware class to split shards along node-token boundaries #1255
base: main
Are you sure you want to change the base?
Implement ShardManagerTokenAware class to split shards along node-token boundaries #1255
Changes from 1 commit
75f7afb
1581fb2
ec6adc1
514220e
73a1b15
ad1a5fa
e5ae871
a7242f5
24afc09
9d38e0d
1c4fb8e
22ba8af
43fc9d4
97063dc
c42c833
038e539
778d47b
50b548d
6f6d711
8c4b186
073452c
43624fc
46de9f6
5ac75cd
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To generate
additionalSplits
many tokens the way that the allocation strategy would do it when new nodes are added, we need to do the following:tokensPerNode
. This should besortedTokens.size() / endPoints.size()
orDatabaseDescriptor.getNumTokens()
(these two are expected to be the same, we should bail out if they aren't, or rely on explicitly specified value in the UCS configuration).splitPointNodes = Math.ceil(additionalSplits / tokensPerNode)
.splitPointNodes
many new fake node ids, which need to be assigned in racks round-robin.tokensPerNode
many tokens for each of these fake nodes (i.e. useTokenAllocation.create(...)
and then repeatedly callallocate
on that object). SeeOfflineTokenAllocator.MultiNodeAllocator
, which starts from empty and generates tokens for the given number of nodes in each rack; we want a variation of this which starts withTokenMetadata
that matches the current, and then continues adding fake nodes until the required number are generated.additionalSplits
many from the new instead of using the selection scheme below).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it the wrong way. I was focused on finding the right splits for the existing nodes instead of just adding fake new nodes until we have enough tokens. This seems more straight forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have thought the selection scheme would give us better data distribution, as opposed to truncating the list of new tokens. Also, if we truncate the list, does that present issues for ensuring that higher levels of UCS have the same splits as lower levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed up a new commit with a proposed implementation. I plan to write tests for it tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it does not, because the same tokens will be generated again when we generate the higher-count levels (in other words, we can cache the smaller generation round (pre-truncation) and use it to save some work when generating the bigger).
The token allocation gives the biggest-impact-token first, which is also the one that splits the current biggest token range so taking them in order should be a good enough choice and we can save the effort. Also, I'm not sure looking for closest to even split can't cause bigger oddities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't permit multiple
evenSplitPoints
to map to the samenodeAlignedSplitPoint
, because this effectively reduces the number of split points.One thing we can do is remove entries from the node-aligned list when we use them, and enforce that the ones picked for smaller shard counts are also picked by recursively generating for
shardCount / 2
first untilshardCount
is not divisible by 2.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure I already did this, though it might not be in the ideal way. As we iterate over
sortedTokens
, we find the position of the nearest neighbor, and then we make setpos
to the index of the next token innodeAlignedSplitPoints
. We then usepos
as the lower bound in the binary search to find the closest split point. I haven't confirmed that this solution maintains the rule thatUCS requires that the splitting points for a given density are also splitting points for all higher densities
.I can see that the recursive solution would trivially ensure
UCS requires that the splitting points for a given density are also splitting points for all higher densities
. Do you think we should prefer that approach?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that mine works assuming we have the power-of-two token allocator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this could work too (if we don't select a specific token, it would be because we selected it for a different split point). Added a couple of comments on ensuring no repetition, and not exhausting the sorted tokens too early.
This method needs to be unit-tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hit a bug in this part of the code when working on adding unit tests and I think that it would be better to have a cleaner and clearer design. In general, I don't yet see how we guarantee that we select the same tokens at successively higher levels as tokens are added and removed to the sorted tokens list. Is it safe to assume a certain
I tried implementing this, but I ran into trouble because I assumed I would have enough split points if I broke the space up by the nearest token to the
midpoint
:That code doesn't work because there is no guarantee that the sorted tokens are split equally. Is that design close to what you were thinking? Or do I literally need to add calls to remove tokens from
sortedTokens
and then at each successive recursive call I would get the next power of two worth of tokens until I hit thesortedTokens
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did have literal removal from the list in mind. It seems, however, that we can make your original solution work by making sure we always report the right number of splits, by doing something like this:
Could you run this through your test to see if it works? If not, could you upload the test so that I can play with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That passed my test, thank you! I'll add some more tests proving that as we add nodes, the split points continue to be aligned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This algorithm fails to choose the same tokens if we add one shard at a time. For example, give sorted tokens:
[-9193912203636246467, -4846620020038852605, -1955575638654777768, 1845313162618248341, 4442481910831405714, 7317158111889931131]
3 shards chooses split points
[-1955575638654777768, 1845313162618248341]
4 shards chooses split points
[-4846620020038852605, 1845313162618248341, 4442481910831405714]
Observe that the 3 split points do not contain 2 split points. I think this is acceptable because the shard count is always
baseShardCount * 2 ^ n
. Is that correct? I'll push up a test shortly that confirms this works with power of 2 growth.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a commit with
testRangeEndsAreFromTokenListAndContainLowerRangeEnds
to show that this works for powers of 2 shard counts.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we are only guaranteed that
findNearest(x)
is a subset offindNearest(y)
whenx
is a subset ofy
. The logic is that if we couldn't select the closest value to a point, it is only because we have already picked it for another point. This shouldn't be too hard to prove formally if we need to do that (I did check if ChatGPT 4o is smart enough to do it, without success).