Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write Ahead Log & Rebroadcast #392

Open
Stebalien opened this issue Jul 1, 2024 · 4 comments · May be fixed by #640
Open

Write Ahead Log & Rebroadcast #392

Stebalien opened this issue Jul 1, 2024 · 4 comments · May be fixed by #640
Assignees
Labels

Comments

@Stebalien
Copy link
Member

We've implemented limited rebroadcasting to help lagging/restarting nodes catch up. However, if the 66%+ of the network crashes after starting an instance but before sending a single decide message, the network could decide on two different values for the same instance.

A simple solution here is write-ahead logging. That is:

  1. Before sending any message (maybe limit it to commit messages?) log/sync the message to disk. To save space, we could just record a single message template.
  2. On restart, re-load all (or maybe the last round? commits only?) messages from the last instance started but with no decision.
  3. Rebroadcast those messages and resume from that point.

Of course, nothing will help if the actual disks die. But this will at least help us recover in case someone finds a way to crash the entire network all at once.

The specific attack I'm worried about is as follows:

  1. An attacker listens for commit messages and waits until they see a quorum (enough to reach a "decision").
  2. The attacker checks to see if it knows of a better tipset at the same height (more weight). E.g., the attacker may choose to withhold a block to make this happen.
  3. The attacker then uses some previously unknown exploit to crash all lotus nodes.
  4. The attacker submits a certificate for the "forgotten" decision to some bridge.
  5. The network restarts/resumes.
  6. The network agrees on a different value.
  7. The bridge is now borked.
@Stebalien
Copy link
Member Author

An alternative is to wait an instance. That is, always consider the latest finality certificate as "pending" until one has been built on-top-of it. We can do this safely due to the power table lookback. The network would have to be willing to "switch" decisions while the latest instance is still pending.

This lookback won't be completely transparent to the client, but shouldn't be that hard to implement....

@Kubuxu
Copy link
Contributor

Kubuxu commented Jul 4, 2024

We discussed this in person. The alternative is not a good solution because we don't have a hash link (and even the existence of that additional finality certificate has serious consequences).

@Stebalien
Copy link
Member Author

We still need a way to re-start an instance halfway through... which is going to require some work.

Stebalien added a commit that referenced this issue Jul 13, 2024
This also allows us to re-config without re-bootstrap (by restarting the
local F3 instance). This is only safe for non-consensus parameters and
will be horribly unsafe until we fix #392.

Also:

- Removes reliance on hashing json, relies on the network name and
  manual equality checks.
- Removes versions. We now expect the version to be explicitly specified
  in the network name.
- Starts message sequence numbers at the current time so we don't need
  to save them.

fixes #468
Stebalien added a commit that referenced this issue Jul 13, 2024
This also allows us to re-config without re-bootstrap (by restarting the
local F3 instance). This is only safe for non-consensus parameters and
will be horribly unsafe until we fix #392.

Also:

- Removes reliance on hashing json, relies on the network name and
  manual equality checks.
- Removes versions. We now expect the version to be explicitly specified
  in the network name.
- Starts message sequence numbers at the current time so we don't need
  to save them.

fixes #468
Stebalien added a commit that referenced this issue Jul 13, 2024
This also allows us to re-config without re-bootstrap (by restarting the
local F3 instance). This is only safe for non-consensus parameters and
will be horribly unsafe until we fix #392.

Also:

- Removes reliance on hashing json, relies on the network name and
  manual equality checks.
- Removes versions. We now expect the version to be explicitly specified
  in the network name.
- Starts message sequence numbers at the current time so we don't need
  to save them.
- Remove the EC from the manifest sender as it's unused.
- Tests.

fixes #468
Stebalien added a commit that referenced this issue Jul 13, 2024
This also allows us to re-config without re-bootstrap (by restarting the
local F3 instance). This is only safe for non-consensus parameters and
will be horribly unsafe until we fix #392.

Also:

- Removes reliance on hashing json, relies on the network name and
  manual equality checks.
- Removes versions. We now expect the version to be explicitly specified
  in the network name.
- Starts message sequence numbers at the current time so we don't need
  to save them.
- Remove the EC from the dynamic manifest provider as it's unused.
- Tests.

fixes #468
github-merge-queue bot pushed a commit that referenced this issue Jul 13, 2024
* Remove manifest versions

This also allows us to re-config without re-bootstrap (by restarting the
local F3 instance). This is only safe for non-consensus parameters and
will be horribly unsafe until we fix #392.

Also:

- Removes reliance on hashing json, relies on the network name and
  manual equality checks.
- Removes versions. We now expect the version to be explicitly specified
  in the network name.
- Starts message sequence numbers at the current time so we don't need
  to save them.
- Remove the EC from the dynamic manifest provider as it's unused.
- Tests.

fixes #468

* additional equality test

* send updates whenever the manifest changes
@Stebalien
Copy link
Member Author

Punting to M2 because this isn't absolutely required for testing.

@jennijuju jennijuju added the P1 label Aug 19, 2024
@rjan90 rjan90 linked a pull request Sep 12, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In progress
Development

Successfully merging a pull request may close this issue.

3 participants