Write Ahead Log & Rebroadcast #392

Stebalien · 2024-07-01T20:57:54Z

We've implemented limited rebroadcasting to help lagging/restarting nodes catch up. However, if the 66%+ of the network crashes after starting an instance but before sending a single decide message, the network could decide on two different values for the same instance.

A simple solution here is write-ahead logging. That is:

Before sending any message (maybe limit it to commit messages?) log/sync the message to disk. To save space, we could just record a single message template.
On restart, re-load all (or maybe the last round? commits only?) messages from the last instance started but with no decision.
Rebroadcast those messages and resume from that point.

Of course, nothing will help if the actual disks die. But this will at least help us recover in case someone finds a way to crash the entire network all at once.

The specific attack I'm worried about is as follows:

An attacker listens for commit messages and waits until they see a quorum (enough to reach a "decision").
The attacker checks to see if it knows of a better tipset at the same height (more weight). E.g., the attacker may choose to withhold a block to make this happen.
The attacker then uses some previously unknown exploit to crash all lotus nodes.
The attacker submits a certificate for the "forgotten" decision to some bridge.
The network restarts/resumes.
The network agrees on a different value.
The bridge is now borked.

Stebalien · 2024-07-01T21:02:01Z

An alternative is to wait an instance. That is, always consider the latest finality certificate as "pending" until one has been built on-top-of it. We can do this safely due to the power table lookback. The network would have to be willing to "switch" decisions while the latest instance is still pending.

This lookback won't be completely transparent to the client, but shouldn't be that hard to implement....

Kubuxu · 2024-07-04T18:42:59Z

We discussed this in person. The alternative is not a good solution because we don't have a hash link (and even the existence of that additional finality certificate has serious consequences).

Stebalien · 2024-07-11T08:44:09Z

We still need a way to re-start an instance halfway through... which is going to require some work.

This also allows us to re-config without re-bootstrap (by restarting the local F3 instance). This is only safe for non-consensus parameters and will be horribly unsafe until we fix #392. Also: - Removes reliance on hashing json, relies on the network name and manual equality checks. - Removes versions. We now expect the version to be explicitly specified in the network name. - Starts message sequence numbers at the current time so we don't need to save them. fixes #468

This also allows us to re-config without re-bootstrap (by restarting the local F3 instance). This is only safe for non-consensus parameters and will be horribly unsafe until we fix #392. Also: - Removes reliance on hashing json, relies on the network name and manual equality checks. - Removes versions. We now expect the version to be explicitly specified in the network name. - Starts message sequence numbers at the current time so we don't need to save them. - Remove the EC from the manifest sender as it's unused. - Tests. fixes #468

This also allows us to re-config without re-bootstrap (by restarting the local F3 instance). This is only safe for non-consensus parameters and will be horribly unsafe until we fix #392. Also: - Removes reliance on hashing json, relies on the network name and manual equality checks. - Removes versions. We now expect the version to be explicitly specified in the network name. - Starts message sequence numbers at the current time so we don't need to save them. - Remove the EC from the dynamic manifest provider as it's unused. - Tests. fixes #468

* Remove manifest versions This also allows us to re-config without re-bootstrap (by restarting the local F3 instance). This is only safe for non-consensus parameters and will be horribly unsafe until we fix #392. Also: - Removes reliance on hashing json, relies on the network name and manual equality checks. - Removes versions. We now expect the version to be explicitly specified in the network name. - Starts message sequence numbers at the current time so we don't need to save them. - Remove the EC from the dynamic manifest provider as it's unused. - Tests. fixes #468 * additional equality test * send updates whenever the manifest changes

Stebalien · 2024-07-17T17:21:19Z

Punting to M2 because this isn't absolutely required for testing.

Stebalien mentioned this issue Jul 9, 2024

Handle failure to receive/validate a decision #436

Merged

Stebalien mentioned this issue Jul 13, 2024

Remove manifest versions #472

Merged

Stebalien mentioned this issue Jul 14, 2024

Record messages for rebroadcast after signing #474

Open

Stebalien modified the milestones: Milestone 0: F3 Alpha Code Freeze, Milestone 1: Passive Testing Readiness, Milestone 2: Harderning and Mainnet Deployment Readiness Jul 16, 2024

Stebalien mentioned this issue Jul 30, 2024

Improve failure/recovery #422

Open

jennijuju added the P1 label Aug 19, 2024

rjan90 assigned Kubuxu Sep 4, 2024

rjan90 linked a pull request Sep 12, 2024 that will close this issue

Implement WriteAheadLog #640

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write Ahead Log & Rebroadcast #392

Write Ahead Log & Rebroadcast #392

Stebalien commented Jul 1, 2024

Stebalien commented Jul 1, 2024

Kubuxu commented Jul 4, 2024

Stebalien commented Jul 11, 2024

Stebalien commented Jul 17, 2024

Write Ahead Log & Rebroadcast #392

Write Ahead Log & Rebroadcast #392

Comments

Stebalien commented Jul 1, 2024

Stebalien commented Jul 1, 2024

Kubuxu commented Jul 4, 2024

Stebalien commented Jul 11, 2024

Stebalien commented Jul 17, 2024