Warning:
this component has no documented requirements. A new requirements page should be created and linked to this component.
Contents |
The caffeinated node watchdog is a component which runs in the background of a Nitro caffeinated node and monitors for situations that could cause or have caused the state of the node as derived from Espresso to differ from the state implied by the finalized L1 (i.e. the state that the rollup will eventually finalize). This can happen due to safety violations in Espresso, bugs in the caffeinated node, edge cases like force inclusion, etc.
Such cases are viewed as severe and exceptionally rare, so that the main concern is detecting them and preventing or limiting potential loss of funds caused by the caffeinated node giving out invalid states. Recovering gracefully is viewed as less important. When the watchdog detects such a case, it will panic the caffeinated node, at which point it is expected that clients will fall back to a slower but more reliable vanilla Nitro node deriving from L1.
Tech Debt:
This entire component is tech debt. Eventually we want the caffeinated node to smoothly handle even the most catastrophic cases by deriving from both Espresso and the L1, and reorging the chain in case the L1 derivation eventually catches up to the Espresso derivation but reaches an inconsistent state.
The watchdog consists of several tasks which run asynchronously to monitor different circumstances which could cause an inconsistency in the state derivation. Any one of these tasks can kill the whole process. Where possible, we try to detect circumstances that could cause an inconsistent state but haven’t yet, and panic before the node ever reports an insconsistent state (e.g. Force Inclusion Checker). However, in cases that should never happen but could if there is a bug in the caffeinated node or in Espresso, the best we can do is detect inconsistencies soon after the fact and panic to limit the damage (e.g. State Checker).
This component is responsible for checking if a message might be force included. It is highly important for the security of Espresso confirmations to check if a delayed message is nearing the force inclusion window on the Sequencer inbox due to a potentially malicious sequencer. As a feature, Nitro allows delayed messages that are submitted to the l1 direcctly to be force included via a call to the SequencerInbox on the rollups parent chain as a way to short circut malicuous sequencers attempting to censor the messages. As this message would sidestep the batch posting process this message has a different route into the rollups final state that doesn’t involve posting the transaction to Espresso which would invalidate Espresso confirmations.
As the CaffNode’s responsibility is to provide execution and state updates related to Espresso confimation securely, If a message is force included and not sent to Espresso, the CaffNode would incorrectly produce it’s internal state and serve incorrect data to any consumers of its its rpc To handle this case the caff node will panic if a message is near the window for force inclusion. Because force inclusion is a currently a drastic case that requires a malicious, currently centralized, sequencer to occur. Panicking the process is drastic, but fundamentally prevents the loss of funds, so losing temporary performance and liveness is an acceptable solution. This approach also gives time/notice for human by intervention to run a non-malicious sequencing service on the network to properly include the delayed message with an honest batch poster and thus prevent force inclusion.
When the start function of the checker is called, its starts a polling loop to check if a force inclusion is possible
1 func (s *C) Start(ctxIn context.Context) error { 2 s.StopWaiter.Start(ctxIn, s) 3 4 err := s.CallIterativelySafe(func(ctx context.Context) time.Duration { 5 err := s.checkIfMessageCanBeForceIncluded(ctx) 6 if err != nil { 7 return s.retryTime 8 } 9 log.Info("Now processing hotshot block", "block number", s.nextHotshotBlockNum) 10 return s.pollingInterval 11 }) 12 return err 13 }
The responsibility of this function is to check if a delayed message was added within a tolerance window of the force inclusion window, or earlier, and has the potential to be force included
It works as follows:
It first reads the sequencer inbox contract to figure out which is the last delayed message that was included in a batch
Then it calls the ‘getForceIncludableEvent‘ function which further calls the Bridge contract to get the ‘MessageDelivered‘ event that has been emitted at least 20 hours ago. ‘MessageDelivered‘ event is emitted when a new delayed message is added to the Bridge contract
If the last delayed message posted in a batch is less than the message index of the event, it means that the message can be force included
If that is the case, panic the caff node
1 func (s *ForceInclusionChecker) checkIfMessageCanBeForceIncluded(ctx context.Context) error { 2 lastDelayedMessage, := s.l1Client.SequencerInboxContract.GetDelayedMessagesRead(ctx) 3 4 canForceInclusionOccurSoon := s.isForceInclusionPossibleSoon(lastDelayedMessage) 5 if canForceInclusionOccurSoon{ 6 panic! 7 } 8 }
getOldForceIncludableEvent
is inspired by the arbitrum-sdk which will use multiple
calls over a range of blocks to fetch delayed message older than 20 hours and check if
there is some message likely to be force included
It’s subtle logic resides in getEventsAndIncreaseRange. This function will get the block number at the start of the tolerance window of force inclusion like so:
1 2 getForceInclusionToleranceBlockNumber() uint64{ 3 maxTimeVariation := sequencerInbox.maxTimeVariation() 4 currentParentChainBlock := getCurrentParentChainBlock() 5 6 lastBadBlockNumber = currentParentChainBlock.number - maxTimeVariation.delayBlocks + config.BlockThresholdTolerance 7 lastBadBlockTime = currentParentChainBlock.timestamp -maxTimeVariation.delaySeconds + config.SecondThresholdTolerance 8 9 lastBadBlock = findFirstParentChainBlockBelow(lastBadBlockNumber, lastBadBlockTime) 10 11 return (startBlock: lastBadBlock.number) 12 13 }
This function will get the first block in the chain that could include a delayed message index inside the tolerance window, based on our tolerances with respect to the force inclusion window
Then in, findFirstParentChainBlockBelow()
the inclusion checker will start at the
block number provided, and iterate through blocks backwards until it reaches a
delayed message event.
This block number is then used by getForceIncludableEvent
like so.
1 isForceInclusionPossibleSoon(lastDelayedMessageIndex){ 2 var events []eventInfo 3 start := getForceInclusionToleranceBlockNumber() 4 for i := start; true; i -= config.maxRangeStep{ //iterate backwards from the start block by the max range step config value. 5 events := getDelayedEventsInRange(i-config.maxRangeStep, i) 6 if len(events) > 0 { 7 event := events[len(events)-1] // we want to see the most recent event 8 if event.MessageIndex < lastDelayedMessageIndex{ 9 return false 10 } else { 11 return true 12 } 13 } 14 } 15 }
This function starts at the first block that is within the force inclusion tolerance, and iterates backwards looking for the first delayed message event.
If the events message index is below the last delayed message index there is no unsequenced delayed message within the tolerance window so the function returns false.
otherwise it return true and panics the process.
As an optimization, this function will run only when the delayed message count on the sequencer inbox hasn’t been updated after a specified interval.
The state checker is an optional task which monitors for a worst case scenario by constantly comparing the state of the caffeinated node to that of a (slower, more reliable) full node deriving from the L1. If the states ever differ at the same position in the chain, the caffeinated node panics. This will usually happen some blocks deep in the chain as reported by the caffeinated node, since the L1 node takes longer to provide a state for those blocks. Thus, if this does happen, the caffeinated node may report an invalid state for newer blocks for a period of some minutes.
The state checker is a simple loop that utilizes the RPCs of both the caffeinated node and another node whose URL should be provided by the operator. It assumes both nodes are honest, so the intention is for risk-intolerant operators to run both their own caffeinated node and their own node deriving from the L1, using the latter to panic the former and as a fallback in case the caffeinated node reports the wrong state.
The loop run by the state checker consists of the following steps:
Get the latest finalized header from the L1-deriving node, using the
eth_getBlockByNumber
RPC call with the finalized
block tag.
Get the corresponding header from the caffeinated node using
eth_getBlockByNumber
If the two headers have different values for their stateRoot
field, panic.