Contents |
Data availability is the property of data being persistently stored and made available for retrieval, usually with strong cryptoeconomic guarantees of retreivability.
The low-level VID protocol works on byte strings. In the GCL, the data being dispersed is more structured: a list of discrete transactions, each of which has a variably sized payload of bytes and belongs to a particular namespace. Thus the VID protocol requires lowering this structured payload into a flat byte string.
This encoding is non-trivial because it is done by the untrusted leader in HotShot consensus. It must be impossible for a malicious leader to construct an encoding which passes all the checks that replicas are capable of enforcing with the limited information they receive (VID shares and VID common) but is later found not to be decodable into the structured payload format. Thus, we assign deterministic interpretations to various kinds of pathological encodings that HotShot consensus nodes cannot exclude before voting on a proposal.
In the rest of this section, text formatted in red indicates an edge case that must only be handled in case of a malicious encoder.
Note:
Starting with protocol version 0.3, some of these edge cases became obsolete, as the underlying VID protocol was changed allowing replicas to provide certain stronger guarantees about the payload encoding at the time they vote on a proposal. This primarily results from replicas having access to a separate VID commitment for each namespace.
Nevertheless, the original payload encoding was kept unchanged to minimize the difference with version ≤ 0.2, and avoid maintaining two separate implementations of payload encoding and decoding.
The encoding consists of two parts:
is a binary-encoded mapping indicating what range of the raw payload corresponds to each namespace in the block. This mapping is succinct, and so it is not dispersed using VID; instead it is included wholesale in the header. This allows clients to discern certain properties about the payload, like the location of a particular namespace within it, without downloading the entire thing.
is a binary-encoded list of namespaces. Each namespace is preceded by a transaction table, which provides the byte ranges within the namespace corresponding to each of a list of transactions. The entire raw payload is encoded to a flat byte string and dispersed using VID.
Format A namespace table is essentialy a variable length list of namespace table entries, where each entry is the pair of a namespace ID and the offset in bytes of the end of that namespace in the raw payload.
Field | Size (bytes) |
Number of entries | 4 |
Entry 0 namespace ID | 4 |
Entry 0 offset | 4 |
Entry 1 namespace ID | 4 |
Entry 1 offset | 4 |
... | ... |
The first field indicates the number n of entries in the table as a little-endian unsigned integer. If the entire table length is smaller than 4 bytes then the missing bytes are zero-padded.
The remaining fields encode table entries. Each entry consumes exactly 8 bytes. The number n could be anything, including a number much larger than the number of entries that could fit in the namespace table. As such, the actual number of entries in the table is defined as the minimum of n and the maximum number of whole entries that could fit in the table, which is
The first 4 bytes of each table entry indicate the namespace ID for this entry. Any table entry whose namespace ID is a duplicate of a previous entry is ignored. The next 4 bytes of each table entry indicate the offset of the end of the namespace in the raw payload. This is a little-endian unsigned integer.
Interpretation The primary operation on the namespace table is to extract the range in the raw payload of the bytes corresponding to a given namespace N. For this, one needs both the start and end offsets for N. The end offset is encoded explicitly in the table entry for N, but the start offset is defined implicitly:
If N occupies the ith entry in the namespace table for i > 0, then the start offset for N is defined as the end offset of the i − 1th entry.
Even if the i − 1th entry would otherwise be ignored (due to a duplicate namespace ID or any other reason), that entry’s end offset still defines the start offset of N. This rule guarantees that both start and end offsets for any namespace N can be read from a constant-size byte range in the namespace table, and it eliminates the need to traverse an unbounded number of previous entries of the namespace table looking for a previous non-ignored entry.
The start offset of the 0th entry in the table is defined to be 0.
The start and end offsets as defined by the namespace table, nsStart and nsEnd, could be anything. As such, the actual start and end offsets, start and end, are defined so as to ensure that the byte range is well-defined and in-bounds for the raw payload:
end = min(nsEnd,len(rawPayloadBytes))
start = min(nsStart,end)
Note that this dependece on the length of the raw payload means the namespace table cannot be directly interpreted on its own. To make sense of the namespace table, you need the raw payload, or at least its length. Thus, these two objects are tightly coupled parts of the same overall object: a block payload.
It is possible that a maliciously encoded namespace table could leave bytes at the end of the raw payload which are not included in any namespace. This is not a problem, but an honestly encoded namespace table will not do this; the end offset of the final namespace will equal the byte length of the raw payload.
It is also possible that a maliciously encoded namespace table could define two distinct namespaces whose byte ranges overlap. This is okay; if a malicious proposer wants to define their block this way, they can. But in an honestly encoded namespace table, all namespaces will be consecutive and disjoint.
The raw payload is a simple concatenation of namespace payloads. The boundaries of these namespace payloads within the raw payload are defined externally by the Namespace Table, so no extra top-level metadata is needed.
Each individual namespace payload is encoded as two concatenated byte sequences:
Transaction Table A transaction table is essentialy a variable length list of transaction table entries, where each entry is the offset in bytes of the end of that transaction in the transaction payloads section of the namespace payload.
Field | Size (bytes) |
Number of entries | 4 |
Transaction 0 offset | 4 |
Transaction 1 offset | 4 |
... | ... |
The first field indicates the number n of entries in the table as a little-endian unsigned integer. If the entire namespace payload byte length is smaller than 4 bytes then the missing bytes are zero-padded.
The remaining fields encode table entries. Each entry consumes exactly 4 bytes. The number n could be anything, including a number much larger than the number of entries that could fit in the namespace payload. As such, the actual number of entries in the table is defined as the minimum of n and the maximum number of whole entries that could fit in the namespace payload, which is
Transaction Payloads The transaction payloads consist of any bytes in the namespace payload beyond the transaction table. In order to extract the payload bytes of a single transaction T, one needs both the start and end offsets for T. The end offset can be read directly from the transaction table. The start offset is defined implicitly:
If T occupies the ith entry in the transaction table for i > 0, then the start offset for T is defined as the end offset of the i − 1th entry in the table.
The start offset of the 0th entry in the table is defined to be 0.
Thus, both start and end offsets for any transaction T can be read from a contiguous, constant-size byte range in the transaction table. This property facilitates transaction proofs.
The start and end offsets as defined by the transaction table, txStart and txEnd, could be anything. As such, the actual start and end offsets, start and end, are defined so as to ensure that the byte range is well-defined and in-bounds for the transaction payloads section:
end = min(txEnd,len(nsPayloadBytes) − len(txTableBytes))
start = min(txStart,end)
To get the byte range for T relative to the namespace payload, the range
start..end
is translated by the byte length of the transaction table as declared in the
table itself, suitably truncated to fit within the namespace payload. If the transaction
table declares a huge number of entries that cannot fit into the namespace payload
then all transactions in this namespace have a zero-length byte range whose
start and end offsets are both equal to the byte length of the namespace
payload.
It is possible that a maliciously encoded transaction table could leave bytes at the end of the namespace payload which are not included in any transaction. This is not a problem, but an honestly encoded transaction table will not do this; the end offset of the final transaction will equal the byte length of the transaction payloads section.
It is also possible that a maliciously encoded transaction table could define two distinct transactions whose byte ranges overlap. This is okay; if a malicious proposer wants to define their block this way, they can. But in an honestly encoded transaction table, all transactions will be consecutive and disjoint.