37 Data Availability

Chapter 37
Data Availability

Contents

37.1 Overview
37.2 CDN
37.3 DA Committee
37.4 VID
37.4.1 Data Structures
Common
Shares
Commitment
37.4.2 Payload Encoding
Namespace Table
Format
Interpretation
Raw Payload
Transaction Table
Transaction Payloads

37.1 # Overview

Data availability is the property of data being persistently stored and made available for retrieval, usually with strong cryptoeconomic guarantees of retreivability.

37.2 # CDN

37.3 # DA Committee

37.4 # VID

37.4.1 # Data Structures

# Common

# Shares

# Commitment

37.4.2 # Payload Encoding

The low-level VID protocol works on byte strings. In the GCL, the data being dispersed is more structured: a list of discrete transactions, each of which has a variably sized payload of bytes and belongs to a particular namespace. Thus the VID protocol requires lowering this structured payload into a flat byte string.

This encoding is non-trivial because it is done by the untrusted leader in HotShot consensus. It must be impossible for a malicious leader to construct an encoding which passes all the checks that replicas are capable of enforcing with the limited information they receive (VID shares and VID common) but is later found not to be decodable into the structured payload format. Thus, we assign deterministic interpretations to various kinds of pathological encodings that HotShot consensus nodes cannot exclude before voting on a proposal.

In the rest of this section, text formatted in red indicates an edge case that must only be handled in case of a malicious encoder.

Note:

Starting with protocol version 0.3, some of these edge cases became obsolete, as the underlying VID protocol was changed allowing replicas to provide certain stronger guarantees about the payload encoding at the time they vote on a proposal. This primarily results from replicas having access to a separate VID commitment for each namespace.

Nevertheless, the original payload encoding was kept unchanged to minimize the difference with version ≤ 0.2, and avoid maintaining two separate implementations of payload encoding and decoding.

The encoding consists of two parts:

A namespace table: is a binary-encoded mapping indicating what range of the raw payload corresponds to each namespace in the block. This mapping is succinct, and so it is not dispersed using VID; instead it is included wholesale in the header. This allows clients to discern certain properties about the payload, like the location of a particular namespace within it, without downloading the entire thing.
The raw payload: is a binary-encoded list of namespaces. Each namespace is preceded by a transaction table, which provides the byte ranges within the namespace corresponding to each of a list of transactions. The entire raw payload is encoded to a flat byte string and dispersed using VID.

# Namespace Table

Format A namespace table is essentialy a variable length list of namespace table entries, where each entry is the pair of a namespace ID and the offset in bytes of the end of that namespace in the raw payload.


Field	Size (bytes)


Number of entries	4

Entry 0 namespace ID	4
Entry 0 offset	4

Entry 1 namespace ID	4
Entry 1 offset	4

...	...

Table 37.1: Namespace table layout

The first field indicates the number n of entries in the table as a little-endian unsigned integer. If the entire table length is smaller than 4 bytes then the missing bytes are zero-padded.

The remaining fields encode table entries. Each entry consumes exactly 8 bytes. The number n could be anything, including a number much larger than the number of entries that could fit in the namespace table. As such, the actual number of entries in the table is defined as the minimum of n and the maximum number of whole entries that could fit in the table, which is

len(tableBytes)−-4 ⌊ 8 ⌋

The first 4 bytes of each table entry indicate the namespace ID for this entry. Any table entry whose namespace ID is a duplicate of a previous entry is ignored. The next 4 bytes of each table entry indicate the offset of the end of the namespace in the raw payload. This is a little-endian unsigned integer.

Interpretation The primary operation on the namespace table is to extract the range in the raw payload of the bytes corresponding to a given namespace N. For this, one needs both the start and end offsets for N. The end offset is encoded explicitly in the table entry for N, but the start offset is defined implicitly:

If N occupies the ith entry in the namespace table for i > 0, then the start offset for N is defined as the end offset of the i − 1th entry.
Even if the i − 1th entry would otherwise be ignored (due to a duplicate namespace ID or any other reason), that entry’s end offset still defines the start offset of N. This rule guarantees that both start and end offsets for any namespace N can be read from a constant-size byte range in the namespace table, and it eliminates the need to traverse an unbounded number of previous entries of the namespace table looking for a previous non-ignored entry.
The start offset of the 0th entry in the table is defined to be 0.

The start and end offsets as defined by the namespace table, nsStart and nsEnd, could be anything. As such, the actual start and end offsets, start and end, are defined so as to ensure that the byte range is well-defined and in-bounds for the raw payload:

end = min(nsEnd,len(rawPayloadBytes))
start = min(nsStart,end)

Note that this dependece on the length of the raw payload means the namespace table cannot be directly interpreted on its own. To make sense of the namespace table, you need the raw payload, or at least its length. Thus, these two objects are tightly coupled parts of the same overall object: a block payload.

It is possible that a maliciously encoded namespace table could leave bytes at the end of the raw payload which are not included in any namespace. This is not a problem, but an honestly encoded namespace table will not do this; the end offset of the final namespace will equal the byte length of the raw payload.

It is also possible that a maliciously encoded namespace table could define two distinct namespaces whose byte ranges overlap. This is okay; if a malicious proposer wants to define their block this way, they can. But in an honestly encoded namespace table, all namespaces will be consecutive and disjoint.

# Raw Payload

The raw payload is a simple concatenation of namespace payloads. The boundaries of these namespace payloads within the raw payload are defined externally by the Namespace Table, so no extra top-level metadata is needed.

Each individual namespace payload is encoded as two concatenated byte sequences:

1.: A transaction table, which defines the byte offets of individual transactions within the namespace payload.
2.: A sequence of transaction payloads

Transaction Table A transaction table is essentialy a variable length list of transaction table entries, where each entry is the offset in bytes of the end of that transaction in the transaction payloads section of the namespace payload.


Field	Size (bytes)


Number of entries	4

Transaction 0 offset	4

Transaction 1 offset	4

...	...

Table 37.2: Transaction table layout

The first field indicates the number n of entries in the table as a little-endian unsigned integer. If the entire namespace payload byte length is smaller than 4 bytes then the missing bytes are zero-padded.

The remaining fields encode table entries. Each entry consumes exactly 4 bytes. The number n could be anything, including a number much larger than the number of entries that could fit in the namespace payload. As such, the actual number of entries in the table is defined as the minimum of n and the maximum number of whole entries that could fit in the namespace payload, which is

len(namespacePayloadBytes)−-4 ⌊ 4 ⌋

Transaction Payloads The transaction payloads consist of any bytes in the namespace payload beyond the transaction table. In order to extract the payload bytes of a single transaction T, one needs both the start and end offsets for T. The end offset can be read directly from the transaction table. The start offset is defined implicitly:

If T occupies the ith entry in the transaction table for i > 0, then the start offset for T is defined as the end offset of the i − 1th entry in the table.
The start offset of the 0th entry in the table is defined to be 0.

Thus, both start and end offsets for any transaction T can be read from a contiguous, constant-size byte range in the transaction table. This property facilitates transaction proofs.

The start and end offsets as defined by the transaction table, txStart and txEnd, could be anything. As such, the actual start and end offsets, start and end, are defined so as to ensure that the byte range is well-defined and in-bounds for the transaction payloads section:

end = min(txEnd,len(nsPayloadBytes) − len(txTableBytes))
start = min(txStart,end)

To get the byte range for T relative to the namespace payload, the range start..end is translated by the byte length of the transaction table as declared in the table itself, suitably truncated to fit within the namespace payload. If the transaction table declares a huge number of entries that cannot fit into the namespace payload then all transactions in this namespace have a zero-length byte range whose start and end offsets are both equal to the byte length of the namespace payload.

It is possible that a maliciously encoded transaction table could leave bytes at the end of the namespace payload which are not included in any transaction. This is not a problem, but an honestly encoded transaction table will not do this; the end offset of the final transaction will equal the byte length of the transaction payloads section.

It is also possible that a maliciously encoded transaction table could define two distinct transactions whose byte ranges overlap. This is okay; if a malicious proposer wants to define their block this way, they can. But in an honestly encoded transaction table, all transactions will be consecutive and disjoint.

[next] [prev] [prev-tail] [front] [up]

Chapter 37Data Availability