-
Notifications
You must be signed in to change notification settings - Fork 14
Description
The WARC-Record-ID has been used in several datasets derived from Common Crawl data as the ID field / column to reference records and establish provenance. See for example the FineWeb or Gneissweb datasets. In order to allow to establish the provenance link between source and derivate, a relation table is required for join operations. This table must include both WARC-Record-ID and URL plus capture timestamp. See the Announcing GneissWeb Annotations for further information.
Adding the WARC-Record-ID directly to the columnar index would allow for faster joins without the need for the relation table.
Because the estimated size of the record ID column is large, exhaustive testing of variant implementation is required.
See also:
-
Proposed column name:
warc_record_id(analogous towarc_record_offsetetc.) -
Decide on the representation:
- Include surrounding
<urn:uuid:...> - Strip surrounding parentheses
<...>. Note URL indexes may strip<>, e.g., for the WARC-Target-URI. - Only keep the bare UUID (as whatever data type)
- Include surrounding
-
Decide on the data type to store the WARC-Record-ID
- It's about a UUID, a 128-bit integer
- Parquet does not have a 128-bit integer data type, so options are:
- FIXED_LEN_BYTE_ARRAY (used for the UUID logical type
- arbitrary length BYTE_ARRAY
- data type (depending on the representation):
- 16 bytes long to purely contain the 128-bit integer in big-endian encoding
- 32 bytes long hex digits
- 36 bytes long hex digits including the four hyphens used for grouping
- 47 bytes including
<urn:uuid:...>
-
Evaluate compression given the representation and data type:
- Compression will hopefully reduce the variant representations and data types onto a similar size.
- Common Crawl's WARC writer uses a type 4 (pseudo randomly generated) UUID.
- Entropy is high. Only the 6 bits representing UUID version and variant will allow to reduce the storage footprint when compressed.
- So, the lower bound for the compressed size is 15-16 bytes per UUID.
- The GneissWeb annotations use the full representation (
<urn:uuid:...>as variable length array and spend 22.5 bytes in average to hold a UUID. - Using a condensed representation and data type may reduce the storage footprint of the new column not trivially. Because a larger size causes smaller row groups in terms of rows, this affects also the storage and query performance of other columns.
- Estimate footprint for a 3 billion index: 60 GiB for a 22.5 byte representation and 40 GiB for a 15.5 byte one.
-
Consider using type 7 UUID as
WARC-Record-ID -
Exhaustively test querying and processing using Athena, Presto, Trino, Spark, DuckDb, Hive, etc.