Skip to content

Commit 34c6c87

Browse files
authored
Task 6 (#12)
Implement task 6 of the whirlwind Java tour
1 parent 4c97de4 commit 34c6c87

File tree

6 files changed

+430
-42
lines changed

6 files changed

+430
-42
lines changed

.editorconfig

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,10 @@ root = true
66
end_of_line = lf
77
insert_final_newline = true
88

9-
# LF: not sure about this
10-
# [*.java]
11-
# charset = utf-8
12-
# indent_style = space
13-
# indent_size = 4
9+
[*.java]
10+
charset = utf-8
11+
indent_style = space
12+
indent_size = 4
1413

1514
[Makefile]
1615
indent_style = tab

Makefile

Lines changed: 27 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,52 @@
11
build:
22
mvn clean package
33

4-
cdxj: build ensure_jwarc
4+
cdxj: build jwarc.jar
55
@echo "creating *.cdxj index files from the local warcs"
66
java -jar jwarc.jar cdxj data/whirlwind.warc.gz > data/whirlwind.warc.cdxj
77
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.CdxjIndexer -Dexec.args="data/whirlwind.warc.wet.gz --records conversion" > data/whirlwind.warc.wet.cdxj
88
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.CdxjIndexer -Dexec.args="data/whirlwind.warc.wat.gz --records metadata" > data/whirlwind.warc.wat.cdxj
99

10-
extract:
10+
extract: jwarc.jar
1111
@echo "creating extraction.* from local warcs, the offset numbers are from the cdxj index"
1212
java -jar jwarc.jar extract --payload data/whirlwind.warc.gz 1023 > extraction.html
1313
java -jar jwarc.jar extract --payload data/whirlwind.warc.wet.gz 466 > extraction.txt
1414
java -jar jwarc.jar extract --payload data/whirlwind.warc.wat.gz 443 > extraction.json
1515
@echo "hint: python -m json.tool extraction.json"
1616

17-
# cdx_toolkit:
18-
# @echo demonstrate that we have this entry in the index
19-
# cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
20-
# @echo
21-
# @echo cleanup previous work
22-
# rm -f TEST-000000.extracted.warc.gz
23-
# @echo retrieve the content from the commoncrawl s3 bucket
24-
# cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 warc an.wikipedia.org/wiki/Escopete
25-
# @echo
26-
# @echo index this new warc
27-
# cdxj-indexer TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj
28-
# cat TEST-000000.extracted.warc.cdxj
29-
# @echo
30-
# @echo iterate this new warc
31-
# python ./warcio-iterator.py TEST-000000.extracted.warc.gz
32-
# @echo
33-
#
17+
cdx_toolkit: jwarc.jar
18+
@echo demonstrate that we have this entry in the index
19+
curl 'https://index.commoncrawl.org/CC-MAIN-2024-22-index?url=an.wikipedia.org/wiki/Escopete&output=json&from=20240518015810&to=20240518015810'
20+
@echo
21+
@echo cleanup previous work
22+
rm -f TEST-000000.extracted.warc.gz
23+
@echo retrieve the content from the commoncrawl data server
24+
curl --request GET --url 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz' --header 'Range: bytes=80610731-80628153' > TEST-000000.extracted.warc.gz
25+
@echo
26+
@echo index this new warc
27+
java -jar jwarc.jar cdxj TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj
28+
cat TEST-000000.extracted.warc.cdxj
29+
@echo
30+
@echo iterate this new warc
31+
java -jar jwarc.jar ls TEST-000000.extracted.warc.gz
32+
@echo
33+
3434
download_collinfo:
3535
@echo "downloading collinfo.json so we can find out the crawl name"
3636
curl -o data/collinfo.json https://index.commoncrawl.org/collinfo.json
3737

3838
CC-MAIN-2024-22.warc.paths.gz:
3939
@echo "downloading the list from s3, requires s3 auth even though it is free"
4040
@echo "note that this file should be in the repo"
41-
aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > data/CC-MAIN-2024-22.warc.paths.gz
41+
aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > data/CC-MAIN-2024-22.warc.paths.gz
42+
43+
duck_ccf_local_files: build
44+
@echo "warning! only works on Common Crawl Foundadtion's development machine"
45+
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args"ccf_local_files"
4246

43-
# duck_local_files:
44-
# @echo "warning! 300 gigabyte download"
45-
# python duck.py local_files
46-
#
47-
# duck_ccf_local_files:
48-
# @echo "warning! only works on Common Crawl Foundadtion's development machine"
49-
# python duck.py ccf_local_files
50-
#
51-
# duck_cloudfront:
52-
# @echo "warning! this might take 1-10 minutes"
53-
# python duck.py cloudfront
54-
#
47+
duck_cloudfront: build
48+
@echo "warning! this might take 1-10 minutes"
49+
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args"cloudfront"
5550

5651
jwarc.jar:
5752
@echo "downloading JWarc JAR"

README.md

Lines changed: 71 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -593,9 +593,76 @@ origin-trial: AonOP4SwCrqpb0nhZbg554z9iJimP3DxUDB8V4yu9fyyepauGKD0NXqTknWi4gnuDf
593593

594594
Make sure you compress WARCs the right way!
595595

596-
## Task 6: Use cdx_toolkit to query the full CDX index and download those captures from AWS S3
596+
## Task 6: Query the full CDX index and download those captures from AWS S3
597597

598-
TBA
598+
Some of our users only want to download a small subset of the crawl. They want to run queries against an index, either the CDX index we just talked about, or in the columnar index, which we'll talk about later.
599+
600+
The CDX server API is documented [here](https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference) and can be accessed through a HTTP API.
601+
602+
Right now there is no specific tool in Java for query the CDX index, nevertheless, we do have a very useful Python tool for working with the CDX index: [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit). Please refer to the [Python Whirlwind Tour](https://github.com/commoncrawl/whirlwind-python) for more details.
603+
604+
In this task we will achieve the same results using direct HTTP API calls and JWARC.
605+
606+
Run
607+
608+
```make query_cdx```
609+
610+
The output looks like this:
611+
612+
<details>
613+
<summary>Click to view output</summary>
614+
615+
```
616+
demonstrate that we have this entry in the index
617+
curl https://index.commoncrawl.org/CC-MAIN-2024-22-index?url=an.wikipedia.org/wiki/Escopete&output=json&from=20240518015810&to=20240518015810
618+
619+
{"urlkey": "org,wikipedia,an)/wiki/escopete", "timestamp": "20240518015810", "url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17423", "offset": "80610731", "filename": "crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz", "languages": "spa", "encoding": "UTF-8"}
620+
621+
cleanup previous work
622+
rm -f TEST-000000.extracted.warc.gz
623+
retrieve the content from the commoncrawl s3 bucket (offset: 80628153 = 80610731 + 17423 - 1)
624+
curl --request GET \
625+
--url https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz \
626+
--header 'Range: bytes=80610731-80628153' > TEST-000000.extracted.warc.gz
627+
628+
index this new warc
629+
java -jar jwarc.jar cdxj TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj
630+
cat TEST-000000.extracted.warc.cdxj
631+
org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17455", "offset": "406", "filename": "TEST-000000.extracted.warc.gz"}
632+
633+
iterate this new warc
634+
java -jar jwarc.jar ls TEST-000000.extracted.warc.gz
635+
0 response 200 https://an.wikipedia.org/wiki/Escopete
636+
```
637+
638+
</details>
639+
640+
There's a lot going on here so let's unpack it a little.
641+
642+
#### Check that the crawl has a record for the page we are interested in
643+
644+
We check for capture results querying the index.commoncrawl.org with GET parameters, specifying the crawl (`CC-MAIN-2024-22-index`), the exact URL `an.wikipedia.org/wiki/Escopete` and the timestamp range `from=20240518015810` and `to=20240518015810`.
645+
The result of this tells us that the crawl successfully fetched this page at timestamp `20240518015810`.
646+
* Captures are named by the surtkey and the time.
647+
648+
[//]: # (* If you need to search across all crawls, of `--crawl CC-MAIN-2024-22`, you could pass `--cc` to search across all crawls.)
649+
[//]: # (Here I'm tempted to mention that you should use the columnar index for this kind of operations, however cdx_toolkit iterate over all crawls when called with -cc, if I'm not wrong)
650+
* You can use the parameter `limit=<N>` to limit the number of results returned - in this case because we have restricted the timestamp range to a single value, we only expect one result.
651+
* URLs may be specified with wildcards to return even more results: `"an.wikipedia.org/wiki/Escop*"` matches `an.wikipedia.org/wiki/Escopulión` and `an.wikipedia.org/wiki/Escopete`.
652+
653+
#### Retrieve the fetched content as WARC
654+
655+
Next, we make another HTTP call to retrieve the content and save it locally as a new WARC file, again specifying the exact URL, crawl identifier, and timestamp range.
656+
This creates the WARC file `TEST-000000.extracted.warc.gz`
657+
658+
[//]: # (Here there is no warcinfo when getting from data.commoncrawl.org, right?)
659+
[//]: # (which contains a `warcinfo` record explaining what the WARC is, followed by the `response` record we requested. )
660+
* If you check the cURL command, you'll find that it is using the offset and length of the WARC record (as returned by the CDX index query) to make an HTTP byte range request to `data.commoncrawl.org` that isolates and returns just the single record we want from the full file. It only downloads the response WARC record because our CDX index only has the response records indexed.
661+
* Limit, timestamp, and crawl index parameters, as well as URL wildcards.
662+
663+
### Indexing the WARC and viewing its contents
664+
665+
Finally, we run `jwarc cdxj` that process the WARC to make a CDXJ index of it as in Task 3, and then list the records using `jwarc ls` as in Task 2.
599666

600667
## Task 7: Find the right part of the columnar index
601668

@@ -643,12 +710,12 @@ TBA
643710

644711
1. Use the DuckDb techniques from [Task 8](#task-8-query-using-the-columnar-index--duckdb-from-outside-aws) and the [Index Server](https://index.commoncrawl.org) to find a new webpage in the archives.
645712
2. Note its url, warc, and timestamp.
646-
3. Now open up the Makefile from [Task 6](#task-6-use-cdx_toolkit-to-query-the-full-cdx-index-and-download-those-captures-from-aws-s3) and look at the actions from the cdx_toolkit section.
713+
3. Now open up the Makefile from [Task 6](#task-6-query-the-full-cdx-index-and-download-those-captures-from-aws-s3) and look at the actions from the cdx_toolkit section.
647714
4. Repeat the cdx_toolkit steps, but for the page and date range you found above.
648715

649716
## Congratulations!
650717

651-
You have completed the Whirlwind Tour of Common Crawl's Datasets using Python! You should now understand different filetypes we have in our corpus and how to interact with Common Crawl's datasets using Python. To see what other people have done with our data, see the [Examples page](https://commoncrawl.org/examples) on our website. Why not join our Discord through the Community tab?
718+
You have completed the Whirlwind Tour of Common Crawl's Datasets using Java! You should now understand different filetypes we have in our corpus and how to interact with Common Crawl's datasets using Java. To see what other people have done with our data, see the [Examples page](https://commoncrawl.org/examples) on our website. Why not join our Discord through the Community tab?
652719

653720

654721
## Other datasets

data/CC-MAIN-2024-22.warc.paths.gz

817 Bytes
Binary file not shown.

pom.xml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,16 @@
2626
<artifactId>jwarc</artifactId>
2727
<version>0.33.0</version>
2828
</dependency>
29-
29+
<dependency>
30+
<groupId>org.duckdb</groupId>
31+
<artifactId>duckdb_jdbc</artifactId>
32+
<version>1.1.3</version>
33+
</dependency>
34+
<dependency>
35+
<groupId>com.google.code.gson</groupId>
36+
<artifactId>gson</artifactId>
37+
<version>2.11.0</version>
38+
</dependency>
3039
</dependencies>
3140

3241
<build>

0 commit comments

Comments
 (0)