Skip to content

Commit 358813f

Browse files
committed
fix: correct instructions for downloading from S3 and http
1 parent 5cfc5e9 commit 358813f

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed

README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -794,7 +794,8 @@ If you want to run many of these queries, and you have a lot of disk space, you'
794794
All of these scripts run the same SQL query and should return the same record (written as a parquet file).
795795

796796
```shell
797-
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ .'
797+
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
798+
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ 'crawl=CC-MAIN-2024-22/subset=warc'
798799
```
799800

800801
> [!IMPORTANT]
@@ -805,8 +806,8 @@ aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/s
805806
If, by any other chance, you don't have access through the AWS CLI:
806807

807808
```shell
808-
mkdir -p cc-main-2024-22
809-
cd cc-main-2024-22
809+
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
810+
cd 'crawl=CC-MAIN-2024-22/subset=warc'
810811

811812
wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/cc-index-table.paths.gz
812813
gunzip cc-index-table.paths.gz
@@ -819,6 +820,7 @@ grep 'subset=warc' cc-index-table.paths | \
819820
wget -O "$2" "$1"
820821
' _
821822

823+
rm cc-index-table.paths
822824
cd -
823825
```
824826

0 commit comments

Comments
 (0)