Skip to content

Commit 25595fc

Browse files
authored
fix(doc): remove repetitions and copy-paste leftover (#16)
* fix(doc): remove repetitions and copy-paste leftover * fix(doc): move warnings above
1 parent 566831a commit 25595fc

File tree

1 file changed

+8
-9
lines changed

1 file changed

+8
-9
lines changed

README.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -790,17 +790,18 @@ The program then writes that one record into a local Parquet file, does a second
790790

791791
### Bonus: download a full crawl index and query with DuckDB
792792

793-
If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run
794-
All of these scripts run the same SQL query and should return the same record (written as a parquet file).
793+
In case you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly.
794+
795+
> [!IMPORTANT]
796+
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
797+
798+
To download the crawl index, there are two options: if you have access to the CCF AWS buckets, run:
795799

796800
```shell
797801
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
798802
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ 'crawl=CC-MAIN-2024-22/subset=warc'
799803
```
800804

801-
> [!IMPORTANT]
802-
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
803-
804805
If, by any other chance, you don't have access through the AWS CLI:
805806

806807
```shell
@@ -822,7 +823,7 @@ rm cc-index-table.paths
822823
cd -
823824
```
824825

825-
The structure should be something like this:
826+
In both ways, the file structure should be something like this:
826827
```shell
827828
tree my_data
828829
my_data
@@ -835,10 +836,8 @@ my_data
835836

836837
Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.
837838

838-
> [!IMPORTANT]
839-
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
839+
Both `make duck_ccf_local_files` and `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` run the same SQL query and should return the same record (written as a parquet file).
840840

841-
All of these scripts run the same SQL query and should return the same record (written as a parquet file).
842841

843842
## Bonus 2: combine some steps
844843

0 commit comments

Comments
 (0)