Skip to content

Commit 68ba160

Browse files
committed
doc: add example of data structure
1 parent 4d5b1c5 commit 68ba160

File tree

1 file changed

+12
-3
lines changed

1 file changed

+12
-3
lines changed

README.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -801,8 +801,6 @@ aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/s
801801
> [!IMPORTANT]
802802
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
803803
804-
805-
806804
If, by any other chance, you don't have access through the AWS CLI:
807805

808806
```shell
@@ -824,7 +822,18 @@ rm cc-index-table.paths
824822
cd -
825823
```
826824

827-
then you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.
825+
The structure should be something like this:
826+
```shell
827+
tree my_data
828+
my_data
829+
└── crawl=CC-MAIN-2024-22
830+
└── subset=warc
831+
├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
832+
├── part-00001-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
833+
├── part-00002-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
834+
```
835+
836+
Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.
828837

829838
> [!IMPORTANT]
830839
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```

0 commit comments

Comments
 (0)