Skip to content

Commit e46fdd1

Browse files
committed
fix: path to the local paths file
1 parent 04505f7 commit e46fdd1

File tree

2 files changed

+3
-2
lines changed

2 files changed

+3
-2
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -796,7 +796,8 @@ If you want to run many of these queries, and you have a lot of disk space, you'
796796
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ .'
797797
```
798798
799-
(**Bonus bonus:** If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```.)
799+
> [!IMPORTANT]
800+
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
800801

801802
All of these scripts run the same SQL query and should return the same record (written as a parquet file).
802803

src/main/java/org/commoncrawl/whirlwind/Duck.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ public static List<String> getFiles(Algorithm algo, String crawl) throws IOExcep
124124
case CLOUDFRONT: {
125125
String externalPrefix = String
126126
.format("https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=%s/subset=warc/", crawl);
127-
String pathsFile = crawl + ".warc.paths.gz";
127+
String pathsFile = Paths.get("data", crawl + ".warc.paths.gz").toString();
128128

129129
List<String> files = new ArrayList<>();
130130
try (GZIPInputStream gzis = new GZIPInputStream(new FileInputStream(pathsFile));

0 commit comments

Comments
 (0)