1- # < div id = " en " > x-crawl</ div >
1+ # x-crawl
22
3- English | < a href = " #cn " style = " text-decoration : none " > 简体中文</ a >
3+ English | [ 简体中文] ( https://github.com/coder-hxl/x-crawl/blob/main/document/cn.md )
44
55XCrawl is a Nodejs multifunctional crawler library. Crawl HTML, JSON, file resources, etc. through simple configuration.
66
7+ ## highlights
8+
9+ - Call the API to grab HTML, JSON, file resources, etc
10+ - Batch requests can choose the mode of sending asynchronously or sending synchronously
11+
712## Install
813
914Take NPM as an example:
@@ -33,13 +38,13 @@ docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
3338})
3439` ` `
3540
36- ## Key concept
41+ ## Core concepts
3742
3843### XCrawl
3944
4045Create a crawler instance via new XCrawl.
4146
42- - Type
47+ #### Type
4348
4449` ` ` ts
4550class XCrawl {
@@ -51,33 +56,42 @@ class XCrawl {
5156}
5257` ` `
5358
54- - <div id="myXCrawl">Example</div>
59+ #### <div id="myXCrawl">Example</div>
5560
5661myXCrawl is the crawler instance of the following example.
5762
5863` ` ` js
5964const myXCrawl = new XCrawl ({
6065 baseUrl: ' https://xxx.com' ,
6166 timeout: 10000 ,
62- // The interval of the next request , multiple requests are valid
67+ // The interval between requests , multiple requests are valid
6368 intervalTime: {
6469 max: 2000 ,
6570 min: 1000
6671 }
6772})
6873` ` `
6974
75+ #### About the pattern
76+
77+ The mode option defaults to async .
78+
79+ - async: In batch requests, the next request is made without waiting for the current request to complete
80+ - sync: In batch requests, you need to wait for this request to complete before making the next request
81+
82+ If there is an interval time set, it is necessary to wait for the interval time to end before sending the request.
83+
7084### fetchHTML
7185
7286fetchHTML is the method of the above <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> instance, usually used to crawl HTML.
7387
74- - Type
88+ #### Type
7589
7690` ` ` ts
7791function fetchHTML (config : string | IFetchHTMLConfig ): Promise<JSDOM>
7892```
7993
80- - Example
94+ #### Example
8195
8296```js
8397myXCrawl.fetchHTML('/xxx').then((jsdom) => {
@@ -89,13 +103,13 @@ myXCrawl.fetchHTML('/xxx').then((jsdom) => {
89103
90104fetchData is the method of the above <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> instance, which is usually used to crawl APIs to obtain JSON data and so on.
91105
92- - Type
106+ #### Type
93107
94108` ` ` ts
95109function fetchData<T = any>(config : IFetchDataConfig ): Promise<IFetchCommon<T>>
96110```
97111
98- - Example
112+ #### Example
99113
100114```js
101115const requestConifg = [
@@ -116,13 +130,13 @@ myXCrawl.fetchData({
116130
117131fetchFile is the method of the above <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> instance, which is usually used to crawl files, such as pictures, pdf files, etc.
118132
119- - Type
133+ #### Type
120134
121135` ` ` ts
122136function fetchFile (config : IFetchFileConfig ): Promise<IFetchCommon<IFileInfo>>
123137```
124138
125- - Example
139+ #### Example
126140
127141```js
128142const requestConifg = [
@@ -202,7 +216,7 @@ type IFetchCommon<T> = {
202216- IFileInfo
203217
204218` ` ` ts
205- IFileInfo {
219+ interface IFileInfo {
206220 fileName: string
207221 mimeType: string
208222 size: number
@@ -217,6 +231,7 @@ interface IXCrawlBaseConifg {
217231 baseUrl?: string
218232 timeout?: number
219233 intervalTime?: IIntervalTime
234+ mode?: ' async' | ' sync' // default: 'async'
220235}
221236` ` `
222237
@@ -246,256 +261,3 @@ interface IFetchFileConfig extends IFetchBaseConifg {
246261## More
247262
248263If you have any **questions** or **needs** , please submit **Issues in** https://github.com/coder-hxl/x-crawl/issues .
249-
250-
251- ---
252-
253-
254- # <div id="cn">x-crawl</div>
255-
256- <a href="#en" style="text-decoration: none">English</a> | 简体中文
257-
258- XCrawl 是 Nodejs 多功能爬虫库。只需简单的配置即可抓取 HTML 、JSON、文件资源等等。
259-
260- ## 安装
261-
262- 以 NPM 为例:
263-
264- ` ` ` shell
265- npm install x- crawl
266- ` ` ` `
267-
268- ## 示例
269-
270- 获取 https: // docs.github.com/zh/get-started 的标题为例:
271-
272- ` ` ` js
273- // 导入模块 ES/CJS
274- import XCrawl from 'x-crawl'
275-
276- // 创建一个爬虫实例
277- const docsXCrawl = new XCrawl({
278- baseUrl: 'https://docs.github.com',
279- timeout: 10000,
280- intervalTime: { max: 2000, min: 1000 }
281- })
282-
283- // 调用 fetchHTML API 爬取
284- docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
285- console.log(jsdom.window.document.querySelector('title')?.textContent)
286- })
287- ` ` `
288-
289- ## 核心概念
290-
291- ### XCrawl
292-
293- 通过 new XCrawl 创建一个爬虫实例。
294-
295- - 类型
296-
297- ` ` ` ts
298- class XCrawl {
299- private readonly baseConfig
300- constructor(baseConfig?: IXCrawlBaseConifg)
301- fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
302- fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
303- fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
304- }
305- ` ` `
306-
307- - < div id= " cn-myXCrawl" style= " text-decoration: none" > 示例< / div>
308-
309- myXCrawl 为后面示例的爬虫实例。
310-
311- ` ` ` js
312- const myXCrawl = new XCrawl({
313- baseUrl: 'https://xxx.com',
314- timeout: 10000,
315- // 下次请求的间隔时间, 多个请求才有效
316- intervalTime: {
317- max: 2000,
318- min: 1000
319- }
320- })
321- ` ` `
322-
323- ### fetchData
324-
325- fetch 是上面 < a href= " #cn-myXCrawl" style= " text-decoration: none" > myXCrawl< / a> 实例的方法,通常用于爬取 API ,可获取 JSON 数据等等。
326-
327- - 类型
328-
329- ` ` ` ts
330- function fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
331- ` ` `
332-
333- - 示例
334-
335- ` ` ` js
336- const requestConifg = [
337- { url: '/xxxx', method: 'GET' },
338- { url: '/xxxx', method: 'GET' },
339- { url: '/xxxx', method: 'GET' }
340- ]
341-
342- myXCrawl.fetchData({
343- requestConifg, // 请求配置, 可以是 IRequestConfig | IRequestConfig[]
344- intervalTime: 800 // 下次请求的间隔时间, 多个请求才有效
345- }).then(res => {
346- console.log(res)
347- })
348- ` ` `
349-
350- ### fetchHTML
351-
352- fetchHTML 是上面 < a href= " #cn-myXCrawl" style= " text-decoration: none" > myXCrawl< / a> 实例的方法,通常用于爬取 HTML 。
353-
354- - 类型
355-
356- ` ` ` ts
357- function fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
358- ` ` `
359-
360- - 示例
361-
362- ` ` ` js
363- myXCrawl.fetchHTML('/xxx').then((jsdom) => {
364- console.log(jsdom.window.document.querySelector('title')?.textContent)
365- })
366- ` ` `
367-
368- ### fetchFile
369-
370- fetchFile 是上面 < a href= " #cn-myXCrawl" style= " text-decoration: none" > myXCrawl< / a> 实例的方法,通常用于爬取文件,可获取图片、pdf 文件等等。
371-
372- - 类型
373-
374- ` ` ` ts
375- function fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
376- ` ` `
377-
378- - 示例
379-
380- ` ` ` js
381- const requestConifg = [
382- { url: '/xxxx' },
383- { url: '/xxxx' },
384- { url: '/xxxx' }
385- ]
386-
387- myXCrawl.fetchFile({
388- requestConifg,
389- fileConfig: {
390- storeDir: path.resolve(__dirname, './upload') // 存放文件夹
391- }
392- }).then(fileInfos => {
393- console.log(fileInfos)
394- })
395- ` ` `
396-
397- ## 类型
398-
399- - IAnyObject
400-
401- ` ` ` ts
402- interface IAnyObject extends Object {
403- [key: string | number | symbol]: any
404- }
405- ` ` `
406-
407- - IMethod
408-
409- ` ` ` ts
410- type IMethod = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTIONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
411- ` ` `
412-
413- - IRequestConfig
414-
415- ` ` ` ts
416- interface IRequestConfig {
417- url: string
418- method?: IMethod
419- headers?: IAnyObject
420- params?: IAnyObject
421- data?: any
422- timeout?: number
423- }
424- ` ` `
425-
426- - IIntervalTime
427-
428- ` ` ` ts
429- type IIntervalTime = number | {
430- max: number
431- min?: number
432- }
433- ` ` `
434-
435- - IFetchBaseConifg
436-
437- ` ` ` ts
438- interface IFetchBaseConifg {
439- requestConifg: IRequestConfig | IRequestConfig[]
440- intervalTime?: IIntervalTime
441- }
442- ` ` `
443-
444- - IFetchCommon
445-
446- ` ` ` ts
447- type IFetchCommon<T> = {
448- id: number
449- statusCode: number | undefined
450- headers: IncomingHttpHeaders // node:http type
451- data: T
452- }[]
453- ` ` `
454-
455- - IFileInfo
456-
457- ` ` ` ts
458- interface IFileInfo {
459- fileName: string
460- mimeType: string
461- size: number
462- filePath: string
463- }
464- ` ` `
465-
466- - IXCrawlBaseConifg
467-
468- ` ` ` ts
469- interface IXCrawlBaseConifg {
470- baseUrl?: string
471- timeout?: number
472- intervalTime?: IIntervalTime
473- }
474- ` ` `
475-
476- - IFetchHTMLConfig
477-
478- ` ` ` ts
479- interface IFetchHTMLConfig extends IRequestConfig {}
480- ` ` `
481-
482- - IFetchDataConfig
483-
484- ` ` ` ts
485- interface IFetchDataConfig extends IFetchBaseConifg {
486- }
487- ` ` `
488-
489- - IFetchFileConfig
490-
491- ` ` ` ts
492- interface IFetchFileConfig extends IFetchBaseConifg {
493- fileConfig: {
494- storeDir: string
495- }
496- }
497- ` ` `
498-
499- ## 更多
500-
501- 如有 ** 问题** 或 ** 需求** 请在 https: // github.com/coder-hxl/x-crawl/issues 中提 **Issues** 。
0 commit comments