Skip to content

Commit 1cb1b2e

Browse files
ikreymertw4l
andauthored
Update Behaviors Docs (#820)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
1 parent f2dac05 commit 1cb1b2e

File tree

5 files changed

+246
-25
lines changed

5 files changed

+246
-25
lines changed

docs/docs/user-guide/behaviors.md

Lines changed: 230 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,63 @@
11
# Browser Behaviors
22

3-
Browsertrix Crawler supports automatically running customized in-browser behaviors. The behaviors auto-play videos (when possible), auto-fetch content that is not loaded by default, and also run custom behaviors on certain sites.
3+
Browsertrix Crawler supports automatically running customized behaviors on each page. Several types of behaviors are supported, including built-in, background, and site-specific behaviors. It is also possible to add fully user-defined custom behaviors that can be added to trigger specific actions on certain pages.
44

5-
To run behaviors, specify them via a comma-separated list passed to the `--behaviors` option. All behaviors are enabled by default, the equivalent of `--behaviors autoscroll,autoplay,autofetch,siteSpecific`. To enable only a single behavior, such as autoscroll, use `--behaviors autoscroll`.
5+
## Built-In Behaviors
66

7-
The site-specific behavior (or autoscroll) will start running after the page is finished its initial load (as defined by the `--waitUntil` settings). The behavior will then run until finished or until the behavior timeout is exceeded. This timeout can be set (in seconds) via the `--behaviorTimeout` flag (90 seconds by default). Setting the timeout to 0 will allow the behavior to run until it is finished.
7+
The built-in behaviors include the following background behaviors which run 'in the background' continually checking for changes:
8+
9+
- Autoplay: find and start playing (when possible) any video or audio on the page (and in each iframe).
10+
- Autofetch: find and start fetching any URLs that may not be fetched by default, such as other resolutions in `img` tags, `data-*`, lazy-loaded resources, etc.
11+
- Autoclick: select all tags (default: `a` tag, customizable via `--clickSelector`) that may be clickable and attempt to click them while avoiding navigation away from the page.
812

9-
See [Browsertrix Behaviors](https://github.com/webrecorder/browsertrix-behaviors) for more info on all of the currently available behaviors.
13+
There is also a built-in 'main' behavior, which runs to completion (or until a timeout is reached):
1014

11-
Browsertrix Crawler includes a `--pageExtraDelay`/`--delay` option, which can be used to have the crawler sleep for a configurable number of seconds after behaviors before moving on to the next page.
15+
- Autoscroll: Determine if a page might need scrolling, and scroll either up or down while new elements are being added. Continue until timeout is reached or scrolling is no longer possible.
1216

13-
To disable behaviors for a crawl, use `--behaviors ""`.
17+
## Site-Specific Behaviors
1418

15-
## Additional Custom Behaviors
19+
Browsertrix also comes with several 'site-specific' behaviors, which run only on specific sites. These behaviors will run instead of Autoscroll and will run until completion or timeout. Currently, site-specific behaviors include major social media sites.
1620

17-
Custom behaviors can be mounted into the crawler and ran from there, or downloaded from a URL.
21+
Refer to [Browsertrix Behaviors](https://github.com/webrecorder/browsertrix-behaviors) for the latest list of site-specific behaviors.
1822

19-
Each behavior should contain a single class that implements the behavior interface. See [the behaviors tutorial](https://github.com/webrecorder/browsertrix-behaviors/blob/main/docs/TUTORIAL.md) for more info on how to write behaviors.
23+
User-defined custom behaviors are also considered site-specific.
24+
25+
## Enabling Behaviors
2026

21-
The first behavior which returns true for `isMatch()` will be run on a given page.
27+
To enable built-in behaviors, specify them via a comma-separated list passed to the `--behaviors` option. All behaviors except Autoclick are enabled by default, the equivalent of `--behaviors autoscroll,autoplay,autofetch,siteSpecific`. To enable only a single behavior, such as Autoscroll, use `--behaviors autoscroll`.
2228

23-
The repeatable `--customBehaviors` flag can accept:
29+
To only use Autoclick but not Autoscroll, use `--behaviors autoclick,autoplay,autofetch,siteSpecific`.
2430

25-
- A path to a directory of behavior files
26-
- A path to a single behavior file
27-
- A URL for a single behavior file to download
28-
- A URL for a git repository of the form `git+https://git.example.com/repo.git`, with optional query parameters `branch` (to specify a particular branch to use) and `path` (to specify a relative path to a directory within the git repository where the custom behaviors are located)
31+
The `--siteSpecific` flag enables all site-specific behaviors to be enabled, but only one behavior can be run per site. Each site-specific behavior specifies which site it should run on.
32+
33+
To disable all behaviors, use `--behaviors ""`.
34+
35+
## Behavior and Page Timeouts
36+
37+
Browsertrix includes a number of timeouts, including before, during and after running behaviors.
38+
The timeouts are as follows:
39+
40+
- `--waitUntil`: how long to wait for page to finish loading, *before* doing anything else.
41+
- `--postLoadDelay`: how long to wait *before* starting any behaviors, but after page has finished loading. A custom behavior can override this (see below).
42+
- `--behaviorTimeout`: maximum time to spend on running site-specific / Autoscroll behaviors (can be less if behavior finishes early).
43+
- `--pageExtraDelay`: how long to wait *after* finishing behaviors (or after `behaviorTimeout` has been reached) before moving on to next page.
44+
45+
A site-specific behavior (or Autoscroll) will start after the page is loaded (at most after `--waitUntil` seconds) and exactly after `--postLoadDelay` seconds.
46+
47+
The behavior will then run until finished or at most until `--behaviorTimeout` is reached (90 seconds by default).
48+
49+
## Loading Custom Behaviors
50+
51+
Browsertrix Crawler also supports fully user-defined behaviors, which have all the capabilities of the built-in behaviors.
52+
53+
They can use a library of provided functions, and run on one or more pages in the crawl.
54+
55+
Custom behaviors are specified with the `--customBehaviors` flag, which can be repeated and can accept the following options.
56+
57+
- A path to a single behavior file. This can be mounted into the crawler as a volume.
58+
- A path to a directory of behavior files. This can be mounted into the crawler as a volume.
59+
- A URL for a single behavior file to download. This should be a URL that the crawler has access to.
60+
- A URL for a git repository of the form `git+https://git.example.com/repo.git`, with optional query parameters `branch` (to specify a particular branch to use) and `path` (to specify a relative path to a directory within the git repository where the custom behaviors are located). This should be a git repo the crawler has access to without additional auth.
2961

3062
### Examples
3163

@@ -52,3 +84,186 @@ docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --u
5284
```sh
5385
docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example.com/ --customBehaviors "git+https://git.example.com/custom-behaviors?branch=dev&path=path/to/behaviors"
5486
```
87+
88+
## Creating Custom Behaviors
89+
90+
A custom behavior file can be in one of the following supported formats:
91+
- JSON User Flow
92+
- JavaScript / Typescript (compiled to JavaScript)
93+
94+
### JSON Flow Behaviors
95+
96+
Browsertrix Crawler 1.6 and up supports replaying the JSON User Flow format generated by [DevTools Recorder](https://developer.chrome.com/docs/devtools/recorder), which is built-in to Chrome devtools.
97+
98+
This format can be generated by using the DevTools Recorder to create a series of steps, which are serialized to JSON.
99+
100+
The format represents a series of steps that should happen on a particular page.
101+
102+
The recorder is capable of picking the right selectors interactively and supports events such as `click`, `change`, `waitForElement` and more. See the [feature reference](https://developer.chrome.com/docs/devtools/recorder/reference) for a more complete list.
103+
104+
#### User Flow Extensions
105+
106+
Browsertrix extends the functionality compared to DevTools Recorder in the following ways:
107+
108+
- Browsertrix Crawler will attempt to continue even if initial step fails, for up to 3 failures.
109+
110+
- If a step is repeated 3 or more times, Browsertrix Crawler will attempt to repeat the step as far as it can until the step fails.
111+
112+
- Browsertrix Crawler ignores the `navigate` and `viewport` step. The `navigate` event is used to match when a particular user flow should run, but does not navigate away from the page.
113+
114+
- If `navigate` step is removed, user flow can run on every page in the crawler.
115+
116+
- A `customStep` step with name `runOncePerCrawl` can be added to indicate that a user flow should run only once for a given crawl.
117+
118+
### JavaScript Behaviors
119+
120+
The main native format of custom behaviors is a Javascript class.
121+
122+
There should be a single class per file, and it should be of the following format:
123+
124+
#### Behavior Class
125+
126+
```javascript
127+
class MyBehavior
128+
{
129+
// required: an id for this behavior, will be displayed in the logs
130+
// when the behavior is run.
131+
static id = "My Behavior Id";
132+
133+
// required: a function that checks if a behavior should be run
134+
// for a given page.
135+
// This function can check the DOM / window.location to determine
136+
// what page it is on. The first behavior that returns 'true'
137+
// for a given page is used on that page.
138+
static isMatch() {
139+
return window.location.href === "https://my-site.example.com/";
140+
}
141+
142+
// optional: if true, will also check isMatch() and possibly run
143+
// this behavior in each iframe.
144+
// if false, or not defined, this behavior will be skipped for iframes.
145+
static runInIframes = false;
146+
147+
// optional: if defined, provides a way to define a custom way to determine
148+
// when a page has finished loading beyond the standard 'load' event.
149+
//
150+
// if defined, the crawler will await 'awaitPageLoad()' before moving on to
151+
// post-crawl processing operations, including link extraction, screenshots,
152+
// and running main behavior
153+
async awaitPageLoad() {
154+
155+
}
156+
157+
// required: the main behavior async iterator, which should yield for
158+
// each 'step' in the behavior.
159+
// When the iterator finishes, the behavior is done.
160+
// (See below for more info)
161+
async* run(ctx) {
162+
//... yield ctx.getState("starting behavior");
163+
164+
// do something
165+
166+
//... yield ctx.getState("a step has been performed");
167+
}
168+
}
169+
```
170+
171+
#### Behavior run() loop
172+
173+
The `run()` loop provides the main loop for the behavior to run. It must be an [async iterator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/AsyncIterator), which means that it can optionally call `yield` to return state to the crawler and allow it to print the state.
174+
175+
For example, a behavior that iterates over elements and then clicks them either once or twice (based on the value of a custom `.clickTwice` property) could be written as follows:
176+
177+
```javascript
178+
async* run(ctx) {
179+
let click = 0;
180+
let dblClick = 0;
181+
for await (const elem of document.querySelectorAll(".my-selector")) {
182+
if (elem.clickTwice) {
183+
elem.click();
184+
elem.click();
185+
dblClick++;
186+
} else {
187+
elem.click();
188+
click++;
189+
}
190+
ctx.log({msg: "Clicked on elem", click, dblClick});
191+
}
192+
}
193+
```
194+
195+
This behavior will run to completion and log every time a click event is made. However, this behavior can not be paused and resumed (supported in ArchiveWeb.page) and generally can not be interrupted.
196+
197+
One approach is to yield after every major 'step' in the behavior, for example:
198+
199+
```javascript
200+
async* run(ctx) {
201+
let click = 0;
202+
let dblClick = 0;
203+
for await (const elem of document.querySelectorAll(".my-selector")) {
204+
if (elem.clickTwice) {
205+
elem.click();
206+
elem.click();
207+
dblClick++;
208+
// allows behavior to be paused here
209+
yield {msg: "Double-clicked on elem", click, dblClick};
210+
} else {
211+
elem.click();
212+
click++;
213+
// allows behavior to be paused here
214+
yield {msg: "Single-clicked on elem", click, dblClick};
215+
}
216+
}
217+
}
218+
```
219+
220+
The data that is yielded will be logged in the `behaviorScriptCustom` context.
221+
222+
This allows for the behavior to log the current state of the behavior and allow for it to be gracefully
223+
interrupted after each logical 'step'.
224+
225+
#### getState() function
226+
227+
A common pattern is to increment a particular counter, and then return the whole state.
228+
229+
A convenience function `getState()` is provided to simplify this and avoid the need to create custom counters.
230+
231+
Using this standard function, the above code might be condensed as follows:
232+
233+
```javascript
234+
async* run(ctx) {
235+
const { Lib } = ctx;
236+
for await (const elem of document.querySelectorAll(".my-selector")) {
237+
if (elem.clickTwice) {
238+
elem.click();
239+
elem.click();
240+
yield Lib.getState("Double-Clicked on elem", "dblClick");
241+
} else {
242+
elem.click();
243+
yield Lib.getState("Single-Clicked on elem", "click");
244+
}
245+
}
246+
}
247+
```
248+
249+
#### Utility Functions
250+
251+
In addition to `getState()`, Browsertrix Behaviors includes [a small library of other utility functions](https://github.com/webrecorder/browsertrix-behaviors/blob/main/src/lib/utils.ts) which are available to behaviors under `ctx.Lib`.
252+
253+
Some of these functions which may be of use to behaviors authors are:
254+
255+
- `scrollAndClick`: scroll element into view and click
256+
- `sleep`: sleep for specified timeout (ms)
257+
- `waitUntil`: wait until a certain predicate is true
258+
- `waitUntilNode`: wait until a DOM node exists
259+
- `xpathNode`: find a DOM node by xpath
260+
- `xpathNodes`: find and iterate all DOM nodes by xpath
261+
- `xpathString`: find a string attribute by xpath
262+
- `iterChildElem`: iterate over all child elements of given element
263+
- `iterChildMatches`: iterate over all child elements that match a specific xpath
264+
- `isInViewport`: determine if a given element is in the visible viewport
265+
- `scrollToOffset`: scroll to particular offset
266+
- `scrollIntoView`: smoothly scroll particular element into view
267+
- `getState`: increment a state counter and return all state counters + string message
268+
269+
More detailed references will be added in the future.

docs/docs/user-guide/cli-options.md

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,7 @@ Options:
1919
crawl configuration (can also be se
2020
t via CRAWL_ID env var), defaults to
2121
combination of Docker container hos
22-
tname and collection
23-
[string] [default: "@hostname-@id"]
22+
tname and collection [string]
2423
--waitUntil Puppeteer page.goto() condition to w
2524
ait for before continuing, can be mu
2625
ltiple separated by ','
@@ -88,6 +87,9 @@ Options:
8887
[number] [default: 1000000000]
8988
--generateWACZ, --generatewacz, --ge If set, generate WACZ on disk
9089
nerateWacz [boolean] [default: false]
90+
--useSHA1 If set, sha-1 instead of sha-256 has
91+
hes will be used for creating record
92+
s [boolean] [default: false]
9193
--logging Logging options for crawler, can inc
9294
lude: stats (enabled by default), js
9395
errors, debug
@@ -100,16 +102,17 @@ Options:
100102
[array] [choices: "general", "worker", "recorder", "recorderNetwork", "writer"
101103
, "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast
102104
", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi
103-
orScript", "jsError", "fetch", "pageStatus", "memoryStatus", "crawlStatus", "l
104-
inks", "sitemap", "wacz", "replay", "proxy"] [default: []]
105+
orScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memorySt
106+
atus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy"] [default:
107+
[]]
105108
--logExcludeContext Comma-separated list of contexts to
106109
NOT include in logs
107110
[array] [choices: "general", "worker", "recorder", "recorderNetwork", "writer"
108111
, "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast
109112
", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi
110-
orScript", "jsError", "fetch", "pageStatus", "memoryStatus", "crawlStatus", "l
111-
inks", "sitemap", "wacz", "replay", "proxy"] [default: ["recorderNetwork","jsE
112-
rror","screencast"]]
113+
orScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memorySt
114+
atus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy"] [default:
115+
["recorderNetwork","jsError","screencast"]]
113116
--text Extract initial (default) or final t
114117
ext to pages.jsonl or WARC resource
115118
record(s)
@@ -236,6 +239,9 @@ Options:
236239
[array] [default: []]
237240
--logErrorsToRedis If set, write error messages to redi
238241
s [boolean] [default: false]
242+
--logBehaviorsToRedis If set, write behavior script messag
243+
es to redis
244+
[boolean] [default: false]
239245
--writePagesToRedis If set, write page objects to redis
240246
[boolean] [default: false]
241247
--maxPageRetries, --retries If set, number of times to retry a p

docs/docs/user-guide/proxies.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ Only key-based authentication is supposed for SSH proxies for now.
7474

7575
## Browser Profiles
7676

77-
The above proxy settings also apply to [Browser Profile Creation](../browser-profiles), and browser profiles can also be created using proxies, for example:
77+
The above proxy settings also apply to [Browser Profile Creation](browser-profiles.md), and browser profiles can also be created using proxies, for example:
7878

7979
```sh
8080
docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler create-login-profile --url https://example.com/ --proxyServer ssh://user@path-to-ssh-host.example.com --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts

docs/mkdocs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ nav:
6363

6464
markdown_extensions:
6565
- toc:
66-
toc_depth: 3
66+
toc_depth: 4
6767
permalink: true
6868
- pymdownx.highlight:
6969
anchor_linenums: true

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "browsertrix-crawler",
3-
"version": "1.6.0-beta.1",
3+
"version": "1.6.0",
44
"main": "browsertrix-crawler",
55
"type": "module",
66
"repository": "https://github.com/webrecorder/browsertrix-crawler",

0 commit comments

Comments
 (0)