Analyze links and extract relevant data.
This requires a few services in order to function properly:
-
A MongoDB (>= 3) server:
MONGO_URLdefaults tomongodb://localhost:27017MONGO_DBdefaults tolink-proxy
-
An S3 compatible service:
S3_ENDPOINTdefaults tohttp://localhost:9000S3_ACCESS_KEYdefaults tominioS3_SECRET_KEYdefaults tominio-s3cr3t
-
A redis server:
REDIS_HOSTdefaults tolocalhostREDIS_PORTdefaults to6379
The
docker-composefile in geodatagouv/docker exposes all these services for an easy development setup.
It also requires the following:
This exposes two services:
- A web service that you can run using
yarn start:web. - A worker service that you can run using
yarn start:worker.
Both services are available as docker images:
$ docker pull geodatagouv/link-proxy-web:latest$ docker pull geodatagouv/link-proxy-worker:latestRun all dependency services by using the dependencies.yml docker-compose file in docker/dev
$ docker-compose -f dependencies.yml upThe link-proxy apps are also available in the apps.yml file, if you just need to run all the services, run
$ docker-compose -f dependencies.yml -f apps.yml upCreate a check for a given URL.
It takes a JSON object with a required location property, containing the URL to check.
Example
$ curl localhost:5000 -d '{"location": "https://geo.data.gouv.fr/robots.txt"}'
{
"_id": "5aa167645d88a1a73a42995e",
"createdAt": "2018-03-08T16:40:03.011Z",
"locations": [
"https://geo.data.gouv.fr/robots.txt"
],
"updatedAt": "2018-03-08T16:40:03.011Z"
}Retrieve all the downloads for the given link.
Example
$ curl localhost:5000/5aa167645d88a1a73a42995e
{
"_id": "5aa167645d88a1a73a42995e",
"createdAt": "2018-03-08T16:40:03.011Z",
"updatedAt": "2018-03-08T16:40:03.081Z",
"locations": [
"https://geo.data.gouv.fr/robots.txt"
],
"downloads": [
{
"_id": "5baa8d29b22014e817d13541",
"createdAt": "2018-03-08T16:40:03.094Z",
"type": "document",
"archive": false,
"files": [
"robots.txt"
],
"url": "http://localhost:9000/link-proxy-files/geo.data.gouv.fr/2018-03-08/5aa16763670cb515e9bf2d12-robots.txt"
}
]
}Find a link based on its URL. It will redirect (302) to the matching link, if found.
Example
$ curl -v 'localhost:5000/?location=https://geo.data.gouv.fr/robots.txt'
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 5000 (#0)
> GET /?location=https://geo.data.gouv.fr/robots.txt HTTP/1.1
> Host: localhost:5000
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 302 Found
< Location: /5aa167645d88a1a73a42995e
< Date: Fri, 09 Mar 2018 15:54:16 GMT
< Connection: keep-alive
< Content-Length: 0
<
* Connection #0 to host localhost left intactRetrieve the list of the past 20 checks for a link.
Example
$ curl localhost:5000/5aa167645d88a1a73a42995e/checks
[
{
"number": 1,
"createdAt": "2018-03-08T16:40:03.011Z",
"updatedAt": "2018-03-08T16:40:03.081Z",
"state": "finished",
"statusCode": 200
}
]Retrieve a check for a link.
Example
$ curl localhost:5000/5aa167645d88a1a73a42995e/checks/1
[
{
"number": 1,
"createdAt": "2018-03-08T16:40:03.011Z",
"updatedAt": "2018-03-08T16:40:03.081Z",
"state": "finished",
"options": {
"noCache": false
},
"statusCode": 200
}
]
Find the latest non-running check for a link. It will redirect (302) to the matching check, if found.
Example
$ curl -v 'localhost:5000/5aa167645d88a1a73a42995e/checks/latest'
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 5000 (#0)
> GET /5aa167645d88a1a73a42995e/checks/latest HTTP/1.1
> Host: localhost:5000
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 302 Found
< Location: /5aa167645d88a1a73a42995e/checks/1
< Date: Thu, 27 Sep 2018 14:17:37 GMT
< Connection: keep-alive
< Content-Length: 0
<
* Connection #0 to host localhost left intactWhenever a check is successful, an HTTP notification can be sent to other applications using webhooks.
The following body will be POSTed to a listening web service:
{
"check": {
"linkId": "5aa167645d88a1a73a42995e",
"number": 1,
"createdAt": "2018-03-08T16:40:03.011Z",
"updatedAt": "2018-03-08T16:40:03.081Z",
"state": "finished",
"location": "https://geo.data.gouv.fr/robots.txt",
"options": {
"noCache": true
},
"statusCode": 200
},
"links": [
"5aa167645d88a1a73a42995e"
]
}The check metadata is sent in the webhook along with an array of all the impacted links.
The links array includes all parent links which may be impacted by the change. For example, when analyzing a subtree of an index-of, the parent links will be included in links so that they can be notified of changes happening down the tree.