scraper/README.md
Simon Vieille b680003549
All checks were successful
Gitnet/scraper/pipeline/head This commit looks good
add option to get multiple results
2020-11-10 13:30:38 +01:00

79 lines
1.6 KiB
Markdown

Scraper
=======
[![Build Status](https://ci.gitnet.fr/buildStatus/icon?job=Gitnet%2Fscraper%2Fmaster)](https://ci.gitnet.fr/job/Gitnet/job/scraper/job/master/)
This project is a basic tool to scrap a data from a website
using a CSS selector.
For example, if you want to retrieve the number of a project's releases hosted on github:
With CLI
---
```
node src/cli.js \
--url https://github.com/foo/bar \
--selector '.repository-content .numbers-summary li:nth-child(4) a' \
--tags \
--breaks \
--spaces \
--breaks \
--trim
```
...will show `XXX releases`.
More help with `node src/cli.js --help`.
With code
---------
```
const scraper = require('deblan-scraper')
const options = {
url: 'https://github.com/foo/bar',
acceptAllStatus: false, // Optional, default is `false`
method: 'GET', // Optional, default is `GET`
}
const isMultiple = false // get the first result, `true` to get an array of results
const selector = '.repository-content .numbers-summary li:nth-child(4) a'
const filters = {
tags: null, // Removes tags. You can specify the tags to remove (separated by comma)
breaks: null, // Removes breaks (\n, \r)
spaces: null, // Replaces 2 successive spaces by 1, except breaks
trim: null, // Strips whitespaces from the beginning and end of the value
}
scraper(
options,
selector,
filters,
function(value) {
console.log(value)
},
function(error) {
console.log(error)
},
isMultiple
)
```
Installation
------------
Requirements:
* node >= 10
* yarn
```
$ git clone https://gitnet.fr/deblan/scraper.git
$ cd scraper
$ yarn
```