7min

Crawler API

A Crawler is a service that takes a source URL and the number of levels in depth to traverse and a script to extract data from a web page. Then the service builds a list of child links by traversing from the source URL to all the HREFs and retrieves each page and apply the script to the HTML document to generate JSON documents that can be injected into a search engine for full-text search.

Features

  • Asynchronous
  • Headless browser (parses js enabled sites)

API Commands

๏ปฟ

  • PUT /:app/crawler/:nameCreate a Job๏ปฟ
  • POST /:app/crawler/:name - Run a Job
  • GET /:app/crawler/:name - Get a Job Document
  • DELETE /:app/crawler/:name - Delete a Job
  • GET /:app/crawler - List Crawler Jobs

๏ปฟ

๏ปฟ

๏ปฟ