Skip to content

Scraper Resource

The scraper executor is a native capability compiled into the kdeps binary. It fetches a URL and returns the text content, with optional CSS selector filtering.

Configuration

yaml
run:
  scraper:
    url: "https://example.com"     # required
    selector: "article.content"    # optional CSS selector
    timeout: 30                    # seconds (default: 30)
FieldTypeRequiredDefaultDescription
urlstringyesURL to fetch
selectorstringnoCSS selector to scope extraction
timeoutintegerno30Request timeout in seconds

Output

KeyTypeDescription
contentstringExtracted text (full body or selector match)
urlstringThe URL that was fetched
statusintegerHTTP status code
jsonstringFull result as a JSON string

Access fields with output('actionId').content etc.

Examples

Fetch a page and summarize

yaml
metadata:
  actionId: fetch
run:
  scraper:
    url: "{{ get('url') }}"

---
metadata:
  actionId: summarize
  requires: [fetch]
run:
  chat:
    model: llama3.2:1b
    prompt: "Summarize: {{ output('fetch').content }}"
  apiResponse:
    response: "{{ output('summarize') }}"

Extract with CSS selector

yaml
metadata:
  actionId: fetchArticle
run:
  scraper:
    url: "https://news.example.com/article"
    selector: "article.body"

Error Handling

Use onError to handle unreachable URLs gracefully:

yaml
run:
  scraper:
    url: "https://example.com"
  onError:
    action: continue
    fallback: ""

Need more? For PDF extraction, OCR, and document types (.docx, .xlsx), install the component:

bash
kdeps registry install scraper

Next Steps

Released under the Apache 2.0 License.