Skip to content

Scraper Resource

The scraper executor is a native capability compiled into the kdeps binary. It fetches a URL and returns the text content, with optional CSS selector filtering.

Where it runs

Both workflow mode and agent mode. In workflow mode it executes as a DAG step. In agent mode, the workflow containing this resource runs as a single callable tool.

Configuration

yaml
# resources/fetch.yaml
scraper:
  url: "https://example.com"     # required
  selector: "article.content"    # optional CSS selector
  timeout: 30                    # seconds (default: 30)
FieldTypeRequiredDefaultDescription
urlstringyesURL to fetch
selectorstringnoCSS selector to scope extraction
timeoutintegerno30Request timeout in seconds

Output

KeyTypeDescription
contentstringExtracted text (full body or selector match)
urlstringThe URL that was fetched
statusintegerHTTP status code
jsonstringFull result as a JSON string

Access fields with output('actionId').content etc.

Examples

Fetch a page and summarize

yaml
# resources/fetch.yaml
actionId: fetch
scraper:
  url: "{{ get('url') }}"

---
actionId: summarize
requires: [fetch]
chat:
  model: llama3.2:1b
  prompt: "Summarize: {{ output('fetch').content }}"
apiResponse:
  response: "{{ output('summarize') }}"

Extract with CSS selector

yaml
# resources/fetch-article.yaml
actionId: fetchArticle
scraper:
  url: "https://news.example.com/article"
  selector: "article.body"

Error Handling

Use onError to handle unreachable URLs gracefully:

yaml
# resources/example.yaml
scraper:
  url: "https://example.com"
onError:
  action: continue
  fallback: ""

Need more? For PDF extraction, OCR, and document types (.docx, .xlsx), install the component:

bash
kdeps registry install scraper

See Also

Released under the Apache 2.0 License.