Skip to content

Scraper Resource

The Scraper resource extracts text content from 15 source types — web pages, documents, spreadsheets, images, and structured data files — without requiring external services for most formats. It can be used as a primary resource or as an inline resource inside before / after blocks.

Basic Usage

yaml
apiVersion: kdeps.io/v1
kind: Resource

metadata:
  actionId: scrapeWebPage
  name: Scrape Web Page

run:
  scraper:
    type: url
    source: "https://example.com"
    timeoutDuration: 30s

Configuration Options

OptionTypeDescription
typestringRequired. Content type to scrape (see Supported Types).
sourcestringRequired. URL or file path to scrape. Supports expressions.
timeoutDurationstringMaximum time for URL fetching (e.g. 30s, 1m). Default: 30s.
timeoutstringAlias for timeoutDuration.
ocrobjectOCR options (only for type: image).
ocr.languagestringTesseract language code (e.g. eng, deu). Default: eng.

Supported Types

TypeDescriptionExternal Dependency
urlFetches a web page and extracts its visible textNone
pdfExtracts text from a PDF filepdftotext (poppler-utils) preferred; falls back to raw scan
wordExtracts text from a .docx fileNone
excelExtracts cell values from a .xlsx fileNone
imageRuns OCR on an image filetesseract CLI required
textReads a plain-text file as-isNone
htmlReads a local HTML file and extracts visible textNone
csvReads a CSV file and returns rows as tab-separated textNone
markdownReads a Markdown file and returns plain text (markup stripped)None
pptxExtracts text from a PowerPoint .pptx fileNone
jsonReads a JSON file and returns its pretty-printed contentNone
xmlReads a local XML file and extracts all text nodesNone
odtExtracts text from an OpenDocument Text .odt fileNone
odsExtracts text from an OpenDocument Spreadsheet .ods fileNone
odpExtracts text from an OpenDocument Presentation .odp fileNone

Examples by Type

URL

Fetches a web page and strips HTML tags, scripts, and styles, returning plain visible text.

yaml
run:
  scraper:
    type: url
    source: "https://example.com/page"
    timeoutDuration: 15s

PDF

Extracts text from a PDF file. Uses pdftotext (from poppler-utils) when available; otherwise falls back to a raw ASCII scan of the PDF binary.

yaml
run:
  scraper:
    type: pdf
    source: /data/report.pdf

Word Document

Extracts text from a Word .docx file by parsing its internal XML.

yaml
run:
  scraper:
    type: word
    source: /data/contract.docx

Excel Spreadsheet

Extracts cell values from an Excel .xlsx file. Each row is returned as a tab-separated line, with rows separated by newlines (tabs and newlines are preserved in the output).

yaml
run:
  scraper:
    type: excel
    source: /data/budget.xlsx

Image OCR

Runs Tesseract OCR on an image to extract text. Requires the tesseract CLI to be installed.

yaml
run:
  scraper:
    type: image
    source: /data/scanned-invoice.png
    ocr:
      language: eng     # Tesseract language code; default: eng

Supported image formats: PNG, JPEG, TIFF, BMP, and any other format that Tesseract accepts.

Plain Text

Reads a plain-text file and returns its content as-is.

yaml
run:
  scraper:
    type: text
    source: /data/notes.txt

HTML File

Reads a local HTML file and returns its visible text (scripts, styles, and tags removed).

yaml
run:
  scraper:
    type: html
    source: /data/page.html

CSV

Reads a CSV file and returns each row as a tab-separated line.

yaml
run:
  scraper:
    type: csv
    source: /data/records.csv

Markdown

Reads a Markdown file and returns plain text with most markup (headers, bold, links) stripped.

yaml
run:
  scraper:
    type: markdown
    source: /data/README.md

PowerPoint

Extracts text from the slides of a .pptx file.

yaml
run:
  scraper:
    type: pptx
    source: /data/presentation.pptx

JSON

Reads a JSON file, validates it, and returns its pretty-printed content.

yaml
run:
  scraper:
    type: json
    source: /data/config.json

XML

Reads a local XML file and concatenates all text node content.

yaml
run:
  scraper:
    type: xml
    source: /data/feed.xml

OpenDocument Text (ODT)

Extracts text from a LibreOffice/OpenOffice Writer .odt file.

yaml
run:
  scraper:
    type: odt
    source: /data/document.odt

OpenDocument Spreadsheet (ODS)

Extracts cell text from a LibreOffice/OpenOffice Calc .ods file.

yaml
run:
  scraper:
    type: ods
    source: /data/spreadsheet.ods

OpenDocument Presentation (ODP)

Extracts slide text from a LibreOffice/OpenOffice Impress .odp file.

yaml
run:
  scraper:
    type: odp
    source: /data/slides.odp

Accessing the Result

The scraper stores its result under the resource's actionId. Use get() in downstream resources to access the extracted content.

yaml
# Scrape the page
metadata:
  actionId: fetchPage
run:
  scraper:
    type: url
    source: "https://example.com"

---

# Use the content in an LLM prompt
metadata:
  actionId: summarize
  requires:
    - fetchPage
run:
  chat:
    model: llama3.2:1b
    prompt: "Summarize this page: {{ get('fetchPage') }}"

The result map returned by the scraper contains:

KeyTypeDescription
contentstringThe extracted text.
sourcestringThe evaluated source URL or path.
typestringThe scraper type used.
successbooltrue if extraction succeeded.

Access individual fields with get('actionId', 'content') or the full map with get('actionId').

yaml
run:
  expr:
    - set('pageText', get('fetchPage', 'content'))
    - set('didSucceed', get('fetchPage', 'success'))

Dynamic Sources with Expressions

The source field supports expressions, so you can build file paths or URLs at runtime.

yaml
run:
  scraper:
    type: url
    source: "{{ get('baseUrl') }}/page/{{ get('pageId') }}"

Using Scraper as an Inline Resource

The scraper can be embedded inside before / after blocks of any resource:

yaml
run:
  before:
    - scraper:
        type: text
        source: /data/prompt_prefix.txt
  chat:
    model: llama3.2:1b
    prompt: "Context loaded. Answer the query."

External Dependencies

TypeRequirement
imagetesseract CLI (install: apt install tesseract-ocr or brew install tesseract)
pdfpdftotext from poppler-utils (optional, but improves quality). Install: apt install poppler-utils or brew install poppler

All other types use Go standard library only and have no external dependencies.


Error Handling

When scraping fails, the error is propagated to the engine. The engine returns early and no output is stored unless onError.action: continue is configured. Use onError to control behavior:

yaml
run:
  scraper:
    type: url
    source: "https://example.com"
  onError:
    action: continue     # continue, fail (default), or retry
    fallback: ""         # Value to use when action is "continue"

Next Steps

Released under the MIT License.