Shake-Based LLM Pipeline Framework
ShakingSpider takes the established Arachnomancer extraction flow and rebuilds it with Shake so we can evaluate Shake as the orchestration layer for LLM-heavy pipelines.
By running a familiar domain through Shake rules, we quantify what the build graph buys us: reproducible runs, parallel rebuilds, and diffable artefacts before applying the pattern elsewhere.
What it unlocks
- Deterministic orchestration. Shake’s dependency tracking keeps the frontier, fetches, classifications, and extracts in sync without hidden state.
- Fast experiment loops. Engineers tweak a prompt or model, rerun the build, and see only the relevant stages recompute.
- Cost control. Cached outputs keep API usage predictable, and the system flags long dependency chains that would trigger full rebuilds.
Pipeline anatomy
- Fetch batches. Shake materialises every URL fetch into a directory structure that mirrors the site, complete with sidecar metadata for proxies and request headers.
- LLM classification. Cached prompts tag each page as navigational or item content, feeding the next crawl frontier and keeping provenance for later audits.
- Selector validation. Downstream rules plug in selector-scoring hooks so extraction heuristics stay measurable and easy to extend.
Next steps
We are packaging the framework for teams running large evaluation suites or agent fleets that need reliable, low-latency refreshes.
Curious how this could slot into your workflow?