- JavaScript 99.4%
- PLpgSQL 0.5%
Community-tier build of Knowledge Harvester. Gated Pro/Enterprise modules (src/processing, src/export, src/integrations) ship as license-required stubs rather than real implementations. Internal integrations, infrastructure, and non-public tooling are excluded. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|---|---|---|
| docs | ||
| src | ||
| tests | ||
| .dockerignore | ||
| .gitignore | ||
| docker-compose.yml | ||
| Dockerfile | ||
| LICENSE | ||
| LICENSING.md | ||
| NOTICE | ||
| package-lock.json | ||
| package.json | ||
| README.md | ||
| run.bat | ||
| test-shard.mjs | ||
Knowledge Harvester
Knowledge Harvester is an internal knowledge-ops tool for collecting, normalizing, enriching, storing, and exploring automation workflows and related technical artifacts.
In its current form, this repository is not just a scraper. It is a combination of:
- a CLI pipeline runner
- a set of source-specific harvesters
- a Postgres/pgvector-backed artifact store
- an internal HTTP API for browsing and operating on harvested data
- a growing set of maintenance, graph, discovery, and autonomy utilities
What The Tool Actually Does
Today the codebase supports these major workflows:
- Harvest workflows and artifacts from many sources, including n8n, GitHub, Reddit, Activepieces, Windmill, Temporal, Airflow, Prefect, Dagster, LangGraph, ComfyUI, Dify, Flowise, Pipedream, Argo, Luigi, Tekton, GitHub Actions, Home Assistant, MLflow, dbt, Camunda, Kafka Connect, Camel, Terraform, Helm, Docker Compose, Kubernetes manifests, Ansible, CI configs, Dockerfiles, Jupyter notebooks, shell scripts, Makefiles, and YAML-defined GitHub searches.
- Normalize data into seven artifact families:
workflow,code_pattern,infra_config,ai_ml_asset,api_spec,data_asset, anddocumentation. - Run post-harvest enrichment such as classification, scoring, embeddings, packaging, guide generation, license detection, validation, test detection, trend enrichment, semantic deduplication, decay scoring, and understanding extraction.
- Build graph and discovery features such as recommendations, relations, clusters, bridge nodes, snapshots, coverage reports, and graph materialization.
- Expose the stored data and internal operations through a lightweight API for artifacts, collections, reviews, forks, analytics, webhooks, graphs, schedules, snapshots, discovery, feed/event streams, auto-refresh, and autonomy status.
What This Is Not
- It is not a polished public SaaS product.
- It is not a generic SDK.
- It is not safe to assume
npm startmerely launches a server.
This repository is best understood as internal tooling for operating a harvested knowledge base.
Entry Points
There are two primary runtime modes.
1. CLI pipeline and operators
The CLI lives in src/index.js.
Important behavior:
npm startrunsnode src/index.js- if no command is provided, the CLI defaults to
pipeline - that default pipeline is the full 23-step workflow, not a lightweight no-op
If you want a safe first command, start with one of:
node src/index.js stats
node src/index.js harvest --source n8n-community
node src/index.js classify --limit 20
2. Internal HTTP API
The API server lives in src/server.js.
There is currently no dedicated npm script for it. Start it directly:
node src/server.js
By default it listens on PORT=8011.
The API is intended for internal use. In development, API auth is permissive if KH_API_KEY is unset. In production mode, requests to /api/* require x-api-key and the server returns 503 if KH_API_KEY is not configured.
Current CLI Capability Surface
The CLI currently exposes these command groups:
- Ingestion:
harvest,migrate - Core enrichment:
classify,score,embed,package,guide - Analysis:
migrations,complexity,compose,facets,stats - Artifact quality and search:
license,validate,test-detect,freshness,relations,recommend,enrich-trends - Higher-order artifact operations:
dedup-semantic,decay,understand,graph-materialize - Operational and autonomy utilities:
schedule,snapshot,discover,coverage,compare,auto-refresh,sync-trends,pulse - End-to-end orchestration:
pipeline
The full pipeline currently runs 23 steps:
- migrate
- harvest
- classify
- score
- embed
- package
- guide
- complexity
- migrations
- compose
- monetize
- bundle
- license detection
- validation
- test detection
- relation building
- facet refresh
- trend enrichment
- semantic dedup
- decay prediction
- understanding extraction
- claim extraction
- graph materialization
Current API Capability Surface
The API is broader than the old README implied. Current route families include:
- Health and metrics:
/health,/health/detailed,/metrics - Legacy workflow views:
/api/workflows,/api/categories,/api/stats - Harvest control and status:
/api/harvest,/api/harvest/:runId/status - Artifact CRUD and batch operations:
/api/artifacts,/api/artifacts/batch,/api/artifacts/batch/export - Artifact subresources: partial fetches, semantic search, at-risk listing, recommendations, export, scaffold, deploy manifest, reviews, forks, duplicates, canonical selection
- Marketplace-style views: packages, guides, bundles, trending artifacts
- Social and curation features: contributors, collections, reviews, forks
- Webhooks and GitHub webhook ingestion
- Graph and research endpoints: graph query/upsert, blueprints, research gaps
- Sourcing request orchestration:
/api/sourcing/requests,/api/sourcing/requests/:id,/api/sourcing/requests/:id/dispatch - Operational APIs: events, feed, schedules, snapshots, discovery, coverage, time compare, auto-refresh
- Autonomy dashboard endpoints:
/api/autonomy/pulse,/api/autonomy/timeline
This is an internal backend, not just a harvest trigger.
Defined Sourcing Contract
The harvester now exposes a defined sourcing-request layer so other systems can ask for targeted research instead of directly guessing which harvesters to run.
This is the intended contract:
- A caller submits a sourcing request with a requester, role, domain, topic, and objective.
- Knowledge Harvester qualifies that request against its current source catalog and existing artifact coverage.
- The request stores recommended sources, unsupported requested sources, suggested categories, suggested artifact types, and current coverage.
- The caller can review the request first or dispatch it immediately.
- The resulting harvest runs are recorded and linked back to the sourcing request.
Current role-aware planning supports requests from roles such as cro, cpo, cmo, cfo, clo, chro, and coo. Those roles influence which YAML and built-in sources are recommended for a request.
Example request payload:
{
"requester": "research-orchestrator",
"requester_role": "cro",
"domain": "revenue",
"topic": "enterprise lead routing and sales handoff",
"objective": "collect reusable guidance, workflows, policies, and enablement materials for revenue operations research",
"research_questions": [
"What artifacts describe lead qualification and routing?",
"What sources cover sales handoff, onboarding, and customer-success transitions?"
],
"preferred_sources": [
"sales-playbooks",
"sales-enablement",
"customer-success-playbooks"
],
"auto_dispatch": true
}
Primary endpoints:
POST /api/sourcing/requestscreates and qualifies a sourcing requestGET /api/sourcing/requestslists sourcing requestsGET /api/sourcing/requests/:idreturns one sourcing request with qualification dataPOST /api/sourcing/requests/:id/dispatchdispatches the recommended or selected sources
The response qualification object is where callers should look for the harvester's judgment about fit and coverage. It includes:
recommended_sourcesunsupported_requested_sourcessuggested_categoriessuggested_artifact_typescurrent_coveragerecommendation
Data Model
The repository currently straddles two storage models:
- a legacy
workflowstable and related workflow-specific processing - a more general
artifactstable for the broader artifact taxonomy
That split explains why some commands and API endpoints are workflow-oriented while others are artifact-oriented.
Artifact Types
Current artifact families:
| Type | Description |
|---|---|
workflow |
Automation workflows, DAGs, orchestration definitions |
code_pattern |
Reusable code snippets, components, handlers, templates |
infra_config |
Terraform, Helm, K8s, Docker, CI, shell, Make-style infra and ops config |
ai_ml_asset |
Notebooks, training configs, model-adjacent artifacts |
api_spec |
OpenAPI, GraphQL, and related API contract artifacts |
data_asset |
SQL schemas, dbt-style assets, migrations, data definitions |
documentation |
ADRs, runbooks, operational and technical docs |
Source Coverage
The CLI currently advertises 37+ built-in sources, plus YAML-defined sources loaded from src/definitions.
The built-in source categories are:
- workflow/template/community sources
- code-search-driven workflow sources
- infrastructure/config harvesters
- notebook/script/config pattern harvesters
- YAML-defined GitHub code-search harvesters
Current YAML definitions in this repo:
fastapi-patternspytorch-trainingreact-patternsopenapi-specsasyncapi-specsgrpc-protosgraphql-schemassql-schemasadrsrunbookssecurity-policiescompliance-controlssupport-playbooksproduct-requirementsrelease-checklistsvendor-assessmentslegal-contractsfinance-controlsemployee-handbooksprivacy-governanceboard-governancesales-playbookssales-enablementcustomer-onboardingcustomer-success-playbookshr-onboardingincident-postmortemsmcp-serversopen-policy-agentaws-step-functionsgoogle-workflowskestra-flowsflyte-workflowspulumi-programscloudformation-templates
Source Expansion Process
Future domain and source growth should follow a fixed repo process, not ad hoc additions.
Use docs/source-domain-expansion-process.md as the source of truth for:
- when to add a new source versus a new domain
- how to qualify legitimacy, fit, and discard rules
- how to check existing coverage before expanding
- how to keep bias and overrepresentation in check
- what implementation and verification steps are required
Supporting templates live here:
Important detail: YAML source names are the definition names above. They are invoked like this:
node src/index.js harvest --source fastapi-patterns
node src/index.js harvest --source react-patterns
The old yaml-fastapi-patterns style examples were wrong.
Setup
Requirements
- Node.js 18+
- Docker, if you want the bundled Postgres + pgvector setup
- Ollama, for classification, scoring, embeddings, and several downstream enrichment steps
- GitHub token, if you want most GitHub-backed harvesters and all YAML-defined harvesters to do useful work
Basic local setup
git clone https://github.com/GozerAI/knowledge-harvester.git
cd knowledge-harvester
npm install
cp .env.example .env
docker compose up -d
npm run migrate
Safer first-run commands
Instead of immediately running the full pipeline, start with:
node src/index.js stats
node src/index.js harvest --source n8n-community
node src/index.js harvest --source fastapi-patterns
Only run this once the environment is ready:
npm run pipeline
Running The API
Start the server directly:
node src/server.js
Useful environment variables for the API:
PORT- server port, defaults to8011KH_API_KEY- required for/api/*in production modeKH_RATE_LIMIT- per-IP rate limit for/api/*, defaults to60requests per minuteKH_WEBHOOK_SECRET- webhook dispatch signing secret
Useful endpoints to verify the service:
GET /health
GET /health/detailed
GET /metrics
GET /api/stats
GET /api/artifacts
GET /api/autonomy/pulse
Configuration
Key runtime configuration comes from .env, src/config.js, and a few server-only environment variables.
Database and model settings
| Variable | Default in code | Notes |
|---|---|---|
PG_HOST |
localhost |
Postgres host |
PG_PORT |
5432 |
Code fallback; .env.example and Docker Compose use 5435 on the host |
PG_DATABASE |
workflow_library |
Database name |
PG_USER |
harvester |
Database user |
PG_PASSWORD |
empty string | .env.example provides a local dev password |
GITHUB_TOKEN |
empty | Needed for most GitHub-backed harvesters and YAML-defined searches |
REDDIT_CLIENT_ID |
empty | Optional |
REDDIT_CLIENT_SECRET |
empty | Optional |
OLLAMA_HOST |
http://localhost:11434 |
Ollama endpoint |
OLLAMA_MODEL |
qwen2.5:7b |
Used for generation/classification-style tasks |
OLLAMA_EMBED_MODEL |
nomic-embed-text |
Used for embeddings |
Port note
There is one subtle but important mismatch in the current runtime setup:
- application code defaults
PG_PORTto5432 .env.examplesetsPG_PORT=5435- Docker Compose maps host
5435to container5432
In practice, local setup works if you copy .env.example. If you rely on code defaults alone, the app expects a local Postgres instance on 5432.
Architecture Reality Check
The old README described a simpler project than the codebase now contains.
More accurate framing:
- the CLI is the orchestration layer
- harvesters are only one part of the system
- the repo now includes graph, quality, packaging, monetization, maintenance, and autonomy-style capabilities
- the internal API is substantial enough to treat as a first-class subsystem
- the project still contains legacy workflow-specific paths alongside the generalized artifact model
Practical Usage Examples
Harvest one source:
node src/index.js harvest --source n8n-community
Harvest a YAML-defined artifact family:
node src/index.js harvest --source openapi-specs
Run selected enrichment steps:
node src/index.js classify --limit 50
node src/index.js score --limit 100
node src/index.js embed --limit 100
node src/index.js relations --limit 200
Inspect the knowledge base:
node src/index.js stats
node src/index.js coverage
node src/index.js snapshot list
node src/index.js pulse
Run the internal API:
node src/server.js
Create a sourcing request for targeted research:
curl -X POST http://localhost:8011/api/sourcing/requests \
-H "Content-Type: application/json" \
-d '{
"requester": "research-orchestrator",
"requester_role": "cpo",
"domain": "product",
"topic": "onboarding and release readiness",
"objective": "find source material for product operations research",
"preferred_sources": ["product-requirements", "release-checklists"],
"auto_dispatch": false
}'
Testing
Run the full suite:
npm test
License
MIT. See LICENSE.