No description

JavaScript 99.4%
PLpgSQL 0.5%

Find a file

GozerAI 5c60135b54 chore(release): leak-clean public community edition Community-tier build of Knowledge Harvester. Gated Pro/Enterprise modules (src/processing, src/export, src/integrations) ship as license-required stubs rather than real implementations. Internal integrations, infrastructure, and non-public tooling are excluded. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>		2026-07-03 03:38:40 -04:00
docs	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
src	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
tests	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
.dockerignore	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
.gitignore	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
docker-compose.yml	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
Dockerfile	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
LICENSE	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
LICENSING.md	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
NOTICE	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
package-lock.json	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
package.json	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
README.md	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
run.bat	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00
test-shard.mjs	chore(release): leak-clean public community edition	2026-07-03 03:38:40 -04:00

README.md

Knowledge Harvester

Knowledge Harvester is an internal knowledge-ops tool for collecting, normalizing, enriching, storing, and exploring automation workflows and related technical artifacts.

In its current form, this repository is not just a scraper. It is a combination of:

a CLI pipeline runner
a set of source-specific harvesters
a Postgres/pgvector-backed artifact store
an internal HTTP API for browsing and operating on harvested data
a growing set of maintenance, graph, discovery, and autonomy utilities

What The Tool Actually Does

Today the codebase supports these major workflows:

Harvest workflows and artifacts from many sources, including n8n, GitHub, Reddit, Activepieces, Windmill, Temporal, Airflow, Prefect, Dagster, LangGraph, ComfyUI, Dify, Flowise, Pipedream, Argo, Luigi, Tekton, GitHub Actions, Home Assistant, MLflow, dbt, Camunda, Kafka Connect, Camel, Terraform, Helm, Docker Compose, Kubernetes manifests, Ansible, CI configs, Dockerfiles, Jupyter notebooks, shell scripts, Makefiles, and YAML-defined GitHub searches.
Normalize data into seven artifact families: workflow, code_pattern, infra_config, ai_ml_asset, api_spec, data_asset, and documentation.
Run post-harvest enrichment such as classification, scoring, embeddings, packaging, guide generation, license detection, validation, test detection, trend enrichment, semantic deduplication, decay scoring, and understanding extraction.
Build graph and discovery features such as recommendations, relations, clusters, bridge nodes, snapshots, coverage reports, and graph materialization.
Expose the stored data and internal operations through a lightweight API for artifacts, collections, reviews, forks, analytics, webhooks, graphs, schedules, snapshots, discovery, feed/event streams, auto-refresh, and autonomy status.

What This Is Not

It is not a polished public SaaS product.
It is not a generic SDK.
It is not safe to assume npm start merely launches a server.

This repository is best understood as internal tooling for operating a harvested knowledge base.

Entry Points

There are two primary runtime modes.

1. CLI pipeline and operators

The CLI lives in src/index.js.

Important behavior:

npm start runs node src/index.js
if no command is provided, the CLI defaults to pipeline
that default pipeline is the full 23-step workflow, not a lightweight no-op

If you want a safe first command, start with one of:

node src/index.js stats
node src/index.js harvest --source n8n-community
node src/index.js classify --limit 20

2. Internal HTTP API

The API server lives in src/server.js.

There is currently no dedicated npm script for it. Start it directly:

node src/server.js

By default it listens on PORT=8011.

The API is intended for internal use. In development, API auth is permissive if KH_API_KEY is unset. In production mode, requests to /api/* require x-api-key and the server returns 503 if KH_API_KEY is not configured.

Current CLI Capability Surface

The CLI currently exposes these command groups:

Ingestion: harvest, migrate
Core enrichment: classify, score, embed, package, guide
Analysis: migrations, complexity, compose, facets, stats
Artifact quality and search: license, validate, test-detect, freshness, relations, recommend, enrich-trends
Higher-order artifact operations: dedup-semantic, decay, understand, graph-materialize
Operational and autonomy utilities: schedule, snapshot, discover, coverage, compare, auto-refresh, sync-trends, pulse
End-to-end orchestration: pipeline

The full pipeline currently runs 23 steps:

migrate
harvest
classify
score
embed
package
guide
complexity
migrations
compose
monetize
bundle
license detection
validation
test detection
relation building
facet refresh
trend enrichment
semantic dedup
decay prediction
understanding extraction
claim extraction
graph materialization

Current API Capability Surface

The API is broader than the old README implied. Current route families include:

Health and metrics: /health, /health/detailed, /metrics
Legacy workflow views: /api/workflows, /api/categories, /api/stats
Harvest control and status: /api/harvest, /api/harvest/:runId/status
Artifact CRUD and batch operations: /api/artifacts, /api/artifacts/batch, /api/artifacts/batch/export
Artifact subresources: partial fetches, semantic search, at-risk listing, recommendations, export, scaffold, deploy manifest, reviews, forks, duplicates, canonical selection
Marketplace-style views: packages, guides, bundles, trending artifacts
Social and curation features: contributors, collections, reviews, forks
Webhooks and GitHub webhook ingestion
Graph and research endpoints: graph query/upsert, blueprints, research gaps
Sourcing request orchestration: /api/sourcing/requests, /api/sourcing/requests/:id, /api/sourcing/requests/:id/dispatch
Operational APIs: events, feed, schedules, snapshots, discovery, coverage, time compare, auto-refresh
Autonomy dashboard endpoints: /api/autonomy/pulse, /api/autonomy/timeline

This is an internal backend, not just a harvest trigger.

Defined Sourcing Contract

The harvester now exposes a defined sourcing-request layer so other systems can ask for targeted research instead of directly guessing which harvesters to run.

This is the intended contract:

A caller submits a sourcing request with a requester, role, domain, topic, and objective.
Knowledge Harvester qualifies that request against its current source catalog and existing artifact coverage.
The request stores recommended sources, unsupported requested sources, suggested categories, suggested artifact types, and current coverage.
The caller can review the request first or dispatch it immediately.
The resulting harvest runs are recorded and linked back to the sourcing request.

Current role-aware planning supports requests from roles such as cro, cpo, cmo, cfo, clo, chro, and coo. Those roles influence which YAML and built-in sources are recommended for a request.

Example request payload:

{
  "requester": "research-orchestrator",
  "requester_role": "cro",
  "domain": "revenue",
  "topic": "enterprise lead routing and sales handoff",
  "objective": "collect reusable guidance, workflows, policies, and enablement materials for revenue operations research",
  "research_questions": [
    "What artifacts describe lead qualification and routing?",
    "What sources cover sales handoff, onboarding, and customer-success transitions?"
  ],
  "preferred_sources": [
    "sales-playbooks",
    "sales-enablement",
    "customer-success-playbooks"
  ],
  "auto_dispatch": true
}

Primary endpoints:

POST /api/sourcing/requests creates and qualifies a sourcing request
GET /api/sourcing/requests lists sourcing requests
GET /api/sourcing/requests/:id returns one sourcing request with qualification data
POST /api/sourcing/requests/:id/dispatch dispatches the recommended or selected sources

The response qualification object is where callers should look for the harvester's judgment about fit and coverage. It includes:

recommended_sources
unsupported_requested_sources
suggested_categories
suggested_artifact_types
current_coverage
recommendation

Data Model

The repository currently straddles two storage models:

a legacy workflows table and related workflow-specific processing
a more general artifacts table for the broader artifact taxonomy

That split explains why some commands and API endpoints are workflow-oriented while others are artifact-oriented.

Artifact Types

Current artifact families:

Type	Description
`workflow`	Automation workflows, DAGs, orchestration definitions
`code_pattern`	Reusable code snippets, components, handlers, templates
`infra_config`	Terraform, Helm, K8s, Docker, CI, shell, Make-style infra and ops config
`ai_ml_asset`	Notebooks, training configs, model-adjacent artifacts
`api_spec`	OpenAPI, GraphQL, and related API contract artifacts
`data_asset`	SQL schemas, dbt-style assets, migrations, data definitions
`documentation`	ADRs, runbooks, operational and technical docs

Source Coverage

The CLI currently advertises 37+ built-in sources, plus YAML-defined sources loaded from src/definitions.

The built-in source categories are:

workflow/template/community sources
code-search-driven workflow sources
infrastructure/config harvesters
notebook/script/config pattern harvesters
YAML-defined GitHub code-search harvesters

Current YAML definitions in this repo:

fastapi-patterns
pytorch-training
react-patterns
openapi-specs
asyncapi-specs
grpc-protos
graphql-schemas
sql-schemas
adrs
runbooks
security-policies
compliance-controls
support-playbooks
product-requirements
release-checklists
vendor-assessments
legal-contracts
finance-controls
employee-handbooks
privacy-governance
board-governance
sales-playbooks
sales-enablement
customer-onboarding
customer-success-playbooks
hr-onboarding
incident-postmortems
mcp-servers
open-policy-agent
aws-step-functions
google-workflows
kestra-flows
flyte-workflows
pulumi-programs
cloudformation-templates

Source Expansion Process

Future domain and source growth should follow a fixed repo process, not ad hoc additions.

Use docs/source-domain-expansion-process.md as the source of truth for:

when to add a new source versus a new domain
how to qualify legitimacy, fit, and discard rules
how to check existing coverage before expanding
how to keep bias and overrepresentation in check
what implementation and verification steps are required

Supporting templates live here:

Important detail: YAML source names are the definition names above. They are invoked like this:

node src/index.js harvest --source fastapi-patterns
node src/index.js harvest --source react-patterns

The old yaml-fastapi-patterns style examples were wrong.

Setup

Requirements

Node.js 18+
Docker, if you want the bundled Postgres + pgvector setup
Ollama, for classification, scoring, embeddings, and several downstream enrichment steps
GitHub token, if you want most GitHub-backed harvesters and all YAML-defined harvesters to do useful work

Basic local setup

git clone https://github.com/GozerAI/knowledge-harvester.git
cd knowledge-harvester
npm install
cp .env.example .env
docker compose up -d
npm run migrate

Safer first-run commands

Instead of immediately running the full pipeline, start with:

node src/index.js stats
node src/index.js harvest --source n8n-community
node src/index.js harvest --source fastapi-patterns

Only run this once the environment is ready:

npm run pipeline

Running The API

Start the server directly:

node src/server.js

Useful environment variables for the API:

PORT - server port, defaults to 8011
KH_API_KEY - required for /api/* in production mode
KH_RATE_LIMIT - per-IP rate limit for /api/*, defaults to 60 requests per minute
KH_WEBHOOK_SECRET - webhook dispatch signing secret

Useful endpoints to verify the service:

GET /health
GET /health/detailed
GET /metrics
GET /api/stats
GET /api/artifacts
GET /api/autonomy/pulse

Configuration

Key runtime configuration comes from .env, src/config.js, and a few server-only environment variables.

Database and model settings

Variable	Default in code	Notes
`PG_HOST`	`localhost`	Postgres host
`PG_PORT`	`5432`	Code fallback; `.env.example` and Docker Compose use `5435` on the host
`PG_DATABASE`	`workflow_library`	Database name
`PG_USER`	`harvester`	Database user
`PG_PASSWORD`	empty string	`.env.example` provides a local dev password
`GITHUB_TOKEN`	empty	Needed for most GitHub-backed harvesters and YAML-defined searches
`REDDIT_CLIENT_ID`	empty	Optional
`REDDIT_CLIENT_SECRET`	empty	Optional
`OLLAMA_HOST`	`http://localhost:11434`	Ollama endpoint
`OLLAMA_MODEL`	`qwen2.5:7b`	Used for generation/classification-style tasks
`OLLAMA_EMBED_MODEL`	`nomic-embed-text`	Used for embeddings

Port note

There is one subtle but important mismatch in the current runtime setup:

application code defaults PG_PORT to 5432
.env.example sets PG_PORT=5435
Docker Compose maps host 5435 to container 5432

In practice, local setup works if you copy .env.example. If you rely on code defaults alone, the app expects a local Postgres instance on 5432.

Architecture Reality Check

The old README described a simpler project than the codebase now contains.

More accurate framing:

the CLI is the orchestration layer
harvesters are only one part of the system
the repo now includes graph, quality, packaging, monetization, maintenance, and autonomy-style capabilities
the internal API is substantial enough to treat as a first-class subsystem
the project still contains legacy workflow-specific paths alongside the generalized artifact model

Practical Usage Examples

Harvest one source:

node src/index.js harvest --source n8n-community

Harvest a YAML-defined artifact family:

node src/index.js harvest --source openapi-specs

Run selected enrichment steps:

node src/index.js classify --limit 50
node src/index.js score --limit 100
node src/index.js embed --limit 100
node src/index.js relations --limit 200

Inspect the knowledge base:

node src/index.js stats
node src/index.js coverage
node src/index.js snapshot list
node src/index.js pulse

Run the internal API:

node src/server.js

Create a sourcing request for targeted research:

curl -X POST http://localhost:8011/api/sourcing/requests \
  -H "Content-Type: application/json" \
  -d '{
    "requester": "research-orchestrator",
    "requester_role": "cpo",
    "domain": "product",
    "topic": "onboarding and release readiness",
    "objective": "find source material for product operations research",
    "preferred_sources": ["product-requirements", "release-checklists"],
    "auto_dispatch": false
  }'

Testing

Run the full suite:

npm test

License

MIT. See LICENSE.