Most organizational knowledge is trapped in prose: papers, contracts, reports, clinical notes, manuals, and meeting records. Search can retrieve a passage, but many applications need something stricter—entities with types, relationships with identifiers, events with time and place, and outputs that downstream software can validate.
Hyper-Extract is an Apache-2.0, Python 3.11+ framework and CLI for turning unstructured documents into persistent, strongly typed Knowledge Abstracts. It combines LLM structured output with reusable schemas, extraction methods, incremental updates, search, visualization, Obsidian export, and MCP access.
Its key idea is choice: not every document should become the same generic knowledge graph.
A typed model captures predictable fields; lists and sets collect repeated items without inventing relationships.
A graph links pairs of entities. A hypergraph can represent one relation involving several entities without forcing artificial pairwise edges.
Temporal, spatial, and spatio-temporal graphs support questions about when, where, and how knowledge changes.
The Three-Layer Architecture
Hyper-Extract separates what knowledge looks like from how it is extracted and which domain configuration is used.
1. Auto-Types
The project provides eight output families: Model, List, Set, Graph, Hypergraph, Temporal Graph, Spatial Graph, and Spatio-Temporal Graph. These are strongly typed rather than arbitrary JSON blobs, which makes validation and downstream integration more predictable.
A model suits fixed fields. A graph captures entities and pairwise relationships. A hypergraph represents a relationship connecting more than two entities. Temporal and spatial variants retain context that a plain graph can flatten away.
2. Extraction Methods
Hyper-Extract exposes multiple extraction engines, including approaches associated with GraphRAG, LightRAG, Hyper-RAG, KG-Gen, and others. The method determines how text is chunked, prompted, merged, and evolved into the selected structure.
This layer matters because extracting a small typed record is different from reconciling entities and relations across a long corpus.
3. Templates
More than 80 YAML presets cover general, finance, legal, medical, traditional Chinese medicine, and industrial domains. A template declares language, output fields, identifiers, types, and other behavior. Teams can start without Python and then create a custom template for their ontology.
Identifiers are especially important. An entity key such as name, or a relation key built from source, type, and target, controls whether new documents update existing knowledge or create duplicates.
From Document to Knowledge Abstract
The CLI’s basic workflow is compact:
uv tool install hyperextract
he config init -k YOUR_OPENAI_API_KEY
he parse report.pdf -t general/academic_graph -o ./knowledge/ -l en
he search ./knowledge/ "What are the key findings?"
he show ./knowledge/
he parse reads the source and applies a template. The resulting Knowledge Abstract is persistent: more documents can be added later to expand and refine it. he search performs semantic retrieval, while he show opens an interactive visualization.
The Python API supports the same pattern:
from hyperextract import Template
ka = Template.create("general/biography_graph")
result = ka.parse(document_text)
result.show()
Providers and Local Deployment
Hyper-Extract depends on an LLM’s structured-output capability, using JSON Schema or function calling. The README lists verified OpenAI, Anthropic, Alibaba Bailian, and local vLLM models. Embeddings use an OpenAI-compatible endpoint.
Anthropic models can perform extraction, but Anthropic does not provide an embeddings API, so semantic search must be paired with a compatible embedding provider. For private deployments, the project documents local Qwen and bge-m3 through vLLM.
Local execution can keep source documents on premises, but “local” does not automatically mean secure. Model endpoints, logs, caches, output folders, and visualization services still require access controls and retention policies.
Incremental Knowledge Evolution
A useful knowledge base cannot be a one-time artifact. Hyper-Extract lets users feed additional documents into an existing Knowledge Abstract. Merge strategies and stable identifiers reconcile new extractions with accumulated state.
This is powerful but also difficult. Entity names vary, facts conflict, and later documents may supersede earlier claims. Teams should define policies for deduplication, confidence, source priority, temporal validity, and deletion. A growing graph is not necessarily a more accurate graph.
Beyond Graphs: Hypergraphs and Context
Traditional knowledge graphs model relations as edges between two nodes. Some facts are inherently multi-party: a clinical intervention connects patient, treatment, clinician, condition, and outcome; a transaction connects buyer, seller, asset, jurisdiction, and time.
A hypergraph can preserve such a relation as one unit instead of decomposing it into pairwise edges that may lose meaning. Spatial and temporal graph types similarly make location and time first-class, enabling questions such as “Which event occurred before this decision?” or “Which entities interacted at this site?”
Use this complexity only when queries need it. A Pydantic-style model or list is easier to validate and maintain when relationships are irrelevant.
Obsidian and MCP
Hyper-Extract can export graph knowledge into an Obsidian vault with Markdown notes and [[wikilinks]]. This turns machine-extracted structures into a navigable human workspace, although generated notes still require review.
The optional MCP server exposes read and export operations to compatible assistants:
pip install 'hyperextract[mcp]'
he-mcp
Its documented tools cover template listing, knowledge information, search, RAG-style questions, and Obsidian export. Read/export-only scope is a sensible boundary, but the server may expose sensitive derived knowledge; client permissions and filesystem scope still matter.
Where It Fits
Hyper-Extract is useful when output structure is a product requirement rather than an implementation detail:
- papers into concepts, authors, methods, and citations;
- earnings reports into companies, executives, metrics, risks, and relationships;
- contracts into clauses, parties, obligations, dates, and exceptions;
- clinical or scientific text into typed entities and time-aware events;
- industrial reports into equipment, failures, locations, and dependencies;
- personal research collections into linked Obsidian notes.
For simple document Q&A, a conventional chunk-and-vector RAG pipeline may be cheaper and easier. Hyper-Extract earns its complexity when applications need reusable typed knowledge, graph traversal, multi-document evolution, or explicit time and space.
Risks and Evaluation
Structured output guarantees shape, not truth. An LLM can return valid JSON containing an invented entity or incorrect relation. Before production use:
- Build a labeled evaluation set from representative documents.
- Measure entity and relation precision/recall separately.
- Preserve source references for every extracted claim.
- Review merges, conflicts, and deletions.
- Treat medical, legal, and financial templates as starting schemas—not professional validation.
- Estimate token, embedding, and reprocessing costs for long documents.
Also test provider compatibility. A model advertised as supporting structured output may differ in schema limits, reliability, context length, and function-calling behavior.
Final Take
Hyper-Extract provides a coherent path from messy prose to typed, persistent knowledge. Its strongest contribution is not “LLM to graph,” but the separation of reusable structures, extraction engines, and domain templates.
That architecture lets teams begin with one command, choose an output richer than flat chunks, evolve it across documents, inspect it visually, export it for humans, and query it from agents through MCP.
Start with the simplest auto-type that answers your questions, define stable identifiers, validate against labeled documents, and add graph, hypergraph, time, or space only when those semantics deliver measurable value.