Extract structured scientific knowledge from papers. Discovers papers from arXiv, PubMed, OpenAlex, OSTI. You read the paper and produce the extraction JSON...
A nanofiltration paper provides selective passage below a size threshold. A drug delivery paper requires selective transport to a target. Neither team knows the other exists — they publish in different journals, use different vocabulary, and will never cite each other.
This skill extracts structured provides/requires relationships from scientific papers, building a knowledge graph that surfaces these hidden cross-domain connections. Each paper you extract adds a node. The graph finds the bridges.
| Trigger | Action |
|---|---|
/discovery-extract | Discover papers, extract, and save results |
| "Extract some papers" | Run the full pipeline (discover → extract → save) |
| "Submit my extractions" | Create a PR with your batch results |
| "Find papers from arXiv" | Discover from a specific source |
Every paper is decomposed into:
The cross_domain section is where discovery happens. The provides and requires fields use abstract functional language (not domain jargon) so a materials science paper can match a biology paper.
No external API keys or LLM calls needed — you read the paper text and produce the structured JSON yourself. The bundled prompt (references/prompt.txt) is your extraction specification.
python scripts/extract.py discover to find new papers with abstractsreferences/prompt.txt — the full extraction format specificationpython scripts/extract.py saveghpython scripts/extract.py discover --count 5
This outputs a JSON array of papers (id, source, title, abstract) to stdout. Already-processed papers are automatically excluded.
To target a specific source:
python scripts/extract.py discover --source arxiv --count 5
python scripts/extract.py discover --source pmc --count 5
Read references/prompt.txt to understand the output format. It specifies:
The prompt contains detailed rules, examples, and a self-check procedure.
For each paper from Step 1, produce a JSON object following the schema in
references/prompt.txt. The paper's abstract is your input text.
Write the JSON to a temporary file (e.g., /tmp/result.json or any local path).
Key requirements:
paper_analysis (not analysis)unsolved_tensions entries must be objects with {tension, constraint_class, why_it_matters, source_quote}provides entries must be objects with {operation, description, performance, conditions}requires entries must be objects with {operation, description, reason}bridge_tags must be abstract functional descriptors, not domain nounscross_domain section is where discovery happens — invest effort herepython scripts/extract.py save /tmp/result.json \
--paper-id "arxiv:2401.00001" \
--source arxiv \
--title "Paper Title Here"
The save command normalizes format issues, validates, adds metadata, and saves
to ~/.discovery/data/batch/. It will report any validation warnings.
python scripts/extract.py validate ~/.discovery/data/batch/
After extracting a batch, submit results as a PR:
# Fork (first time only)
gh repo fork pcdeni/discovery-engine --clone=false
# Clone your fork
gh repo clone pcdeni/discovery-engine discovery-engine-submit
cd discovery-engine-submit
# Create branch and copy results
BRANCH="contrib/$(gh api user --jq .login)/$(date +%Y%m%d-%H%M%S)"
git checkout -b "$BRANCH"
cp ~/.discovery/data/batch/*.json submissions/
git add submissions/
git commit -m "Add extraction results"
git push -u origin "$BRANCH"
# Create PR
gh pr create --title "extraction: $(ls submissions/*.json | wc -l) papers" \
--body "Extraction results from discovery-extract skill" \
--repo pcdeni/discovery-engine
GitHub Actions CI validates submissions and auto-merges passing PRs.
| File | Purpose |
|---|---|
scripts/extract.py | Paper discovery, normalization, validation, saving (Python stdlib only) |
references/prompt.txt | The extraction format specification (444 lines) |
references/schema.json | JSON schema for validation |
ZIP package — ready to use