An autonomous research paper writing toolkit for Claude Code. From a topic prompt, it produces a publication-ready, journal-quality LaTeX paper through multi-agent orchestration — running for 1-4 hours (standard) or 3-8 hours (deep mode).
New here? See QUICKSTART.md for the 3-minute setup.
Visual overview? Open docs/pipeline-diagram.html in a browser.
Security model? See SECURITY.md. Additional docs: Architecture, Developer Guide, Venue Reference, Pipeline Reference, Scripts Reference.
/write-paper/auto CommandResearch Agent is a Claude Code workspace template and orchestration system that writes complete, journal-quality research papers autonomously. You provide a topic; the pipeline does the rest.
The core workflow is:
create-paper my-paper "LLM reasoning" --venue arxiv
cd my-paper
write-paper
That command sequence launches a multi-stage pipeline that:
Everything produced is traced. Every paragraph in the final paper links back to its origin: which sources informed it, which agent wrote it, what feedback revised it, and what was cut and why.
| Mode | Research agents | Reference target | Estimated time | Estimated cost |
|---|---|---|---|---|
| Standard | 5 | 30-50 refs | 1-4 hours | ~$50 |
| Deep | 12 + 3 targeted-pass | 60-80 refs | 3-8 hours | ~$150 |
Required:
claude) or Codex (codex)pdflatex and latexmkOptional:
codex-bridge (npm i -g codex-bridge) — adversarial AI review from OpenAI Codex throughout the pipelineOPENROUTER_API_KEY — required to build and query the knowledge graph (LightRAG via Gemini Flash and Qwen3 8B)CORE_API_KEY — free API key from CORE (200M+ institutional repository papers). Register at core.ac.uk for a free key.NCBI_API_KEY — free API key from NCBI. Increases PubMed Central rate limit from 3/s to 10/s. Only useful for biomedical/clinical papers.create-paperSymlink the two launcher scripts to somewhere on your PATH:
git clone https://github.com/ndcorder/research-agent
cd research-agent
ln -s $(pwd)/create-paper ~/.local/bin/create-paper
ln -s $(pwd)/write-paper ~/.local/bin/write-paper
ln -s $(pwd)/sync-papers ~/.local/bin/sync-papers
Paper projects symlink their runtime scaffold and scripts/ back to this repository's template. Claude projects use .claude/; Codex projects use .codex/ plus a root AGENTS.md symlink. When you update the template, every paper project sees the change instantly.
If you have existing projects created before the symlink migration, run sync-papers once to convert them:
sync-papers /path/to/your/papers
Safe to run multiple times. Projects already using symlinks are skipped.
create-paper <directory> [topic] [--venue <venue>] [--runtime <claude|codex>] [--deep]
Examples:
# Create a project and immediately offer to launch the pipeline
create-paper my-survey "A survey on LLM reasoning" --venue arxiv
# Create the project structure without starting (add topic later)
create-paper my-paper --venue neurips
# Codex-native project
create-paper my-paper "Protein structure prediction" --venue nature --runtime codex
# Deep mode: 12 agents, 60-80 refs, targeted second pass
create-paper my-paper "Protein structure prediction" --venue nature --deep
create-paper does the following in one command:
main.tex formatted for the target venue.claude/ or .codex/).paper.json and .venue.jsonvendor/claude-scientific-skills)vendor/praxis) and installs Python dependenciescodex-bridge if it is installedFrom inside the paper project:
# Method 1: the write-paper launcher script (recommended)
write-paper # uses topic from .paper.json
write-paper "new topic" # overrides topic
# Method 2: runtime-neutral command launcher
scripts/run-paper-command preview-pipeline
scripts/run-paper-command health
write-paper reads .paper.json and dispatches to the configured runtime. Claude projects use the Claude slash-command flow. Codex projects use codex exec with the project-local .codex/ instructions. See docs/CODEX.md for the Codex-specific workflow.
The pipeline writes a human-readable progress file you can watch from another terminal:
cat .paper-progress.txt # current stage, section word counts
ls research/ # research files as they appear
ls reviews/ # review files during QA
| Flag | Format | Citation style | Page limit |
|---|---|---|---|
generic |
Standard article (default) | natbib (\citep, \citet) |
none |
ieee |
IEEEtran two-column | numeric | 8 pages |
acm |
acmart sigconf, double-blind | natbib | 12 pages |
neurips |
NeurIPS single-column | natbib | 9 pages |
nature |
Nature family — Results before Methods | numeric superscript | none |
arxiv |
arXiv preprint, extended format | natbib | none |
apa |
APA 7th edition | apacite | none |
Each venue JSON also includes a writing_guide field with venue-specific tone, structure, citation density, figure count, and reviewer expectation guidance. Writing agents read the guide to match the conventions of the target venue.
/write-paperRun with the active runtime, or via the write-paper launcher. The pipeline reads .paper.json for topic, venue, depth, and runtime, then executes the shared stages sequentially.
After completing each stage or section, the pipeline writes .paper-state.json tracking exactly which stages and sections are done. If the session is interrupted, rerunning /write-paper reads this file and skips completed work. Partial stage recovery tracks sub-steps within stages (individual research agents, individual sections) so a crash mid-stage resumes from the last completed sub-step, not the start of the stage. .paper-progress.txt is updated at each checkpoint with a human-readable summary.
Goal: 30-50 verified references (standard) or 60-80 (deep mode) covering the field comprehensively.
Before spawning agents, the pipeline detects the paper's domain from the topic keywords (Biomedical, Chemistry, CS/AI, Physics, Materials Science, Ecology, Economics, Clinical, or General). This determines which scientific skill databases are prioritized.
Standard mode: 5 agents in parallel
| Agent | Task | Output file |
|---|---|---|
| Field Survey | 10-15 most influential papers, major research threads, recent breakthroughs | research/survey.md |
| Methodology Deep Dive | Major methodological approaches, state-of-the-art, standard benchmarks | research/methods.md |
| Empirical Evidence | Standard benchmarks, datasets, SotA results, reproducibility concerns | research/empirical.md |
| Theoretical Foundations | Formal definitions, theorems, connections to broader frameworks | research/theory.md |
| Gap Analysis | Research gaps, promising directions, proposed thesis and contribution | research/gaps.md |
The Gap Analysis agent (Agent 5) runs AFTER the first four complete, because it reads all their output.
Deep mode: 7 additional agents in parallel (with agents 1-4)
| Agent | Task | Output file |
|---|---|---|
| Recent Frontiers | Papers published 2024-2026 only, emerging trends | research/recent_frontiers.md |
| Negative Results | What didn't work, failed approaches, replication failures | research/negative_results.md |
| Cross-Disciplinary | Insights from adjacent fields | research/cross_disciplinary.md |
| Datasets & Reproducibility | All standard datasets, open-source implementations | research/datasets_reproducibility.md |
| Industry & Applied | Deployed systems, patents, gap between academic and production | research/industry_applied.md |
| Competing Hypotheses | Active scientific debates, schools of thought | research/competing_hypotheses.md |
| Intellectual Lineage | Seminal papers, how ideas evolved, paradigm shifts | research/intellectual_lineage.md |
Tool fallback chain: Every research agent is instructed to try tools in this order: domain-specific database skills (PubMed, arXiv, etc.) → Perplexity search → WebSearch → Firecrawl search → WebFetch on known URLs → research-lookup skill. Agents must try at least 3 different tools before giving up on a query.
Codex independent contribution: After all Claude research agents complete, Codex is called to independently suggest papers Claude may have missed, drawing on its own training data. Its suggestions are verified for existence before being passed to the bibliography builder.
Bibliography builder (haiku model): Reads all research files, extracts every cited paper, verifies each against CrossRef or search, generates BibTeX entries, and writes references.bib. If fewer than 25 references result (standard) or 50 (deep), additional targeted research agents are spawned for underrepresented areas.
Research log: After every tool call (success or failure), agents append an entry to research/log.md recording timestamp, agent name, tool used, exact query, result, and any URLs or DOIs found. This log is the complete provenance trail of all literature searches.
Source extracts: For every cited paper, the agent creates research/sources/<bibtexkey>.md recording the access level (FULL-TEXT, ABSTRACT-ONLY, or METADATA-ONLY), the actual content that was read (verbatim or near-verbatim — not a summary), key findings, and provenance.
A standalone stage that reads every research file and calls Codex to cross-check each one's major claims: are they accurately represented? Are there contradicting studies or missing nuances? Are key papers overlooked?
If Codex identifies missing perspectives or inaccurate claims, a targeted follow-up research agent investigates those specific gaps. All Codex feedback and Claude's evaluations are logged to research/codex_cross_check.md.
This stage is non-skippable. The pipeline verifies the file exists before proceeding.
Goal: Ensure every cited source has a verifiable content snapshot before writing begins.
Phase 1 — Audit: Every BibTeX key is matched against research/sources/. Each paper is classified as FULL-TEXT, ABSTRACT-ONLY, or METADATA-ONLY based on its source extract file.
Phase 2 — Automated OA resolution: For each ABSTRACT-ONLY and METADATA-ONLY source, the pipeline attempts to find a free full-text copy:
openAccessPdf fieldEach successfully resolved paper has its source extract upgraded to FULL-TEXT with a new content snapshot.
Phase 3 — Human acquisition (human-in-the-loop): After automated resolution, if papers still lack full text, the pipeline pauses and presents a prioritized acquisition list sorted by how many times each paper is cited in the research files. In standard mode it presents the top 5; in deep mode it presents all abstract-only and metadata-only papers. The user can drop PDFs into attachments/ and reply "continue", or reply "skip" to proceed with abstract-level sources (which get flagged in the claims-evidence matrix).
Phase 4 — Coverage report: research/source_coverage.md is written with final counts by access level.
Immediately after Stage 1d, the pipeline attempts to build a LightRAG knowledge graph:
python scripts/knowledge.py build
This reads all research/sources/*.md files, extracts entities (papers, theories, methods, findings, authors) and relationships (cites, contradicts, supports, extends), and builds a queryable graph with semantic embeddings. The graph is stored in research/knowledge/ (gitignored).
If scripts/knowledge.py is not found or OPENROUTER_API_KEY is not set, this step is silently skipped. The pipeline works without the knowledge graph; agents fall back to reading research files directly.
With all research complete, the pipeline:
Defines the thesis and contribution in research/thesis.md — the specific problem, the novel contribution, and the key claims.
Determines paper structure from the topic type and venue. Nature-family papers get Results before Methods. IEEE and ACM papers get tighter sections to respect page limits. Survey papers use thematic analysis sections rather than IMRAD.
Creates the detailed outline in main.tex — all sections and subsections with % OUTLINE: comments containing the key arguments, planned citation keys, figure plans, and word targets per section.
Builds the Claims-Evidence Matrix in research/claims_matrix.md. Every major claim the paper will make is listed with: the type of evidence supporting it (experiment, citation, formal proof, or data), which section will present it, and whether the supporting sources are FULL-TEXT or ABSTRACT-ONLY. Claims backed only by abstract-level sources are flagged with a warning.
Codex stress-tests the claims-evidence matrix — checking whether evidence is actually sufficient for each claim, whether any claims are too strong for the evidence type, and whether obvious claims are missing.
Logs planning provenance — entries in research/provenance.jsonl for the thesis selection, section structure decisions, and each claim's evidence strategy.
A standalone stage that calls codex_plan to challenge the paper plan before any writing begins:
Results go to research/codex_thesis_review.md. If Codex identifies structural problems, they are fixed in the outline before proceeding.
Only runs when depth is "deep". Now that the thesis and claims are finalized, three agents run a surgical second research pass:
| Agent | Task | Output file |
|---|---|---|
| Supporting Evidence | 3-5 papers providing direct evidence for each key claim | research/targeted_support.md |
| Counterarguments | Papers that contradict or challenge the claims, strongest reviewer objections | research/targeted_counter.md |
| Methodological Precedents | Papers using similar methods, known limitations, best practices | research/targeted_methods.md |
The bibliography builder runs again after to incorporate new references.
A standalone stage that searches for existing work making the same contribution before writing begins. This catches the worst case: investing hours in writing a paper that already exists.
The pipeline searches arXiv, Semantic Scholar, domain-specific databases, and (if available) calls Codex to check its own training data. Results are classified:
Results are written to research/novelty_check.md.
Each section gets its own dedicated Opus-tier agent running sequentially. Writing is sequential (each section can reference prior ones); research and review are parallel.
Deep mode per-section literature searches: Before each writing agent, a haiku-tier agent quickly searches for literature specific to that section's topics and writes a targeted reference list to research/section_lit_[section].md. The writing agent then reads this alongside the main research files.
Writing order and targets:
| Order | Section | Key guidance |
|---|---|---|
| 1 | Introduction | Broad context → specific problem → contribution → findings preview → paper organization. 8-12 citations. |
| 2 | Related Work | Organize thematically in 3-5 subsections. 3-5 papers per theme. Explicit positioning of this work. 15-20 citations. |
| 3 | Methods/Approach | Exhaustive: full detail for reproduction, math in equation/align environments, pseudocode if applicable, design rationale. |
| 4 | Results/Experiments | Setup → quantitative results (tables) → ablations → qualitative analysis. At least 2 booktabs tables. |
| 5 | Discussion | Interpret findings, compare with prior work, limitations (written honestly), broader implications, future work. |
| 6 | Conclusion | Concise. Restate problem, summarize approach, highlight key results with numbers. No new information. |
| 7 | Abstract | Written last, reads the full paper first. Specific, quantitative, self-contained. |
Codex-authored Limitations: After the Discussion section is written, Codex drafts the Limitations subsection from an adversarial perspective (methodological assumptions, data limitations, scope limitations, threats to validity). Claude evaluates each point, pushes back where warranted, and the Discussion agent integrates the agreed-upon content.
After each section: Claude assesses whether the section is substantively complete. If it is thin, missing citations, or leaving obvious gaps, an expansion agent (Opus tier) adds depth. Then Codex does a quick spot-check of the section for logic, evidence proportionality, and gaps. In deep mode, Codex also identifies substantive content gaps for a further expansion round.
Knowledge graph queries: If the knowledge graph exists, each writing agent queries it for section-specific evidence, checks for contradictions, and ensures comprehensive coverage using the CLI commands.
Provenance logging: After writing each paragraph, the agent appends a provenance entry to research/provenance.jsonl recording the paragraph target (section/pN), which source extracts informed it, which claims it supports, and a reasoning field explaining the writing choices.
Step 4a — Data-driven figures with Praxis (if available):
If vendor/praxis/scripts/ exists and attachments/ contains data files, a data analysis agent auto-detects the characterisation technique and runs Praxis analysis:
import sys
sys.path.insert(0, "vendor/praxis/scripts")
# venue-matched style, colourblind-safe palette, technique-specific metrics
apply_style("<venue_style>")
set_palette("okabe_ito")
Figures are exported as PDF to figures/, scripts are saved to figures/scripts/ for reproducibility, and quantitative results are inserted into the Results and Methods sections of main.tex.
If Praxis is not installed but data files exist, the pipeline falls back to generic matplotlib using the matplotlib skill.
Step 4b — Structural figures and tables:
A visualization agent ensures the paper has:
Step 4c — Codex figure and claims audit:
Codex audits whether captions accurately describe what is shown, whether surrounding text claims match what the data actually shows, and whether there are misleading axis scales, cherry-picked comparisons, or missing error bars.
This stage loops until all quality criteria pass, up to 5 iterations (standard) or 8 (deep mode). Stale review files are deleted before each iteration so reviewers evaluate the latest version.
Step 5a — Parallel review (3 agents):
| Reviewer | Focus |
|---|---|
| Technical Reviewer | Claims supported by evidence, methodology sound and reproducible, results properly analyzed, argument coherent Introduction to Conclusion |
| Writing Quality Reviewer | No bullet points in body text, every paragraph complete, transitions smooth, terminology consistent, tense correct, section word counts |
| Citation and Completeness Reviewer | All citation keys exist in .bib, uncited claims flagged, no placeholder text, all cross-references resolve, LaTeX compilation |
Step 5a-ii — Codex adversarial review (parallel with agents):
While the 3 review agents run, Codex is called directly as a 4th adversarial reviewer:
You are an adversarial peer reviewer. Find the weakest points: claims that exceed the evidence, logical gaps in the argument chain, methodological shortcuts, missing baselines or unfair comparisons, conclusions that don't follow from results.
Result goes to reviews/codex_adversarial.md. The pipeline verifies all 4 review files exist before proceeding.
Step 5b — Synthesize: All review files are read and a prioritized fix list is built (CRITICAL first, then MAJOR, word count shortfalls, missing citations).
Step 5c — Revision agent (Opus tier): Fixes every critical and major issue, expands thin sections with substantive content, adds verified citations, and compiles. Content that should be removed is cut, with the removed text archived in provenance/cuts/.
Step 5d — Quality gate: All criteria in the table below must pass. If any fail, loop back to Step 5a. In deep mode, after the final passing iteration, Codex does one additional deep review looking for subtle issues that earlier reviews missed.
Quality gate criteria:
| Criterion | Requirement |
|---|---|
| Sections substantively complete | No obvious gaps, thin arguments, or missing depth |
| Claims-Evidence Matrix | Every claim has status "Supported" |
| References in .bib | 25+ verified entries |
| Placeholder text | Zero (no TODO/TBD/FIXME/lipsum) |
| LaTeX compilation | No errors |
| Tables | 2+ with booktabs |
| Cross-references | All \ref{} resolve |
| Citation keys | All \citep{}/\citet{} exist in .bib |
| Body text | Full paragraphs only, no bullet points |
After the QA loop exits (all criteria met), two final audits run in parallel:
Consistency Checker: Finds and fixes notation inconsistencies, terminology drift, abbreviations defined twice or used before definition, tense inconsistencies, and reference format inconsistencies (Figure vs Fig., Section vs Sec.).
Claims Auditor: Flags overclaims — "novel"/"first" without evidence, "significantly" without a statistical test, "prove"/"demonstrate" with only experiments, unsupported factual claims, generalizations from limited experiments. Critical and major overclaims are softened in main.tex.
After both complete, a Reproducibility Checker (haiku tier) verifies that Methods includes all hyperparameters, training details, compute resources, dataset descriptions, evaluation metric definitions, and random seed/variance information.
A reference validation agent (haiku tier) verifies every BibTeX entry is a real publication:
references.bibreferences.bib and their citations from main.texIn parallel, Codex independently verifies a random sample of 10-15 references. If Codex flags a reference that Claude's validator marked as verified, both findings are investigated further.
This stage is non-negotiable. Fabricated references are the primary risk in AI-assisted writing.
Codex's codex_risk_radar tool assesses the complete manuscript across five dimensions:
| Dimension | Description |
|---|---|
| Scientific risk | Claims that could be proven wrong or are unfalsifiable |
| Ethical risk | Data handling, consent, bias, or dual-use concerns |
| Reputational risk | Anything that could embarrass the authors under scrutiny |
| Reproducibility risk | Whether an independent team could reproduce the results |
| Novelty risk | Whether the contribution is incremental enough for desk rejection |
Each dimension is rated LOW / MEDIUM / HIGH. HIGH-risk items must be addressed before finalization. MEDIUM-risk items are flagged for the user.
Final polish (Opus tier): Reviews the complete manuscript end-to-end: abstract accuracy, title specificity, introduction's paper-organization paragraph matching actual sections, conclusion referencing actual results with numbers, redundancies removed, formatting consistent.
Lay summary (haiku tier): Generates a 200+ word plain-language summary (high school reading level) and a 2-3 sentence elevator pitch, written to research/summaries.md. If the venue requires a lay summary (Nature, medical journals), it is added to main.tex.
De-AI polish (Opus tier): Scans for and removes AI writing patterns:
Provenance report (haiku tier): Reads the complete research/provenance.jsonl and generates research/provenance_report.md — a human-readable report of every action taken, organized by section and stage, with paragraph-level histories.
Archive: The /archive command runs automatically to bundle all artifacts into a browsable archive/ directory with an indexed README.md.
Codex collaboration stats: codex_stats is called to report how the two AI systems collaborated throughout the pipeline.
/auto CommandAfter the pipeline completes, /auto runs additional autonomous improvement iterations on the finished paper. It does not re-run the pipeline — it improves what exists.
/auto # run 1 iteration
/auto 3 # run 3 iterations
/auto --continue # resume from last completed iteration
Each iteration runs a 4-phase cycle. Like /write-paper, /auto uses the split-phase pattern: each phase reads its instructions from a dedicated file in pipeline/ (auto-phase-1-assessment.md through auto-phase-4-verification.md) to avoid context degradation in long sessions.
| Agent | Focus |
|---|---|
| Depth and Evidence | Which arguments are thin, which claims need stronger evidence, logical leaps, redundant paragraphs that should be cut |
| Structure and Flow | Argument flow between sections, disproportionate sections, reordering opportunities, front-loading of key insights |
| Competitive Positioning | Differentiation from closest prior work, fairness of comparisons, missing recent papers (2024-2026), field familiarity |
| Writing and Polish | Em dashes, AI writing patterns, sentence length variety, vague statements, over- or under-claiming |
Codex also runs in parallel (if available) to identify the 5 highest-impact improvements with fresh eyes.
All assessment files are read, similar findings are deduplicated, and the top 5 actions for this iteration are selected by impact. Actions are a deliberate mix of strengthening, cutting, structural improvements, and polish. If 3 or fewer meaningful actions are found across all reviewers, early stop is triggered immediately.
The action plan is written to reviews/auto_iter[N]_plan.md with each action typed, targeted to a specific location, and annotated with whether research is needed.
Research phase (max 3 queries): If any selected action requires new evidence, a targeted research agent runs up to 3 searches.
Revision phase (Opus tier): Executes every action in the plan. For cuts, the removed text is saved to provenance/cuts/[section]-[pN]-auto[N].tex before deletion. After all changes, LaTeX is compiled and errors are fixed.
A lightweight verification agent confirms:
.bib\ref{} cross-references resolveIf changes_made < 3 after an iteration, the loop stops. The paper has stabilized; additional iterations would introduce noise rather than improvement. This is logged explicitly as a success condition.
research/thesis.md is fixed)research/provenance.jsonl with iteration: Nprovenance/cuts/Every word in the final paper is traceable. The provenance system consists of two components: the machine-readable ledger and the human-readable report.
research/provenance.jsonl is an append-only log of every action taken during both the initial pipeline and all /auto iterations. One JSON object per line.
Entry schema:
{
"ts": "2026-03-24T14:32:11Z",
"stage": "3",
"agent": "section-writing-methods",
"action": "write",
"target": "methods/p3",
"reasoning": "Establishes the core architectural choice — why attention over convolution — citing smith2024 for the theoretical motivation and jones2023 for empirical evidence of the trade-off in this regime.",
"sources": ["smith2024", "jones2023"],
"claims": ["C2"],
"feedback_ref": null,
"diff_summary": null,
"iteration": 0
}
Required fields: ts, stage, agent, action, target, reasoning
Conditional fields:
sources — BibTeX keys informing this action (required for write, add, expand)claims — claim IDs from the claims matrix this action supportsfeedback_ref — pointer to the review feedback that triggered this action (required for revise and cut during QA)diff_summary — one-line description of what changed (required for revise, cut, expand)archived_to — path where cut content is saved (required for cut)iteration — 0 for the initial pipeline, 1+ for /auto iterationsActions: write, revise, cut, add, expand, reorder, research, plan
Paragraph targeting: [section]/p[N] (e.g., introduction/p1, methods/p5). For subsections: methods/training-procedure/p2. For splits: methods/p3a and methods/p3b.
Whenever content is removed from main.tex, it is first saved to provenance/cuts/[section]-[pN]-[context].tex. The provenance entry for the cut records the archived_to path. Content is never deleted without archiving.
/provenance query commandQuery the provenance ledger interactively:
| Query form | What it returns |
|---|---|
/provenance |
Full summary: total actions by type, coverage (which paragraphs are traced), source utilization |
/provenance methods |
Provenance for every paragraph in the Methods section: who wrote it, sources, reasoning, revision history |
/provenance trace C3 |
Full chain for claim C3: which paragraphs support it, what sources provide evidence, writing reasoning, any revisions |
/provenance history methods/p3 |
Complete history of one paragraph: original writing, every revision with feedback reference |
/provenance sources smith2024 |
Every place smith2024 was used: which paragraphs, what content, whether for claims, background, or methodology |
/provenance gaps |
Paragraphs with no provenance entry, claims with no linked provenance, sources never referenced in provenance |
/provenance timeline |
Chronological view of all actions; add a stage name to filter |
At the end of Stage 6 and after each /auto run, research/provenance_report.md is generated as a human-readable summary of the entire provenance ledger, organized by section with paragraph-level histories, a cuts archive section, and a list of any untraced content.
An optional LightRAG-based knowledge graph built from source extracts. It enables semantic search across all sources, contradiction detection, and claim-level evidence queries.
export OPENROUTER_API_KEY=your-key
python scripts/knowledge.py build
Reads all research/sources/*.md files, extracts entities and relationships using Gemini Flash (via OpenRouter), and builds a queryable graph with Qwen3 8B semantic embeddings. Stored in research/knowledge/ (gitignored — rebuilds from sources).
The pipeline builds the graph automatically after Stage 1d if the prerequisites are met.
python scripts/knowledge.py query "how do transformer architectures handle long sequences?"
python scripts/knowledge.py contradictions
python scripts/knowledge.py evidence-for "attention is more parameter-efficient than convolution"
python scripts/knowledge.py evidence-against "scaling laws hold at all model sizes"
python scripts/knowledge.py entities
python scripts/knowledge.py relationships "attention mechanism"
| Command | Output |
|---|---|
query "question" |
Freeform semantic search; synthesized answer with source citations |
contradictions |
Conflicting claims across sources; saved to research/knowledge_contradictions.md |
evidence-for "claim" |
Sources and findings supporting the claim |
evidence-against "claim" |
Sources and findings challenging or contradicting the claim |
entities |
All extracted concepts, theories, methods, papers, authors grouped by type |
relationships "entity" |
How a concept connects to other entities in the graph |
The /knowledge slash command wraps these operations interactively. If the graph does not exist, it offers to build it.
| Command | Description |
|---|---|
/write-paper <topic> |
Full autonomous pipeline: research → outline → writing → figures → QA → finalization. 1-4 hours standard, 3-8 hours deep. |
/auto [N] |
Run N improvement iterations on a completed paper (default 1). Use --continue to resume. |
/preview-pipeline |
Dry run of /write-paper — shows each stage, what will RUN vs SKIP, model selections, time estimate. Executes nothing. |
/status |
Progress dashboard: word counts per section vs targets, reference count, figure count, pipeline stage, LaTeX compilation status. Also writes .paper-progress.txt. |
| Command | Description |
|---|---|
/search-literature <query> |
Find relevant papers for a topic using the domain-appropriate skill databases and tool fallback chain. |
/add-citation <DOI or title> |
Add a properly formatted BibTeX entry to references.bib after verifying the paper exists. |
/ingest-papers |
Import PDFs from attachments/, extract metadata and content snapshots, generate BibTeX entries, write source extracts. |
/cite-network |
Analyze citation patterns: distribution by section and year, temporal coverage, venue diversity, author diversity, orphan detection, gap identification with suggested additions. |
/ask <question> |
Query research artifacts to answer questions. Searches research/sources/, research/, reviews/, main.tex, references.bib, and research/log.md, providing the provenance trail for each answer. |
/knowledge [operation] |
Interact with the LightRAG knowledge graph: query, contradictions, evidence-for, evidence-against, entities, relationships. Builds the graph if needed. |
/export-sources |
Export source extracts and references to the shared knowledge base (~/.research-agent/shared-sources/) for reuse across papers. |
/import-sources [topic] |
Import relevant sources from the shared knowledge base into the current paper. Uses .paper.json topic if omitted. |
/audit-sources |
Retroactive source coverage audit: classifies all references by access level, attempts OA resolution for abstract-only sources, generates acquisition list. Standalone version of Stage 1d. |
| Command | Description |
|---|---|
/init <topic> |
Quick single-pass paper generation without the full multi-stage pipeline. Useful for drafts or shorter documents. |
/outline <section> |
Generate a structured outline for a specific section with key arguments, planned citations, and subsection organization. |
/revise-section <section> |
Rewrite a section based on provided feedback, maintaining consistency with the rest of the paper. |
| Command | Description |
|---|---|
/review |
Comprehensive manuscript quality review covering technical soundness, writing quality, and citation completeness. |
/check-consistency |
Find and fix notation inconsistencies, terminology drift, abbreviations used before definition or defined twice, and reference format inconsistencies. |
/audit-claims |
Flag overclaims — "novel"/"first" without evidence, "significantly" without statistical tests, "prove" based only on experiments, unsupported factual claims. |
/check-citations |
Verify every citation via CrossRef API or search, fix metadata mismatches, remove fabricated entries, attempt OA resolution for newly verified papers. |
/novelty-check [contribution] |
Verify the paper's contribution hasn't been published. Uses multiple databases plus Codex cross-model verification. Returns NOVEL, PARTIALLY NOVEL, or NOT NOVEL. |
/de-ai-polish [section] |
Remove AI writing patterns across 7 categories: filler phrases, AI vocabulary, formulaic transitions, redundant phrasing, empty emphasis, em dashes, structural tells. |
/reproducibility-checklist |
Check Methods completeness against a structured checklist (general scientific, ML-specific if applicable, ethical considerations). Reports YES/NO/N/A per item with section references. |
/codex-review [section] |
On-demand adversarial review from OpenAI Codex via codex_plan, codex_review, and codex_ask. Requires codex-bridge. |
/codex-telemetry [export] |
Analyze Codex interaction patterns: agreement rates, tool usage breakdown, disagreement hotspots, timeline. Use export to write report to file. |
/health |
Diagnose pipeline prerequisites and optional integrations (LaTeX, API keys, knowledge graph, Codex, Praxis). Reports status, detail, and impact for each check. |
/compile |
Compile LaTeX to PDF via latexmk -pdf -interaction=nonstopmode main.tex and report errors. |
| Command | Description |
|---|---|
/analyze-data <file> |
Statistical analysis on datasets in attachments/, generate publication figures via matplotlib or Praxis. |
/praxis-analyze [file or technique] |
Technique-specific analysis via Praxis (auto-detects from data): XRD, DSC, TGA, FTIR, Raman, XPS, EIS, mechanical testing, VSM, UV-Vis, BET, and more. Venue-matched journal figure styles. |
| Command | Description |
|---|---|
/provenance [mode] |
Query the provenance ledger. Modes: summary (default), section name, trace <claim-id>, history <target>, sources <bibtex-key>, gaps, timeline [stage]. |
| Command | Description |
|---|---|
/lay-summary |
Generate a 200+ word plain-language summary, a 2-3 sentence elevator pitch, and (where required by venue) a lay summary for inclusion in the manuscript. |
/archive |
Bundle all research artifacts into a browsable archive/ directory with a README index. Auto-runs at the end of /write-paper. |
/prepare-submission |
Generate submission package: anonymized version (for blind-review venues), camera-ready version, cover letter, response to reviewers (if reviews exist), and a submission checklist. |
/respond-to-reviewers |
Generate a structured point-by-point response to peer reviewer comments with tracked changes in the manuscript. |
/make-slides |
Generate a presentation slide deck from the paper (structured markdown with speaker notes, calibrated to venue talk length). |
/make-blog-post |
Generate a 1500-3000 word blog post explaining the paper for a general technical audience. |
/make-deliverables |
Generate all deliverables in parallel: lay summary, slide deck, and blog post (3 simultaneous agents). |
/prisma-flowchart |
Generate a PRISMA 2020 flowchart from the research log and add it to the manuscript. |
/clean |
latexmk -c to remove LaTeX build artifacts. Add all argument to also remove research/, reviews/, archive/, and pipeline state files (never removes main.tex, references.bib, figures/, attachments/, .paper.json, .venue.json). |
Install with npm i -g codex-bridge. When present, create-paper auto-configures it via codex-bridge init. All integration is graceful — if not installed, every step that uses it is silently skipped.
Codex (OpenAI) contributes at 10 points in the pipeline:
| Pipeline point | What Codex does |
|---|---|
| Stage 1 (after research agents) | Independent literature contribution — papers Claude may have missed |
| Stage 1c | Cross-checks every research file for inaccurate representations or missing nuances |
| Stage 2 (claims-evidence matrix) | Challenges whether evidence actually supports each claim |
| Stage 2b | Stress-tests the contribution statement and argument structure |
| Stage 3 (Limitations) | Drafts the Limitations subsection from an adversarial perspective |
| Stage 3 (each section) | Quick spot-check after each writing agent; in deep mode, also identifies content gaps |
| Stage 4c | Audits whether figures and surrounding text claims match |
| Stage 5 | Adversarial peer review as a 4th parallel reviewer |
| Post-QA | Independent reference verification of a random sample |
| Post-QA | Risk radar assessment across 5 dimensions |
| Stage 6 | Collaboration statistics report |
Deliberation protocol: Codex feedback is never blindly accepted. For every Codex interaction, Claude evaluates each point as AGREE, PARTIALLY AGREE, or DISAGREE, with explicit reasoning. On DISAGREE, Claude sends a rebuttal with specific counterarguments. Codex gets one response. If still unresolved, both perspectives are logged in reviews/codex_deliberation_log.md for the user to decide. Neither side silently wins.
Auto-cloned as a git submodule at vendor/praxis/ by create-paper. Provides:
.venue.jsonThe pipeline auto-detects data files in attachments/ and uses Praxis at Stage 4. Analysis scripts are saved to figures/scripts/ for reproducibility. Install dependencies with pip install -r vendor/praxis/requirements.txt.
Cloned as a git submodule at vendor/claude-scientific-skills/ and symlinked to .claude/skills/. Skills are markdown files that agents read to guide their behavior. Used throughout the pipeline by naming them in agent prompts (e.g., "invoke the scientific-writing skill").
Skill categories include:
scientific-writing, peer-review, scientific-critical-thinking, citation-managementpubmed-database, arxiv-database, openalex-database, chembl-database, uniprot-database, clinicaltrials-database, and 30+ morestatistical-analysis, exploratory-data-analysis, scikit-learn, transformers, pytorch-lightningmatplotlib, seaborn, plotly, scientific-visualizationrdkit, biopython, scanpy, pymatgen, qiskit, and 100+ moreAt the start of Stage 1, the pipeline analyzes the topic to detect its domain. Domain detection matches indicator keywords:
| Domain | Indicators | Priority databases |
|---|---|---|
| Biomedical / Life Sciences | gene, protein, cell, disease, clinical, drug, genomic, cancer | PubMed, bioRxiv, UniProt, KEGG, Reactome, PDB |
| Chemistry / Drug Discovery | molecule, compound, synthesis, binding, SMILES, ADMET | PubChem, ChEMBL, DrugBank, ZINC, BindingDB |
| Computer Science / AI/ML | neural, network, model, training, transformer, LLM, NLP | arXiv, HuggingFace |
| Physics / Quantum | quantum, particle, field, relativity, optics | arXiv, NIST |
| Materials Science | crystal, material, alloy, polymer, semiconductor | arXiv, Materials Project |
| Ecology / Geospatial | ecology, climate, satellite, geographic, biodiversity | GBIF, geospatial databases |
| Economics / Finance | market, economic, stock, GDP | FRED, Alpha Vantage, EDGAR |
| Clinical / Medical | patient, treatment, trial, diagnosis, hospital | ClinicalTrials.gov, FDA databases |
| General (default) | (all others) | arXiv, OpenAlex |
All domains also receive universal skills: scientific-writing, citation-management, peer-review, statistical-analysis, scientific-visualization, matplotlib, seaborn, plotly.
These rules are enforced across all agents in the pipeline. Violations are caught in QA review.
\citep{key} for parenthetical and \citet{key} for narrative citations. Keys are in firstauthorlastnameYear format.figures/, reference with \includegraphics{figures/filename}. Prefer vector formats (PDF, EPS). PNG only for photographs.booktabs package (\toprule, \midrule, \bottomrule). Minimum 2 tables.\label{} and \ref{} consistently. All \ref{} calls must resolve.\lipsum, TODO, TBD, FIXME before finalizing.research/claims_matrix.md. Every claim must reach "Supported" status to pass QA.research/provenance.jsonl. Venue-aware length: if .venue.json has a page limit, scale section word targets proportionally.Three model tiers with 1M context windows are used throughout the pipeline. The [1m] suffix is required — the shorthand "opus" and "sonnet" resolve to standard context models, not the 1M variants.
| Tier | Model ID | Used for |
|---|---|---|
| Opus 1M | claude-opus-4-6[1m] |
Writing agents, revision agents, expansion agents, gap analysis, de-AI polish, final polish — anything requiring deep reasoning, synthesis, or prose quality |
| Sonnet 1M | claude-sonnet-4-6[1m] |
Research agents, review agents, data analysis, figures, assessment agents in /auto, verification agents — tasks requiring tool use, search, and structured evaluation |
| Haiku | haiku |
Bibliography building, reference validation, lay summary, reproducibility checklist, per-section literature searches in deep mode, provenance report generation — mechanical lookup and formatting tasks |
The 1M context windows allow agents to read entire manuscripts, all research files, full bibliographies, and all review feedback without hitting context limits.
research-agent/ (this repository)
├── create-paper Bash script: stamps out new paper projects
├── write-paper Bash script: launcher for the autonomous pipeline
├── sync-papers Bash script: migrate/update existing projects to use symlinks
├── template/
│ ├── claude/
│ │ ├── CLAUDE.md Project instructions, writing rules, command reference
│ │ ├── settings.local.json Tool permissions for autonomous operation
│ │ ├── commands/ All 35 slash commands (symlinked into each paper project)
│ │ └── pipeline/ Stage-specific instructions (read on-demand per stage/phase)
│ │ ├── stage-1-research.md through stage-6-finalization.md
│ │ ├── auto-phase-1-assessment.md through auto-phase-4-verification.md
│ │ └── shared-protocols.md
│ ├── scripts/ Utility scripts (knowledge.py, etc.)
│ ├── venues/ Venue configuration JSON files
│ │ ├── generic.json
│ │ ├── ieee.json
│ │ ├── acm.json
│ │ ├── neurips.json
│ │ ├── nature.json
│ │ ├── arxiv.json
│ │ └── apa.json
│ ├── main.tex LaTeX template (overwritten by venue-specific generation)
│ ├── references.bib Empty bibliography template
│ └── gitignore Standard gitignore for paper projects
├── tests/ Test suite (run_all.sh, test_structure.sh, test_prompts.sh, test_schema.py)
├── .github/workflows/ci.yml CI via GitHub Actions
└── vendor/ External dependencies (submodules)
Generated paper project structure:
my-paper/
├── main.tex LaTeX document, venue-formatted
├── references.bib BibTeX bibliography
├── .paper.json Paper topic, venue, authors, depth, config
├── .venue.json Venue formatting rules (copied from template)
├── .paper-state.json Pipeline checkpoint state (created at runtime)
├── .paper-progress.txt Human-readable progress monitor (tail from another terminal)
├── figures/ Generated figures (PDF, PNG)
│ └── scripts/ Figure generation scripts (for reproducibility)
├── attachments/ User-provided PDFs, datasets, supplementary materials
├── research/ Literature research outputs (created by pipeline)
│ ├── sources/ Raw source extracts per cited paper (<bibtexkey>.md)
│ ├── knowledge/ LightRAG knowledge graph (gitignored, rebuilds from sources)
│ ├── log.md Complete research provenance log
│ ├── provenance.jsonl Machine-readable provenance ledger (append-only)
│ ├── provenance_report.md Human-readable provenance summary (generated at end)
│ ├── survey.md Field survey (Stage 1, Agent 1)
│ ├── methods.md Methodology deep dive (Stage 1, Agent 2)
│ ├── empirical.md Empirical evidence (Stage 1, Agent 3)
│ ├── theory.md Theoretical foundations (Stage 1, Agent 4)
│ ├── gaps.md Gap analysis and thesis proposal (Stage 1, Agent 5)
│ ├── thesis.md Thesis and contribution statement (Stage 2)
│ ├── claims_matrix.md Claims-evidence matrix (Stage 2)
│ ├── novelty_check.md Novelty verification report (Stage 2d)
│ ├── source_coverage.md Source access level audit (Stage 1d)
│ ├── codex_cross_check.md Codex research cross-check (Stage 1c)
│ ├── codex_thesis_review.md Codex thesis stress-test (Stage 2b)
│ ├── reference_validation.md Reference verification report (Post-QA)
│ ├── reproducibility_checklist.md Reproducibility checklist (Post-QA)
│ └── summaries.md Lay summary and elevator pitch (Stage 6)
├── reviews/ Review feedback (created during QA)
│ ├── technical.md Technical reviewer output
│ ├── writing.md Writing quality reviewer output
│ ├── completeness.md Citation and completeness reviewer output
│ ├── codex_adversarial.md Codex adversarial review
│ ├── codex_risk_radar.md Codex risk radar assessment
│ ├── consistency.md Post-QA consistency checker output
│ ├── claims_audit.md Post-QA claims auditor output
│ └── codex_deliberation_log.md All Claude-Codex deliberations
├── provenance/
│ └── cuts/ Archived text from all content that was cut
├── archive/ Browsable research archive (created by /archive)
│ └── README.md Indexed guide to all archive contents
├── submission/ Submission package (created by /prepare-submission)
├── vendor/
│ ├── claude-scientific-skills/ 177 scientific skills (git submodule)
│ └── praxis/ Scientific data analysis toolkit (git submodule)
├── .claude/ Claude runtime scaffold (Claude projects)
│ ├── CLAUDE.md
│ ├── settings.local.json
│ ├── commands/
│ ├── pipeline/
│ └── skills/ -> vendor/
└── .codex/ Codex runtime scaffold (Codex projects)
├── AGENTS.md
├── commands/
├── pipeline/
└── skills/ -> vendor/
.paper.jsonCreated by create-paper, read by the pipeline and most commands.
{
"topic": "A survey on large language model reasoning",
"venue": "arxiv",
"depth": "standard",
"runtime": "claude",
"model": "claude-opus-4-6",
"max_revisions": 3,
"email": "[email protected]",
"oa_resolution": {
"unpaywall": true,
"openalex": true,
"semantic_scholar": true,
"core": true,
"pubmed_central": "auto",
"web_search": true,
"repository_search": true
},
"authors": [
{
"name": "First Author",
"affiliation": "Department, University",
"email": "[email protected]",
"orcid": "0000-0000-0000-0000"
}
],
"keywords": ["large language models", "reasoning", "chain-of-thought"],
"funding": "Grant XYZ from Funding Body",
"conflicts": "None declared",
"data_availability": "Available at https://github.com/...",
"code_availability": "https://github.com/..."
}
| Field | Description |
|---|---|
topic |
Paper topic, used to seed the research pipeline and all agent prompts |
venue |
Target venue: generic, ieee, acm, neurips, nature, arxiv, apa |
depth |
"standard" (5 agents, 30-50 refs, 1-4 hrs) or "deep" (12 agents, 60-80 refs, 3-8 hrs) |
runtime |
Active harness: claude or codex |
model |
Base model (1M context variants are added automatically per tier) |
max_revisions |
Maximum QA iterations (overridden by depth: 5 for standard, 8 for deep) |
email |
Email for Unpaywall API auth and OpenAlex rate-limit boost. Also read from UNPAYWALL_EMAIL env var. |
oa_resolution |
Per-API toggles for the OA resolution chain (see below). All default to true. |
authors |
Author list for cover letter and author block |
keywords |
Keywords for submission metadata |
funding |
Funding acknowledgment text |
conflicts |
Conflicts of interest declaration |
data_availability |
Data availability statement for submission |
code_availability |
Code availability statement and URL |
oa_resolution sub-objectControls which APIs are tried during source acquisition (Stage 1d), /audit-sources, and /check-citations. The pipeline tries each enabled API in order and stops on the first successful PDF download.
| Key | Default | Description |
|---|---|---|
unpaywall |
true |
Unpaywall — ~30M OA articles. Requires email field or UNPAYWALL_EMAIL env var. |
openalex |
true |
OpenAlex — 250M+ works, no key needed. Also extracts abstracts. |
semantic_scholar |
true |
Semantic Scholar — checks openAccessPdf field. |
core |
true |
CORE — 200M+ institutional repository papers. Requires CORE_API_KEY env var (free). |
pubmed_central |
"auto" |
PubMed Central — biomedical full text. "auto" enables only for biomedical/clinical domains. Set true to force on, false to disable. Optional: NCBI_API_KEY env var increases rate limit. |
web_search |
true |
Firecrawl search for PDFs (filetype:pdf). |
repository_search |
true |
Search ResearchGate, Academia.edu, SSRN. |
| Variable | Required | Description |
|---|---|---|
UNPAYWALL_EMAIL |
For Unpaywall | Alternative to email in .paper.json. Any valid email address. |
CORE_API_KEY |
For CORE | Free API key from core.ac.uk/services/api. CORE is skipped if not set. |
NCBI_API_KEY |
No | Free key from NCBI. Increases PubMed rate limit from 3/s to 10/s. |
OPENROUTER_API_KEY |
For knowledge graph | Required for LightRAG knowledge graph. Graph is skipped if not set. |
.venue.jsonCopied from the template venues directory. Read by the pipeline, writing agents, and Praxis for figure styles.
{
"name": "Generic Journal",
"id": "generic",
"documentclass": "\\documentclass[12pt]{article}",
"packages": ["\\usepackage{natbib}", "..."],
"bibliography_style": "plainnat",
"citation_style": "natbib",
"citation_commands": ["\\citep{}", "\\citet{}"],
"page_limit": null,
"abstract_word_limit": 300,
"blind_review": false,
"sections": ["Introduction", "Related Work", "Methods", "Results", "Discussion", "Conclusion"],
"notes": "Standard journal format with natbib citations. No page limit."
}
The pipeline reads page_limit to scale section word targets proportionally. It reads blind_review to know whether /prepare-submission should anonymize the author block. It reads sections to determine the initial section order in main.tex.
.paper-state.jsonWritten after every stage. Read at startup for resume. Do not edit manually.
{
"topic": "...",
"venue": "generic",
"started_at": "2026-03-24T10:00:00Z",
"current_stage": "writing",
"stages": {
"research": { "done": true, "completed_at": "...", "notes": "45 refs found" },
"codex_cross_check": { "done": true, "completed_at": "..." },
"source_acquisition": { "done": true, "full_text": 28, "abstract_only": 12, "metadata_only": 5 },
"knowledge_graph": { "done": true, "entities": 347, "relationships": 891 },
"outline": { "done": true, "completed_at": "..." },
"codex_thesis": { "done": true, "completed_at": "..." },
"novelty_check": { "done": true, "status": "NOVEL" },
"writing": {
"done": false,
"sections": {
"introduction": { "done": true, "words": 1250 },
"related_work": { "done": true, "words": 2100 },
"methods": { "done": false, "words": 0 }
}
},
"figures": { "done": false },
"qa": { "done": false },
"qa_iteration": 2,
"codex_risk_radar": { "done": false },
"finalization": { "done": false },
"auto_iterations": {
"completed": 0,
"requested": 0,
"history": []
}
}
}
.paper-progress.txtHuman-readable progress file updated at each stage. Intended for monitoring from a second terminal:
watch cat .paper-progress.txt
Updated by /status and at each pipeline checkpoint.
The template files in template/claude/commands/ define what each slash command does. Editing them changes the behavior of the pipeline in all new paper projects.
Key files for contributors:
template/claude/CLAUDE.md — project rules and command reference (symlinked into each paper project)template/claude/commands/write-paper.md — the full pipeline definition (~1500 lines)template/claude/commands/auto.md — the /auto improvement looptemplate/claude/commands/provenance.md — the provenance query commandcreate-paper — the project scaffolding scriptwrite-paper — the pipeline launcher scripttemplate/venues/ — venue configuration filesRun the test suite with tests/run_all.sh. Individual tests: tests/test_structure.sh (project structure validation), tests/test_prompts.sh (prompt consistency checks), tests/test_schema.py (JSON schema validation). CI runs automatically via GitHub Actions (.github/workflows/ci.yml).
MIT