Terpene biosynthesis

Designing a microbial host for terpene biomanufacturing

One natural-language query, four federated knowledge graphs, an integrated answer that would otherwise take days of cross-portal search.

Key finding

20+ datasets surfaced spanning seven plant species and three microbial hosts, with a candidate enzyme panel anchored across pathway and expression evidence — usable starting material for a synthetic biology team within minutes.

Graphs queried

PRO-KN, Gene Expression Atlas OKN, SPOKE-GeneLab, and the NDE/WOBD metadata layer.

WOBD contribution

Connects candidate enzyme parts to plant evidence, microbial host context, and supporting datasets.

Audit trail

Genes, organisms, datasets, and graph joins remain traceable to source records for follow-up review.

The question

A team of synthetic biologists is designing a microbial host for sustainable biomanufacturing of a high-value terpene. They need plant and microbial enzymes that drive terpene biosynthesis, and the experimental datasets — RNA-seq, proteomics, fermentation studies — that justify specific gene parts and predict how a production strain will behave.

Answering that well requires three different kinds of evidence at once: pathway-level gene relationships, organism-specific expression data across plant and microbial systems, and curated dataset metadata that ties each candidate gene to the experiments that justify it. Without an OKN-backed federation, that means days of fragmented searching across pathway portals, GEO, ArrayExpress, and species-specific resources — with the cross-organism synthesis still done by hand.

One query, federated answer

Natural-language query

Which plant and microbial genes drive terpene biosynthesis, and which datasets support engineering a microbial host?

Unified MCP server federates four KGs

  • prokn — pathway and gene-part discovery
  • gene-expression-atlas-okn — differential expression across organisms
  • spoke-genelab — organism-specific studies and assays
  • nde — NDE/WOBD dataset metadata layer

Integrated answer

Candidate gene panel anchored across plant and microbial organisms, with the experimental datasets that justify each choice — every result traceable to its source graph.

4
federated KGs
20+
datasets surfaced
~15
candidate genes anchored
7 / 3
plant species / microbial hosts

What the chat looks like

Excerpt from a conversation that produced these findings:

I'm designing a microbial host for terpene biomanufacturing. Find candidate enzymes and the experimental datasets that justify them, biased toward plant and microbial systems rather than the human-heavy default.

Querying prokn, gene-expression-atlas-okn, spoke-genelab, nde…

Pulled candidate parts from pathway and expression KGs, then leaned on the NDE/WOBD metadata layer for the supporting experimental datasets.

  • Precursor-supply panel (gene-expression-atlas-okn): HMGR, DXS, DXR, HDR, IDI1/2, ERG20/FDPS, GGPS1, PSY — consistently present across plant and microbial systems.
  • Plant evidence (NDE): Artemisia argyi transcriptome (GSE102404), four Taxus paclitaxel-biosynthesis studies, Crocus apocarotenoids (GSE103181), maize and Isodon diterpenes.
  • Microbial host engineering (NDE): E. coli IPP/FPP toxicity (GSE102672, GSE29267, GSE30403), IspG/IspH balancing (GSE84255), taxadiene-producing yeast adaptation (GSE225783).
  • Cross-graph integration: joins on NCBI_Gene, GeneSymbol, and UBERON link organism-specific pathway evidence to organism-spanning expression data.

Recommended starting panel: a core precursor-supply module, a product-branch module chosen by target class (taxoid, defense diterpene, or carotenoid), and a host-hardening module derived from the microbial toxicity datasets.

Illustrative excerpt; phrasing is paraphrased and lightly edited from the upstream analysis. Live results may vary as graphs are updated.

What this query unlocks

  • Cross-graph synthesis on shared identifiers. Joins on NCBI_Gene, GeneSymbol, and UBERON link organism-specific pathway evidence to organism-spanning expression data.
  • Pathway and dataset discovery in a single step. The NDE/WOBD metadata graph surfaces real experiments — Artemisia transcriptomes, Taxus paclitaxel biosynthesis, E. coli precursor toxicity, taxadiene-producing yeast — that justify each candidate gene.
  • Plant and microbial scope from one ask. No separate queries against plant pathway resources, microbial host-engineering datasets, and tolerance studies, and no manual stitching across them.
  • Provenance preserved. Every gene, dataset, and edge in the answer traces back to a specific KG and source record — auditable, reproducible, and re-runnable.

Why this matters — translational research

Every commercial fermentation product — antimalarials, biofuels, fragrances, cannabinoids — starts with a similar question: which gene parts to express, in which host, with what tolerance modules. The discovery phase is traditionally a months-long literature review for a synthetic biology team. Federated KG access shortens it to a single chat session, with the supporting datasets surfaced alongside the gene panel.

The same pattern — pathway evidence + organism-spanning expression coverage + curated dataset metadata + microbial host-engineering data — applies to every terpene class and every microbial host platform. As more dataset metadata sources are ingested into the federation, the same query also reaches further into clinical trial registries, raw-omics archives, and chemical-space databases that today live in separate silos.

Read the full analysis

Specific datasets, gene panels, ranking criteria, and the proposed engineering workflow.