Data#
This section summarizes the datasets used by SynPlanner and how to obtain them.
Overview of datasets#
SynPlanner operates reaction and molecule data stored in different formats.
Data type |
Description |
|---|---|
Reactions |
Reactions can be loaded and stored as the list of reaction smiles in the file (.smi) or RDF File (.rdf) |
Molecules |
Molecules can be loaded and stored as the list of molecule smiles in the file (.smi) or SDF File (.sdf) |
Reaction rules |
Reaction rules stored as TSV (.tsv, preferred) or pickled list (.pickle, legacy) |
Retrosynthetic models |
Retrosynthetic models (neural networks) can be loaded and stored as serialized PyTorch models (.ckpt) |
Retrosynthetic routes |
Retrosynthetic routes can be visualized and stored as HTML files (.html) and can be stored as JSON files (.json) |
Note
Reaction and molecule file formats are parsed and recognized automatically by SynPlanner from file extensions.
Be sure to store the data with the correct extension.
Data repository structure#
Data is hosted on HuggingFace in a component-based structure:
SynPlanner-data/
├── policy/ # Reaction rules + policy network weights
│ └── {architecture}/
│ └── {rules_version}/
│ ├── reaction_rules.tsv
│ ├── pipeline.yaml
│ └── {weights_version}/
│ ├── ranking_policy.ckpt
│ └── filtering_policy.ckpt
├── value/ # Value network weights
│ └── {architecture}/
│ └── {version}/
│ ├── value_network.ckpt
│ └── meta.yaml
├── building_blocks/ # Building block sets
│ └── {name}/
│ ├── building_blocks.tsv
│ └── meta.yaml
├── reaction_data/ # Raw → standardized → filtered pipeline
│ └── {source}/
│ ├── raw/
│ ├── standardized/{YYYY-MM-DD}/
│ └── filtered/{YYYY-MM-DD}/
├── training_data/ # Per-network training inputs
│ ├── ranking_policy/{YYYY-MM-DD}/
│ ├── filtering_policy/{YYYY-MM-DD}/
│ └── value_network/{YYYY-MM-DD}/
├── presets/ # Ready-to-use preset definitions
│ └── {name}.yaml
└── benchmarks/ # Benchmark target sets (downloaded separately)
└── sascore/
Versioning: model components (policy/, value/) are versioned by directory name
(architecture family, rules version, weights version). Pipeline data (reaction_data/,
training_data/) is versioned by processing date (YYYY-MM-DD).
Data sources and bundles#
reaction_rules.tsv — 24k reaction rules in SMARTS format (TSV)v1/ranking_policy.ckpt — ranking policy network trained on filtered USPTO and corresponding rulesv1/filtering_policy.ckpt — filtering policy network trained on ChEMBL molecules and corresponding rulespipeline.yaml — full reproducibility manifest (standardization, filtration, extraction, training configs)value_network.ckpt — value network trained from planning simulations on ChEMBL targetsbuilding_blocks.tsv — 186k standardized building blocks (eMolecules + Sigma Aldrich)raw/uspto_full_mapped.smi.zip — original USPTO dataset (1.48M reactions, compressed)standardized/2024-12-31/ — standardized reactions + config + errorsfiltered/2024-12-31/ — filtered reactions + config + errorsfiltering_policy/2024-12-31/molecules_for_training.smi.zip — ChEMBL molecules for filtering policy trainingvalue_network/2024-12-31/targets_for_training.smi.zip — ChEMBL targets for value network tuningDownload data#
Data download#
Use the built-in downloader to fetch pre-trained models, reaction rules, and building blocks from HuggingFace.
Preset download (recommended)#
Download a ready-to-use preset with all components needed for retrosynthetic planning:
synplan download_preset --preset synplanner-article --save_to synplan_data
This downloads the synplanner-article preset, which includes:
Reaction rules (TSV):
policy/supervised_gcn/v1/reaction_rules.tsvRanking policy weights:
policy/supervised_gcn/v1/v1/ranking_policy.ckptFiltering policy weights:
policy/supervised_gcn/v1/v1/filtering_policy.ckptValue network weights:
value/supervised_gcn/v1/value_network.ckptBuilding blocks:
building_blocks/emolecules-salt-ln/building_blocks.tsv
Python API:
from synplan.utils.loading import download_preset
paths = download_preset("synplanner-article", save_to="synplan_data")
rules_path = paths["reaction_rules"]
policy_path = paths["ranking_policy"]
bb_path = paths["building_blocks"]
Details#
For a full list of datasets and descriptions, see Data.
Download from Hugging Face (browse)#
New repository: Hugging Face – SynPlanner-data - policy/ - value/ - building_blocks/ - reaction_data/ - benchmarks/ - presets/
Legacy repository: Hugging Face – SynPlanner (flat structure, deprecated)