Data Curation#
This tutorial demonstrates how to prepare data (reaction standardization and filtration) before reaction rules extraction and retrosynthetic model training in SynPlanner
Basic recommendations#
1. Always do reaction data filtration
Reaction data filtration is a crucial step in the reaction data curation pipeline. Reaction filtration ensures the validity of the extracted reaction rules and is needed for the correct execution of the programming code. Thus, it is recommended to do a reaction data filtration before the extraction of reaction rules and training retrosynthetic models.
2. Input and output reaction representation can be different after filtration
The current version of the reaction data filtration protocol in SynPlanner includes some functions for additional standardization of input reactions. This is why sometimes the output reaction SMILES, after it passes all the reaction filters, may not exactly match the input reaction SMILES.
1. Set up input and output data locations#
The SynPlanner input data will be downloaded from the HuggingFace repository to the specified directory.
[ ]:
import os
import shutil
from pathlib import Path
from synplan.utils.loading import download_unpack_data
# download reaction data from new repo
data_folder = Path("synplan_data").resolve()
original_data_path = download_unpack_data(
filename="uspto_full_mapped.smi.zip",
subfolder="reaction_data/uspto/raw",
save_to=data_folder,
)
# results folder
results_folder = Path("tutorial_results").resolve()
results_folder.mkdir(exist_ok=True)
shutil.copy(original_data_path, results_folder.joinpath('uspto_original.smi'))
# output_data
mapped_data_path = results_folder.joinpath("uspto_mapped.smi")
standardized_data_path = results_folder.joinpath("uspto_standardized.smi")
filtered_data_path = results_folder.joinpath("uspto_filtered.smi")
2. Reaction atom-atom mapping#
Atom-atom mapping establishes the correspondence between atoms in reactants and products. This is a prerequisite for downstream steps such as reaction rule extraction.
SynPlanner provides GPU-accelerated reaction mapping via a neural attention model (chytorch). The map_reactions_from_file function processes a SMILES file in streaming batches through a three-stage pipeline (parse → GPU inference → map + write).
[ ]:
from synplan.chem.data.mapping import MappingConfig, map_reactions_from_file
mapping_config = MappingConfig(
batch_size=16,
chunk_size=5000,
)
map_reactions_from_file(
config=mapping_config,
input_reaction_data_path=original_data_path,
mapped_reaction_data_path=mapped_data_path,
silent=False,
)
Note
For mapping a single reaction programmatically, use map_reaction from the same module:
from synplan.chem.data.mapping import map_reaction
from chython import smiles
rxn = smiles("CC(=O)O.OCC>>CC(=O)OCC.O")
map_reaction(rxn)
3. Reaction standardization#
The reaction data standardization protocol includes the standardization of individual molecules (reagents, reactants, and products) and the standardization of reactions (e.g. reaction equation balancing).
More details about reaction standardization protocol in SynPlanner can be found in official documentation.
Note
In this tutorial, the input data are already standardized by a slightly different protocol. It omits major tautomer selection done by ChemAxon standardizer.
Standardization configuration#
The next step is to configure the reaction standardization process. We do this using the ReactionStandardizationConfig class in SynPlanner. This class allows for the specification of various parameters and settings for the standardization process.
More details about reaction standardization configuration in SynPlanner can be found in official documentation.
[ ]:
from synplan.utils.logging import init_logger
# Initialize before importing standardizing
logger, log_file_path = init_logger(
name="synplan",
console_level="ERROR",
file_level="INFO",
)
from synplan.chem.data.standardizing import (
ReactionStandardizationConfig, # the main config class
standardize_reactions_from_file, # reaction standardization function
# reaction standardizers
KekuleFormConfig,
CheckValenceConfig,
ImplicifyHydrogensConfig,
CheckIsotopesConfig,
AromaticFormConfig,
MappingFixConfig,
UnchangedPartsConfig,
DuplicateReactionConfig,
)
# specify the list of applied reaction standardizers
standardization_config = ReactionStandardizationConfig(
kekule_form_config=KekuleFormConfig(),
check_valence_config=CheckValenceConfig(),
implicify_hydrogens_config=ImplicifyHydrogensConfig(),
check_isotopes_config=CheckIsotopesConfig(),
aromatic_form_config=AromaticFormConfig(),
mapping_fix_config=MappingFixConfig(),
unchanged_parts_config=UnchangedPartsConfig(),
duplicate_reaction_config=DuplicateReactionConfig(),
)
Note
If the reaction standardizer name (..._config) is listed in the ReactionStandardizationConfig (see above), it means that this standardizer will be activated.
As mentioned before, it is possible to apply only desirable standardizers to the reactions. For example, if you only want to kekulize, you can specify only one config in ReactionStandardizationConfig:
standardization_config = ReactionStandardizationConfig(
kekule_form_config=KekuleFormConfig(),
)
Running standardization#
Once this standardization configuration is in place, we can proceed to apply these standardizers to the source reaction data:
[ ]:
standardize_reactions_from_file(
config=standardization_config,
input_reaction_data_path=mapped_data_path, # mapped input data
standardized_reaction_data_path=standardized_data_path, # standardized output data
silent=False,
num_cpus=4,
batch_size=100,
worker_log_level="INFO",
log_file_path=log_file_path
)
4. Reaction filtration#
In SynPlanner, reaction data filtration is a crucial step to ensure the validity of reaction rules used in retrosynthetic planning.
More details about reaction filtration protocol in SynPlanner can be found in official documentation.
Filtration configuration#
The next step is to configure the reaction filtration process. We do this using the ReactionFilterConfig class in SynPlanner. This class allows for the specification of various parameters and settings for the filtration process.
More details about reaction filtration configuration in SynPlanner can be found in official documentation.
[4]:
from synplan.chem.data.filtering import (
ReactionFilterConfig, # the main config class
filter_reactions_from_file, # reaction filtration function
# reaction filters:
CCRingBreakingConfig,
WrongCHBreakingConfig,
CCsp3BreakingConfig,
DynamicBondsConfig,
MultiCenterConfig,
NoReactionConfig,
)
# specify the list of applied reaction filters
filtration_config = ReactionFilterConfig(
dynamic_bonds_config=DynamicBondsConfig(
min_bonds_number=1, # minimum number of dynamic bonds for a reaction
max_bonds_number=6, # maximum number of dynamic bonds for a reaction
),
no_reaction_config=NoReactionConfig(),
multi_center_config=MultiCenterConfig(),
wrong_ch_breaking_config=WrongCHBreakingConfig(),
cc_sp3_breaking_config=CCsp3BreakingConfig(),
cc_ring_breaking_config=CCRingBreakingConfig(),
)
Note
If the reaction filter name (..._config) is listed in the ReactionFilterConfig (see above), it means that this filter will be activated.
Running filtration#
Once the filtration configuration is in place, we can proceed to apply these filters to the source reaction data:
[5]:
filter_reactions_from_file(
config=filtration_config,
input_reaction_data_path=standardized_data_path, # standardized input data
filtered_reaction_data_path=filtered_data_path, # filtered output data
num_cpus=4,
batch_size=100,
)
Number of reactions processed: 1314804 [1:38:05]
Initial number of reactions: 1314804
Removed number of reactions: 295500
Results#
If the tutorial is executed successfully, you will get in the results folder four reaction data files:
original reaction data
mapped reaction data
standardized reaction data
filtered reaction data
[6]:
sorted(Path(results_folder).iterdir(), key=os.path.getmtime, reverse=False)
[6]:
[PosixPath('/home1/dima/synplanner/tutorials/tutorial_results/uspto_original.smi'),
PosixPath('/home1/dima/synplanner/tutorials/tutorial_results/uspto_standardized.smi'),
PosixPath('/home1/dima/synplanner/tutorials/tutorial_results/uspto_filtered.smi')]