Reaction Rules Extraction#

This tutorial demonstrates how to extract reaction rules from reaction data in SynPlanner

Basic recommendations#

  1. The specificity of extracted reaction rules can be adjusted by the configuration of the extraction protocol.

  2. The extracted reaction rules are stored as a TSV file with reaction SMARTS, popularity, and reaction indices columns.

1. Set up input and output data locations#

The SynPlanner input data will be downloaded from the HuggingFace repository to the specified directory.

[ ]:
import os
from pathlib import Path

# Uses outputs from Data Curation tutorial (no HF download needed)
# results folder
results_folder = Path("tutorial_results").resolve()
results_folder.mkdir(exist_ok=True)

# input data — use filtered data from Data Curation tutorial or replace with custom data
filtered_data_path = results_folder.joinpath("uspto_filtered.smi")

# output_data
reaction_rules_path = results_folder.joinpath("uspto_reaction_rules.tsv")

2. Reaction rules extraction#

The reaction rule extraction protocol in SynPlanner includes several steps (e.g. reaction center identification and specification, reaction rule validation, etc.)

More details about reaction rule extraction protocol in SynPlanner can be found in official documentation.

Extraction configuration#

The next step is to configure the reaction rule extraction process. We do this using the RuleExtractionConfig class in SynPlanner. This class allows for the specification of various parameters and settings for the reaction rule extraction process.

More details about reaction rule extraction configuration in SynPlanner can be found in official documentation.

[ ]:
from synplan.utils.config import RuleExtractionConfig
from synplan.chem.reaction_rules.extraction import extract_rules_from_reactions

# Functional group list from: Coley, Connor W., JCIM., 59.6 (2019): 2529-2537.
functional_groups = [
                    '[O,S;h0]=C[O,Cl,I,Br,F]',                   # carboxylic acid / halogen
                    '[O,S;h0]=CN',                               # amide/sulfamide
                    'S(O)(O)[Cl]',                               # sulfonyl chloride
                    'B(O)O',                                     # boronic acid/ester
                    '[Si](C)(C)C',                               # trialkyl silane
                    '[Si](OC)(OC)(OC)',                          # trialkoxy silane, default to methyl
                    '[N;H0;$(N-[#6]);D2]-,=[N;D2]-,=[N;D1]',     # azide
                    'O=C1N([Br,I,F,Cl])C(=O)CC1',                # NBS brominating agent
                    'Cc1ccc(S(=O)(=O)O)cc1',                     # Tosyl
                    'CC(C)(C)OC(=O)[N]',                         # N(boc)
                    '[C;h3][C;h0]([C;h3])([C;h3])O',             #
                    '[C,N]=[C,N]',                               # alkene/imine
                    '[C,N]#[C,N]',                               # alkyne/nitrile
                    'C=C-[A]',                                   # adj to alkene
                    'C#C-[A]',                                   # adj to alkyne
                    'O=C-[A]',                                   # adj to carbonyl
                    'O=C([C;h3])-[A]',                           # adj to methyl ketone
                    'O=C([O,N])-[A]',                            # adj to carboxylic acid/amide/ester
                    'ClS(Cl)=O',                                 # thionyl chloride
                    '[Mg,Li,Zn,Sn][Br,Cl,I,F]',                  # grinard/metal (non-disassociated)
                    'S(O)(O)',                                   # SO2 group
                    'N~N',                                       # diazo
                    '[C;a]:[N,S,O;a]',                           # adjacency to heteroatom in aromatic ring
                    '[N,S,O;a]:[C;a]:[C;a]',                     # two-steps away from heteroatom in aromatic ring
                    '[B,C](F)(F)F',                              # CF3, BF3 should have the F3 included
                 ]

extraction_config = RuleExtractionConfig(
    min_popularity=3,
    environment_atom_count=1,
    multicenter_rules=True,
    include_rings=False,
    keep_leaving_groups=True,
    keep_incoming_groups=False,
    keep_reagents=False,
    include_func_groups=True,
    func_groups_list=functional_groups,
    atom_info_retention={
        "reaction_center": {
            "neighbors": True,  # retains information about neighboring atoms to the reaction center
            "implicit_hydrogens": False,  # includes data on implicit hydrogen atoms attached to the reaction center
            "ring_sizes": False,  # keeps information about the sizes of rings that reaction center atoms are part of
        },
        "environment": {
            "neighbors": False,  # retains information about neighboring atoms to the atoms in the environment of the reaction center
            "implicit_hydrogens": False,  # includes data on implicit hydrogen atoms attached to atoms in the environment
            "ring_sizes": False,  # keeps information about the sizes of rings that environment atoms are part of
        },
    },
)

Running extraction#

After configuring the rule extraction settings in SynPlanner, the next step is to apply these configurations to extract reaction rules from the reaction data. This is achieved using the extract_rules_from_reactions function.

[3]:
extract_rules_from_reactions(
    config=extraction_config,  # the configuration settings for rule extraction
    reaction_data_path=filtered_data_path,  # path to the reaction data file
    reaction_rules_path=reaction_rules_path,  # path to the pickle file where the extracted reaction rules will be stored
    num_cpus=4,
    batch_size=100,
)
Number of reactions processed: 1019304 [1:28:25]
Number of extracted reaction rules: 34925

The extracted reaction rules can be loaded and visually inspected (the reaction rules are sorted by popularity):

[ ]:
from synplan.utils.loading import load_reaction_rules

reaction_rules_list = load_reaction_rules(reaction_rules_path)
[ ]:
str(reaction_rules_list[-1])
[ ]:
len(reaction_rules_list)
[5]:
reaction_rules_list[0]
[5]:
../_images/user_guide_03_Rules_Extraction_13_0.svg
[6]:
reaction_rules_list[1]
[6]:
../_images/user_guide_03_Rules_Extraction_14_0.svg
[7]:
reaction_rules_list[100]
[7]:
../_images/user_guide_03_Rules_Extraction_15_0.svg

Results#

If the tutorial is executed successfully, you will get in the results folder three reaction data files (from reaction curation tutorial) and corresponding extracted reaction rules:

  • original reaction data

  • standardized reaction data

  • filtered reaction data

  • extracted reaction rules

[8]:
sorted(Path(results_folder).iterdir(), key=os.path.getmtime, reverse=False)
[8]:
[PosixPath('/home1/dima/synplanner/tutorials/tutorial_results/uspto_original.smi'),
 PosixPath('/home1/dima/synplanner/tutorials/tutorial_results/uspto_standardized.smi'),
 PosixPath('/home1/dima/synplanner/tutorials/tutorial_results/uspto_filtered.smi'),
 PosixPath('/home1/dima/synplanner/tutorials/tutorial_results/uspto_reaction_rules.pickle')]