Clustering of Retrosynthesis Routes#

In retrosynthetic analysis, multiple synthetic routes can be generated for a single target molecule. Clustering these routes helps to group together pathways that share similar high-level strategies, making it easier to identify fundamentally different approaches and prioritize the most promising ones. This step enables chemists to focus on unique synthetic ideas rather than redundant variations.

This tutorial demonstrates how to cluster the Retrosynthesis Routes by unique strategic bond patterns. At this stage you should have a set of solved retrosynthetic routes (either from running the planning step here or from loading previously saved results).

1. Setup paths#

Point to the existing data folder. No download needed — assumes data is already present in tutorials/synplan_data/.

[ ]:
from pathlib import Path
from synplan.utils.loading import download_preset

# Download preset data (or use already downloaded)
paths = download_preset("synplanner-article", save_to="synplan_data")

# input data
ranking_policy_network = paths["ranking_policy"]
reaction_rules_path = paths["reaction_rules"]
building_blocks_path = paths["building_blocks"]

# output folder (shared with other tutorials)
results_folder = Path("tutorial_results").resolve()
results_folder.mkdir(exist_ok=True)

2. Retrosynthetic planning#

Run a short example to produce routes, or skip if you already have routes_*.json/csv.

[2]:
from synplan.chem.utils import mol_from_smiles

# let's take capivasertib used as anti-cancer medication for the treatment
# of breast cancer and approved by FDA in 2023
example_smiles = "NC1(C(=O)N[C@@H](CCO)c2ccc(Cl)cc2)CCN(c2nc[nH]c3nccc2-3)CC1"

target_molecule = mol_from_smiles(
    example_smiles,
    clean2d=True,
    standardize=True,
    clean_stereo=True
    )
[3]:
target_molecule
[3]:
../_images/user_guide_06_Clustering_5_0.svg

Run example planning (optional)

[ ]:
from synplan.mcts.tree import Tree
from synplan.utils.config import TreeConfig, RolloutEvaluationConfig
from synplan.utils.loading import load_building_blocks, load_reaction_rules, load_policy_function, load_evaluation_function
from synplan.route_quality.scorer import ProtectionRouteScorer

building_blocks = load_building_blocks(building_blocks_path, standardize=False)
reaction_rules = load_reaction_rules(reaction_rules_path)

policy_network = load_policy_function(weights_path=ranking_policy_network)

tree_config = TreeConfig(
    search_strategy="expansion_first",
    max_iterations=300,
    max_time=120,
    max_depth=9,
    min_mol_size=1,
    init_node_value=0.5,
    ucb_type="uct",
    c_ucb=0.1,
)

eval_config = RolloutEvaluationConfig(
    policy_network=policy_network,
    reaction_rules=reaction_rules,
    building_blocks=building_blocks,
    min_mol_size=tree_config.min_mol_size,
    max_depth=tree_config.max_depth,
    normalize=True,
)
evaluation_function = load_evaluation_function(eval_config)

# Protection-aware route scorer (uses bundled SMARTS and incompatibility matrix)
route_scorer = ProtectionRouteScorer.from_config()

tree = Tree(
    target=target_molecule,
    config=tree_config,
    reaction_rules=reaction_rules,
    building_blocks=building_blocks,
    expansion_function=policy_network,
    evaluation_function=evaluation_function,
    route_scorer=route_scorer,
)

tree_solved = False
for solved, node_id in tree:
    if solved:
        tree_solved = True
tree

Export planned routes as JSON (AiZynthFinder‑compatible) or CSV for later clustering.

[ ]:
from synplan.chem.reaction_routes.io import export_tree_to_json, export_tree_to_csv

export_tree_to_json(tree, results_folder / "clustering_routes.json")
export_tree_to_csv(tree, results_folder / "clustering_routes.csv")

3. Load routes#

Input formats

  • CSV: rows of (route_id, step_id, reaction_smiles)

  • JSON: route tree with mol/reaction nodes

Both can be converted to routes_dict (route_id → {step_id → ReactionContainer}).

[7]:
from synplan.chem.reaction_routes.io import read_routes_csv, read_routes_json, make_json

Load from JSON

[ ]:
json_path = results_folder / "clustering_routes.json"
routes_data = read_routes_json(file_path=json_path, to_dict=True)

# or from csv
# csv_path = results_folder / "clustering_routes.csv"
# routes_data = read_routes_csv(csv_path)
[9]:
routes_json = make_json(routes_data)

4. Build RouteCGR and SB‑CGR#

We convert each multi‑step route into compact graph objects:

  • CGR: overlay reactants/products for one step using atom mapping (formed/broken bonds).

  • RouteCGR: fold all step CGRs into one graph for the whole route.

  • SB‑CGR: keep only atoms of the target and their dynamic bonds (strategic bonds).

Why: SB‑CGR captures the core disconnections used to make the target and ignores non‑strategic details.

[10]:
from synplan.chem.reaction_routes.route_cgr import *
from synplan.chem.reaction_routes.clustering import *
from synplan.chem.reaction_routes.visualisation import cgr_display
from IPython.display import display, HTML, SVG
from synplan.utils.visualisation import get_route_svg_from_json

compose_all_route_cgrs builds RouteCGRs from a Tree or from a routes_dict (loaded from CSV/JSON).

[11]:
all_route_cgrs = compose_all_route_cgrs(routes_data)
# or
# all_route_cgrs = compose_all_route_cgrs(tree)
[13]:
i = 1
for route_id, route_cgr in all_route_cgrs.items():
    print(route_id)
    cgr_prods = [route_cgr.substructure(c) for c in route_cgr.connected_components]
    target_cgr = cgr_prods[0]
    display(SVG(cgr_display(target_cgr)))
    display(SVG(get_route_svg_from_json(routes_json, route_id)))
    # or
    # display(SVG(get_route_svg(tree, route_id))) # Currently the pathway from the serialized tree can not be depicted
    if i >= 6:
        break
    i += 1
71
../_images/user_guide_06_Clustering_20_1.svg
../_images/user_guide_06_Clustering_20_2.svg
72
../_images/user_guide_06_Clustering_20_4.svg
../_images/user_guide_06_Clustering_20_5.svg
74
../_images/user_guide_06_Clustering_20_7.svg
../_images/user_guide_06_Clustering_20_8.svg
201
../_images/user_guide_06_Clustering_20_10.svg
../_images/user_guide_06_Clustering_20_11.svg
202
../_images/user_guide_06_Clustering_20_13.svg
../_images/user_guide_06_Clustering_20_14.svg
204
../_images/user_guide_06_Clustering_20_16.svg
../_images/user_guide_06_Clustering_20_17.svg
[14]:
all_reduced_route_cgrs = compose_all_sb_cgrs(all_route_cgrs)
[15]:
i = 1
for route_id, route_cgr in all_reduced_route_cgrs.items():
    print(route_id)
    cgr_prods = [route_cgr.substructure(c) for c in route_cgr.connected_components]
    target_cgr = cgr_prods[0]
    display(SVG(cgr_display(target_cgr)))
    display(SVG(get_route_svg_from_json(routes_json, route_id)))
    # or
    # display(SVG(get_route_svg(tree, route_id)))
    if i >= 3:
        break
    i += 1
71
../_images/user_guide_06_Clustering_22_1.svg
../_images/user_guide_06_Clustering_22_2.svg
72
../_images/user_guide_06_Clustering_22_4.svg
../_images/user_guide_06_Clustering_22_5.svg
74
../_images/user_guide_06_Clustering_22_7.svg
../_images/user_guide_06_Clustering_22_8.svg

5. Cluster routes#

How it works

  • Cluster by SB‑CGR: routes with identical SB‑CGR end up together (same strategic bonds).

  • use_strat=False (default here): compare SB‑CGR graph signatures; robust to atom mapping.

  • Output: dict keyed by NSB.index (e.g., 3.2) with route IDs and a representative SB‑CGR.

[16]:
# use_strat: if True, clustering will use the CGRContainer’s structural signature
#            to ensure that routes which are chemically identical but differ only
#            in their atom mappings are grouped together instead of split apart
clusters = cluster_routes(all_reduced_route_cgrs, use_strat=False)
[18]:
len(clusters)
[18]:
4
[19]:
clusters.keys()
[19]:
dict_keys(['2.1', '3.1', '4.1', '4.2'])

Cluster report (HTML)#

For any cluster ID (e.g., 2.1), generate a compact HTML summary:

  • target SMILES

  • cluster index and size

  • SB‑CGR (strategic bonds)

  • each route: steps, optional score, SVG and reaction SMILES

[21]:
cluster_index = '3.1'
if cluster_index in clusters.keys():
    # display(HTML(routes_clustering_report(tree, clusters, cluster_index,
    #                      all_reduced_route_cgrs)))
    # or
    display(HTML(routes_clustering_report(routes_json, clusters, cluster_index,
                         all_reduced_route_cgrs)))
else:
    print(f"Cluster {cluster_index} not found in the clustering results.")

Cluster 3.1 Routes Report

Retrosynthetic Routes Report - Cluster 3.1

Target Molecule: c1cc(ccc1Cl)C(NC(C2(CCN(CC2)c3c4cc[nH]c4ncn3)N)=O)CCO
Group index: 3.1
Size of Cluster: 2 routes
Identified Strategic Bonds
N O N O Cl N N N N N O N O Cl N N N N
Target Molecule Molecule Not In Stock Molecule In Stock
Route 3221 — 9 steps
OH O O O Cl NH4 + O O O Cl O O N NH O O O OH Cl O O NH2 O O N NH O O NH O O Cl O I O O N NH O O NH Cl O OH O O O N NH O O NH O O Cl O O O N NH O O Cl O NH2 NH NH2 O NH O O Cl N N N Cl O O NH N NH N N NH2 O Cl Cl NH N NH N N NH2 O OH
Step 1: C(O)CCC.c1cc(ccc1C(CC(=O)OC)=O)Cl>>CO.c1cc(ccc1C(=O)CC(=O)OCCCC)Cl
Step 2: [NH4+].c1cc(ccc1C(=O)CC(=O)OCCCC)Cl>>O.c1cc(ccc1Cl)C(CC(=O)OCCCC)N
Step 3: CC(C)(C)OC(=O)N1CCC(CC1)(NC(OCc2ccccc2)=O)C(=O)O.c1cc(ccc1Cl)C(CC(=O)OCCCC)N>>CC(C)(C)OC(=O)N1CCC(CC1)(NC(=O)OCc2ccccc2)C(NC(CC(OCCCC)=O)c3ccc(cc3)Cl)=O.O
Step 4: CC(C)(C)OC(=O)N1CCC(CC1)(NC(=O)OCc2ccccc2)C(NC(CC(OCCCC)=O)c3ccc(cc3)Cl)=O>>C(CC)C.CC(C)(C)OC(=O)N1CCC(CC1)(NC(=O)OCc2ccccc2)C(NC(c3ccc(cc3)Cl)CC(=O)O)=O
Step 5: CC(C)(C)OC(=O)N1CCC(CC1)(NC(=O)OCc2ccccc2)C(NC(c3ccc(cc3)Cl)CC(=O)O)=O.CCI>>CC(C)(C)OC(=O)N1CCC(CC1)(NC(=O)OCc2ccccc2)C(NC(CC(OCC)=O)c3ccc(cc3)Cl)=O.I
Step 6: CC(C)(C)OC(=O)N1CCC(CC1)(NC(=O)OCc2ccccc2)C(NC(CC(OCC)=O)c3ccc(cc3)Cl)=O>>CC(C)(C)OC(=O)N1CCC(CC1)(C(NC(CC(OCC)=O)c2ccc(cc2)Cl)=O)N.c1ccccc1COC=O
Step 7: CC(C)(C)OC(=O)N1CCC(CC1)(C(NC(CC(OCC)=O)c2ccc(cc2)Cl)=O)N>>C1CNCCC1(N)C(=O)NC(CC(OCC)=O)c2ccc(cc2)Cl.CC(C)(C)OC=O
Step 8: C1CNCCC1(N)C(=O)NC(CC(OCC)=O)c2ccc(cc2)Cl.c1cnc2c1c(ncn2)Cl>>Cl.c1cc(ccc1C(CC(OCC)=O)NC(C2(CCN(CC2)c3c4cc[nH]c4ncn3)N)=O)Cl
Step 9: c1cc(ccc1C(CC(OCC)=O)NC(C2(CCN(CC2)c3c4cc[nH]c4ncn3)N)=O)Cl>>c1cc(ccc1Cl)C(NC(C2(CCN(CC2)c3c4cc[nH]c4ncn3)N)=O)CCO
Route 3224 — 9 steps
OH O O O Cl NH4 + O O O Cl O O N NH O O O OH Cl O O NH2 O O N NH O O NH O O Cl O I O O N NH O O NH Cl O OH O O O N NH O O NH O O Cl O O O N NH O O Cl O NH2 NH NH2 O NH O O Cl N N N Cl O O NH N NH N N NH2 O Cl Cl NH N NH N N NH2 O OH
Step 1: C(O)CCC.c1cc(ccc1C(CC(OCC)=O)=O)Cl>>CCO.c1cc(ccc1C(=O)CC(=O)OCCCC)Cl
Step 2: [NH4+].c1cc(ccc1C(=O)CC(=O)OCCCC)Cl>>O.c1cc(ccc1Cl)C(CC(=O)OCCCC)N
Step 3: CC(C)(C)OC(=O)N1CCC(CC1)(NC(OCc2ccccc2)=O)C(=O)O.c1cc(ccc1Cl)C(CC(=O)OCCCC)N>>CC(C)(C)OC(=O)N1CCC(CC1)(NC(=O)OCc2ccccc2)C(NC(CC(OCCCC)=O)c3ccc(cc3)Cl)=O.O
Step 4: CC(C)(C)OC(=O)N1CCC(CC1)(NC(=O)OCc2ccccc2)C(NC(CC(OCCCC)=O)c3ccc(cc3)Cl)=O>>C(CC)C.CC(C)(C)OC(=O)N1CCC(CC1)(NC(=O)OCc2ccccc2)C(NC(c3ccc(cc3)Cl)CC(=O)O)=O
Step 5: CC(C)(C)OC(=O)N1CCC(CC1)(NC(=O)OCc2ccccc2)C(NC(c3ccc(cc3)Cl)CC(=O)O)=O.CCI>>CC(C)(C)OC(=O)N1CCC(CC1)(NC(=O)OCc2ccccc2)C(NC(CC(OCC)=O)c3ccc(cc3)Cl)=O.I
Step 6: CC(C)(C)OC(=O)N1CCC(CC1)(NC(=O)OCc2ccccc2)C(NC(CC(OCC)=O)c3ccc(cc3)Cl)=O>>CC(C)(C)OC(=O)N1CCC(CC1)(C(NC(CC(OCC)=O)c2ccc(cc2)Cl)=O)N.c1ccccc1COC=O
Step 7: CC(C)(C)OC(=O)N1CCC(CC1)(C(NC(CC(OCC)=O)c2ccc(cc2)Cl)=O)N>>C1CNCCC1(N)C(=O)NC(CC(OCC)=O)c2ccc(cc2)Cl.CC(C)(C)OC=O
Step 8: C1CNCCC1(N)C(=O)NC(CC(OCC)=O)c2ccc(cc2)Cl.c1cnc2c1c(ncn2)Cl>>Cl.c1cc(ccc1C(CC(OCC)=O)NC(C2(CCN(CC2)c3c4cc[nH]c4ncn3)N)=O)Cl
Step 9: c1cc(ccc1C(CC(OCC)=O)NC(C2(CCN(CC2)c3c4cc[nH]c4ncn3)N)=O)Cl>>c1cc(ccc1Cl)C(NC(C2(CCN(CC2)c3c4cc[nH]c4ncn3)N)=O)CCO

6. Subclustering#

Refines each main cluster by abstracting non‑strategic details.

What happens

  • Replace leaving/protecting groups in RouteCGR by generic X‑labels (synthon).

  • Build a “pseudo‑reaction” between labeled building blocks and target.

  • Collect and tabulate leaving groups per position across routes.

Why: routes may share the same SB‑CGR but differ in tactics (e.g., different leaving groups or protections). Subclustering separates these variants.

[22]:
all_subclusters = subcluster_all_clusters(clusters, all_reduced_route_cgrs, all_route_cgrs)
[ ]:
cluster_index = '2.1'
# Subcluster keys are strings formatted as "{lg_sizes}_{index}", e.g. "0_1", "1_1"
# Let's inspect available subclusters for this cluster
if cluster_index in all_subclusters:
    print(f"Available subclusters for {cluster_index}: {list(all_subclusters[cluster_index].keys())}")
    # Pick the first available subcluster
    first_key = next(iter(all_subclusters[cluster_index]))
    subgroup = all_subclusters[cluster_index][first_key]
    display(HTML(routes_subclustering_report(routes_json, subgroup, cluster_index, first_key, all_reduced_route_cgrs, aam=False)))
else:
    subgroup = None
    print(f"Cluster {cluster_index} not found in the subclustering results.")

Post‑processing (experimental)#

  • Remove leaving‑group columns that are constant across all routes in a subgroup.

  • Merge routes with identical leaving‑group sets.

  • Merge symmetric pseudo‑reactions that differ only by atom mapping.

[ ]:
if subgroup and len(subgroup['routes_data']) > 1:
    new_subgroup = post_process_subgroup(subgroup)
    display(HTML(routes_subclustering_report(routes_json, new_subgroup, cluster_index, first_key, all_reduced_route_cgrs, if_lg_group=True)))

References & further reading#

  • SynPlanner docs: official documentation

  • Concepts used here: CGR, RouteCGR, SB‑CGR, strategic bond patterns (see paper).