Coming from RDKit#

This tutorial bridges the gap between RDKit and chython.

Chython is a pure-Python cheminformatics library with a Pythonic API, built-in atom-to-atom mapping, and the Condensed Graph of Reaction (CGR) representation. Critically, chython has built-in RDKit interoperability, so you can freely convert molecules between the two libraries and use them together.

This tutorial assumes you are comfortable with Python and RDKit. We won’t explain basic cheminformatics concepts but instead, we focus on the differences and the conversion workflow.

[1]:
from chython import smiles, smarts, MoleculeContainer, ReactionContainer

1. The Big Differences#

1.1 Atom Indexing: Arbitrary Keys vs 0-Based Indices#

In RDKit, atoms are accessed by a 0-based contiguous index: mol.GetAtomWithIdx(0), mol.GetAtomWithIdx(1), etc.

In chython, atoms have arbitrary integer keys that typically start from 1 and may have gaps. This is the single biggest source of confusion for RDKit users.

[2]:
mol = smiles('CCO')

# Iterate over atoms: yields (key, atom) pairs
for n, atom in mol.atoms():
    print(f'Key: {n}, Element: {atom.atomic_symbol}, Charge: {atom.charge}')

# Access a specific atom by key
# RDKit: mol.GetAtomWithIdx(0)
# Chython: mol.atom(1)  - NOT mol.atom(0)!
print(f'\nAtom with key 1: {mol.atom(1).atomic_symbol}')
Key: 1, Element: C, Charge: 0
Key: 2, Element: C, Charge: 0
Key: 3, Element: O, Charge: 0

Atom with key 1: C
[3]:
# Keys can have gaps after operations like substructure extraction
toluene = smiles('Cc1ccccc1')
sub = toluene.substructure([1, 3, 5, 7])  # extract some atoms
print('Atom keys in substructure:', list(sub.atoms_numbers))
# Note: keys are preserved from the original molecule, not renumbered 0..N
Atom keys in substructure: [1, 3, 5, 7]

1.2 Mutability: Edit Molecules In-Place#

In RDKit, molecule editing is done through RWMol objects.

In chython, MoleculeContainer is mutable - you can add/remove atoms and bonds directly on the molecule object.

[4]:
mol = smiles('C')
print('Before:', str(mol))

# Add an oxygen atom and connect it
n = list(mol.atoms_numbers)[0]  # get the carbon's key
o_key = mol.add_atom('O')       # returns the key of the new atom
mol.add_bond(n, o_key, 1)       # single bond
print('After adding O:', str(mol))

# You can also use transactions for safe multi-step edits
mol2 = smiles('CCO')
try:
    with mol2:  # start transaction
        mol2.delete_atom(list(mol2.atoms_numbers)[-1])  # remove last atom
        print('Inside transaction:', str(mol2))
    print('After commit:', str(mol2))
except Exception:
    print('Transaction rolled back')
Before: C
After adding O: CO
Inside transaction: CC
After commit: CC

1.3 Hashability and Equality#

In RDKit, molecule comparison and deduplication is typically done via canonical SMILES strings:

# RDKit
seen = set()
seen.add(Chem.MolToSmiles(mol))  # convert to string for hashing

Chython molecules are directly hashable and comparable. You can use them in sets, as dict keys, and compare with ==.

[5]:
mol1 = smiles('CCO')  # ethanol
mol2 = smiles('OCC')  # ethanol, different SMILES, same molecule

# Direct equality comparison
print('mol1 == mol2:', mol1 == mol2)

# Use in sets and dicts
mol_set = {mol1, mol2}
print('Set size:', len(mol_set))  # 1, because they represent the same molecule

mol_dict = {mol1: 'ethanol'}
print('Dict lookup:', mol_dict[mol2])  # works!

# Note: aromatic vs Kekule forms are NOT equal until canonicalized
aromatic = smiles('c1ccccc1')
kekule = smiles('C1=CC=CC=C1')
print('\naromatic == kekule:', aromatic == kekule)  # False!
kekule.thiele()  # aromatize the Kekule form
print('After thiele():  ', aromatic == kekule)  # True
mol1 == mol2: True
Set size: 1
Dict lookup: ethanol

aromatic == kekule: False
After thiele():   True

Warning: Because molecules are mutable AND hashable, modifying a molecule after putting it in a set or dict will break the container. Always work with copies if you need to mutate molecules that are stored in hash-based collections.

1.4 SMILES: str(mol) Instead of Chem.MolToSmiles(mol)#

Chython’s canonical SMILES is simply str(mol). No function call needed.

[6]:
mol = smiles('OCC(=O)O')

# RDKit: Chem.MolToSmiles(mol)
# Chython:
print('Canonical SMILES:', str(mol))

# Format strings for non-standard SMILES output
print('With atom mapping:', f'{mol:A}')   # include atom-to-atom mapping
print('With hydrogens:', f'{mol:h}')      # explicit hydrogens in SMILES
Canonical SMILES: C(O)C(=O)O
With atom mapping: C(O)C(=O)O
With hydrogens: [CH2]([OH])[C](=[O])[OH]

1.5 Aromaticity: thiele() / kekule() Instead of SetAromaticity() / Kekulize()#

RDKit

Chython

Notes

Chem.SetAromaticity(mol)

mol.thiele()

Named after Thiele’s theory of partial valences

Chem.Kekulize(mol)

mol.kekule()

Same concept, different name

In chython, if you parse a Kekule SMILES, it stays in Kekule form until you explicitly call thiele().

[7]:
# Already aromatic from SMILES
aromatic = smiles('c1ccccc1')
print('Aromatic SMILES:', str(aromatic))

# Kekulize it
aromatic.kekule()
print('Kekule SMILES:', str(aromatic))

# Re-aromatize
aromatic.thiele()
print('Re-aromatized:', str(aromatic))
Aromatic SMILES: c1ccccc1
Kekule SMILES: C1=CC=CC=C1
Re-aromatized: c1ccccc1

1.6 Substructure Search: Operator Overloading#

RDKit: mol.HasSubstructMatch(query) returns True/False.

Chython: ``query < mol`` (reads as “query is a substructure of mol”). This is the < operator overloaded for chemical meaning.

[8]:
benzene = smiles('c1ccccc1')
toluene = smiles('Cc1ccccc1')
pyridine = smiles('c1ccncc1')

# Substructure check
# RDKit: toluene_rdkit.HasSubstructMatch(benzene_rdkit)
# Chython:
print('benzene < toluene:', benzene < toluene)   # True: benzene is a substructure of toluene
print('toluene < benzene:', toluene < benzene)   # False: toluene is NOT a substructure of benzene
print('benzene < pyridine:', benzene < pyridine) # False: pyridine has nitrogen
benzene < toluene: True
toluene < benzene: False
benzene < pyridine: False
[9]:
# Get atom mappings (equivalent to RDKit's GetSubstructMatches)
# RDKit returns tuples of 0-based indices
# Chython returns dicts of {query_key: mol_key}
query = smiles('c1ccccc1')
target = smiles('Cc1ccccc1')

for mapping in query.get_mapping(target):
    print('Mapping:', mapping)
    break  # just show the first one
Mapping: {1: 7, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6}

2. Converting Between RDKit and Chython#

This is the key feature that lets you use both libraries together. Chython provides MoleculeContainer.from_rdkit() and mol.to_rdkit() for seamless conversion.

[10]:
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors

2.1 RDKit to Chython#

[11]:
# Parse with RDKit
rdkit_mol = Chem.MolFromSmiles('c1ccc(CC(=O)O)cc1')

# Convert to chython
chython_mol = MoleculeContainer.from_rdkit(rdkit_mol)

print('Type:', type(chython_mol).__name__)
print('Canonical SMILES:', str(chython_mol))
chython_mol
Type: MoleculeContainer
Canonical SMILES: c1ccccc1CC(=O)O
[11]:
../_images/user_guide_01_Coming_from_RDKit_20_1.svg

2.2 Chython to RDKit#

[12]:
# Parse with chython
chython_mol = smiles('c1ccc(CC(=O)O)cc1')

# Convert to RDKit
rdkit_mol = chython_mol.to_rdkit()

print('Type:', type(rdkit_mol).__name__)
print('RDKit SMILES:', Chem.MolToSmiles(rdkit_mol))
Type: RWMol
RDKit SMILES: [cH:1]1[cH:2][cH:3][c:4]([CH2:5][C:6](=[O:7])[OH:8])[cH:9][cH:10]1
[13]:
# to_rdkit() options:
# keep_mapping=True (default) - preserves atom map numbers
# keep_hydrogens=True (default) - preserves implicit hydrogen counts

rdkit_no_map = chython_mol.to_rdkit(keep_mapping=False)
print('Without atom mapping:', Chem.MolToSmiles(rdkit_no_map))
Without atom mapping: O=C(O)Cc1ccccc1

2.3 Round-Trip: Stereochemistry and Conformers#

The conversion preserves stereochemistry (tetrahedral and cis/trans) and conformer data.

[14]:
# Stereochemistry round-trip
rdkit_stereo = Chem.MolFromSmiles('C/C=C/C[C@@H](O)F')
chython_stereo = MoleculeContainer.from_rdkit(rdkit_stereo)
print('Chython SMILES:', str(chython_stereo))

# Convert back
rdkit_back = chython_stereo.to_rdkit()
print('RDKit SMILES:  ', Chem.MolToSmiles(rdkit_back))
Chython SMILES: O[C@H](C/C=C/C)F
RDKit SMILES:   [CH3:1]/[CH:2]=[CH:3]/[CH2:4][C@@H:5]([OH:6])[F:7]

2.4 Practical Workflow: Use RDKit Descriptors on Chython Molecules#

[15]:
# Start in chython for substructure analysis
aspirin = smiles('CC(=O)Oc1ccccc1C(=O)O')

# Note: chython-synplan supports recursive SMARTS ($()) and the & operator,
# but does NOT support RDKit-style X (total connectivity) primitive.
# Use chython's supported primitives instead: D (degree), h (implicit H count), etc.
# RDKit SMARTS: '[CX3](=O)[OX2H1]'
# Chython SMARTS: '[C;D3](=[O])[O;h1]'
carboxylic_acid = smarts('[C;D3](=[O])[O;h1]')

print('Has COOH group:', carboxylic_acid < aspirin)

# Convert to RDKit for descriptors
rdkit_aspirin = aspirin.to_rdkit()
print('Molecular weight:', Descriptors.ExactMolWt(rdkit_aspirin))
print('LogP:', Descriptors.MolLogP(rdkit_aspirin))
print('TPSA:', Descriptors.TPSA(rdkit_aspirin))
Has COOH group: True
Molecular weight: 180.042258736
LogP: 1.3101
TPSA: 63.60000000000001
[16]:
# Use RDKit fingerprints on chython molecules
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
import numpy as np

mol1 = smiles('c1ccccc1CC(=O)O')  # phenylacetic acid
mol2 = smiles('c1ccccc1CCC(=O)O')  # hydrocinnamic acid

# --- RDKit Morgan fingerprints ---
fpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fp1 = fpgen.GetFingerprint(mol1.to_rdkit())
fp2 = fpgen.GetFingerprint(mol2.to_rdkit())

rdkit_sim = DataStructs.TanimotoSimilarity(fp1, fp2)
print(f'RDKit Morgan Tanimoto:   {rdkit_sim:.3f}')

# --- Chython Morgan fingerprints ---
# Default: min_radius=1, max_radius=4, length=2048
cfp1 = mol1.morgan_fingerprint(max_radius=4, length=2048)
cfp2 = mol2.morgan_fingerprint(max_radius=4, length=2048)

# Tanimoto on binary fingerprints: |A ∩ B| / |A ∪ B|
intersection = np.sum(cfp1 & cfp2)
union = np.sum(cfp1 | cfp2)
chython_sim = intersection / union if union else 0.0
print(f'Chython Morgan Tanimoto: {chython_sim:.3f}')

print(f'\nNote: values differ because the two implementations use different atom invariants and radii.')
RDKit Morgan Tanimoto:   0.565
Chython Morgan Tanimoto: 0.655

Note: values differ because the two implementations use different atom invariants and radii.

Chython also has its own Morgan fingerprint implementation:

fp = mol.morgan_fingerprint(min_radius=1, max_radius=4, length=1024)

For compatibility with existing RDKit-based workflows (trained models, similarity databases), you may prefer to convert to RDKit and use its fingerprints.

2.5 Convenience Function#

For quick conversions, chython also provides a module-level from_rdkit() function:

[17]:
from chython import from_rdkit

rdkit_mol = Chem.MolFromSmiles('CCN')
chython_mol = from_rdkit(rdkit_mol)
print(str(chython_mol))
CCN

3. Operation Cheat Sheet#

Operation

RDKit

Chython

Parse SMILES

Chem.MolFromSmiles('CCO')

smiles('CCO')

Canonical SMILES

Chem.MolToSmiles(mol)

str(mol)

Parse SMARTS

Chem.MolFromSmarts('[#6]')

smarts('[#6]')

Read SDF

Chem.SDMolSupplier('f.sdf')

SDFRead('f.sdf') (context manager)

Write SDF

Chem.SDWriter('out.sdf')

SDFWrite('out.sdf') (context manager)

Substructure search

mol.HasSubstructMatch(query)

query < mol

Get matches

mol.GetSubstructMatches(query)

list(query.get_mapping(mol))

Aromatize

Chem.SetAromaticity(mol)

mol.thiele()

Kekulize

Chem.Kekulize(mol)

mol.kekule()

Sanitize

Chem.SanitizeMol(mol)

mol.canonicalize()

Add Hs

Chem.AddHs(mol)

mol.explicify_hydrogens()

Remove Hs

Chem.RemoveHs(mol)

mol.implicify_hydrogens()

Neutralize

rdMolStandardize.Uncharger()

mol.neutralize()

Atom access

mol.GetAtomWithIdx(0) (0-based)

mol.atom(n) (arbitrary int key)

Atom iteration

for atom in mol.GetAtoms()

for n, atom in mol.atoms()

Bond between atoms

mol.GetBondBetweenAtoms(i, j)

mol.bond(i, j)

Molecular hash

Via canonical SMILES

hash(mol)

Equality check

Chem.MolToSmiles(a) == Chem.MolToSmiles(b)

a == b

Combine fragments

Chem.CombineMols(a, b)

a \| b

Split fragments

Chem.GetMolFrags(mol, asMols=True)

mol.split()

2D coordinates

AllChem.Compute2DCoords(mol)

mol.clean2d()

Check valence

Automatic in SanitizeMol

mol.check_valence() (returns list)

Depict

Draw.MolToImage(mol)

mol.depict() or just mol in notebook

Run reaction

rxn.RunReactants((mol,))

reactor((mol,)) (returns generator)

Atom-atom mapping

External tools or manual

reaction.reset_mapping() (built-in)

Reaction SMILES

AllChem.ReactionToSmiles(rxn)

str(reaction)

RDKit convert

mol.to_rdkit() / MoleculeContainer.from_rdkit(rdkit_mol)

4. Chython’s Design Focus#

Operator overloading for substructure#

query < mol is concise and readable once you learn the convention.

Direct hashability and equality#

mol1 == mol2, hash(mol), molecules in set() and dict - no need for explicit canonical SMILES conversion.

Built-in atom-to-atom mapping#

Chython includes a neural network-based atom-to-atom mapper accessible via reaction.reset_mapping().

Condensed Graph of Reaction (CGR)#

CGR overlays reactants and products into a single graph where changed bonds carry both their “before” and “after” bond orders. This is a powerful representation for reaction analysis. SynPlanner is built around this concept.

Pythonic API#

  • str(mol) for canonical SMILES

  • hash(mol) for hashing

  • mol1 == mol2 for equality

  • mol1 | mol2 for combining fragments

  • mol.split() for disconnected components

  • Context manager (with mol:) for transactional edits

5. Key Differences in Scope#

The two libraries have different design goals and focus areas:

  • Fingerprints and similarity: RDKit provides many fingerprint types (Morgan, MACCS, atom pairs, topological torsions). Chython has Morgan fingerprints; for other types, convert with .to_rdkit().

  • 3D conformer generation: RDKit includes EmbedMolecule() with ETKDG. Chython delegates 3D tasks to RDKit or CDPKit.

  • Descriptors: RDKit has hundreds of molecular descriptors. Chython provides basic properties (molecular_mass, brutto_formula); for a full descriptor suite, convert to RDKit.

  • SMARTS: The chython-synplan package supports recursive SMARTS ($()) and the & logic operator. Some RDKit-specific primitives like X (total connectivity) are not available — use D (degree) and h (implicit H count) instead.

  • Reactions and CGR: Chython provides built-in atom-to-atom mapping, the CGR representation, and reaction analysis tools. These are its core focus areas.

  • Mutability and hashing: Chython molecules are mutable and hashable. RDKit molecules are immutable (require RWMol for edits) and not hashable.

6. Practical Tip: Using Both Libraries Together#

The recommended workflow for projects using SynPlanner is to use both libraries, converting between them as needed:

  • Use chython for reaction analysis, CGR operations, atom-to-atom mapping, substructure search with operator syntax, molecule hashing/deduplication, and SMARTS-based queries (including recursive SMARTS in chython-synplan).

  • Use RDKit for fingerprint diversity, 3D conformers, molecular descriptors, and integration with tools that expect RDKit objects.

  • Convert freely using from_rdkit() and .to_rdkit().

[18]:
# Example: Combined workflow
# 1. Parse and deduplicate with chython (hashable molecules)
smiles_list = ['c1ccccc1', 'C1=CC=CC=C1', 'c1ccccc1', 'CCO', 'OCC']
unique_mols = {smiles(s) for s in smiles_list}
print(f'Input: {len(smiles_list)} SMILES -> {len(unique_mols)} unique molecules')

# 2. Use chython for substructure filtering
aromatic_query = smarts('[a]')
aromatic_mols = [m for m in unique_mols if aromatic_query < m]
print(f'Aromatic molecules: {len(aromatic_mols)}')

# 3. Convert to RDKit for descriptor calculation
for mol in unique_mols:
    rdmol = mol.to_rdkit()
    mw = Descriptors.ExactMolWt(rdmol)
    logp = Descriptors.MolLogP(rdmol)
    print(f'  {str(mol):20s}  MW={mw:.2f}  LogP={logp:.2f}')
Input: 5 SMILES -> 3 unique molecules
Aromatic molecules: 1
  CCO                   MW=46.04  LogP=-0.00
  c1ccccc1              MW=78.05  LogP=1.69
  C1=CC=CC=C1           MW=78.05  LogP=1.69
[19]:
# Example: Batch processing with RDKit parsing and chython deduplication
raw_smiles = ['c1ccccc1', 'C1=CC=CC=C1', 'c1ccc(O)cc1', 'Oc1ccccc1']

# Parse with RDKit (e.g., to use its error handling)
rdkit_mols = [Chem.MolFromSmiles(s) for s in raw_smiles]
rdkit_mols = [m for m in rdkit_mols if m is not None]  # filter failures

# Convert to chython for deduplication
chython_mols = {MoleculeContainer.from_rdkit(m) for m in rdkit_mols}
print(f'{len(raw_smiles)} SMILES -> {len(rdkit_mols)} valid -> {len(chython_mols)} unique')

for mol in chython_mols:
    print(f'  {str(mol)}')
4 SMILES -> 4 valid -> 2 unique
  c1cc(ccc1)O
  c1ccccc1

Summary#

The key takeaways for RDKit users:

  1. Atoms have integer keys (not 0-based indices) - use mol.atom(n) and for n, atom in mol.atoms().

  2. Molecules are mutable and hashable - great for deduplication, but be careful with mutation in hash-based collections.

  3. ``str(mol)`` gives canonical SMILES - no need for Chem.MolToSmiles(mol).

  4. ``query < mol`` for substructure search - reads as “query is a substructure of mol”.

  5. ``thiele()`` = aromatize, ``kekule()`` = kekulize - same concept, different names.

  6. ``mol.to_rdkit()`` and ``MoleculeContainer.from_rdkit(rdkit_mol)`` - convert freely between the two.

  7. CGR - chython’s representation for encoding reaction transformations as single graphs.