IIT Mandi - GNN molecular olfaction
Full implementation deep-dive on the molecular olfaction GNN - the chemistry data pipeline with RDKit, the message-passing architecture in PyTorch Geometric, the training loop, the experiments that did not work, and the lessons about GNNs that hold up two years later.
The problem
A molecule's smell is an emergent property of its 3D structure and chemistry. Cinnamaldehyde smells like cinnamon. Vanillin smells like vanilla. The chemistry of smell maps molecular structure to a high-dimensional space of perceptual descriptors - fruity, citrus, woody, floral, musky, and so on, around 100 descriptors in the standard datasets.
The task we were working on - given a molecule (represented as a graph of atoms and bonds), predict the probability of each odor descriptor. Multi-label classification with around 100 outputs. The lab was reproducing parts of recent published work and trying to extend it to a different dataset.
The data pipeline
The input is a SMILES string. SMILES (Simplified Molecular Input Line Entry System) is a text representation of a molecule - "CC(=O)Oc1ccccc1C(=O)O" is aspirin. The first job is to parse SMILES into a graph and extract features.
I used RDKit, the standard cheminformatics library in Python. The flow -
from rdkit import Chem
mol = Chem.MolFromSmiles(smiles)
for atom in mol.GetAtoms():
# atom.GetAtomicNum(), GetFormalCharge(), GetHybridization()...
features = extract_node_features(atom)
for bond in mol.GetBonds():
# bond.GetBondType(), GetIsAromatic(), IsInRing()...
features = extract_edge_features(bond)Node features per atom -
- Atomic number, one-hot encoded across the elements that appear in the dataset (mostly C, H, N, O, S, F, Cl, Br, P with a few rare ones).
- Formal charge.
- Hybridization (sp, sp2, sp3, etc.), one-hot.
- Number of attached hydrogens.
- Aromaticity (boolean).
- Ring membership (boolean).
- Chirality (R, S, none).
Edge features per bond -
- Bond type (single, double, triple, aromatic), one-hot.
- Conjugation (boolean).
- Ring membership (boolean).
- Stereochemistry where defined.
I packed these into a PyTorch Geometric Data object - x for node features, edge_index for the (source, dest) pairs, edge_attr for edge features, y for the multi-label target vector.
A dataset class wrapped the whole thing and cached the processed graphs on disk so I did not re-parse SMILES every epoch. With a few thousand molecules this saved several minutes per training run.
The model
The model was a standard message-passing GNN with three layers. Pseudocode-ish for the forward pass -
def forward(self, x, edge_index, edge_attr, batch):
h = self.atom_embed(x) # initial node embeddings
for layer in self.mp_layers:
h = layer(h, edge_index, edge_attr)
h = F.relu(h)
h = self.dropout(h)
h_graph = global_mean_pool(h, batch) # mean over nodes in each graph
out = self.mlp_head(h_graph) # per-descriptor logits
return outEach message-passing layer was a standard scheme - for every node, aggregate the feature vectors of its neighbors (weighted by edge features), apply a learned transformation, add the result back to the node's own features. After three layers, each node "knows" about its 3-hop neighborhood, which is plenty for molecules where most behavior is captured by local chemistry.
I experimented with three message-passing variants - GCN (graph convolutional, simplest), GIN (graph isomorphism network, more expressive), and GAT (graph attention, learns weighting). At our dataset scale, all three landed within 1-2 percent of each other on ROC-AUC. The literature says GIN is provably more expressive than GCN - in our regime, it did not matter.
Hidden dim 128, three layers, dropout 0.3 on the MLP head. The MLP head was two hidden layers (128, 64) and an output layer of 100 logits (one per descriptor).
Training
Loss - binary cross-entropy with logits, per-descriptor, summed. Multi-label setup, no softmax.
Optimizer - AdamW with weight decay 1e-4, learning rate 1e-3 with cosine annealing over 100 epochs.
Batch size - 64 graphs per batch. PyG handles batching cleverly by combining the graphs into one large disconnected graph with a batch vector pointing each node to its origin graph. Mean pooling per graph then uses the batch vector.
Eval - 80/10/10 train/val/test split. Stratification on label co-occurrence was tried, did not move results. Eval metric was per-descriptor ROC-AUC averaged across descriptors.
Early stopping on validation ROC-AUC with patience 15. Best checkpoint saved.
The results
I will not put exact numbers because I do not want to misremember them two years later. The shape - my model reproduced the published baseline ROC-AUC within a percentage point. That was the goal for the reproduction phase. We did not get to a meaningful extension before I rotated off.
The experiments that did not work
I tried a number of extensions in the last month. None gave a clear win.
- Attention pooling instead of mean pooling. Marginal change, within noise.
- Deeper networks (5, 7 layers). Started to over-smooth - node features became too similar across the graph, hurt accuracy.
- Larger hidden dim (256, 512). Helped marginally but increased overfitting on our small dataset.
- Learned edge embeddings (more expressive edge features). Marginal.
- Auxiliary tasks (predict molecular weight, logP). Did not help main task.
The pattern - at our dataset scale (a few thousand molecules), the model was not the bottleneck. The dataset was. The literature confirmed this - more recent olfaction papers focused on dataset construction and active learning, not architecture changes.
The chemistry sidebar
I had not taken chemistry seriously since high school. Working on this project, I had to learn enough chemistry to talk to the PhD student about why certain functional groups predict certain smells (esters are often fruity, sulfur compounds often go onion or rotten, aromatic rings often go floral or sweet). I never became a chemist. I did learn enough to read a paper without getting stuck on terminology.
The lesson - if you work on a domain-specific ML problem, you have to learn the domain. Not all of it, but enough to read the literature and have a conversation with the domain expert without slowing them down. I did not always do this well that summer. Two years later, doing it well is non-negotiable for me on any domain-specific work.
What I would do differently
I would have spent more time on the dataset. I treated the dataset as a given and optimized the model. The PhD student kept pointing this out - "the model is fine, are the labels right?" I would have run more analysis on label quality, descriptor co-occurrence, and per-molecule disagreement among human raters before touching the model again.
I would have built a simpler baseline first. I went straight to a GNN because the project was a GNN project. A logistic regression on Morgan fingerprints (a classic chemistry-inspired bag-of-features representation) would have been a 30-minute baseline that told me how much the GNN was actually helping. I built that baseline in week 6. Should have been week 1.
What this taught me about GNNs
The mental model that stuck -
- A GNN is repeated local aggregation. Each layer lets every node see one hop further out. After K layers, each node summarizes its K-hop neighborhood.
- Mean pooling collapses the graph into a fixed-size vector. This is the bridge between "graph" and "MLP head".
- The choice of message-passing scheme (GCN vs GIN vs GAT) matters less than people claim, at most scales.
- The biggest design choice is the features. Choose features that capture what you think matters about your nodes and edges.
- Over-smoothing is real. Deep GNNs make all nodes look the same. Three to five layers is usually the sweet spot.
That mental model has transferred. When I read GNN papers now, I evaluate them with these axes in mind, and I can usually predict whether a proposed architecture change will matter at the dataset's scale.
What this taught me about research
Most ideas do not work. That is the headline. You will try ten things and one will be a clear win, two will be ambiguous, seven will not move the needle. The research workflow is built around managing that ratio - reading widely so you pick better ideas to try, running experiments cheaply so you can try more, evaluating honestly so you do not fool yourself when a result is noise.
I am an engineer, not a researcher. But the discipline of "most of your ideas will not work, evaluate honestly, move on quickly" transferred directly into engineering. Most of the optimizations I tried in the deck pipeline rewrite did not work. The eval set told me honestly, I moved on. That habit started at IIT Mandi.
Two years on, the GNN code is stale. The habits are not.
Learn more
- Docs
- DocsRDKit docsRDKit
- Article
- ArticleDistill - understanding GNNsDistill
- DocsPyTorch docsPyTorch