IIT Mandi - GNN molecular olfaction
Implemented a message-passing GNN in PyTorch Geometric to predict odor descriptors from molecular graphs, reproducing the published baseline on a public olfaction dataset.
I built a graph neural network for molecular olfaction at IIT Mandi in summer 2024. The task - predict perceptual odor descriptors (fruity, woody, citrus, etc.) from a molecule's structure. The model - a 3-layer message-passing GNN that took the molecular graph and output multi-label probabilities. The result - reproduced the published baseline, did not extend it.
The pipeline started with SMILES strings (a compact text representation of molecules like "CC(=O)Oc1ccccc1C(=O)O" for aspirin). RDKit parsed each SMILES into a molecule object. I extracted node features per atom (element, charge, hybridization, aromaticity, hydrogen count) and edge features per bond (single/double/triple/aromatic, ring membership, conjugation). Those became a PyTorch Geometric Data object.
The model architecture - 3 message-passing layers, hidden dim 128, GCN-style aggregation (some experiments with GIN and GAT), mean-pooling over the node embeddings to get a graph-level representation, and a small MLP head for the multi-label output. Trained with AdamW, BCE-with-logits loss, ROC-AUC as the eval metric per descriptor.
The interesting technical moments - first, learning that the choice of node features matters more than the choice of message-passing scheme at this scale. GCN, GIN, and GAT all landed within 1-2 percent of each other. Better features moved things more. Second, the dataset was small (a few thousand molecules), so regularization was critical. Dropout 0.3 on the MLP head, weight decay in AdamW, early stopping on validation ROC-AUC.
What did not work - I tried a few extensions (attention pooling, deeper networks, learned edge embeddings) without a clear win. At the dataset scale we had, the gains were noise. With a larger dataset they might have mattered. We did not have a larger dataset.
The honest summary in an interview - "I implemented a standard message-passing GNN that reproduced the published baseline. I learned the PyTorch Geometric ecosystem, how to build a chemistry data pipeline, and how to defend my choices in a research setting. I did not invent a new architecture and I would not claim I did."
Learn more
- Docs
- DocsRDKit docsRDKit
- Article
- ArticleDistill - understanding GNNsDistill