Liu, Shengchao and Nie, Weili and Wang, Chengpeng and Lu, Jiarui and Qiao, Zhuoran and Liu, Ling and Tang, Jian and Xiao, Chaowei and Anandkumar, Anima (2022) Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. . (Unpublished) https://resolver.caltech.edu/CaltechAUTHORS:20230316-153807969
![]() |
PDF
- Submitted Version
See Usage Policy. 19MB |
Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20230316-153807969
Abstract
There is increasing adoption of artificial intelligence in drug discovery. However, existing works use machine learning to mainly utilize the chemical structures of molecules yet ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions, and predict complex biological activities. We present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecule's chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct the largest multi-modal dataset to date, namely PubChemSTM, with over 280K chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM possesses two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.
Item Type: | Report or Paper (Discussion Paper) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Related URLs: |
| ||||||||||
ORCID: |
| ||||||||||
Record Number: | CaltechAUTHORS:20230316-153807969 | ||||||||||
Persistent URL: | https://resolver.caltech.edu/CaltechAUTHORS:20230316-153807969 | ||||||||||
Usage Policy: | No commercial reproduction, distribution, display or performance rights in this work are provided. | ||||||||||
ID Code: | 120091 | ||||||||||
Collection: | CaltechAUTHORS | ||||||||||
Deposited By: | George Porter | ||||||||||
Deposited On: | 16 Mar 2023 19:00 | ||||||||||
Last Modified: | 16 Mar 2023 19:00 |
Repository Staff Only: item control page