CaltechAUTHORS
  A Caltech Library Service

Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing

Liu, Shengchao and Nie, Weili and Wang, Chengpeng and Lu, Jiarui and Qiao, Zhuoran and Liu, Ling and Tang, Jian and Xiao, Chaowei and Anandkumar, Anima (2022) Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. . (Unpublished) https://resolver.caltech.edu/CaltechAUTHORS:20230316-153807969

[img] PDF - Submitted Version
See Usage Policy.

19MB

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20230316-153807969

Abstract

There is increasing adoption of artificial intelligence in drug discovery. However, existing works use machine learning to mainly utilize the chemical structures of molecules yet ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions, and predict complex biological activities. We present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecule's chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct the largest multi-modal dataset to date, namely PubChemSTM, with over 280K chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM possesses two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.


Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription
http://arxiv.org/abs/2212.10789arXivDiscussion Paper
ORCID:
AuthorORCID
Liu, Shengchao0000-0003-2030-2367
Qiao, Zhuoran0000-0002-5704-7331
Xiao, Chaowei0000-0002-7043-4926
Anandkumar, Anima0000-0002-6974-6797
Record Number:CaltechAUTHORS:20230316-153807969
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20230316-153807969
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:120091
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:16 Mar 2023 19:00
Last Modified:16 Mar 2023 19:00

Repository Staff Only: item control page