Webinar: Powering Molecular Transformers with High Quality Data
In digital chemistry, machine learning applications have a single point of failure: data quality.
The primary impediment to the widespread and successful application of machine learning in day-to-day laboratory operations is data quality. Machine learning has stringent requirements, and incorrect data can degrade the performance of trained prediction models and, ultimately, the end user experience.
To effectively train prediction models, historical chemical data must meet extremely broad and stringent quality standards. To begin with, the data must be correct, properly labeled, and deduplicated. Furthermore, complex inference tasks, such as the prediction of retrosynthetic routes or experimental procedures, require not just more data, but data that is more diverse and detailed, free of bias across the entire range of inputs for which the predictive models are being developed.
IBM Research has been developing data-driven chemistry solutions based on language models for over four years. The team relied on chemical reaction records derived from patents and have come to appreciate both the benefits and drawbacks of these freely accessible databases. Science of Synthesis (sos.thieme.com) and Synfacts (both maintained by Thieme) provide an unprecedented level of human curation among commercially available datasets, establishing them as the gold standard for chemical reaction records.
During this webinar, the IBM Research and Thieme teams will disclose the outcome of their collaborations. The teams will compare the performance of language models trained on the highest-quality commercially available datasets (Science of Synthesis and Synfacts) to that of publicly available patent reaction records, with a specific focus on retrosynthetic and chemical prediction tasks.
With Prof. Margaret Brimble, Prof. Cristina Nevado, Prof. Alois Fürstner, Prof. Karl Gademann, Prof. Ang Li, Prof. Richmond Sarpong, and Prof. Dirk Trauner, seven eminent synthetic chemistry experts from China, Germany, New Zealand, Switzerland, and the USA (together with their groups) provided insightful feedback to IBM and Thieme during this collaboration creating a unique forum for exchange between machine learning experts and the synthetic organic chemistry community. Their valuable work will also be illustrated.
Min. 00:00-00:05 – Welcome/Introduction
Min. 00:05-00:15 – SOS/SF: high quality chemical knowledge records
Min. 00:15-00:25 – Chemistry and Language: basics of the AI models
Min. 00:25-00:45 – Training models with Science of Synthesis and Synfacts: data, models and performance
Min. 00:45-01:00 – Panel + Q&A
Teodoro Laino is a distinguished research scientist at IBM Research — Europe, Switzerland. He got his PhD from University of Pisa and Scuola Normale Superiore di Pisa, working in the group of Michele Parrinello on the development of QM/MM algorithms for quantum chemical applications. Teo joined IBM Research in 2008 working on materials modelling and high performance computing. Since 2017, Teo leads the IBM Research activities in the space of AI and ML for Chemistry and Materials.
Alain Vaucher is a research scientist at IBM Research — Europe, Switzerland. He studied Chemistry at ETH Zurich and completed his Ph.D. in the group of Markus Reiher, also at ETH Zurich, where his research concentrated on interactive approaches for quantum chemistry. After his PhD, Alain devoted his research to applications of artificial intelligence in chemistry, first at BenevolentAI, and then at IBM Research.
Philippe Schwaller is a post-doctoral researcher at IBM Research — Europe, Switzerland. He received a bachelor’s and master’s degree in Materials Science and Engineering from EPFL and an MPhil in Physics at the University of Cambridge. Philippe joined IBM in 2017 and completed his PhD in Chemistry and Molecular Sciences at the University of Bern in 2021, working on machine learning for accelerating the discovery and synthesis of novel molecules and materials. In February 2022, Philippe will join the Institute of Chemical Sciences and Engineering at EPFL as a Tenure Track Assistant Professor.
Fiona Shortt de Hernandez has a PhD in organic chemistry completed under the supervision of Professor M. S. Baird at the University of Wales, Bangor, UK and Professor John R. Falck at the University of Texas Southwestern Medical Center at Dallas, USA. Also, an MBA from the University of Warwick – Warwick Business School, UK. Extensive publishing experience includes work on digital chemistry databases at Chapman and Hall Ltd., London (now CRC Press, Taylor and Francis) and currently at Thieme Chemistry as Senior Director, Product Management, Strategic Partnerships and Science of Synthesis/Head of Operations, Thieme China.
Sascha Hausberg received his PhD in inorganic chemistry from the University of Heidelberg in 2011. After a brief postdoc period in Heidelberg, he started a career in software development with the cheminformatics company InfoChem GmbH in Munich. From 2016 onwards, he was involved in project management and joined the executive team as Prokurist in 2017. He was promoted to interim managing director in 2019. Sascha joined Thieme Chemistry as Director eProducts and RÖMPP in May 2021, taking over responsibility for the RÖMPP online encyclopedia and bringing in his cheminformatics expertise in the development of the Thieme Chemistry portfolio.
Interested in using authorative Science of Synthesis data?