Automating Scientific Discovery with Open Data

Andrew White






FutureHouse
CZI Open Science
October 2025

FutureHouse Structure

  • Non-profit founded in 2023
  • Funded primarily by Eric Schmidt
  • Based in San Francisco
  • 25 employees

Science is changing independent of AI


Arxiv.org,10.6084/m9.figshare.17064419.v3

Number of Researchers are Growing

International R&D spending

PhD Researchers


NSF - https://ncses.nsf.gov/pubs/nsf24332; UNESCO UISI SDG9

Intellectual bottlenecks are growing


📝 Increasing paper count ($\approx$10M per year)

🧬 Larger data sets from cheaper experiments (genome at $200 per person, $1 / GB of sequencing)

🔍95% decline in disruptive papers since 1980

Park, M. et al. Nature 613, 138-144 (2023); Scannell, J.W. et al. Nat. Rev. Drug Discov. 11, 191–200 (2012); Deloitte 2025: Pharma innovation returns.

FutureHouse Mission


Accelerate Scientific Discovery

What is an agent?

Agent: trained, makes decisions

Environment: untrained, has tools, state

Protein Design Environment

  • Protein design with 5 existing deep learning models
  • Molecular dynamics, bioinformatics, literature research agent
  • Input: "design 92 binders for PD-L1"

Wet lab validation

Learning vs Frontier Models

Crows

Name Environment Key Tools
Crow/PaperQA Literature Research Search, Citation Traversal
ProteinCrow Designing novel proteins AlphaFold2, Molecular Dynamics
ChemCrow/Phoenix Designing new molecules Retrosynthesis, self-driving robotic lab
Data analysis crow/Finch Generating discoveries from data bioinformatics tools, code, file system

Agent vs ML Model

Modify surface residues of IL-10 to increase expression and solubility in E. coli without disrupting dimerization or receptor interaction.

link

Automating research of scientific literature


Language agents achieve superhuman synthesis of scientific knowledge

Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, Andrew D. White arXiv:2409.13740, 2024

Evaluating

Overexpression studies of PRMT4 in SW480 UPF1 knockout cells show that which arginine residue in PRMT4 is important for asymmetric di-methylation of UPF1 R433?

Better at answering questions than PhD biology experts

Improving over time

Better than human written Wikipedia articles

Can be used to check for precedent and disagreement in literature

FutureHouse Platform

  • Free, with rate limits
  • API - can be incorporated into your pipeline/agents
  • Majority of code is open source
  • Major updates coming soon!

API

Literature Research Agent Scale
  • Tasks per minute: 300
  • Research Papers 100,000,000
  • Wiki page for all diseases every 7 hours
  • All arxiv papers per week 30,000 papers / month
  • Check for contradictions 6.3M papers / year
  • All Wikipedia every 3 weeks

Model intelligence will continue to increase

Complete cycle of disease to mechanism to target to drug

ROBIN: A Multi-Agent System for Automating Scientific Discovery

Ali Essam Ghareeb*, Benjamin Chang*, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White†, Michaela M. Hinks‡, Samuel G. Rodriques

Conclusions

  1. AI builds on open data, but that is breakingdown
  2. Open source and publishers are not ready for AI
  3. The way we do science is changing rapidly
questions