EmbeddingsKnowledge GraphSelf-Supervised LearningScience of ScienceBarlow TwinsVICRegPubMedOpenAlexPythonNeo4j

Scientific Knowledge Mapping

Semantic Cartography of Biomedical Science

Using novel self-supervised learning to map the entire landscape of biomedical knowledge beyond citations and impact factors

Knowledge GraphAI/MLBiomedical ResearchScience of Science
7 min read

Also available in:

🇪🇸Leer en Español
Share:

Scientific Knowledge Mapping: Revealing the Hidden Connections in Science

Science is built on connections. Every discovery links to previous work, inspires new questions, and opens unexpected avenues. Yet, these connections often remain hidden in the silos of individual papers, databases, and researcher expertise.

Scientific Knowledge Mapping aims to make these connections visible, searchable, and actionable by creating a semantic atlas of all biomedical knowledge.

The Challenge: A Fragmented Scientific Landscape

Modern biological research is scattered across an impossibly complex ecosystem:

  • 📄 35+ million papers in PubMed alone
  • 🗄️ Hundreds of specialized databases (UniProt, KEGG, Gene Ontology, etc.)
  • 🧪 Proprietary datasets locked in individual laboratories
  • 🧠 Tacit knowledge residing only in researchers' minds

The Hidden Cost

This fragmentation has real consequences:

Duplicate Experiments: Researchers unknowingly repeat work already done elsewhere, wasting time and resources.

Missed Connections: Revolutionary insights often emerge from connecting distant fields, but these bridges remain invisible.

Slow Hypothesis Generation: Scientists spend countless hours searching for relevant prior work instead of doing science.

Research Waste: Studies estimate that over 85% of biomedical research investment is wasted due to poor knowledge integration.

Our Approach: Semantic Understanding Through Self-Supervised Learning

We're creating the first comprehensive semantic map of life sciences literature using a fundamentally new approach to embedding generation.

Beyond Traditional Methods: Learning What Makes Science "Science"

Traditional embedding models rely on contrastive learning—showing the model pairs of similar and dissimilar examples and asking it to distinguish between them. This is like teaching someone what a dog is by showing them dogs and not-dogs.

We're taking inspiration from neuroscience instead.

The Barlow Twins Principle

In the brain, neurons learn to represent information by reducing redundancy—different neurons learn to encode different, independent features of the world. This is the principle behind Barlow Twins, a neuroscience concept that helps explain how our brains efficiently represent complex information.

We've adapted this principle to machine learning using two foundational papers:

  • Barlow Twins (Zbontar et al., 2021): Self-supervised learning through redundancy reduction
  • VICReg (Meta AI Research): Variance-Invariance-Covariance Regularization for robust representations

The Key Innovation

Instead of telling our model "these papers are similar, those are different," we let it discover the fundamental features that distinguish different types of scientific knowledge. Each dimension of our embedding space learns to encode a meaningful aspect of scientific content—methodology, disease focus, molecular mechanisms, experimental approaches.

This is a more holistic approach to understanding language and meaning, avoiding the biases introduced by manual positive/negative sampling and letting the mathematics reveal the natural structure of scientific knowledge.

Current Reality: A Work in Progress

Transparency in Science: Our preliminary benchmarking shows that classical negative sampling methods still outperform our approach on standard downstream tasks. We're iterating on the architecture and training procedure to close this gap while preserving the theoretical advantages of redundancy reduction.

This is cutting-edge research—messy, uncertain, but potentially transformative.

The Semantic Atlas of Science

Using our embeddings, we're mapping the entire landscape of biomedical knowledge from PubMed and OpenAlex. This creates a semantic space where:

Every paper has coordinates based on its meaning, not its citations or impact factor.

Similar concepts cluster together regardless of terminology, journal, or field.

Distances reflect semantic relationships between ideas, approaches, and discoveries.

Reimagining Scientific Geography

In this new map, we can visualize science as having "countries" (disciplines), "cities" (research topics), and "landmarks" (seminal works). But unlike traditional bibliometric maps:

Journals become regions defined by the semantic space they cover, revealing overlap and unique niches.

Institutions appear as clouds showing their collective research focus and how it evolves over time.

Researchers are represented by the average embedding of their work, making expertise truly searchable.

Funding programs map to territories they support, revealing gaps and redundancies.

Breaking Free from Citation Bias

Traditional science mapping relies on citations—who cites whom. But citations are:

  • Endogamous: People cite their own field, missing cross-disciplinary connections
  • Impact-biased: High-impact journals get cited more, regardless of semantic relevance
  • Slow: It takes years for citation networks to form
  • Incomplete: Fundamental work in obscure journals may never be "discovered"

Our semantic approach gives equal weight to all work, letting the content speak for itself.

Transformative Applications

1. Unbiased Peer Review Assignment

Match papers and grant proposals to the most semantically relevant reviewers, not just those in the same citation network. Find the perfect referee even if they've never worked in the "same field" by traditional definitions.

2. Bias Detection in Science

Reveal systematic gaps in funding, publication, or attention. Which research areas are semantically connected but have zero cross-citations? Where do funding programs overlap unnecessarily? Which perspectives are systematically excluded?

3. Evolution of Knowledge Tracking

Follow how concepts emerge, merge, split, and transform over time. See the birth of new fields before the terminology even stabilizes. Identify paradigm shifts as they happen.

4. Cross-Domain Discovery

Find hidden connections between distant fields. The treatment developed in oncology that could revolutionize neurodegenerative disease. The statistical method from ecology that solves a proteomics problem. These connections exist—we just need to see them.

5. Integration with SourceData-NLP

By connecting our semantic knowledge map with SourceData-NLP's entity-level understanding, we create a powerful discovery engine:

  • Drug repurposing: Find compounds studied in one disease that might work in semantically related conditions
  • Protein interaction prediction: Identify likely interactions based on semantic context of existing literature
  • Hypothesis generation: Suggest novel experiments by bridging semantic gaps in the knowledge graph

6. Research Strategy Optimization

Help institutions and funders understand their research portfolio not through administrative categories, but through the actual semantic space they occupy. Identify genuine innovation vs. incremental work.

The Technical Architecture

Our pipeline processes millions of papers through several stages:

1. Document Processing: Extract full text, abstracts, and metadata from PubMed and OpenAlex

2. Embedding Generation: Transform each document into a dense vector using our Barlow Twins-inspired model

3. Knowledge Graph Construction: Connect documents through semantic similarity, entity co-occurrence (from SourceData-NLP), and citation relationships

4. Dimensional Analysis: Interpret what each embedding dimension encodes through systematic perturbation studies

5. Interactive Exploration: Enable researchers to navigate the knowledge space through natural language queries and visual interfaces

What Makes This Different

Semantic-First: Content matters more than citations or venue prestige

Unbiased: No manual curation of "similar" and "dissimilar" examples

Comprehensive: Entire biomedical literature, not just recent or high-impact papers

Interpretable: Dimensions have meaning, not just mathematical correlations

Integrated: Connects with entity-level knowledge from SourceData-NLP

Open: Methods, models, and tools will be freely available

Evolving: The map grows and improves as new research is published

Current Status & Future Directions

This work is actively ongoing at EMBL Heidelberg. We're:

  • Refining the embedding model architecture to improve benchmark performance
  • Scaling to the full PubMed and OpenAlex corpus (35M+ papers)
  • Developing interactive visualization tools for exploring the knowledge space
  • Building the integration layer with SourceData-NLP
  • Conducting user studies with researchers, funders, and publishers

Publication Timeline: We're preparing an application paper for submission in Q1 2026, which will introduce the methodology, benchmark results, and initial applications.

An Endless World of Applications

Scientific Knowledge Mapping opens doors we're only beginning to imagine:

  • Personalized literature recommendations that understand your research trajectory
  • Automated research gap analysis for funding agencies
  • Predictive models of which fields are likely to merge or spawn new subdisciplines
  • Equity analysis revealing which communities and perspectives are systematically excluded
  • Education tools that help students navigate the conceptual landscape of their field
  • Science policy insights grounded in the actual structure of knowledge, not administrative categories

The map is not the destination—it's the tool that helps us navigate more wisely, discover more readily, and connect more meaningfully.

Join the Journey

This project represents a fundamental rethinking of how we organize and navigate scientific knowledge. While the work is still in progress, we're committed to transparency about both successes and challenges.

The goal isn't just to build a better search engine—it's to reveal the deep structure of scientific understanding and make those insights actionable for discovery.

Stay Updated

  • Project Status: In development at EMBL Heidelberg
  • Expected Publication: Q1 2026
  • Code & Models: Will be open-sourced upon publication

The map of science should reflect the geography of ideas, not the sociology of citations. We're drawing that map, one embedding at a time.