Clustering More Than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches


This project aims to provide a highly accurate interactive map of medical research that can be easily used by both technical and non-technical users. Most current science maps in use today are small in scale and have not been validated. Accurate decisions require high quality and high coverage data, well defined and tested data analysis workflows, and a resulting representation that matches the visual perception and cognitive processing capabilities of human users.

Phase I of this project compares and determines the relative accuracies of maps of medical research based on commonly used text-based and citation-based similarity measures at a scale of over two million documents.


The project is lead by SciTech Strategies Inc. in collaboration with the Cyberinfrastructure for Network Science Center at Indiana University. There are subcontracts to different researchers and one company. The full team comprises:

  • Kevin W. Boyack, Richard Klavans, SciTech Strategies Inc.
  • Katy Börner, Russell J. Duhon, Nianli Ma, Indiana University
  • Bob Schijvenaars, Aaron Sorensen, Collexis Holdings Inc.
  • André Skupin, San Diego State University

The following people, although not part of the formal team, will also contribute to the project.

  • Edmund Talley, National Institute of Health
  • Dave Newman, University of California, Irvine

Please cite as: Boyack, Kevin W., David Newman, Russell Jackson Duhon, Richard Klavans, Michael Patek, Joseph R. Biberstine, Bob Schijvenaars, André Skupin, Nianli Ma, and Katy Börner. 2011. "Clustering More Than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches". PLoS ONE 6(3): 1-11.


It was decided that all work will be documented in real time and at a level of detail that supports the exact replication of work. All subcontractors will have access to this documentation as well as to intermediate data results. While the Scopus data cannot be made available, all Medline based derivative data will be made freely available from this page. Data compilation and statistics are documented in STS-Documentation.pdf.

Raw data

List of PMIDs sts-pmids.txt.gz (4.9MB)
List of stop words sts-stop-words.txt.gz
List of PMIDs with titles and abstracts pmid-title-abstr1.txt.gz (538MB)
List PMIDs with titles and abstracts pmid-title-abstr2.txt.gz (528MB)

Analysis Input Data

Title/Abstract term adjacency list sts-text-adj.gz (642MB)
MeSH adjacency list sts-mesh-adj.gz (98MB)

Term Frequency Data

Title/Abstract term frequencies sts-text-freq.gz (17MB)
MeSH term freqencies sts-mesh-freq.gz (211KB)

Analysis results will also be made available from this site. They will comprise:

Analysis Result Data

Linkage-based analysis

Co-citation sts-cocite-sim.gz sts-cocite-clust.gz (56MB)
Bibliographic coupling sts-bibcoup-topn.sim.gz (121MB) sts-bibcoup-clust.gz (9.1MB)
Direct citation sts-directcit-topn.sim.gz (67MB) sts-direct-clust.gz (8.8MB)

Title/Abstract term analysis

Co-occurrence sts-TA-co-topn.sim.gz (212MB) sts-TA-co-clust.gz (8.4MB)
LSA sts-TA-lsa-topn.sim.gz (194MB) sts-TA-lsa-clust.gz (8.9MB)
Topic model (UCI) sts-TA-topics-uci.sim.gz (117MB) sts-TA-topics-clust.gz (9.4MB)
Collexis sts-TA-collx-topn.sim.gz (146MB) sts-TA-collx-clust.gz (9.3MB)

MeSH analysis

Co-occurrence sts-mesh-co.sim.gz (155MB) sts-mesh-co-clust.gz (9.5MB)
LSA sts-mesh-lsa-topn.sim.gz (198MB) sts-mesh-lsa-clust.gz (10MB)
Self-organizing maps (SOM) sts-mesh-som.sim sts-mesh-som.clust.gz (9.4MB)
Collexis sts-mesh-collx.sim.gz (149MB) sts-mesh-collx-clust.gz (9.3MB)

Other analysis

NCBI related records data sts-ncbi-topn.sim.gz (115MB) sts-ncbi-clust.gz (9.4MB)


Bib coupling coherence result bc-lev1-coh.gz (387KB)
Co-citation coherence result cc-lev1-coh.gz (379KB)
Direct citation coherence result dc-lev1-coh.gz (590KB)
Co-word MeSH coherence result co-mesh-lev1-coh.gz (290KB)
LSA MeSH coherence result lsa-mesh-lev1_coh.gz (294KB)
SOM MeSH coherence result som-mesh-lev1-coh.gz (346KB)
Collexis MeSH coherence result collx-mesh-lev1-coh.gz (314KB)
Co-word TA coherence result co-ta-lev1-coh.gz (251KB)
LSA TA coherence result lsa-ta-lev1-coh.gz (280KB)
NCBI coherence result ncbi-lev1-coh.gz (340KB)
Collexis TA coherence result collx-ta-lev1-coh.gz (340KB)
Topics TA coherence result topic-ta-lev1.gz (284KB)


This project is funded by NIH SBIR Contract HHSN268200900053C.

Indiana University's Big Red supercomputer used in this study is supported by the National Science Foundation under Grant No. ACI-0338618l, OCI-0451237, OCI-0535258, and OCI-0504075. This research was supported in part by the Indiana METACyt Initiative. The Indiana METACyt Initiative of Indiana University is supported in part by Lilly Endowment, Inc. This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Thank you to our generous sponsors: