Clustering More Than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Abstract:

This project aims to provide a highly accurate interactive map of medical research that can be easily used by both technical and non-technical users. Most current science maps in use today are small in scale and have not been validated. Accurate decisions require high quality and high coverage data, well defined and tested data analysis workflows, and a resulting representation that matches the visual perception and cognitive processing capabilities of human users.

Phase I of this project compares and determines the relative accuracies of maps of medical research based on commonly used text-based and citation-based similarity measures at a scale of over two million documents.

Team

The project is lead by SciTech Strategies Inc. in collaboration with the Cyberinfrastructure for Network Science Center at Indiana University. There are subcontracts to different researchers and one company. The full team comprises:

Kevin W. Boyack, Richard Klavans, SciTech Strategies Inc.
Katy Börner, Russell J. Duhon, Nianli Ma, Indiana University
Bob Schijvenaars, Aaron Sorensen, Collexis Holdings Inc.
André Skupin, San Diego State University

The following people, although not part of the formal team, will also contribute to the project.

Edmund Talley, National Institute of Health
Dave Newman, University of California, Irvine

Please cite as: Boyack, Kevin W., David Newman, Russell Jackson Duhon, Richard Klavans, Michael Patek, Joseph R. Biberstine, Bob Schijvenaars, André Skupin, Nianli Ma, and Katy Börner. 2011. "Clustering More Than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches". PLoS ONE 6(3): 1-11.

Datasets

It was decided that all work will be documented in real time and at a level of detail that supports the exact replication of work. All subcontractors will have access to this documentation as well as to intermediate data results. While the Scopus data cannot be made available, all Medline based derivative data will be made freely available from this page. Data compilation and statistics are documented in STS-Documentation.pdf.

Raw data

List of PMIDs	sts-pmids.txt.gz (4.9MB)
List of stop words	sts-stop-words.txt.gz
List of PMIDs with titles and abstracts	pmid-title-abstr1.txt.gz (538MB)
List PMIDs with titles and abstracts	pmid-title-abstr2.txt.gz (528MB)

Analysis Input Data

Title/Abstract term adjacency list	sts-text-adj.gz (642MB)
MeSH adjacency list	sts-mesh-adj.gz (98MB)

Term Frequency Data

Title/Abstract term frequencies	sts-text-freq.gz (17MB)
MeSH term freqencies	sts-mesh-freq.gz (211KB)

Analysis results will also be made available from this site. They will comprise:

Analysis Result Data

Linkage-based analysis
Co-citation	sts-cocite-sim.gz	sts-cocite-clust.gz (56MB)
Bibliographic coupling	sts-bibcoup-topn.sim.gz (121MB)	sts-bibcoup-clust.gz (9.1MB)
Direct citation	sts-directcit-topn.sim.gz (67MB)	sts-direct-clust.gz (8.8MB)
Title/Abstract term analysis
Co-occurrence	sts-TA-co-topn.sim.gz (212MB)	sts-TA-co-clust.gz (8.4MB)
LSA	sts-TA-lsa-topn.sim.gz (194MB)	sts-TA-lsa-clust.gz (8.9MB)
Topic model (UCI)	sts-TA-topics-uci.sim.gz (117MB)	sts-TA-topics-clust.gz (9.4MB)
Collexis	sts-TA-collx-topn.sim.gz (146MB)	sts-TA-collx-clust.gz (9.3MB)

MeSH analysis
Co-occurrence	sts-mesh-co.sim.gz (155MB)	sts-mesh-co-clust.gz (9.5MB)
LSA	sts-mesh-lsa-topn.sim.gz (198MB)	sts-mesh-lsa-clust.gz (10MB)
Self-organizing maps (SOM)	sts-mesh-som.sim	sts-mesh-som.clust.gz (9.4MB)
Collexis	sts-mesh-collx.sim.gz (149MB)	sts-mesh-collx-clust.gz (9.3MB)

Other analysis
NCBI related records data	sts-ncbi-topn.sim.gz (115MB)	sts-ncbi-clust.gz (9.4MB)

Validation
Bib coupling coherence result	bc-lev1-coh.gz (387KB)
Co-citation coherence result	cc-lev1-coh.gz (379KB)
Direct citation coherence result	dc-lev1-coh.gz (590KB)
Co-word MeSH coherence result	co-mesh-lev1-coh.gz (290KB)
LSA MeSH coherence result	lsa-mesh-lev1_coh.gz (294KB)
SOM MeSH coherence result	som-mesh-lev1-coh.gz (346KB)
Collexis MeSH coherence result	collx-mesh-lev1-coh.gz (314KB)
Co-word TA coherence result	co-ta-lev1-coh.gz (251KB)
LSA TA coherence result	lsa-ta-lev1-coh.gz (280KB)
NCBI coherence result	ncbi-lev1-coh.gz (340KB)
Collexis TA coherence result	collx-ta-lev1-coh.gz (340KB)
Topics TA coherence result	topic-ta-lev1.gz (284KB)

Acknowledgements

This project is funded by NIH SBIR Contract HHSN268200900053C.

Indiana University's Big Red supercomputer used in this study is supported by the National Science Foundation under Grant No. ACI-0338618l, OCI-0451237, OCI-0535258, and OCI-0504075. This research was supported in part by the Indiana METACyt Initiative. The Indiana METACyt Initiative of Indiana University is supported in part by Lilly Endowment, Inc. This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).