with Tom Dean and Ed Boyden
How can one gain comprehension of an unfamiliar field as quickly as possible? While some have developed strategies for learning the structures of entire fields, better web-based tools might facilitate such learning. The citation graph of the scientific literature reveals community and topic structure which can be extracted automatically, yet a full citation graph of the literature is not yet publicly available. Thus, we are exploring whether meaningful information about scientific semantics and topic structure can be automatically inferred purely from publicly available data, such as paper abstracts.
As an example of what we are looking for, a simple PubMed search for "neuroscience" returned around 150000 papers. We wrote a simple script to extract the abstract from each of the corresponding PMIDs, and then ran these abstracts through Word2Vec and LDA topic modeling, both of which were made easy by the Python package Gensim.
Word2Vec analysis of even this very small corpus extracts meaningful semantic categories of technical words, such as the names of neurotransmitters.
Principal Components Analysis on the Word2Vec-generated vectors reveals clusters that correspond to semantically reasonable topics.
LDA topic modeling also reveals coherent research topics, like Alzheimer's disease, circadian clocks, the dopamine system and gap junctions.
Future work will explore more sophisticated language models applied to larger corpuses of scientific text. These language models could become useful components for future scientific search, sharing and analysis tools, like Beagle.