Skip to main content

Research Repository

Advanced Search

Quantifying textual similarities across scientific research
communities

Duffee, Boyd

Quantifying textual similarities across scientific research
communities Thumbnail


Authors

Boyd Duffee



Abstract

There are well-established approaches of text mining collections of documents and for understanding the network of citations between academic papers. Few studies have examined the textual content of the papers that constitute a citation network. A document corpus was obtained from the arXiv repository, selected from papers relating to the subject of Dark Matter and a citation network was created from the data held by NASA’s Astrophysics Data System on those papers, their citations and references.
I use the Louvain community-finding algorithm on the Dark Matter network to identify groups of papers with a higher density of citations and compare the textual similarity between papers in the Dark Matter corpus using the Vector Space Model of document representation and the cosine similarity function. It was found that pairs of papers within a citation community have a higher similarity than they do with papers in other citation communities. This implies that content is associated with structure in scientific citation networks, which opens avenues for research on network communities for finding ground-truth using advanced Text Mining techniques, such as Topic Modelling.
It was found that using the titles of papers in a citation network community was a good method for identifying the community. The power law exponent of the degree distribution was found to be, = 2.3, lower than results reported for other citation networks. The selection of papers based on a single subject, rather than based on a journal or category, is suggested as the reason for this lower value. It was also found that the degree pair correlation of the citation network classifies it as a disassortative network with a cut-off value at degree kc = 30.The textual similarity of documents decreases linearly with age over a 15 year timespan.

Publicly Available Date Mar 29, 2024

Files




Downloadable Citations