Duffee, Boyd (2018) Quantifying textual similarities across scientific research communities. Doctoral thesis, Keele University.

[thumbnail of DuffeePhD2018.pdf]
Preview
Text
DuffeePhD2018.pdf

Download (5MB) | Preview

Abstract

There are well-established approaches of text mining collections of documents and for understanding the network of citations between academic papers. Few studies have examined the textual content of the papers that constitute a citation network. A document corpus was obtained from the arXiv repository, selected from papers relating to the subject of Dark Matter and a citation network was created from the data held by NASA’s Astrophysics Data System on those papers, their citations and references.
I use the Louvain community-finding algorithm on the Dark Matter network to identify groups of papers with a higher density of citations and compare the textual similarity between papers in the Dark Matter corpus using the Vector Space Model of document representation and the cosine similarity function. It was found that pairs of papers within a citation community have a higher similarity than they do with papers in other citation communities. This implies that content is associated with structure in scientific citation networks, which opens avenues for research on network communities for finding ground-truth using advanced Text Mining techniques, such as Topic Modelling.
It was found that using the titles of papers in a citation network community was a good method for identifying the community. The power law exponent of the degree distribution was found to be, = 2.3, lower than results reported for other citation networks. The selection of papers based on a single subject, rather than based on a journal or category, is suggested as the reason for this lower value. It was also found that the degree pair correlation of the citation network classifies it as a disassortative network with a cut-off value at degree kc = 30.The textual similarity of documents decreases linearly with age over a 15 year timespan.

Item Type: Thesis (Doctoral)
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Faculty of Natural Sciences > School of Computing and Mathematics
Contributors: Rugg, Gordon (Thesis advisor)
Depositing User: Lisa Bailey
Date Deposited: 30 Jul 2018 10:43
Last Modified: 08 Oct 2020 14:35
URI: https://eprints.keele.ac.uk/id/eprint/5174

Actions (login required)

View Item
View Item