Al-Janabi, Mohammed Fadhil Zamil ORCID: 0000-0003-1630-1900 (2018) Detection of suspicious URLs in online social networks using supervised machine learning algorithms. Doctoral thesis, Keele University.

[img]
Preview
Text
Al-JanabiPhD2018.pdf

Download (4MB) | Preview

Abstract

This thesis proposes the use of several supervised machine learning classification models that were built to detect the distribution of malicious content in OSNs. The main focus was on ensemble learning algorithms such as Random Forest, gradient boosting trees, extra trees, and XGBoost. Features were used to identify social network posts that contain malicious URLs derived from several sources, such as domain WHOIS record, web page content, URL lexical and redirection data, and Twitter metadata.

The thesis describes a systematic analysis of the hyper-parameters of tree-based models. The impact of key parameters, such as the number of trees, depth of trees and minimum size of leaf nodes on classification performance, was assessed. The results show that controlling the complexity of Random Forest classifiers applied to social media spam is essential to avoid overfitting and optimise performance. The model complexity could be reduced by removing uninformative features, as the complexity they add to the model is greater than the advantages they give to the model to make decisions.

Moreover, model-combining methods were tested, which are the voting and stacking methods. Both show advantages and disadvantages; however, in general, they appear to provide a statistically significant improvement in comparison to the highest singular model. The critical benefit of applying the stacking method to automate the model selection process is that it is effective in giving more weight to more topperforming models and less affected by weak ones.

Finally, 'SuspectRate', an online malicious URL detection system, was built to offer a service to give a suspicious probability of tweets with attached URLs. A key feature of this system is that it can dynamically retrain and expand current models.

Item Type: Thesis (Doctoral)
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Faculty of Natural Sciences > School of Computing and Maths
Depositing User: Lisa Bailey
Date Deposited: 12 Dec 2018 10:45
Last Modified: 12 Dec 2018 10:45
URI: http://eprints.keele.ac.uk/id/eprint/5581

Actions (login required)

View Item View Item