Phishing URLs Detection Using Naives Baiyes, Random Forest and LightGBM Algorithms

Cik Feresa Mohd  Foozy; Muhammad Amir Izaan  Anuar; Andi  Maslan; Husaini Aza Mohd  Adam; Hairulnizam  Mahdin

doi:10.18517/ijods.5.1.56-63.2024

DOI : https://doi.org/10.18517/ijods.5.1.56-63.2024

Phishing URLs Detection Using Naives Baiyes, Random Forest and LightGBM Algorithms

Cik Feresa Mohd Foozy ⁽¹⁾, Muhammad Amir Izaan Anuar ⁽²⁾, Andi Maslan ⁽³⁾, Husaini Aza Mohd Adam ⁽⁴⁾, Hairulnizam Mahdin ⁽⁵⁾

(1) Institut Kejuruteraan Integrasi, Pusat Kecemerlangan Industri-Rail (ICoE-Rel), Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor, Malaysia

(2) Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor. Malaysia

(3) Faculty of Engineering and Computer Science, Putera Batam University, Batam, Indonesia

(4) Kolej Komuniti Seberang Jaya, Permatang Pauh, Pulau Pinang, Malaysia

(5) Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor. Malaysia

Fulltext View | Download

How to cite (IJASEIT) :

[1]

C. F. M. . Foozy, M. A. I. . Anuar, A. . Maslan, H. A. M. . Adam, and H. . Mahdin, “Phishing URLs Detection Using Naives Baiyes, Random Forest and LightGBM Algorithms”, Int. J. Data. Science., vol. 5, no. 1, pp. 56–63, Jun. 2024.

Citation Format :

In response to the increasing complexity of phishing attacks, particularly in Malaysia, this study aims to compare the accuracy and precision effectiveness of three machine learning algorithms Naive Bayes, Random Forest and LightGBM in detecting URL (Uniform Resource Locator) phishing. This research employs a comprehensive four-stages methodology including data collection, preprocessing, feature selection, and classification to analyze data for URL phishing attacks classification. The objectives are to identify phishing attack features based on dataset using and machine learning algorithms, to compare between three classification algorithms of Naïve Bayes, Random Forest, and Light Gradient Boosting Model (LightGBM), and to evaluate the model in terms of accuracy, and precision using machine learning algorithms. Through this comparative analysis, the study seeks to develop a phishing detection model, to identify the suitable features and classification algorithms for the datasets. The result accuracy, precision for NB, Random & LightGBM. The Accuracy result of Naives Baiyes is 94.24%, the result of Random Forest is 94.80% and the result of LightGBM is 95.00%.

P. Yang, G. Zhao, and P. Zeng, “Phishing website detection based on multidimensional features driven by deep learning,” IEEE Access, vol. 7, pp. 15196–15209, 2019, doi: 10.1109/ACCESS.2019.2892066.

Q. Khanh, T. Hoang, T. Nguyen, & V. Ong, "Predicting and avoiding hazardous occurrences of stuck pipe for the petroleum wells at offshore vietnam using machine learning techniques", IOP Conference Series: Earth and Environmental Science, vol. 1091, no. 1, p. 012003, 2022. https://doi.org/10.1088/1755-1315/1091/1/012003

Yi Yong Lee, Chin Lay Gan, and Tze Wei Liew, “Phishing victimization among Malaysian young adults: cyber routine activities theory and attitude in information sharing online,”pp. 8–31, 2022.

D. Zuo, L. Yang, Y. Jin, H. Qi, Y. Liu, & L. Ren, "Machine learning-based models for the prediction of breast cancer recurrence risk", BMC Medical Informatics and Decision Making, vol. 23, no. 1, 2023. https://doi.org/10.1186/s12911-023-02377-z

D. Zuo, L. Yang, Y. Jin, H. Qi, Y. Liu, & L. Ren, "Machine learning-based models for the prediction of breast cancer recurrence risk", BMC Medical Informatics and Decision Making, vol. 23, no. 1, 2023. https://doi.org/10.1186/s12911-023-02377-z

A. Hannousse and S. Yahiouche, “Towards Benchmark Datasets for Machine Learning Based Website Phishing Detection: An experimental study,” Oct. 2020, doi:10.1016/j.engappai.2021.104347.

N. Nagy et al., “Phishing URLs Detection Using Sequential and Parallel ML Techniques: Comparative Analysis,” Sensors, vol. 23, no. 7, Apr. 2023, doi:10.3390/s23073467

Y. Kang, W. Kim, S. Lim, H. Kim, and H. Seo, “DeepDetection: Privacy-Enhanced Deep Voice Detection and User Authentication for Preventing Voice Phishing,”Applied Sciences (Switzerland), vol. 12, no. 21, Nov. 2022, doi:10.3390/app122111109

S. A. Khan, W. Khan, and A. Hussain, “Phishing Attacks and Websites Classification Using Machine Learning and Multiple Datasets (A Comparative Analysis),” Intelligent Computing Methodologies, pp. 301–313, 2020, doi: https://doi.org/10.1007/978-3-030-60796-8_26

M. Hasan, M. Jawad, A. Dutta, M. Awal, M. Islam, M. Masudet al., "Associating measles vaccine uptake classification and its underlying factors using an ensemble of machine learning models", Ieee Access, vol. 9, p. 119613-119628, 2021. https://doi.org/10.1109/access.2021.3108551

E. Oram, P. B. Dash, B. Naik, J. Nayak, S. Vimal, and S. K. Nataraj, “Light gradient boosting machine-based phishing webpage detection model using phisher website features of mimic URLs,” Pattern Recognition Letters, vol. 152, pp. 100–106, Dec. 2021, doi: https://doi.org/10.1016/j.patrec.2021.09.018

A. Jain and D. Zongker, "Feature selection: evaluation, application, and small sample performance", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 2, p. 153-158, 1997. https://doi.org/10.1109/34.574797

I. I. Ismagilov, A. A. Murtazin, D. V. Kataseva, A. S. Katasev, and A. I. Barinov, “Definition of Phishing Sites Based on the Team Model of Fuzzy Neural Networks,”HELIX, vol. 10, no. 5, pp. 133–140, Oct. 2020, doi: 10.29042/2020-10-5-133-140

K. L. Chiew, C. L. Tan, K. S. Wong, K. S. C. Yong, and W. K. Tiong, “A new hybrid ensemble feature selection framework for machine learning-based phishing detection system,” Inf Sci (N Y), vol. 484, pp. 153–166, May 2019, doi: 10.1016/j.ins.2019.01.064

N. Nagy et al., “Phishing URLs Detection Using Sequential and Parallel ML Techniques: Comparative Analysis,” Sensors, vol. 23, no. 7, Apr. 2023, doi:10.3390/s23073467

J. Sun, H. Yu, G. Zhong, J. Dong, Z. Shu, & H. Yu, "Random shapley forests: cooperative game-based random forests with consistency", Ieee Transactions on Cybernetics, vol. 52, no. 1, p. 205-214, 2022.https://doi.org/10.1109/tcyb.2020.2972956

M. Li, H. Chen, H. Zhang, M. Zeng, B. Chen, & L. Guan, "Prediction of the aqueous solubility of compounds based on light gradient boosting machines with molecular fingerprints and the cuckoo search algorithm", Acs Omega, vol. 7, no.46, p. 42027-42035, 2022. https://doi.org/10.1021/acsomega.2c03885

Y. Saeys, I. Inza, & P. Larrañaga, "A review of feature selection techniques in bioinformatics", Bioinformatics, vol. 23, no. 19, p. 2507-2517, 2007. https://doi.org/10.1093/bioinformatics/btm344

K. Sharma, "Quantum adiabatic feature selection",, 2019. https://doi.org/10.48550/arxiv.1909.08732.

Attribution-ShareAlike 4.0 International License
https://creativecommons.org/licenses/by-sa/4.0/