Evaluating Machine Learning Techniques for Web Robot Detection
DOI:
https://doi.org/10.47611/jsrhs.v13i2.6794Keywords:
web robot detection, machine learning, artificial intelligenceAbstract
A web robot, also known as a web crawler, is an automated script that systematically browses the World Wide Web in a methodical manner. Web robots are commonly used by search engines to index web pages, gather information, and update search engine databases. They follow hyperlinks from one web page to another, collecting data and information for various purposes such as indexing, data mining, and website monitoring. Web robot detection is the process of identifying and distinguishing between human users and web robots on websites, which are a major source of web traffic. This process is vital to prevent malicious web robots from having a negative effect on web servers’ traffic and their users’ privacy. Companies such as Imperva, Inc. use machine learning models to identify malicious bots and prevent them from having unauthorized access to an organization’s server. To understand the benefit that machine learning models bring to web robot detection, we used pre-published server log data to construct machine learning models on Orange (a Data Mining Software) and Python that can distinguish between malicious and benign web robots. We evaluated the performance of three well-known machine learning algorithms: kNN, neural network, and decision tree. Based on our experimental results, the neural network gains the highest precision and the lowest false-positive and false-negative percentage of web robots. However, the neural network requires more time to generate the desired output.
Downloads
References or Bibliography
- Brownlee, J. (2020, November 7). How to Identify Overfitting Machine Learning Models in Scikit-Learn. Machine Learning Mastery. Retrieved February 25, 2024, from https://machinelearningmastery.com/overfitting-machine-learning-models/
- Gomede, E. (2024, January 6). Understanding the Causes of Overfitting: A Mathematical Perspective. Medium. Retrieved February 25, 2024, from https://medium.com/aimonks/understanding-the-causes-of-overfitting-a-mathematical-perspective-09af234e9ce4
- Gross, K. (2020, April 6). Tree-Based Models: How They Work (In Plain English!). Dataiku. Retrieved February 25, 2024, from https://blog.dataiku.com/tree-based-models-how-they-work-in-plain-english
- Intuitive Machine Learning. (2020, April 26). K Nearest Neighbors | Intuitive explained | Machine Learning Basics [Video]. Youtube. https://www.youtube.com/watch?v=0p0o5cmgLdE
- Lagopoulos, A., & Tsoumakas, G. (2019). Web robot detection - Server logs [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3477932
- Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K. (2003). KNN Model-Based Approach in Classification. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds) On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE. OTM 2003. Lecture Notes in Computer Science, vol 2888. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39964-3_62
- Benardos, P. G., & Vosniakos, G.-C. (2007). Optimizing feedforward artificial neural network architecture. Engineering Applications of Artificial Intelligence, 20(3), 365-382. https://doi.org/10.1016/j.engappai.2006.06.005
- Tangirala, S. (2020). Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm. International Journal of Advanced Computer Science and Applications, 11(2). http://dx.doi.org/10.14569/IJACSA.2020.0110277
- Srivastava, N., et al. (2014). Dropout: A simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), 1929-1958. https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
- Handelman, G. S., Kok, H. K., Chandra, R. V., Razavi, A. H., Huang, S., Brooks, M., Lee, M. J., & Asadi, H. (2019). Peering Into the Black Box of Artificial Intelligence: Evaluation Metrics of Machine Learning Methods. AJR. American journal of roentgenology, 212(1), 38–43. https://doi.org/10.2214/AJR.18.20224
Published
How to Cite
Issue
Section
Copyright (c) 2024 Jayan Sirikonda, Mahdieh Zabihimayvan
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright holder(s) granted JSR a perpetual, non-exclusive license to distriute & display this article.