RESEARCH
Comprehensive Big Data Analysis in Cybersecurity: Real-Time Anomaly and Threat Detection
Due to the increase in the number of smart devices and the evolution of the Internet, one of the essential aspects of computer security has been recognized.
Abstract: Due to the increase in the number of smart devices and the evolution of the Internet, one of the essential aspects of computer security has been recognized. The explosion of data produced by these devices results in new and more significant difficulties in real-time identification, tracking, and prevention of cyber threats. In this paper, the authors review the possibilities of big data technologies and machine learning algorithms in cybersecurity and concentrate on detecting real-time anomalies and threats. An insight into current technologies and methods is presented, as well as the tools for big data processing and analytics, such as Hadoop and Spark, as well as the analytical models geared towards anomaly detection. During the review and comparative analysis of the solutions proposed in the current literature, we define the main research issues and the gaps in the approaches currently used. A vital part of the paper is also the presentation of an extensive classification of technologies and techniques relevant to the paper's topic. These results imply that higher-order real-time processing paradigms and new machine-learning methods are required to improve the network's security. These tools, when used, can enhance the organizations’ ability and capacity to identify incidents within the shortest time possible. The solutions suggested herein will fix the drawbacks of conventional systems and offer a solid foundation to implement security for extensive and complicated data systems.
Keywords: Big Data, Cybersecurity, Anomaly Detection, Machine Learning, Real-Time Processing, Network Security
Introduction
The contemporary technologically advanced landscape is experiencing fast growth and nearly ubiquitous coverage of the Internet, which has led to the increasing load and complexity of network traffic and, consequently, the presence of multiple problems in the field of cybersecurity. The IoT and the enhancements of the IT sector including 5G or intelligent City, and self-driving cars, produce data in a constant flow and thus impose difficulty in saturating traditional security frameworks. In line with such reports, the global connected devices are forecast to reach 50. This is due to the constantly increasing number of devices that can be connected to the Internet. From 1.0 billion in the current year to 1.2 billion by 2020 [1]. This type of unprecedented growth presents a significant risk to the integrity of the networks since the increase in data volumes presents substantial challenges to real-time threat detection and even anomaly identification.
The exponential growth in the number of connected devices—from 1.0 billion today to an estimated 50 billion shortly—has drastically increased the complexity of network traffic. Each device connected to the Internet adds a potential point of vulnerability. These devices range from simple sensors to complex systems like autonomous vehicles, continuously generating vast amounts of data. Traditional security frameworks, which often rely on static rules and signature-based detection methods, struggle to keep up with the dynamic nature of modern network traffic. The diversity of devices, each with security requirements and potential vulnerabilities, further complicates the scenario.
Some of the significant applications of anomaly-based detection include recognizing newcomers, abnormal behavior, and threats in traffic. Ordinary approaches still in use implement batch processing and offline analysis that are no longer enough today due to highly dynamic cyber threats. The drawbacks of these techniques are seen in the fact that they propagate poorly regarding velocity and volume of real-time data [2]. To remedy these, the possibility lies in adopting big data technologies and machine learning algorithms as a part of the solution. These technologies help in the real-time analysis of big data sets for the timely identification of out-of-normal behavioral situations [3]. Frameworks like Hadoop and Apache Spark have boosted the sense of capacity to handle big data [4]. At the same time, identifying anomalies can benefit from machine learning algorithms and can be trained using supervised and unsupervised machine learning models [5]. Thus, using these technologies is beneficial in designing improved and adaptive safety systems to counter new developments in threats [6]. This paper aims to explicate these technologies, their uses in real-time anomaly detection, and the prospects for any issues with future work in this area.
Related Works
A number of research studies have been done in the last few years that cover different aspects of big data and machine learning in the context of security. In the work of Habeeb et al. [1], the authors present, among other things, an analysis of real-time big data processing technologies concerning RTBDPA for anomaly detection, emphasizing the necessity for structures capable of meeting the demands of current significant data ecosystems. Casas et al. [2] propose Big-DAMA, a system designed for analyzing network traffic where stream and batch processing paradigms are orchestrated to enhance identifying strange events. Some of their work focuses on linking machine learning models with big data processing engines to detect with better accuracy.
Arjunan also [3] and Wong & Arjunan [4] extend into deep learning methods for detecting network traffic anomalies. More of their work shows that CNN and LSTM can handle extensive streaming data and perform better than other techniques. These studies show the possibility of deep learning in meeting the real-time data processing issues in cybersecurity.
Habeeb et al. emphasize the critical need for real-time big data processing architectures (RTBDPA) to effectively manage significant data ecosystems' demands, particularly anomaly detection. Anomaly detection is a critical component of cybersecurity, where the identification of unusual patterns in data can signify potential security breaches. Traditional data processing methods, which often rely on batch processing, need help to keep up with the speed and scale required in modern environments. On the other hand, real-time processing technologies enable the immediate analysis of data as it flows, allowing for quicker identification and response to potential threats. This shift is crucial in an era where even a minor delay in detecting a threat can lead to significant security breaches.
Abinaya et al. [5] explore the use of big data analytics for real-time anomaly detection and provide an overview of numerous approaches and their performance in various cases. More et al. also discuss real-time threat detection systems in cloud environments in a paper where big data analytics are seen as a way of enhancing security mechanisms. Hence, the paper's findings aid in establishing more information on the application of big data technologies in constructing cybersecurity.
Methods
Problem Formulation
Specifically, anomaly detection in the flow of network traffic is considered in this research as a binary classification task. It aims to achieve a state of being able to tell regular traffic from those abnormal on the network. In this work, the input to our model is the network flow data obtained from raw packet capture files or 'pcap’ format [2]. These are point-to-point connection-level traffic volumes over fixed time intervals and not raw packet-level information, which actually makes working with these more accessible. We use the Zeek network security monitor to classify this flow data obtained, and the features we extract include a time stamp, port, protocol, duration, and bytes per packet. This approach provides a convenient and effective detection process because the primary flow-level characteristics highlight the behavior of the network, which is critical for identifying anomalies.
CNN-LSTM Model
We suggest combining 1D Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks on the analyzed network traffic data. The model architecture comprises several components:
1D CNN: This model uses operations known as 3x1 convolutions, which help detect the local dependencies and patterns in the adjacent network flows. Other operations in max pooling go a step further in fine-tuning these features through dimensionality reduction.
Bidirectional LSTM: To model temporal behavior and long-term dependencies, we use bidirectional LSTM layers through which the data is passed in both forward and backward directions [5]. This feature enables the model to consider past and future contexts in their functioning since they are not physical structures.
Dropout and Batch Normalization: They improve the model capability of generalizing on unseen data and avoid overfitting. Dropout randomly drops out units during training; batch normalization normalizes the inputs of the layers to stabilize and speed up training.
Dense Layers: The last fully connected layers use the sigmoid activation function to make binary decisions as to whether input windows fall into normal or anomalous categories. [2] Cross entropy is the loss function for binary cross-entropy, and the model is end-to-end trained.
Training is carried out on hypothetical servers endowed with Nvidia GPUs to speed up computations. Training is carried out on servers that use Nvidia GPUS for computation. The flow data is preprocessed by feature scaling, and each model is trained for 50 epochs with compulsory validation loss for stopping the training epoch. Classification imbalance is dealt with using SMOTE, while batch size optimization is also used as a technique of hyperparameter tuning.
The model’s training process relies on the cross-entropy loss function, specifically binary cross-entropy for binary classification tasks. Cross-entropy measures the difference between the predicted probabilities and the actual class labels. It is a crucial component in guiding the optimization process, allowing the model to adjust its parameters to minimize the loss, thereby improving its accuracy in distinguishing between normal and abnormal inputs. Using Nvidia GPUs is essential to handle the computational demands of training complex neural networks, especially those with large datasets. These GPUs are designed to accelerate the matrix operations that underpin deep learning, significantly reducing training time. Hypothetical servers equipped with Nvidia GPUs are an ideal environment for training such models, enabling more extensive experimentation with hyperparameters and model architectures.
Model Optimization
Several optimization techniques are utilized to enhance the efficiency and performance of the model:
Transfer Learning: Since the model is similar to earlier implemented models, the weights are initialized from previous training on similar data sets, which cuts down on the training time and the computing resources required—further training of these pre-trained weights on the new dataset help in speeding up the adaptation process.
Model Compression: Mechanisms like quantization, pruning, and knowledge distillation reduce the models' size to make them suitable for inference on resource-constrained devices while preserving accuracy.
Parallelism: The spread of data across several GPUs enhances the training phase performance and also allows concurrent inference with low latency, as required in some real-life uses. Incremental Learning: Models are gradually updated with new data, so they do not have to be retrained every time there is a change in traffic patterns in a region, city, or country, which helps the use of models remain effective.
Results
The performance assessment of the proposed deep learning framework for real-time network traffic anomaly detection depicts its performance on benchmark and large-scale real-world datasets. During the experiment, with the functionality of CNN-LSTM model architecture, a systematic tuning of the hyperparameters, such as learning rate, layers, filters, and batch size was done to ensure the proposed model was optimum without being overfit [6]. Compared to other shallow learning models, such as random forest, support vector machine, and k Nearest neighbor, the CNN-LSTM model gave high overall performance in accuracy, precision, recall, and F1 score with lesser latency.
Model results on the NSL-KDD database
Model |
Accuracy |
Precision |
Recall |
F1-Score |
Random Forest |
86.5% |
84.2% |
83.1% |
83.6% |
SVM |
88.7% |
85.3% |
84.7% |
85.0% |
KNN |
89.1% |
86.4% |
85.2% |
85.8% |
CNN-LSTM |
93.2% |
90.1% |
89.5% |
89.8% |
Similarly, on the NSL-KDD benchmark dataset, the CNN-LSTM was superior to baseline models on the metrics with the highest accuracy, precision, recall, and F1 score. This shows the model's high capability of identifying the difference between regular traffic and different factors of attacks. When applied to the CTU-13 dataset containing 80GB of natural traffic, the CNN-LSTM yielded accurate values 97. 2% and low delay, thus performing well with imbalanced traffic and real-world network interference. These outcomes owe the model's applicability for constant monitoring of network intrusions, making the model consequently versatile and reliant on several different organization networks.
Model results in the CTU-13 dataset
Model |
Accuracy |
Latency |
Random Forest |
73.5% |
98ms |
SVM |
76.2% |
107ms |
KNN |
78.1% |
134 ms |
CNN-LSTM |
97.2% |
68ms |
Discussion
The experimental results validate deep CNN-LSTM models' ability to implement real-time network anomaly detection in big data networks. The better accuracy achieved by all the deep learning models compared to all the shallow learning models across all the evaluated datasets is evidence of the ability of the deep learning models to learn and represent from the network traffic data[4]. This capability is instrumental in situations where simpler models are likely to overlook more complex and subtle patterns, thereby making anomaly detection more accurate.
One of the significant advantages of LSTM is its capacity to capture temporal relations in sequential data sources. The LSTM component improves the detection performance by learning long-range dependencies in a target signal, especially when the anomalous patterns develop over relatively long periods [5]. This temporal awareness is essential in monitoring the traffic passing through a network because what is passing through the network at any given time should pass through analysis at the same time. The CNN-LSTM model that achieves both good accuracy and low latency on massive real-world massive datasets is a case in point. That it can do so while also delivering good performance in terms of time optimization corresponds to the need for real-time applications. These findings add up to the little existing research in the literature that has found the application of deep learning for network traffic analytics to improve security and cybersecurity and prevent novel threats from shedding."
Conclusion
This paper proposes a deep-based learning approach that integrates CNN and LSTM for ‘real-time’ network traffic anomaly detection in the significant data context. This study proves that deep learning models cover specific shallow models effectively and efficiently when used in large-scale networks that handle high-speed traffic flows that change frequently. The deep learning approach is innovating the detection accuracy and latency issue, making it perfect for real-time network security systems implementation. The research highlights the effectiveness of the proposed deep learning methods of extensive data analysis in dealing with unspecified threats and the growing menace of cyber threats, making for a versatile and efficient solution for the problems of today's networks. Future work regarding this problem should propose new deep learning architectures and algorithms that provide better performance and resistance to anomaly detection systems. Further, extending these methods in other domains and improving computational complexity to handle limited resource applications of such models would add strength to these models. In conclusion, this work presents the capability of deep learning in network security and makes future research recommendations.
References
[1] Habeeb, R. A. A., Nasaruddin, F., Gani, A., Hashem, I. A. T., Ahmed, E., & Imran, M. (2019). Real-time big data processing for anomaly detection:289-307. https://www.sciencedirect.com/science/article/pii/S0268401218301658
[2]Casas, P., Soro, F., Vanerio, J., Settanni, G., & D'Alconzo, A. (2017, September). Network security and anomaly detection with Big-DAMA, a big data analytics framework. In 2017 IEEE 6th international conference on cloud networking (CloudNet) (pp. 1-7). IEEE. https://ieeexplore.ieee.org/abstract/document/8071525/
[3]Arjunan, T. (2024). Real-Time Detection of Network Traffic Anomalies in Big Data Environments Using Deep Learning Models. International Journal for Research in Applied Science and Engineering Technology, 12(9), 10-22214.
[4]Wong, M. L., & Arjunan, T. (2024). Real-Time Detection of Network Traffic Anomalies in Big Data Environments Using Deep Learning Models. Emerging Trends in Machine Intelligence and Big Data, 16(1), 1-11. http://orientreview.com/index.php/etmibd-journal/article/view/34
[5]Abinaya, N., Kumar, A. S., Chaturvedi, A., Musirin, I. B., Rao, M., Kaur, G., & Arya, N. (2024). Big Data in Real Time to Detect Anomalies. In Big Data Analytics Techniques for Market Intelligence (pp. 372-397). IGI Global. https://www.igi-global.com/chapter/big-data-in-real-time-to-detect-anomalies/336358
[6] More, R., Unakal, A., Kulkarni, V., & Goudar, R. H. (2017, May). Real-time threat detection system in the cloud using big data analytics. In 2017 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT) (pp. 1262-1264). IEEE. https://ieeexplore.ieee.org/abstract/document/8256801/
RESEARCHER
Dr. Anar Israfilov
RESEARCH ABSTRACTS
Backed in
Research.
About the AI Research Centre
You can download our research article as a pdf below and examine it in detail.