RESEARCH
End-to-end analysis of big data in cybersecurity: Detecting anomalies and threats in real time
The global Internet expansion and vast device connectivity opened doors for numerous fraudulent activities.
Materials and methods: When writing the article, an integrated approach to the problems of research was implemented using various research methods. Among the general methods used in the research are the method of systematic, quantitative and qualitative analysis, synthesis, as well as the method of formal logical method and theoretical generalization.
Research results: It is proved that special requirements should be imposed on the real-time threat and anomaly detection system, in terms of the need for constant modernization in an ever-changing landscape of cyber threats and anomalies. To solve the problem of building an optimal threat and anomaly detection system, a system model is needed that can optimally take into account a complete list of cases of anomalies and threats, as well as anticipate zero-day attacks.
Conclusion: The study proposes end-to-end data-driven big data analysis based on the deeply deterministic Policy Gradient (DDPG) algorithm. The above algorithm simultaneously studies the Q-function and the policy. It uses data outside politics and the Bellman equation to study the Q-function and uses the Q-function to study politics and is perfectly suitable for use in environments with continuous action spaces (real-time on the Internet).
Keywords: big data, real-time threat and anomaly detection, neural networks, end-to-end data analysis, data analytics
Introduction
The relevant measures taken in the field of cybersecurity include a whole range of measures. However, the most important of these activities is network security analytics. This is due to the fact that network security analytics is directly aimed at detecting anomalies and threats in the network.
Big data analytics, in turn, is an essential element of any cybersecurity solution. This is due to the need for fast processing of high-speed, large-volume data obtained from various sources to quickly detect threats and anomalies, as well as models of such threats and anomalies.
At the same time, existing research in the field of anomaly detection and threats to network security shows that existing approaches to detecting anomalies in the network are not effective enough, especially in cases involving the detection of anomalies and threats in real time [7, 8]. Despite the fact that over the past few years, it has also been developed and there are a sufficient number of approaches to big data analysis, the use of such approaches in the field of cybersecurity is problematic. Thus, existing approaches do not take into account such important aspects as: zero-day attack detection, real-time threat and anomaly analysis, data exchange between threat detection systems; data processing with limited resources; time series analysis to detect threats and anomalies.
The inefficiency of existing approaches in the field of detecting anomalies and threats to network security is primarily due to the accumulation of huge amounts of data through Internet-connected devices. Based on this, it seems to be an objective need to develop an approach focused on processing big data in real time and detecting threats and anomalies in networks.
Considering the above, in the context of this study, the analysis of modern technologies for processing big data in real time related to the detection of threats and anomalies is of particular relevance.
Materials and methods
When writing the article, an integrated approach to the problems of research was implemented using various research methods. Among the general methods used in the research are the method of systematic, quantitative and qualitative analysis, synthesis, as well as the method of formal logical method and theoretical generalization.
The reliability of the results obtained in the framework of the study is confirmed by a sufficient amount of analyzed material on the studied problem, the use of methods adequate to the tasks set and the use of modern methods of analysis. The validity of scientific conclusions and provisions is confirmed by the results of the conducted research. The conclusions objectively and fully reflect the results obtained.
Thus, within the framework of the conducted research, such relevant topics as neural network learning algorithms were touched upon in order to identify zero-day attacks and unknown types of attacks and threats. The author analyzes the research in this field. From a critical point of view, such studies were evaluated, including the advantages and disadvantages of already developed models and algorithms for end-to-end analysis of big data in cybersecurity when detecting threats and anomalies in real time.
Results
The ability to quickly identify threats and anomalies in data is becoming an objective necessity to ensure reliable security and operational integrity. For these purposes, threat and anomaly detection systems are being developed that process data and identify violations or deviations in such data that differ significantly from established standards. To solve the problem of detecting threats and anomalies in real time, the requirements for threat and anomaly detection systems should be increased, since in real time there is an objective need for instant response to threats and anomalies as they arise.
Detection of anomalies and threats in the most general sense involves continuous analysis of data arrays in search of anomalies and threats that do not correspond or deviate from standard patterns [3, 4]. Such deviations from standard patterns may, for example, be associated with unusual spikes in network traffic, which indicate cyber-attacks, threats and anomalies in financial transactions, suggesting cases of fraud, and so on.
From a cybersecurity perspective, real-time threat and anomaly detection is aimed at identifying potential threats and anomalies (including: data leakage, unauthorized access to data, network intrusions or malicious actions) before such anomalies and threats escalate into serious security incidents.
For example, network monitoring helps to detect threats from various network infrastructure elements, for example, substitution or duplication of a MAC address or IP address, virus detection, detection of anomalies in terms of network connection speed.
In modern research, the disadvantages of existing threat and anomaly detection methods are reduced to a high level of false alarms based on unknown behavior of the threat and anomaly detection system [4]. In the specialized literature, threats and anomalies are differentiated by categories (Table 1).
Table 1
Categories and descriptions of anomalies and threats to network security
Categories of anomalies and threats |
Description of anomalies and threats |
Point - based |
The data of a particular type does not conform to the standard of such data in relation to the rest of the data |
Contextual |
A data instance that is abnormal in a certain context |
Common |
A set of related data instances that is abnormal with respect to the entire dataset |
The categories of anomalies and threats to network security presented in table 1 are standard. In addition to such standard anomalies and threats to network security, there are threats and attacks aimed at cloud servers (Table 2).
Table 2
Categories and characteristics of anomalies and threats to network security aimed at cloud servers
Categories of anomalies and threats |
Characteristics of anomalies |
Speculative |
Atypical execution of hardware performance optimization functions, for example, performed not in the order of hardware prediction and caching |
By side channels |
Temporary attacks with leakage of the secret key of symmetric key ciphers or the private key of public key ciphers |
Software |
Buffer overflow (the recorded data exceeds the size of the allocated buffer) |
One of the solutions to network security problems to minimize the risks of anomalies and threats to network security aimed at cloud servers in the specialized literature is called high-performance computing [2], as well as controlled neural networks [1, 5, 10]. This is due to the fact that high-performance computing is aimed at monitoring system performance and debugging software, for example, when detecting malware. In turn, trained neural networks can detect malware and intrusions based on embedded learning algorithms, respectively.
The main disadvantage of using supervised deep learning to detect threats and attacks is the specificity of neural network learning models based on learning algorithms based on available attack examples. In this case, zero-day attacks that cannot be embedded in the training dataset for neural networks cannot be tracked by the neural network in a timely manner. High-performance computing also has its drawbacks related to the inability to account for all unknown anomalies and threats affecting system performance.
At the same time, the statistics of cyber-attacks (Table 3) show the need to take drastic measures aimed at timely detection of anomalies and threats in real time.
Table 3
Cyber-attack statistics for 2023 [13, 14, 15]
The name of the indicator |
The value of the indicator |
Frequency of cyber attacks |
Once every 39 seconds |
The number of victims of cyber attacks |
800,000 people annually |
Approximate losses from cyber-attacks for companies in various countries of the world |
More than $17,000 every minute |
The most common causes of data leakage |
Malware infection |
As can be seen from the data presented in Table 3, the damage from cyber-attacks is enormous. At the same time, the main reason for cyberattacks on big data is malicious software. Such software is designed to take control of the victim's computer infrastructure or disrupt its operation. Posing as harmless files or links, these programs trick users into downloading them – thus giving outsiders access not only to the victim's computer, but also to the entire network within a particular organization.
Figure 1 – Cumulative number of malware detections worldwide from 2015 to 2021 [13, 14, 15]
As can be seen from the data presented in Figure 1, as of 2015, the total number of newly detected malware worldwide amounted to 153 million programs, and in 2021 exceeded 700 million. Ransomware has become one of the most widespread and fastest growing threats to individuals and organizations around the world. In addition to malware infection, the most popular types of attacks are attacks on the Internet of Things (IoT), denial of service (DDoS), the number of which is growing uncontrollably (Figure 2).
Figure 2 – Cumulative number of IoT hacking and DDoS attacks detected from 2020 to 2022 [13, 14, 15]
According to Statista–a German company specializing in market and consumer data for 2022, attackers committed more than 1.9 billion Internet of Things (IoT) hacks, compared with just 639 million in 2020. At the same time, most IoT network attacks occur through the telnet protocol, an interface that facilitates remote connection to a server or device.
Like many other types of threats and attacks, real-time digital security has shown a significant increase in activity since the start of the COVID-19 pandemic, and the activity of such threats and attacks is only increasing every year.
As can be seen from the data presented in the figure, the number of global DDoS attacks reached 15 million by the beginning of 2023.
Attacks on IoT devices have tripled in the last three years, while according to Statista, 90% of attacks on remote code execution are related to crypto mining, and 1 out of every 36 mobile devices, including phones and tablets, contains a high-risk application.
As the number of real-time cyber threats increases every year, it is only natural that the cybersecurity market is also expanding. In addition, Statista experts expect that by 2028 the total volume of the cybersecurity market will reach a total estimate of 366.1 billion dollars, while in 2023 the global cybersecurity market is estimated at 172.32 billion dollars, showing annual growth (Figure 3).