D.A. Zuev – Bachelor, Beijing Institute of Technology
A.P. Kalistratov – Post-graduate Student, Master, Bauman Moscow State Technical University
G.I. Afanasyev – Ph. D. (Eng.), Associate Professor, Department «Information Processing and Control Systems», Bauman Moscow State Technical University
P.S. Semkin – Associate Professor, Bauman Moscow State Technical University
Massive data analysis is often performed in a streaming manner, where data are continuously generated at a high rate in the form of data streams. Processing data streams attracts much attention in recent years because of requirements in real-time big data analysis, such as neural network based methods and classification based approaches. An important aspect in data stream processing: estimating cardinality for large-scale data streams over sliding windows. The cardinality of a data stream S is the number of distinct items in S. Counting the cardinality of a data stream plays an important role in many applications, such as database query, computer network monitoring, and anomaly detection. The problem of estimating online the number of distinct elements in a massive data stream has several interesting applications in the fields of traffic engineering and network security. In fact, some attacks, such as worm propagation or Denial of Service, can be detected by supervising the number of active flows. The problem is estimation of the cardinality of the data stream over the given a sliding window of size W and a a massive data stream S. Algorithms must be scalable, operate fast, using a small limited memory. Most of algorithms are designed to perform on a static set of data. It means that they can simply estimate the total number of flows for a given fixed dataset. So, they cannot be applied to an infinite data stream. Moreover, for time series data, we are mostly interested in data in the near past (few minutes ago). A natural way to adapt these algorithms to data stream is to use the sliding window model. The main advantage of the Sliding HyperLogLog algorithm consists of its capacity to cope with a data stream of any duration. Moreover, it can answer a query with a variable duration w (bounded by the time window W). One can notice that the error on the estimation of the number of flows is most often within the theoretical bound of 3.25%. So, the experiments agree with the result that adding the sliding window mechanism does not affect the accuracy of the original HyperLogLog algorithm.
- Gorshkov N.A, Denisov V.S. Analiz soobshhenij soczial'noj seti Twitter s ispol'zovaniem sistem obrabotki potokovy'x danny'x Apache Spark i Apache Storm // International Journal of Open Information Technologies. 2016. № 11. S. 1−11.
- Samarev R.S. Obzor sostoyaniya oblasti potokovoj obrabotki danny'x // Trudy' ISP RAN. 2017. 29:1. 231−260.
- Reyes-Ortiz J.L., Oneto L., Anguita D. Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf // INNS Conference on Big Data Program. San Francisco. 8−10 August 2015. P. 121−130.
- Whang K.-Y., Vander-Zanden B.T., Taylor H.M. A linear-time probabilistic counting algorithm for database applications // ACM Transactions on Database Systems. 1990. 15: 208−229.
- Durand M. & Flajolet P. Loglog counting of large cardinalities. Lecture Notes in Comput. Sci. 2003. 2832. 605−617.
- Zuev D.A., Kalistratov A.P. Primenenie algoritma Flazholeta-Martina dlya potokovoj obrabotki danny'x // Mezhdunar. konf. «Inzhiniring & Telekommunikaczii» (En&T). 2017.
- Jingsong Shan, Jianxin Luo, Guiqiang Ni, Zhaofeng Wu, Weiwei Duan. CVS: Fast cardinality estimation for large-scale data streams over sliding windows // Neurocomputing. 2016. 194: 107−116.
- Flajolet P., Martin G.N. Probabilistic counting algorithms for data base applications // Journal of Computer and System Sciences. 1985. 31(2):182−209.
- Potapov V.P., Popov S.E., Kosty'lev M.A. Primenenie massovo-parallel'ny'x texnologij dlya organizaczii potokovoj obrabotki radarny'x danny'x // Vestnik NGU. Seriya: Informaczionny'e texnologii. 2016. № 3. S. 69−80.
- Apache Spark Streaming // Apache Spark Documentation. URL: http://spark.apache.org/streaming (data obrashheniya 20.10.2017).
- Flajolet P., Fusy E., Gandouet O., Meunier F. HyperLogLog: the analysis of a nearoptimal cardinality estimation algorithm // Proceedings of Conference on Analysis of Algorithms. 2007. V. AH of Discrete Mathematics and Theoretical Computer Science Proceedings (DMTCS). P. 127−146.
- Chabchoub Y., Hébrail G. Sliding hyperloglog: Estimating cardinality in a data stream over a sliding window // IEEE International Conference on Data Mining Workshops. Sydney(Australia). December 2010.
- Fusy E., Giroire F. Estmating the number of active flows in a data stream over a sliding window // Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments and the Fourth Workshop on Analytic Algorithmics and Combinatorics (ANALCO). January 2007.
- Datar M., Gionis A., Indyk P., Motwani R. Maintaining stream statistics over sliding windows // SIAM Journal on Computing. 2002. 31. 6. 17941813.
- Heule M. Nunkesser, Hall A. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. EDBT. 2013.