Yu.A. Grigorev – Dr.Sc.(Eng.), Professor, Department «Information Processing and Control Systems», Bauman Moscow State Technical University
A.V. Burdakov – Ph.D.(Eng.), Head of Program on the Development and Implementation of Automated Information Systems, Company «Information Technologies for Epidemiology» (Moscow)
V.A. Proletarskaya – Post-graduate Student, Department «Information Processing and Control Systems», Bauman Moscow State Technical University
A.I. Ustimov – Student, Department «Information Processing and Control Systems», Bauman Moscow State Technical University
The Date Warehouses (DWH) have gained widespread application for Big Data processing. DWH implementation approach has changed over time. Initially there were systems based on the following three approaches: MOLAP, ROLAP and HOLAP. Introduction of NoSQL databases changed DWH implementation strategy. This was caused by the fact that NoSQL are open-source, highly scalable (up to a few thousand nodes) and reliable (due to multiple replication of database records), while require inexpensive nodes. The next step of DWH development was utilization of MapReduce-like platforms. Hadoop is an example of such platform.
Several software project on top of Hadoop have emerged, such as Hive. A number of new DWH access methods were proposed that preserve benefits of Hadoop’s high scalability, ease software requests programming and secure their prompt execution. The following four methods where developed that are based on the table dimensions caching in each node’s RAM: MFRJ, MRIJ, MRIJ on RCFile, MRIJ with a large dimension’s number.
This paper reviews MFRJ DWH access method and analyzes MapReduce processing. An analytical model for DWH query execution time evaluation was developed based on the results of this analysis. The proposed model is related to the class of cost-based models and has the following advantages as compared to the existing approaches: 1) it models MFRJ algorithm execution which does not require DWH records duplication; 2) it does not require to manually assign weights for cost evaluation; 3) the model calculates average values for memory and time characteristics of n-dimensional DWH query execution (taking into account the disk, processor and network components); 4) it is quite detailed, considers Map, Shuffle and Reduce phase specifics and is set up for a large number of parameters that affect the performance characteristics; 4) it has a parameters calibration procedure.
The model parameters that can be directly measured where calibrated based on experiments. The calibration helped evaluating per-formance of the following resources: HDFS (read and write – μDR, μDW), LocalFS (read and write – μdR, μdW); switch ports (input and output – μPR, μPW), switch (μN1) and processor (short logical operation of an algorithm, SLOA – τ). The border values of these parameters were determined and evaluated with the least squares method based on the peak performance measurements.
The developed model’s adequacy was analyzed. The experiment and modelling results indicate that the modelling accuracy increases with the number of nodes increase. The experiments with 3, 6, and 9 nodes have demonstrated error of more than 20% for 8 out of 18 queries. Clusters with 6 and 9 nodes have demonstrated error higher than 20% for 1 out of 12 queries. The experiments on 9 nodes demonstrated error lower than 20% for all 6 queries. This accuracy level is considered acceptable for a forecasting model that is used at the information system design stage.
- Anureet Kaur. Big Data: A Review of Challenges // Tools and Techniques. 2016. IJSRSET. V. 2. № 2. P. 1090−1093.
- Jerzy Duda. Business intelligence and NoSQL databases // Information Systems in Management. 2012. V. 1 (1). P. 25−37.
- W.H. Inmon. Building the Data Warehouse. Edition 4th. Wiley Publishing, Inc. 2005. 576 p.
- Matteo Golfarelli, Stefano Rizzi. Data Warehouse Design: Modern Principles and Methodologies. McGraw-Hill, Inc. New York, NY, USA. 2009. 458 p.
- Bc. Aleš Hejmalíček. Hadoop as an Extension of the Enterprise Data Warehouse. Masaryk university, Faculty of informatics, Brno, 2015.
- E. Redmond and J.R. Wilson. Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement. Pragmatic Bookshelf. 2012.
- Sadalage P., Fowler M.: NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison Wesley Professional. 2012.
- Jeffrey A Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters // Sixth Symposium on Operating System Design and Implementation (OSDI’04). San Francisco CA. 2004.
- Feng Li, Beng Chin Ooi, M. Tamer Özsu, Sai Wu. Distributed data management using MapReduce // Journal ACM Computing Surveys (CSUR). January 2014. V. 46. № 3. Article № 31.
- Yin Huai, Ashutosh Chauhan, Alan Gates et al. Major Technical Advancements in Apache Hive, VLDB. 2012.
- Zhou G. Zhu, Y. Wang, G. Cache Conscious Star-Join in MapReduce Environments // Proceedings of the International Workshop on Cloud Intelligence (Cloud-I '13). 26 August 2013.
- Jaqueline Joice Brito, Thiago Mosqueiro, Ricardo Rodrigues Ciferri, Cristina Dutra de Aguiar Ciferri. Faster cloud Star Joins with reduced disk spill and network communication // International Conference on Computational Science (ICCS). 2016. V. 80. P. 74−85.
- Pavlo A., Paulson E., Rasin A., Abadi D.J., DeWitt D.J., Madden S.R., Stonebraker M. A comparison of approaches to large-scale data analysis // Proceedings 35th SIGMOD International Conference on Management of Data. New York: ACM Press. 2009. P. 165−178.
- A.V. Burdakov, U.A. Grigorev, A.D. Ploutenko. Comparison of table join execution time for parallel DBMS and MapReduce, Software Engineering / 811: Parallel and Distributed Computing and Networks / 816: Artificial Intelligence and Applications Proceedings (18 March 2014. Innsbruck, Austria), ACTA Press. 2014.
- H.V. Simhadri. Program-Centric Cost Models for Locality and Parallelism. PhD thesis. CMU. 2013.
- K. Palla. A comparative analysis of join algorithms using the Hadoop Map/Reduce framework. Master’s thesis, University of Edinburgh. 2009.
- Afrati F.N. and Ullman J.D. Optimizing joins in a map-reduce environment // Proceedings of 3th EDBT. 2010.
- WU S., LI F., Mehrotra S., Ooi B.C. 2011. Query optimization for massively parallel data processing // Proc. 2nd ACM Symp. on Cloud Computing. P. 12:1–12:13.
- Foto N. Afrati, Anish Das Sarma, Semih Salihoglu, Jeffrey D. Ullman. Upper and lower bounds on the cost of a map-reduce computation. CoRR, abs/1206.4377. 2012.
- Howard Karloff, Siddharth Suri, Sergei Vassilvitskii. A model of computation for MapReduce // Proceedings of Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 2010. Philadelphia, PA (USA). Society for Industrial and Applied Mathematics. P. 938−948.
- Koutris P., Suciu D. Parallel evaluation of conjunctive queries // Proc. 30th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems. New York: ACM. 2011. P. 223−234.
- Yufei Tao, Wenqing Lin, Xiaokui Xiao. Minimal mapreduce algorithms // Proceedings ACM SIGMOD International Conference on Management of Data (SIGMOD). New York (USA): ACM. 2013. P. 529−540.
- Li B., Mazur E., Diao Y., McGregor A., Shenoy P.J.: A platform for scalable one-pass analytics using MapReduce // Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD). 2011. P. 985−996.
- Grigor'ev Yu.A., Plutenko A.D., Pluzhnikov V.L., Ermakov E.Yu., Czvyashhenko E.V., Proletarskaya V.A. Teoriya i praktika analiza parallel'ny'x sistem baz danny'x. Vladivostok: Dal'nauka. 2015. 336 c.
- T. White. Hadoop: The Definitive Guide. Edition 4th. O'Reilly Media. 2015.
- Digital Ocean. https://www.digitalocean.com.
- H. Abdi. The method of least squares // Encyclopedia of Measurement and Statistics / N. Salkind (editor). CA (USA): Thousand Oaks. 2007.
- Grigoriev Y.A., Proletarskaya V.A., Ermakov E.Y., Ermakov O.Y. Efficiency Analysis of the access method with the cascading Bloom filter to the data warehouse on the parallel computing platform // Journal of Physics: Conference Series. IOP Publishing. 2017. T. 913. № 1. P. 1−10.