Hadoop in Data Warehousing

by Alexey Grigorev

INFO-H-419: Data Warehouses project

Hadoop in Data Warehousing

by Alexey Grigorev





Hadoop: In this Presentation

  1. Introduction
  2. Origins
  3. MapReduce
  4. Hadoop as MapReduce Implementation
  5. Data Warehouse on Hadoop
  6. Hadoop and Data Warehousing
  7. Conclusions

Why?

MapReduce: Origins

MapReduce: Origins

MapReduce: Origins

MapReduce: Origins

MapReduce

MapReduce Stages

each MapReduce Job is executed in 3 stages

MapReduce Stages

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.

Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros. Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis fringilla dolor ornare mi dictum ornare.

MapReduce Example

			def map(String input_key, String doc):
			  for each word w in doc:
			    EmitIntermediate(w, 1)
			def reduce(String output_key, Iterator output_vals):
			  int res = 0
			  for each v in output_vals:
			    res += v
			  Emit(res)
		

MapReduce Example

MapReduce Example: Result

Hadoop

http://flickr.com/photos/erikeldridge/3614786392/

Hadoop

... is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop

Hadoop Cluster: Terminology

Hadoop

http://escience.washington.edu/get-help-now/what-hadoop

Fault-Tolerance $\approx$ Load-Balancing

Advantages

Disadvantages

Disadvantages

[Abouzeid, Azza et al 2009]

Hadoop as a Data Warehouse

Cheetah

Cheetah

Cheetah

Hive

HiveQL

Tables
	  	STATUS UPDATE(user id int, status string, ds string)
			PROFILES(userid int, school string, gender int)
	  
	  	LOAD DATA LOCAL INPATH 'logs/status_updates'
			INTO TABLE status_updates
			PARTITION (ds='2009-03-20')
	  

HiveQL

			FROM
			(SELECT a.status, b.school, g.gender
			 FROM status_updates a JOIN profiles b
			 ON (a.userid = b.userid and a.ds = '2009-03-20') subq1
			INSERT OVERWRITE TABLE gender_summary
			PARTITION (ds='2009-03-20')
			SELECT subq1.gender, count(1)
			GROUP BY subq1.gender
			INSERT OVERWRITE TABLE school_summary
			PARTITION (ds='2009-03-20')
			SELECT subq.school, count(1)
			GROUP BY subq1.school
	  

HiveQL

			FROM
			(SELECT a.status, b.school, g.gender
			 FROM status_updates a JOIN profiles b
			 ON (a.userid = b.userid and a.ds = '2009-03-20') subq1
			INSERT OVERWRITE TABLE gender_summary
			PARTITION (ds='2009-03-20')
			SELECT subq1.gender, count(1)
			GROUP BY subq1.gender
			INSERT OVERWRITE TABLE school_summary
			PARTITION (ds='2009-03-20')
			SELECT subq.school, count(1)
			GROUP BY subq1.school
	  

HiveQL

			REDUCE subq2.school, subq2.meme, subq2.cnt
			USING 'top10.py' AS (school, meme, cnt)
			FROM (
				SELECT subq1.school, subq1.meme, count(1) as cnt
				FROM
				(MAP b.school, a.status
					USING 'meme_extractor.py'
					AS (school, meme)
					FROM status_update a JOIN profiles b
					ON (a.userid = b.userid)) subq1
				GROUP BY subq1.school, subq1.meme
				DISTRIBURE BY school, meme
				SORT BY school, meme, cnt desc)
			) subq2
	  

Hadoop + Data Warehouse

http://www.flickr.com/photos/mrflip/5150336351/in/photostream/

Hadoop + Data Warehouse

ETL

ETL: examples

Active Storage

Active Storage - 2

Analytical Sandbox $\Rightarrow$

http://www.flickr.com/photos/pasukaru76/9824401426/
http://www.flickr.com/photos/pasukaru76/4977447932/

Analytical Sandbox

Conclusions

References

  1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20. [pdf]
  2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013.
  3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010. [pdf]
  4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and Teradata)
  5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB Endowment 2.2 (2009): 1626-1629. [pdf]
  6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]

References

  1. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013.
  2. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf]
  3. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf]
  4. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. [pdf]
  5. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013.
  6. Apache Hadoop project home page, url: [link].
  7. Apache HBase home page, [link].
  8. Apache Mahout home page, [link].
  9. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014.
  10. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf]
  11. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]

Thank you

 Prepared with Shower