MapReduce Interview Questions Answers
What exactly is MapReduce?MapReduce is the system used to process data in the Hadoop cluster. It consists of two phases: Map, and then Reduce. Between the two is a stage known as the shuffle and sort. Each Map task operates on a discrete portion of the overall dataset. Typically one HDFS block of data. After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase. Each node processes data stored on that node where possible.
Can I write a MapReduce program with any language other than Java?
Yes, MapReduce can be written in many programming languages Java, R, C++, scripting languages (Python, PHP). Any language able to read from stadin and write to stdout and parse tab and new line characters should work. Hadoop streaming (a Hadoop utility) allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
Is the MapReduce infrastructure on the BDA open source?Yes, the core Hadoop HDFS storage and MapReduce compute infrastructure is 100% open source.
Since $HADOOP_HOME is deprecated on CDH4.1.2 / BDA V2.0.1 what environment variable should be used?
On BDA V2.0.1 with CDH 4.1.2, use $HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce.
What is the impact of shutting down a server for maintenance on MapReduce jobs?
In the general case for a non-critical server (i.e. not node 1, 2, or 3) HDFS should redistribute jobs to other nodes. There should be no noticeable impact.
Can standard R code be translated into MapReduce? ORCH V2.0 can auto generate Hive queries for R Language constructs to aid in data analysis and data preparation. The Hive queries in turn are executed as map-reduce code. This is accomplished through the ore API (ore.connect(type="HIVE")).
What exactly is MapReduce?MapReduce is the system used to process data in the Hadoop cluster. It consists of two phases: Map, and then Reduce. Between the two is a stage known as the shuffle and sort. Each Map task operates on a discrete portion of the overall dataset. Typically one HDFS block of data. After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase. Each node processes data stored on that node where possible.
Can I write a MapReduce program with any language other than Java?
Yes, MapReduce can be written in many programming languages Java, R, C++, scripting languages (Python, PHP). Any language able to read from stadin and write to stdout and parse tab and new line characters should work. Hadoop streaming (a Hadoop utility) allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
Is the MapReduce infrastructure on the BDA open source?Yes, the core Hadoop HDFS storage and MapReduce compute infrastructure is 100% open source.
Since $HADOOP_HOME is deprecated on CDH4.1.2 / BDA V2.0.1 what environment variable should be used?
On BDA V2.0.1 with CDH 4.1.2, use $HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce.
What is the impact of shutting down a server for maintenance on MapReduce jobs?
In the general case for a non-critical server (i.e. not node 1, 2, or 3) HDFS should redistribute jobs to other nodes. There should be no noticeable impact.
Can standard R code be translated into MapReduce? ORCH V2.0 can auto generate Hive queries for R Language constructs to aid in data analysis and data preparation. The Hive queries in turn are executed as map-reduce code. This is accomplished through the ore API (ore.connect(type="HIVE")).