Laboratory Assignment: MapReduce with Hadoop



Laboratory Assignment

MapReduce Paradigm in a Virtual Grid Environment

1. Objective

MapReduce allows for relatively fast and easy processing over very large datasets using a cluster of commodity machines. In this assignment we will become more familiar with the MapReduce paradigm and the open source Java implementation, Hadoop. This document will walk students through the process of settting up and executing a MapReduce task over the NetFlix Prize dataset on a single node. Finally, the student will be given the task of implementing their own MapReduce job and analyzing the execution thereof.

1.1 Learning Objectives

• Understanding the NetFlix Prize dataset

• MapReduce code example

• Compiling and executing a MapReduce Task

2. Resources

1. MapReduce: Simplified Data Processing on Large Clusters

2. Important Concepts

3. Hadooop MapReduce Implementation

4. Hadoop command line options

5. Hadoop Job Partitioning (would be more pertinent to the project)

6. Hadoop Debugging:

7. Hadoop APIs

3. Example

This section of the document details the NetFlix Prize dataset and presents an example MapReduce task to calculate the average movie rating per user in the dataset.

3.1 Dataset

Before any MapReduce tasks we will need to understand understand the formatting of the dataset as well as load it on the image. The formatting of the Netflix Prize dataset is described in it’s accompanying README. The description is as follows:

The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:

CustomerID,Rating,Date

- MovieIDs range from 1 to 17770 sequentially.

- CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.

- Ratings are on a five star (integral) scale from 1 to 5.

- Dates have the format YYYY-MM-DD.

For our purposes we only use the Customer ID and Rating. The customer ID will be used to identify unique ratings and the rating is the element of the dataset we will be aggregating via Map and Reduce tasks.

Your instructor will give directions on where to access the (potentially truncated or aggregated) dataset.

If the dataset has been aggregated, then the movie ratings will not be in one file per movie. MovieIDs are still present in the data file and each rating following a : line are the ratings for that movie:

...

1447274,5,2005-11-15

709867,4,2005-12-22

33:

1623180,5,2005-07-11

282486,3,2005-07-12

1987434,4,2005-07-13

1447783,5,2004-06-25

2616301,5,2005-07-22

...

3.2 Average User Rating

We will be using Hadoop’s implementation of MapReduce to find the average of each user’s movie ratings in the dataset.

For the average value of a user rating two elements must be captured per user rating, the rating value and the sum of the ratings. This is not as straightforward as it may seem. Because the map task will be working in parallel on separate chunks of the ratings document and will have no higher level knowledge of which ratings and users have been emitted the simplest possible emit must be chosen that will still allow the reducer to aggregate the sum of the ratings. In order to accomplish this, for every rating there will be an emit for the UserID consisting of a two element list which contains the rating and the sum of the rating. Because the map task works independently with each rating the sum will always be emitted as a 1.

High level emit:

CSV Data file line ->

Input:

33:

1623180,5,2005-07-11

282486,3,2005-07-12

1987434,4,2005-07-13

Emit:

The Reduce task then encompasses the function which iterates over the list for each user and aggregates the rating and sum. The final operation of the reduce task is to calculate the average rating using the aggregated sum and rating values and emit the average rating for the user.

High level emit:

->

Input:

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download