JustAnswer



If you have any question regarding the assignment, please post it to the “Discussion” section on Canvas so everyone can get help faster. Thanks!Part One: Hadoop set up for Map ReduceDeliverables:For part I, you need to write a report on what you have done and provide a few typical screenshots (marked in red): the java install, word count output (just need part of it since it is very long), server monitoring page ( and ), and bin/hdfs operations. All you have to do is following the instructions, copying and pasting codes.This part is to install and setup Hadoop environment.The following instructions have been tested on DigitalOcean Ubuntu 16.04.03x64 image. If you have a local Linux machine, you can carry out the project using your local Linux. You can also carry out the project using Amazon EC2, Digital Ocean, or Google Cloud.If you do not have a Digital Ocean account, you may use this link: to create an account. You are supposed to get 10$ credit. It is not guaranteed though. After this, you create an Ubuntu Droplet. You can choose any size, but the faster the better (I used an 8gb/4cpu machine).It is very important that you don't connect to the instance by an internet explorer console. If you are using Windows, please use PUTTY: . If you are using Mac, please refer to this post: Ocean Linux is not required for this project, but it may be easier than AWS since some students may encounter authentication issues with AWS’s Linux instance (based on the experience from previous classes). Attention: DON’T forget to destroy the Digital Ocean Droplet (this is how they named the Linux image) if you are not going to use it for a while. It will be counting fees as long as you have any Droplets running. The codes and operations:Install Java & Hadoop:sudo apt-get updatesudo apt-get install default-jdkjava -version(screenshot)cd /usr/localwget -xzvf hadoop-2.8.1.tar.gzmv hadoop-2.8.1 hadoopsudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh to edit:…export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop...source /usr/local/hadoop/etc/hadoop/hadoop-env.shStandalone HADOOP Operation:cd hadoopmkdir inputcd inputwget wget ### try more times if you encounter troubles getting the files cd ..bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount input outputcat output/*(screenshot) if you see some weird strings, don’t worry, you are getting right resultsPseudo-Distributed Operation:nano etc/hadoop/core-site.xml to add….<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property></configuration>…sudo nano etc/hadoop/hdfs-site.xml:…<configuration> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>…ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsacat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keysexport HADOOP\_PREFIX=/usr/local/hadoopssh localhostexitnano $HOME/.bashrc to add the following to the end:…………..# Set Hadoop-related environment variablesexport HADOOP_HOME=/usr/local/hadoopexport JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/# Some convenient aliases and functions for running Hadoop-related commandsunalias fs &> /dev/nullalias fs="hadoop fs"unalias hls &> /dev/nullalias hls="fs -ls"# If you have LZO compression enabled in your Hadoop cluster and# compress job outputs with LZOP (not covered in this tutorial):# Conveniently inspect an LZOP compressed file from the command# line; run via:## $ lzohead /hdfs/path/to/lzop/compressed/file.lzo## Requires installed 'lzop' command.#lzohead () { hadoop fs -cat $1 | lzop -dc | head -1000 | less}# Add Hadoop bin/ directory to PATHexport PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin……………source ~/.bashrcbin/hdfs namenode -formatsbin/start-dfs.shsbin/start-yarn.shBy typing in command "jps", you should be able to see resourcemanager, datanode, namenode, jps, nodemanager, and secondarynamenodecheck your server is running by opening(screenshot)bin/hdfs dfs -mkdir /userbin/hdfs dfs -mkdir /user/itis6320bin/hdfs dfs -put input/* /user/itis6320bin/hdfs dfs -ls /user/itis6320bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount /user/itis6320 outputbin/hdfs dfs -ls outputbin/hdfs dfs -rm output/*(screenshot)sbin/stop-dfs.shmv etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xmlsudo nano etc/hadoop/mapred-site.xml…<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property></configuration>…sudo nano etc/hadoop/yarn-site.xml…<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property></configuration>…sbin/start-dfs.shsbin/start-yarn.sh(screenshot)YOU WILL BE ABLE TO RUN YOUR MAPREDUCE JOBsbin/stop-dfs.shsbin/stop-yarn.shPart Two: Movie SimilaritiesFor this part, we will process a large corpus of movie ratings for providing recommendations. When you're done, your program will help you decide what to watch on Netflix tonight. For each pair of movies in the data set, you will compute their statistical correlation and cosine similarity (see this blog for a discussion of these and other potential similarity metrics). Since this isn't a statistics class, the calculation of similarity metrics for Python and Java will be provided, but you need to provide them with the correct inputs.For this section of the assignment, we have two input data sets: a small set for testing on your local machine or on digitalOcean and a large set for running on Amazon's cloud (attention: DigitalOcean’s disk is too small for the large set). More info about the data set can be found here. For both data sets, you will find two input files:movies.csv or movies.dat contains identification and genre information for each movie. Lines are of the form: MovieID,Movie Title,Genre 0068646,The Godfather (1972),Crime|Drama 0181875,Almost Famous (2000),Drama|Musicratings.csv or ratings.dat contains a series of individual user movie ratings. Lines are of the form: UserID,MovieID,Rating,:Timestamp 120,0068646,10,1365448727 374,0181875,9,1374863640In addition to these two input files, your program should take a few additional arguments:-m [movie title]: The title of a movie for which we'd like to see similar titles. You should be able to accept multiple movie titles with more than one -m argument.-k [number of items]: For each of the movies specified using -m, this specifies how many of the top matches we'd like to see. In other words, running with "-m The Godfather (1972) -k 10" would be asking for "the top ten movie recommendations if you liked The Godfather." (Default 25)-l [similarity lower bound]: When computing movie similarity metrics, we'll produce a floating-point value in the range [-1, 1]. This input says to ignore any movie parings whose similarity metric is below this value. (Default 0.4)-p [minimum rating pairs]: When computing similarity metrics, ignore any pair of movies that don't have at least this many shared ratings. (Default 3)Please don't attempt to filter down to the movies specified via -m until the final step. I want you to compute the similarities for all movies. The -m argument is there to reduce the output size and make reading (and grading) the output easier. For the other arguments (-k, -l, and -p), you may filter whenever you want.OutputSince we're computing two similarity metrics, we'll need to combine them into a single similarity value somehow. For your submission, you should blend the values together, using 50% of each. That is, your final value for a pair of movies is 0.5 * statistical correlation + 0.5 * the cosine correlation for the pair.For each movie selected (-m), sort them from largest to smallest by their blended similarity metric, outputting only the top K (-k) most similar movies that are have at least the minimum blended similarity score (-l). For movies meeting this criterion, you should output:The name of the movie for which we want similar titles (specified via -m).The name of a similar movie.The blended similarity metric that these two movies share.The statistical correlation that these two movies share.The cosine correlation that these two movies share.The number of ratings this pair of movies had in common.You may format your output however you like, as long as the values are in the correct order and I can reasonably make sense of it by looking at it briefly.StepsYou may structure your sequence of map/reduce tasks at your choice. However, I recommend the following sequence of steps:Join the input files: Initially, you have two input files (ratings.dat and movies.dat). You'll get most of the important info from ratings.dat, but it only has movie IDs rather than movie names. For the first step, you can assign names to the rated movies and drop the movie ID. This way, you can refer to movies by their name going forward. You probably want your reducer's output to be key: user id, value: (movie title, rating) (which you will use as the input of next map reduce step). Hint: ratings file has 4 items each line, while movies file has 3 each line. This difference can be used to differ input.Produce movie rating pairs: Next, you want to organize the movies into pairs, recording the ratings of each when a user has rated both movies (i.e. for each user, you will create pairs for every two movies). This gives you vectors to use for the similarity metrics. For example, suppose we have three users, Alice, Bob, and Charlie: Alice has rated Almost Famous a 10, The Godfather a 9, and Anchorman a 4. Bob has rated Almost Famous a 7 and Anchorman a 10. Charlie has rated The Godfather a 10 and Anchorman an 8. You would end up with records that look like: Key: (Almost Famous, The Godfather) Values: (10, 9) Key: (Almost Famous, Anchorman) Values: (10, 4), (7, 10) Key: (Anchorman, The Godfather) Values: (4, 9), (8, 10) Protip: You'll want to ensure that the keys you output are consistent for a pair of movies, for example, by putting them in alphabetical order. Otherwise, you run the risk of having two keys for a pair of movies, e.g., (Anchorman, The Godfather) and (The Godfather, Anchorman). Having more than one key for a pair is bad, as they will be treated independently (and probably sent to different machines for processing).Compute the similarity scores: Given keys that tell you a pair of movie names and values that contain a sequence of pairs, each corresponding to how a user rated that pair, you now have the information you need to compute similarity scores for that pair of movies. You will need to organize the data before the calculation. For example:Movies “Anchorman” and “The Godfather” has 5 values: (4, 9), (8, 6), (4, 9), (8, 4), (7, 9), the input for the calculation will be [4, 8, 4, 8, 7] and [9, 6, 9, 4, 9]. Then you put them into the calculation script.In Python:The statistical calculation in Python is as follow. You may want to add the calculation process In Java:For two list of numbers r1 and r2, you can refer here for statistical correlation and here for cosine correlation.Filter and format the output: Filter out the movies that weren't specified with the -m flag and sort the output by similarity metric so that it conforms to the desired output format.Running itFor the small data set, you can run it locally. It will probably take a few minutes to complete. You can run over the large data set locally too, if you want, but it will take at least a few hours. Instead, let's farm it out to Amazon EC2 platform.Specially, if you are writing you map reduce job in Python, the “mrjob” module we used in the warm-up project will be helpful. The tutorial of the package can be found here. It comes with the links to run this on Amazon EMR or Google Cloud. To run on Amazon, you'll need to tell MRJob to use "emr" as its runner. You'll also need to give it some basic configuration information. Edit your mrjob.conf with the following contents (more information here):runners: inline: base_tmp_dir: /local emr: core_instance_type: t2.micro ###use a faster instance if you get education credit from Amazon (refer here for more instance info) num_core_instances: 4 ###use more instances if you want it to be solved faster aws_access_key_id: [your access key] aws_secret_access_key: [your secret key] aws_region: us-east-2a ###can be other available zonesAttention: YAML doesn't allow tabs, so only use space in mrjob.conf; by using m3.xlarge and 10 instances, you should be able to solve the large set in a few hours.Other than use security key sets, you can also use AWS’s IAM to configure your account. Now, you should be able to invoke MRJob with "--conf-path mrjob.conf -r emr" to run your code in the cloud.Deliverables:For part II, you need to include in your report the following information:The source code that you have writtenRun you script with at least 3 movies and provide the screenshot of the result. An example is (note that you should list your results with highest average correlations, not lowest as shown here):Some useful linksInstall Java and Hadoop: to localhost without password: the warm-up project, we only used one mapper and one reducer. But in this one, you will need multiple mappers/reducers. So you need to organize them by defining steps. Here is how I define them:Also, to help you with the computation algorithm, here are the explanations of each step:Step1: You need to take movie ID, user ID, movie names, rating from the two input files. The key to differ between two input files is the number of items in each line. The output of this step should be (key, value) sets as movie ID, (user ID, movie name, rating). Notice that key or value can be a tuple.Step2: You don’t need any mapper after Step1. In this step, you need to create rating pairs. The output should be (movie_name_1, movie_name_2), (movie_name_1, movie_name_2, rating1, rating2). Notice that movie ID and user ID are abandoned.Step3: You need to do the computations in this step. The output should be (movie_name_1, movie_name_2), (movie_name_1, movie_name_2, average correlation, statistical correlation, cosine correlation, number of co-occurences)Step4: In this step, you need to take information from the system parser and output the similar movies of interested movies.Step5: I add this step since the result from last reducer is not yet sorted. Also, you can take parser information such as “the number of item” and print the final results. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download