Just wondering on a quote! Thank you!Hadoop Project Technical Aspects of Big Data Management -...

Question

Just wondering on a quote! Thank you!Hadoop Project		 Technical Aspects of Big Data Management - BUAL 5660 Instructor: Dr. Pankush Kalgotra This is a group assignment and divided in three parts. First part mainly concentrates on the basics of Linux commands that are useful in accessing the Linux Environment and HDFS. Second part is running a Map Reduce application. Third part is about exploring Pig programming. Provide screenshot of the code or output when required.  1. Open Hadoop machine in VMware workstation. In System Configurations of VMware, change Number of Cores per processor = 2 and RAM at least 12 GB. If machine is having 8 GB RAM, then assign 4GB RAM. In this case, you will have to forcefully start the Cloudera. 2. Now Launch the Machine… 3. Download the folder hadoop_project from Canvas 4. Click on Launch Cloudera Express on its Desktop 5. Open Browser from the top 6. Click on the bookmark named Cloudera Manager (User: cloudera Password:cloudera) 7. Start two services: HDFS and Yarn 8. Open terminal from the top and change the user to root by typing sudo bash Change the user to root (sudo bash). You may want to change password for root if you don’t know (passwd root) Part 1 Let us explore linux file system and hdfs.  Question 1 Make a directory named bigdata in your home directory of local linux machine (use mkdir). Create a text file inside the bigdata folder named bigtest (Use touch, cat or vi command). Paste a screenshot showing that the directory and text file is created.			            Question 2: Change user to hdfs (su hdfs) (If you are using cloudera). Create a directory inside the hdfs named classhdfs in “user” directory (use hadoop fs -mkdir). Then copy bigtest text file created in Question 1. Paste a screenshot showing that the directory is created and the file is copied. (Hint: Use -copyFromLocal) Part 2 HDFS and Map-Reduce In software, JAR (Java Archive) is a package file format typically used to aggregate many Java class files and associated metadata and resources (text, images, etc.) into one file to distribute application software or libraries on the Java platform. JAR files are fundamental archive files, built on the ZIP file format and have the .jar file extension. Computer users can create or extract JAR files using the jar command that comes with a JDK. They can also use zip tools to do so; however, the order of entries in the zip file headers is important when compressing, as the manifest often needs to be first. Inside a JAR, file names are unicode text. You are given a folder named “Anagram_bigdata”. It contains map and reduce java files. It also contains a jar file to compile java files. This assignment is to find anagrams in the English Dictionary. The text file named en-US.dic is also in the folder. Your task is to first compile the java files. Then create a jar file to run the map reduce program on the text file. Finally, run map reduce function on the dictionary file to find anagrams. Note: To run map reduce functions, you need to have to files in hdfs. Step 1: Move folder “Anagram_bigdata” from local machine to the cloudera virtual machine. Make sure you know the location of the folder (usually it is in your home directory or Desktop). Step 2: Change the current directory to the Anagram_bigdata. (cd /home/……/Anagram_bigdata) Step 3: Run below codes below to compile the java code and create a jar file named anagram. sudo bash javac -classpath hadoop-core-1.2.1.jar *.java jar cvf anagram.jar *.class Step 4: Change user to hdfs (su hdfs) (If you are using cloudera) Step 5: Create a directory named “Anainput” in the hdfs (e.g. /Anainput). Use “hadoop fs -mkdir” command.   Step 6: Copy the file en-US.dic from Anagram_bigdata to Anainput (Use -copyFromLocal command) Step 7: Run the jar file by running the code below. hadoop jar anagram.jar AnagramDriver  /Anainput   /Anaoutput Note: Your results will be saved in /Anaoutput directory. /Anaoutput should not exist before running the code. Answer the following questions after running the jar file. Question 3: Provide the screen shot from output that shows that the job ran successfully without errors. Question 4: How many reads and writes happened on HDFS for map and reduce task? Question 5: How many map and reduce tasks were launched? Question 6: Paste a screenshot of the output from the map reduce application. To see the results, you have to use -cat command. hadoop fs -cat /Anaoutput/* Note: You can also see the output files using File browser in Cloudera manager>HDFS>NamenodeUI>Utilities>File browser. There you can also see the replication factor, block size, etc. Or you can also transfer files from HDFS to Local system also using copyToLocal command. hadoop fs -copyToLocal hdfs_location local_location Part 3: Exploring Pig Latin The dataset which you have to use for this assignment is the IRIS dataset. The dataset is in the same folder. The datatype for the variables is listed down: IRIS dataset Note: · Once you exit the grunt shell, the alias created ceases to exist. So, the tasks given in this part will have to be completed in a single session in the grunt shell.  · Sometimes, you may have to delete the quotes and retype them if you are copying the codes from a word document. For this assignment you need to use the Pig component of Hadoop to complete the assignment. Question 7: Create a directory named “bigdatapigYourName” in the hdfs path / bigdatapigYourName and store the dataset iris in this directory. Provide a screen showing the data successfully uploaded in HDFS. Now move into the grunt shell of PIG and answer the following questions: Question 8: Create an alias for the dataset. Display the schema of the dataset and display the entire dataset in the terminal. Attach screenshots to support your answer. Question 9: Display the count of records in each category of Species and attach the screenshots.  FOREACH GENERATE syntax: alias = FOREACH alias GENERATE expression [expression ….] Question 10: Compute the average Sepal Length of all species. 1 VariablesDatatype SepalLengthfloat SepalWidthfloat PetalLengthfloat PetalWidthfloat Specieschararray

Hadoop Project Technical Aspects of Big Data Management - BUAL 5660 Instructor: Dr. Pankush Kalgotra This is a group assignment and divided in three parts. First part mainly concentrates on the basics...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment