CCA175 : Practice Questions and Answer

[Pages:18]CCA175 : Practice Questions and Answer

Total 61 Scenarios



Check Sample Paper

Access all Question paper (Only paid subscription)

About Cloudera CCA175 (Scala) : Hands-on Practice Scenario on CDP with Spark 2.4 Certifications Preparation Kit : Total 61 Solved scenarios Cloudera Certification CCA175 (Hadoop & Spark Developer) is one of the most popular and demanding certification in the BigData worlds and since inception HadoopExam is providing certification preparation material for BigData world and as part of our technical team hard work, we are able to provide the certification preparation material for the new version of CCA175 and this time we have separated the certification preparation material for the Scala and PySpark version. This is a Scala version certification preparation material. As you might be knowing Cloudera has changed their BigData platform and this is no more only Hadoop framework but this is Integrated Software which can run in Public Cloud as well as On-prem Data Center. And also to let you know Cloudera had discontinued their QuickstartVM as of now and there is no VM available for practicing the CCA175 certifications. HadoopExam technical team created a complete guideline and providing the same with this certification preparation material to set up single node Hadoop Cluster instance in the Google Cloud (Google Cloud provide $300 free credits and that is good enough for practicing the CCA175 scenarios). In this step by step guide you would learn how to setup the instance and data on that instance for practicing the scenarios. Each and every scenario we are providing as part of Practice Scenario Test is tested and executed on that single Node CDP cluster. To access all the scenarios, you don't have to install any software, you can access questions and answer with our Browser based Simulator. It is a Hands-on exam, which candidate have to complete in two hours. You will be given 10-12 tasks and have to complete at least 70% to clear the exam. Being a hands-on exam, it has more credibility in industry. In this practice set we have covered the entire syllabus as well as executed on new Cloudera platform CDP. It includes in depth complex scenarios where as you move ahead complexity of the scenarios increases. We are in process of adding complementary videos for selected problems given in online simulator. Practice and Sample Problem with its solutions will be provided in HadoopExam Online Simulator only. You can check the sample paper below.

Exam Instructions

? Cloudera CCA175 Hadoop (CDP) &Spark 2.4 in Scala ? Assessment-1-20 ? Please accomplish all the assessment, given in next pages. ? This assessment questions are valid for Cloudera Hadoop & Spark Developer

Certification CCA175 which is based on CDP and Spark 2.4. ? There are currently total 20 assessment in this set. ? As on when you move further, exercise complexity would be increased. ? To access the data, a separate link provided in each exercise. ? This assessment is not time bounded, but in the real exam it would be certainly.

As of today, real exam is 120 Mins, which include section as below and there are no separate timeline given for individual exercise. ? Assessment: In this case you have to implement problem solution on the Cloudera CDP (Single Node) platform. ? Before appearing in real exam, please check or drop an email to hadoopexam@, so if there is any update, we will share with you. ? Once you appear the exam, please share your feedback, question patterns and what we should improve etc. So that future learners can get benefit from your feedback. And we can improve and provide better material to you as well.

Exercise-1

Problem Statement: You have to create data or files from the given dataset (Check Data Tab to access and download the data).

? hecourses.json ? students.csv

Based on that please accomplish the following activities.

1. Create this two files in local directory and then upload the hdfs under the spark4 directory.

2. Use the inbuild schema for the hecourses.json file and create a new Dataframe from this.

3. Define a new schema for the students.csv as given below column name. a. StdID b. CourseId c. RegistrationDate

4. Using the above schema create a DataFrame for the "students.csv" data. 5. Find the list of the courses using both the dataframe which is/are not yet

subscribed and then save the result in the "spark4/notsubscribed.json" directory. 6. Find the total fee collected by each course category. The column name of the

total fee collected field should be "TotalFeeCollected"

7. Save the result in the "spark4/TotalFee.json"

Below is the Data for the Exercise

//File Contents for the hecourses.json

[{ "CourseId": 1001, "CourseFee": 7000, "Subscription": "Annual", "CourseName": "Hadoop Professional Training", "Category": "BigData", "Website": ""

}, {

"CourseId": 1002, "CourseFee": 7500, "Subscription": "Annual", "CourseName": "Spark Professional Training", "Category": "BigData", "Website": "" },{ "CourseId": 1003, "CourseFee": 7000, "Subscription": "Annual", "CourseName": "PySpark Professional Training", "Category": "BigData", "Website": "" }, { "CourseId": 1004, "CourseFee": 7000, "Subscription": "Annual", "CourseName": "Apache Hive Professional Training", "Category": "Analytics", "Website": "" },{ "CourseId": 1005, "CourseFee": 10000, "Subscription": "Annual", "CourseName": "Machine Learning Professional Training", "Category": "Data Science", "Website": "" }, { "CourseId": 1006, "CourseFee": 7000, "Subscription": "Annual", "CourseName": "SAS Base", "Category": "Analytics", "Website": "" }]

//File Contents for the students.csv

ST1,1004,20200201 ST1,1003,20200211 ST2,1002,20200206 ST2,1001,20200204 ST3,1004,20200202 ST4,1003,20200211 ST6,1004,20200207 ST7,1005,20200202 ST9,1003,20200206 ST9,1002,20200209 ST3,1001,20200208 ST2,1004,20200207 ST1,1005,20200201 ST2,1003,20200204

Solution:

Step-1: Create the json file locally.

mkdir spark4

cd spark4

vi hecourses.json

vi students.csv

Step-2: Now upload this file to hdfs.

//Go to home directory

cd ~

//Upload spark4 directory to hdfs

hdfs dfs -put -f spark4

//Check whether data have been uploaded or not.

hdfs dfs -ls spark4

Step-3: Read this file in Spark as a DataFrame

//We should note that this is a multipline file.

//So need to provide option accordingly

val heCourseDF = spark.read.option("multiline","true").json("spark4/hecourses.json")

heCourseDF.show(false) Step-4: Define the custom schema for the students.csv file. //Use (:paste) to enter multiline contents and (ctrl-D to finish) //We need to import the types package as well.

import org.apache.spark.sql.types._ val heschema = new StructType() .add("StdID", StringType, true) .add("CourseId", LongType, true) .add("RegistrationDate", StringType, true) Step-5: You can check the schema created with the following statements. heschema.printTreeString //Now create a DataFrame for the csv data as below. val heStdDF = spark.read.format("csv") .option("header", "false") .schema(heschema) .load("spark4/students.csv") //Now check the data in the dataframe heStdDF.show(false) Step-6: Now find all the courses which is not yet subscribed. //We can use the join operation to find all the courses which is not yet subscribed. heCourseDF.join(heStdDF,heCourseDF("CourseId") === heStdDF("CourseId"),"left").show(false) //Now filter all the records where StdID is null. heCourseDF.join(heStdDF,heCourseDF("CourseId") === heStdDF("CourseId"),"left").filter("StdID is null").show(false)

//We can save this dataframe as json data. Before saving we need to remove duplicate column as well. heCourseDF.join(heStdDF,heCourseDF("CourseId") === heStdDF("CourseId"),"left") .filter("StdID is null") .drop(heStdDF("CourseId")) .write .json("spark4/notsubscribed.json") //Now check that data have been saved or not. (Do it in different shell or using Hue UI) hdfs dfs -ls spark4/notsubscribed.json hdfs dfs -cat spark4/notsubscribed.json/part-00000-8992451d-2da8-4448-b6e8869c797da5c8-c000.json Step-7: Now find the total fee collected in each category. //Also we need to Rename the column as TotalFeeCollected heCourseDF.join(heStdDF,heCourseDF("CourseId") === heStdDF("CourseId"),"left") .groupBy("Category") .agg($"Category", sum("CourseFee")) .withColumnRenamed("sum(CourseFee)", "TotalFeeCollected") .show(false) //Now save this data as Json File. //Also remove duplicate column before saving. We can use select function as well for collecting the data. heCourseDF.join(heStdDF,heCourseDF("CourseId") === heStdDF("CourseId"),"left") .groupBy("Category") .agg($"Category", sum("CourseFee")) .withColumnRenamed("sum(CourseFee)", "TotalFeeCollected")

.select($"Category" , $"TotalFeeCollected") .write .json("spark4/TotalFee.json") //In case you have to delete the existing directory you can use the below command and re-run the above code snippet. hdfs dfs -rm -R spark4/TotalFee.json //Now check the data whether created in the hdfs or not. hdfs dfs -ls spark4/TotalFee.json hdfs dfs -cat spark4/TotalFee.json/part-00104-7a72c85a-a156-4b66-9dc52796a45e7c3f-c000.json Step-8: I would rather prefer the SQL syntax for implementing the same solution. So lets first create a temporary view from the dataframe. heCourseDF.createOrReplaceTempView("heCourseView") heStdDF.createOrReplaceTempView("heStdView") //Apply the select query on it. spark.sql("select * from heCourseView ").show(false) spark.sql("select * from heStdView ").show(false) //Now apply the join operations on both the view spark.sql("select Category, sum(CourseFee) as TotalFeeCollected from heCourseView C left join heStdView S On C.CourseId=S.CourseId Group BY Category").show(false); //Save this dataframe val totalFeeDF=spark.sql("select Category, sum(CourseFee) as TotalFeeCollected from heCourseView C left join heStdView S On C.CourseId=S.CourseId Group BY Category"); totalFeeDF .write .json("spark4/TotalFee_sql.json") //Remove if directory already created.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download