Spark – Print contents of RDD - Tutorial Kart

Spark ? Print contents of RDD

Spark ? Print contents of RDD

RDD (Resilient Distributed Dataset) is a fault-tolerant collection of elements that can be operated on in parallel.

To print RDD contents, we can use RDD collect action or RDD foreach action.

RDD.collect() returns all the elements of the dataset as an array at the driver program, and using for loop on this array, we can print elements of RDD.

RDD foreach(f) runs a function f on each element of the dataset.

In this tutorial, we will go through examples with collect and foreach action in Java and Python.

RDD.collect() ? Print RDD ? Java Example

In the following example, we will write a Java program, where we load RDD from a text file, and print the contents of RDD to console using RDD.collect().

PrintRDD.java

import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class PrintRDD {

public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD") .setMaster("local[2]").set("spark.executor.memory" // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // read text files to RDD JavaRDD lines = sc.textFile("data/rdd/input/file1.txt"); // collect RDD for printing for(String line:lines.collect()){ System.out.println("* "+line); }

} }

}

file1.txt

Welcome to TutorialKart Learn Apache Spark Learn to work with RDD

Output

18/02/10 16:31:33 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from 18/02/10 16:31:33 INFO DAGScheduler: ResultStage 0 (collect at PrintRDD.java:18) finished in 0.513 18/02/10 16:31:33 INFO DAGScheduler: Job 0 finished: collect at PrintRDD.java:18, took 0.726936 s * Welcome to TutorialKart * Learn Apache Spark * Learn to work with RDD 18/02/10 16:31:33 INFO SparkContext: Invoking stop() from shutdown hook 18/02/10 16:31:33 INFO SparkUI: Stopped Spark web UI at 18/02/10 16:31:33 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

RDD.collect() ? Print RDD ? Python Example

In the following example, we will write a Python program, where we load RDD from a text file, and print the contents of RDD to console using RDD.collect().

print-rdd.py

import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__":

# create Spark context with Spark configuration conf = SparkConf().setAppName("Print Contents of RDD - Python") sc = SparkContext(conf=conf) # read input text file to RDD rdd = sc.textFile("data/rdd/input/file1.txt") # collect the RDD to a list list_elements = rdd.collect() # print the list for element in list_elements:

print(element)

Run this Python program from terminal/command-prompt as shown below.

$ spark-submit print-rdd.py

$ spark-submit print-rdd.py

Output

18/02/10 16:37:05 INFO DAGScheduler: ResultStage 0 (collect at /home/arjun/workspace/spark/readToRD 18/02/10 16:37:05 INFO DAGScheduler: Job 0 finished: collect at /home/arjun/workspace/spark/readToR This is File 1 Welcome to TutorialKart Learn Apache Spark Learn to work with RDD 18/02/10 16:37:05 INFO SparkContext: Invoking stop() from shutdown hook

RDD.foreach() ? Print RDD ? Java Example

In the following example, we will write a Java program, where we load RDD from a text file, and print the contents of RDD to console using RDD.foreach().

PrintRDD.java

import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.VoidFunction;

public class PrintRDD {

public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD") .setMaster("local[2]").set("spark.executor.memory" // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf);

// read text files to RDD JavaRDD lines = sc.textFile("data/rdd/input/file1.txt");

lines.foreach(new VoidFunction(){ public void call(String line) { System.out.println("* "+line);

}}); } }

RDD.foreach() ? Print RDD ? Python Example

In the following example, we will write a Java program, where we load RDD from a text file, and print the contents of RDD to console using RDD.foreach().

print-rdd.py

import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__":

# create Spark context with Spark configuration conf = SparkConf().setAppName("Print Contents of RDD - Python") sc = SparkContext(conf=conf) # read input text file to RDD rdd = sc.textFile("data/rdd/input/file1.txt") def f(x): print(x) # apply f(x) for each element of rdd rdd.foreach(f)

Conclusion

In this Spark Tutorial ? Print Contents of RDD, we have learnt to print elements of RDD using collect and foreach RDD actions with the help of Java and Python examples.

Learn Apache Spark Apache Spark Tutorial Install Spark on Ubuntu Install Spark on Mac OS Scala Spark Shell - Example Python Spark Shell - PySpark Setup Java Project with Spark Spark Scala Application - WordCount Example Spark Python Application Spark DAG & Physical Execution Plan Setup Spark Cluster Configure Spark Ecosystem Configure Spark Application Spark Cluster Managers

Spark RDD

Spark RDD Spark RDD - Print Contents of RDD Spark RDD - foreach Spark RDD - Create RDD Spark Parallelize Spark RDD - Read Text File to RDD Spark RDD - Read Multiple Text Files to Single RDD Spark RDD - Read JSON File to RDD Spark RDD - Containing Custom Class Objects Spark RDD - Map Spark RDD - FlatMap Spark RDD - Filter Spark RDD - Distinct Spark RDD - Reduce

Spark Dataseet Spark - Read JSON file to Dataset Spark - Write Dataset to JSON file Spark - Add new Column to Dataset Spark - Concatenate Datasets

Spark MLlib (Machine Learning Library) Spark MLlib Tutorial KMeans Clustering & Classification Decision Tree Classification Random Forest Classification Naive Bayes Classification Logistic Regression Classification Topic Modelling

Spark SQL Spark SQL Tutorial Spark SQL - Load JSON file and execute SQL Query

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download