Spark - Read multiple text files to single RDD - Java & Python Examples

[Pages:9]Spark ? Read multiple text files to single RDD ? Java & Python Examples

Read multiple text files to single RDD

To read multiple text files to single RDD in Spark, use SparkContext.textFile() method.

In this tutorial, we shall look into examples addressing different scenarios of reading multiple text files to single RDD.

Read multiple text files to single RDD [Java Example] [Python Example] Read all text files in a directory to single RDD [Java Example] [Python Example] Read all text files in multiple directories to single RDD [Java Example] [Python Example] Read all text files matching a pattern to single RDD [Java Example] [Python Example]

Read Multiple Text Files to Single RDD

In this example, we have three text files to read. We take the file paths of these three files as comma separated valued in a single string literal. Then using textFile() method, we can read the content of all these three text files into a single RDD.

First we shall write this using Java.

FileToRddExample.java

import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class FileToRddExample {

public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD" .setMaster("local[2]").set("spark.executor.memory" // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide text file paths to be read to RDD, separated by comma String files = "data/rdd/input/file1.txt, data/rdd/input/file2.txt, data/rdd/input/file3.tx // read text files to RDD JavaRDD lines = sc.textFile(files);

JavaRDD lines = sc.textFile(files);

// collect RDD for printing for(String line:lines.collect()){

System.out.println(line); } } }

Note : Please take care in providing input file paths. There should not be any space between the path strings

except comma.

file1.txt

This is File 1 Welcome to TutorialKart Learn Apache Spark Learn to work with RDD

file2.txt

This is File 2 Learn to read multiple text files to a single RDD

file3.txt

This is File 3 Learn to read multiple text files to a single RDD

Output

18/02/10 12:13:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from 18/02/10 12:13:26 INFO DAGScheduler: ResultStage 0 (collect at FileToRddExample.java:21) finished i 18/02/10 12:13:26 INFO DAGScheduler: Job 0 finished: collect at FileToRddExample.java:21, took 0.88 This is File 1 Welcome to TutorialKart Learn Apache Spark Learn to work with RDD This is File 2 Learn to read multiple text files to a single RDD This is File 3 Learn to read multiple text files to a single RDD 18/02/10 12:13:26 INFO SparkContext: Invoking stop() from shutdown hook 18/02/10 12:13:26 INFO SparkUI: Stopped Spark web UI at 18/02/10 12:13:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

Now, we shall use Python programming, and read multiple text files to RDD using textFile() method.

readToRdd.py

import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__":

# create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf) # read input text files present in the directory to RDD lines = sc.textFile("data/rdd/input/file1.txt,data/rdd/input/file2.txt,data/rdd/input/file3.txt" # collect the RDD to a list llist = lines.collect() # print the list for line in llist:

print(line)

Run this Spark Application using spark-submit by executing the following command.

$ spark-submit readToRdd.py

Note : Please take care in providing input file paths. There should not be any space between the path strings

except comma.

Read all text files in a directory to single RDD

Now, we shall write a Spark Application, that reads all the text files in a given directory path, to a single RDD.

Following is a Spark Application written in Java to read the content of all text files, in a directory, to an RDD.

FileToRddExample.java

import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class FileToRddExample {

public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD" .setMaster("local[2]").set("spark.executor.memory" // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide path to directory containing text files String files = "data/rdd/input";

// read text files to RDD JavaRDD lines = sc.textFile(files); // collect RDD for printing for(String line:lines.collect()){

System.out.println(line); } } }

In the above example, we have given the directory path via variable files .

All the text files inside give directory path, data/rdd/input , shall be read to lines RDD.

Now, we shall write a Spark Application to do the same job of reading data from all text files in a directory to RDD, but using Python programming language.

readToRdd.py

import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__":

# create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf) # read input text files present in the directory to RDD lines = sc.textFile("data/rdd/input") # collect the RDD to a list llist = lines.collect() # print the list for line in llist:

print(line)

Run the above Python Spark Application, by executing the following command in a console.

$ spark-submit readToRdd.py

Read all text files in multiple directories to single RDD

This is next level to our previous scenarios. We have seen how to read multiple text files, or all text files in a directory to an RDD. Now, we are going to learn how to read all text files in not one, but all text files in multiple directories.

First we shall write a Java application to write all text files in multiple directories.

FileToRddExample.java

import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext;

public class FileToRddExample {

public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD" .setMaster("local[2]").set("spark.executor.memory" // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf);

// provide path to directories containing text files seperated by comma String directories = "data/rdd/input,data/rdd/anotherFolder";

// read text files to RDD JavaRDD lines = sc.textFile(directories);

// collect RDD for printing for(String line:lines.collect()){

System.out.println(line); } } }

All the text files in both the directories, provided in the variable directories , shall be read to RDD. Similarly, you may provide more that two directories.

Let us write the same program in Python.

readToRdd.py

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

# create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf)

# read input text files present in the directory to RDD lines = sc.textFile("data/rdd/input,data/rdd/anotherFolder")

# collect the RDD to a list llist = lines.collect()

# print the list for line in llist:

for line in llist: print(line)

You may submit this Python application to Spark, by running the following command.

$ spark-submit readToRdd.py

Read all text files, matching a pattern, to single RDD

This scenario kind of uses a regular expression to match a pattern of file names. All those files that match the given pattern will be considered for reading into an RDD.

Let us write a Java application, to read files only that match a given pattern.

FileToRddExample.java

import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class FileToRddExample {

public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD" .setMaster("local[2]").set("spark.executor.memory" // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide path to directories containing text files seperated by comma String files = "data/rdd/input/file[0-3].txt,data/rdd/anotherFolder/file*"; // read text files to RDD JavaRDD lines = sc.textFile(files); // collect RDD for printing for(String line:lines.collect()){ System.out.println(line); }

} }

file[0-3].txt would match : file0.txt, file1.txt, file2.txt, file3.txt. Any of these files present, would be taken to RDD. file* would match the files starting with the string file : Example: file-hello.txt, file2.txt, filehing.txt, etc.

Following is a Python Application that reads files to RDD, whose file name match a specific pattern.

readToRdd.py

import sys

import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__":

# create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf) # read input text files present in the directory to RDD lines = sc.textFile("data/rdd/input/file[0-3].txt,data/rdd/anotherFolder/file*") # collect the RDD to a list llist = lines.collect() # print the list for line in llist:

print(line)

Conclusion

In this Spark Tutorial ? Read multiple text files to single RDD, we have covered different scenarios of reading multiple files.

Learn Apache Spark Apache Spark Tutorial Install Spark on Ubuntu Install Spark on Mac OS Scala Spark Shell - Example Python Spark Shell - PySpark Setup Java Project with Spark Spark Scala Application - WordCount Example Spark Python Application Spark DAG & Physical Execution Plan Setup Spark Cluster Configure Spark Ecosystem Configure Spark Application Spark Cluster Managers

Spark RDD

Spark RDD Spark RDD - Print Contents of RDD Spark RDD - foreach Spark RDD - Create RDD Spark Parallelize Spark RDD - Read Text File to RDD Spark RDD - Read Multiple Text Files to Single RDD Spark RDD - Read JSON File to RDD Spark RDD - Containing Custom Class Objects Spark RDD - Map Spark RDD - FlatMap Spark RDD - Filter Spark RDD - Distinct Spark RDD - Reduce

Spark Dataseet Spark - Read JSON file to Dataset Spark - Write Dataset to JSON file Spark - Add new Column to Dataset Spark - Concatenate Datasets

Spark MLlib (Machine Learning Library) Spark MLlib Tutorial KMeans Clustering & Classification Decision Tree Classification Random Forest Classification Naive Bayes Classification Logistic Regression Classification Topic Modelling

Spark SQL Spark SQL Tutorial Spark SQL - Load JSON file and execute SQL Query

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download