Spark - Read multiple text files to single RDD - Java & Python Examples
[Pages:9]Spark ? Read multiple text files to single RDD ? Java & Python Examples
Read multiple text files to single RDD
To read multiple text files to single RDD in Spark, use SparkContext.textFile() method.
In this tutorial, we shall look into examples addressing different scenarios of reading multiple text files to single RDD.
Read multiple text files to single RDD [Java Example] [Python Example] Read all text files in a directory to single RDD [Java Example] [Python Example] Read all text files in multiple directories to single RDD [Java Example] [Python Example] Read all text files matching a pattern to single RDD [Java Example] [Python Example]
Read Multiple Text Files to Single RDD
In this example, we have three text files to read. We take the file paths of these three files as comma separated valued in a single string literal. Then using textFile() method, we can read the content of all these three text files into a single RDD.
First we shall write this using Java.
FileToRddExample.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class FileToRddExample {
public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD" .setMaster("local[2]").set("spark.executor.memory" // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide text file paths to be read to RDD, separated by comma String files = "data/rdd/input/file1.txt, data/rdd/input/file2.txt, data/rdd/input/file3.tx // read text files to RDD JavaRDD lines = sc.textFile(files);
JavaRDD lines = sc.textFile(files);
// collect RDD for printing for(String line:lines.collect()){
System.out.println(line); } } }
Note : Please take care in providing input file paths. There should not be any space between the path strings
except comma.
file1.txt
This is File 1 Welcome to TutorialKart Learn Apache Spark Learn to work with RDD
file2.txt
This is File 2 Learn to read multiple text files to a single RDD
file3.txt
This is File 3 Learn to read multiple text files to a single RDD
Output
18/02/10 12:13:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from 18/02/10 12:13:26 INFO DAGScheduler: ResultStage 0 (collect at FileToRddExample.java:21) finished i 18/02/10 12:13:26 INFO DAGScheduler: Job 0 finished: collect at FileToRddExample.java:21, took 0.88 This is File 1 Welcome to TutorialKart Learn Apache Spark Learn to work with RDD This is File 2 Learn to read multiple text files to a single RDD This is File 3 Learn to read multiple text files to a single RDD 18/02/10 12:13:26 INFO SparkContext: Invoking stop() from shutdown hook 18/02/10 12:13:26 INFO SparkUI: Stopped Spark web UI at 18/02/10 12:13:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
Now, we shall use Python programming, and read multiple text files to RDD using textFile() method.
readToRdd.py
import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__":
# create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf) # read input text files present in the directory to RDD lines = sc.textFile("data/rdd/input/file1.txt,data/rdd/input/file2.txt,data/rdd/input/file3.txt" # collect the RDD to a list llist = lines.collect() # print the list for line in llist:
print(line)
Run this Spark Application using spark-submit by executing the following command.
$ spark-submit readToRdd.py
Note : Please take care in providing input file paths. There should not be any space between the path strings
except comma.
Read all text files in a directory to single RDD
Now, we shall write a Spark Application, that reads all the text files in a given directory path, to a single RDD.
Following is a Spark Application written in Java to read the content of all text files, in a directory, to an RDD.
FileToRddExample.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class FileToRddExample {
public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD" .setMaster("local[2]").set("spark.executor.memory" // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide path to directory containing text files String files = "data/rdd/input";
// read text files to RDD JavaRDD lines = sc.textFile(files); // collect RDD for printing for(String line:lines.collect()){
System.out.println(line); } } }
In the above example, we have given the directory path via variable files .
All the text files inside give directory path, data/rdd/input , shall be read to lines RDD.
Now, we shall write a Spark Application to do the same job of reading data from all text files in a directory to RDD, but using Python programming language.
readToRdd.py
import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__":
# create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf) # read input text files present in the directory to RDD lines = sc.textFile("data/rdd/input") # collect the RDD to a list llist = lines.collect() # print the list for line in llist:
print(line)
Run the above Python Spark Application, by executing the following command in a console.
$ spark-submit readToRdd.py
Read all text files in multiple directories to single RDD
This is next level to our previous scenarios. We have seen how to read multiple text files, or all text files in a directory to an RDD. Now, we are going to learn how to read all text files in not one, but all text files in multiple directories.
First we shall write a Java application to write all text files in multiple directories.
FileToRddExample.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext;
public class FileToRddExample {
public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD" .setMaster("local[2]").set("spark.executor.memory" // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf);
// provide path to directories containing text files seperated by comma String directories = "data/rdd/input,data/rdd/anotherFolder";
// read text files to RDD JavaRDD lines = sc.textFile(directories);
// collect RDD for printing for(String line:lines.collect()){
System.out.println(line); } } }
All the text files in both the directories, provided in the variable directories , shall be read to RDD. Similarly, you may provide more that two directories.
Let us write the same program in Python.
readToRdd.py
import sys
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
# create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf)
# read input text files present in the directory to RDD lines = sc.textFile("data/rdd/input,data/rdd/anotherFolder")
# collect the RDD to a list llist = lines.collect()
# print the list for line in llist:
for line in llist: print(line)
You may submit this Python application to Spark, by running the following command.
$ spark-submit readToRdd.py
Read all text files, matching a pattern, to single RDD
This scenario kind of uses a regular expression to match a pattern of file names. All those files that match the given pattern will be considered for reading into an RDD.
Let us write a Java application, to read files only that match a given pattern.
FileToRddExample.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class FileToRddExample {
public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD" .setMaster("local[2]").set("spark.executor.memory" // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide path to directories containing text files seperated by comma String files = "data/rdd/input/file[0-3].txt,data/rdd/anotherFolder/file*"; // read text files to RDD JavaRDD lines = sc.textFile(files); // collect RDD for printing for(String line:lines.collect()){ System.out.println(line); }
} }
file[0-3].txt would match : file0.txt, file1.txt, file2.txt, file3.txt. Any of these files present, would be taken to RDD. file* would match the files starting with the string file : Example: file-hello.txt, file2.txt, filehing.txt, etc.
Following is a Python Application that reads files to RDD, whose file name match a specific pattern.
readToRdd.py
import sys
import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__":
# create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf) # read input text files present in the directory to RDD lines = sc.textFile("data/rdd/input/file[0-3].txt,data/rdd/anotherFolder/file*") # collect the RDD to a list llist = lines.collect() # print the list for line in llist:
print(line)
Conclusion
In this Spark Tutorial ? Read multiple text files to single RDD, we have covered different scenarios of reading multiple files.
Learn Apache Spark Apache Spark Tutorial Install Spark on Ubuntu Install Spark on Mac OS Scala Spark Shell - Example Python Spark Shell - PySpark Setup Java Project with Spark Spark Scala Application - WordCount Example Spark Python Application Spark DAG & Physical Execution Plan Setup Spark Cluster Configure Spark Ecosystem Configure Spark Application Spark Cluster Managers
Spark RDD
Spark RDD Spark RDD - Print Contents of RDD Spark RDD - foreach Spark RDD - Create RDD Spark Parallelize Spark RDD - Read Text File to RDD Spark RDD - Read Multiple Text Files to Single RDD Spark RDD - Read JSON File to RDD Spark RDD - Containing Custom Class Objects Spark RDD - Map Spark RDD - FlatMap Spark RDD - Filter Spark RDD - Distinct Spark RDD - Reduce
Spark Dataseet Spark - Read JSON file to Dataset Spark - Write Dataset to JSON file Spark - Add new Column to Dataset Spark - Concatenate Datasets
Spark MLlib (Machine Learning Library) Spark MLlib Tutorial KMeans Clustering & Classification Decision Tree Classification Random Forest Classification Naive Bayes Classification Logistic Regression Classification Topic Modelling
Spark SQL Spark SQL Tutorial Spark SQL - Load JSON file and execute SQL Query
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- spark print contents of rdd tutorial kart
- spark rdd map java python examples
- learning apache spark with python computer science software engineering
- apache spark guide cloudera
- spark python application tutorial kart
- what is apache spark github
- developing apache spark applications cloudera
- getting started with apache spark on azure databricks microsoft
- python spark shell pyspark tutorial kart
- spark read multiple text files to single rdd java python examples
Related searches
- pandas read text file to dataframe
- how to import text files into python
- multiple pdf to single pdf
- reading text files in python
- combine multiple excel files into one
- java read a text file into string
- python text file to string
- read and write files python
- python read json text file
- python text file to list
- python load text file to list
- python read multiple files simultaneously