Text Mining Using Hadoop

[Pages:35]Text Mining Using Hadoop

Adam Kuns &

Steve Jordan

Project Overview

For our project we will use Hadoop components to perform text mining on the State of the Union Addresses provided at the URL:

Components that will be used: Bash Shell to retrieve HTML files and store into Hadoop Distributed File System(HDFS). PySpark to preprocess the data in Hadoop and create structured data format. Hue to perform initial querying and browse HDFS. Tableau to do data visualizations.

Map-Reduce Overview

Originally Proprietary Google Technology

Framework for processing large datasets using a large number of computers

1. "Map" step a. Filtering and sorting b. Each node writes data to temporary storage

2. "Shuffle" step a. Nodes redistribute data based on the output keys from "map()" step

3. "Reduce" step procedure - marshalling and summarizing data a. Nodes process output data, per key, in parallel

Apache Hadoop Framework

Hadoop is an open-source software framework for distributed storage and distributed processing The Hadoop core consists two parts

1. Storage part - Hadoop Distributed File System (HDFS)

a. stores large files (gigabytes to terabytes) across multiple machines b. replicates data to achieve reliability

2. Processing part - MapReduce Engine

a. JobTracker - client applications submit jobs

Hive Overview

Hive is a Data Warehouse technology developed at

to provide an

SQL interface(called HiveQL) for end-user analyst to query data within HDFS, as

opposed to using a language like Java to write Map-Reduce tasks.

Hive compiles the HiveQL statements into Map-Reduce tasks to run on Hadoop, and then returns the query results

This provides analyst with a familiar query language to utilize and allows them to provide immediate contributions with little re-training.

Cloudera Impala Overview

Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop.

Provides, on average, faster querying processing than equivalent Hive queries.

This is due to the fact that Impala:

Daemon services running on the data nodes to avoid startup overhead Does not require data to be moved or transformed, and does not perform MapReduce Ideal for data scientist to retrieve results quickly

Impala Performance



Hue Overview

Hue is an open-source web-interface that is used to support the Hadoop ecosystem and its components. It provides a graphical user interface for end-users to perform HDFS actions through File Browser, perform queries using the Hive and Impala Query Editors, and monitor jobs through Job Browser. We will be using it explore HDFS and to execute queries.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download