Text Mining Using Hadoop
[Pages:35]Text Mining Using Hadoop
Adam Kuns &
Steve Jordan
Project Overview
For our project we will use Hadoop components to perform text mining on the State of the Union Addresses provided at the URL:
Components that will be used: Bash Shell to retrieve HTML files and store into Hadoop Distributed File System(HDFS). PySpark to preprocess the data in Hadoop and create structured data format. Hue to perform initial querying and browse HDFS. Tableau to do data visualizations.
Map-Reduce Overview
Originally Proprietary Google Technology
Framework for processing large datasets using a large number of computers
1. "Map" step a. Filtering and sorting b. Each node writes data to temporary storage
2. "Shuffle" step a. Nodes redistribute data based on the output keys from "map()" step
3. "Reduce" step procedure - marshalling and summarizing data a. Nodes process output data, per key, in parallel
Apache Hadoop Framework
Hadoop is an open-source software framework for distributed storage and distributed processing The Hadoop core consists two parts
1. Storage part - Hadoop Distributed File System (HDFS)
a. stores large files (gigabytes to terabytes) across multiple machines b. replicates data to achieve reliability
2. Processing part - MapReduce Engine
a. JobTracker - client applications submit jobs
Hive Overview
Hive is a Data Warehouse technology developed at
to provide an
SQL interface(called HiveQL) for end-user analyst to query data within HDFS, as
opposed to using a language like Java to write Map-Reduce tasks.
Hive compiles the HiveQL statements into Map-Reduce tasks to run on Hadoop, and then returns the query results
This provides analyst with a familiar query language to utilize and allows them to provide immediate contributions with little re-training.
Cloudera Impala Overview
Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop.
Provides, on average, faster querying processing than equivalent Hive queries.
This is due to the fact that Impala:
Daemon services running on the data nodes to avoid startup overhead Does not require data to be moved or transformed, and does not perform MapReduce Ideal for data scientist to retrieve results quickly
Impala Performance
Hue Overview
Hue is an open-source web-interface that is used to support the Hadoop ecosystem and its components. It provides a graphical user interface for end-users to perform HDFS actions through File Browser, perform queries using the Hive and Impala Query Editors, and monitor jobs through Job Browser. We will be using it explore HDFS and to execute queries.
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- purdue university fall 2016 cs 348 information systems
- text mining using hadoop
- execution of recursive queries in apache spark
- mysql connector python revealed sql and nosql data storage
- spark sql query example
- python spark shell pyspark tutorial kart
- toolkit get started uncovering critical insights with
- how to load data from json file and execute sql query in
- minio spark select
- cheat sheet pyspark sql python lei mao
Related searches
- text to cell phones using email
- is mining beneficial or harmful
- best gold mining stocks
- fraser institute mining rankings 2019
- mining claims in montana
- mining company ranking 2019
- resolute mining stock
- resolute mining limited
- resolute mining limited asx
- elite dangerous mining corvette
- imperial cutter mining build
- imperial cutter laser mining build