MADlib Analytics Library Contributions

MADlib Analytics Library Contributions

Babak Alipour, Aditya Nain, Giang Nguyen

CISE department, University of Florida

Background

? The philosophy behind MADlib is to prevent data from moving between multiple runtimes while using advanced machine learning capabilities.

? MADlib uses SQL-based algorithms and syntax, making it very straightforward for adoption among millions of running systems.

? MADlib is open source and enjoys an active community of developers.

? MADlib has been designed to run on MPP databases, namely GreenPlum and HAWQ, for data parallelism.

? Using C++ for per-record processing, Python for driver functions and PL/pgSQL, many popular machine learning algorithms are supported.

? The module anatomy allows for flexible implementations of algorithms and tuning for different database backend engines.

? We focused on contributions by adding new modules.

K-Nearest Neighbors

K-Nearest Neighbors algorithm is a popular algorithm used in machine learning and data analytics. Due to the request of its corresponding JIRA, our implementation used linear search approach for finding the k-nearest neighbors. This algorithm was at first developed using a combination of Python, C++ and SQL but later on all parts were moved into the SQL code to simplify debugging and to allow MPP engine to do data partitioning efficiently. A design goal is this project was to have a generalizable interface so that users can plug in different tables, distance function and weighting functions for flexibility and applicability to a wider range of scenarios. The JIRA does not specify the interface and details thus we made some assumptions through discussions and implemented based on those. ? The input is two tables, one is the training set and the other is test set. ? For every row i in the test set, its k-nearest neighbors in training set

will be returned in a table of results. ? A distance function should be provided as input with a signature of

DOUBLE[] x DOUBLE[] -> DOUBLE ? Input validation is performed and appropriate error messages are

produced in case of invalid input parameters. ? The default supported functions are Manhattan, Euclidean and

Minkowski distance with choice of arbitrary p, all of which are implemented purely in PL/pgSQL. ? The longer term vision is to add support for data structures such as KDtree or Ball-tree to improve performance and support more advanced kNN queries such as kNN-join.

Gaussian Mixture Models

Click to add text Click toadd text

Merge Step in MPP database.

An example of a blog enhanced by NLP name entity extraction.

Web Application with MADlib-enabled Database Backend

Motivation

? Use MADlib NLP to improve web application features such as information extraction, summarization, recommendation system, sentiment analysis, etc...

? Django is an efficient Python web application development framework that supports PostgreSQL.

? Django model represent the data of the application; Each model is mapped to a single table in the database automatically and each attribute of the mode represents a database field.

? SQL query can be executed directly from Django with connection.cursor()

Idea behind out web blog application

? Each user's post (text) is stored in a table in Postgres database ? We can call MADlib NLP (CRF) function on this table to

extract noun entity. ? For each noun entity, check if it is in Wikipedia and make it

Wiki link accordingly. ? This can be easily accomplished by send a http request and check if the status code is 200.

CONCLUSIONS

? An exploration of MADlib module anatomy was performed and several introductory documents were produced on getting started with MADlib, its installation on different platforms and its application in some scenarios.

? Per JIRA {MADLIB-927}, an implementation of k-Nearest Neighbors algorithm was carried out and is now being tested.

? Per JIRA {MADLIB-410}, an implementation of Gaussian Mixture Models was integrated into MADlib and is now being tested.

? An application of advanced features enabled by MADlib-enabled database backend was explored.

REFERENCES

1. J. M. Hellerstein, F. Schoppmann, D. Z. Wang, E. Fratkin, and C. Welton, "The MADlib Analytics Library or MAD Skills , the SQL," Proc. VLDB Endow., pp. 1700?1711, 2012

2. MADlib methods, retrieved from:

Supported by Dr Daisy Zhe Wang as part of the Projects in Data Science Course

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download