MADlib Analytics Library Contributions
MADlib Analytics Library Contributions
Babak Alipour, Aditya Nain, Giang Nguyen
CISE department, University of Florida
Background
? The philosophy behind MADlib is to prevent data from moving between multiple runtimes while using advanced machine learning capabilities.
? MADlib uses SQL-based algorithms and syntax, making it very straightforward for adoption among millions of running systems.
? MADlib is open source and enjoys an active community of developers.
? MADlib has been designed to run on MPP databases, namely GreenPlum and HAWQ, for data parallelism.
? Using C++ for per-record processing, Python for driver functions and PL/pgSQL, many popular machine learning algorithms are supported.
? The module anatomy allows for flexible implementations of algorithms and tuning for different database backend engines.
? We focused on contributions by adding new modules.
K-Nearest Neighbors
K-Nearest Neighbors algorithm is a popular algorithm used in machine learning and data analytics. Due to the request of its corresponding JIRA, our implementation used linear search approach for finding the k-nearest neighbors. This algorithm was at first developed using a combination of Python, C++ and SQL but later on all parts were moved into the SQL code to simplify debugging and to allow MPP engine to do data partitioning efficiently. A design goal is this project was to have a generalizable interface so that users can plug in different tables, distance function and weighting functions for flexibility and applicability to a wider range of scenarios. The JIRA does not specify the interface and details thus we made some assumptions through discussions and implemented based on those. ? The input is two tables, one is the training set and the other is test set. ? For every row i in the test set, its k-nearest neighbors in training set
will be returned in a table of results. ? A distance function should be provided as input with a signature of
DOUBLE[] x DOUBLE[] -> DOUBLE ? Input validation is performed and appropriate error messages are
produced in case of invalid input parameters. ? The default supported functions are Manhattan, Euclidean and
Minkowski distance with choice of arbitrary p, all of which are implemented purely in PL/pgSQL. ? The longer term vision is to add support for data structures such as KDtree or Ball-tree to improve performance and support more advanced kNN queries such as kNN-join.
Gaussian Mixture Models
Click to add text Click toadd text
Merge Step in MPP database.
An example of a blog enhanced by NLP name entity extraction.
Web Application with MADlib-enabled Database Backend
Motivation
? Use MADlib NLP to improve web application features such as information extraction, summarization, recommendation system, sentiment analysis, etc...
? Django is an efficient Python web application development framework that supports PostgreSQL.
? Django model represent the data of the application; Each model is mapped to a single table in the database automatically and each attribute of the mode represents a database field.
? SQL query can be executed directly from Django with connection.cursor()
Idea behind out web blog application
? Each user's post (text) is stored in a table in Postgres database ? We can call MADlib NLP (CRF) function on this table to
extract noun entity. ? For each noun entity, check if it is in Wikipedia and make it
Wiki link accordingly. ? This can be easily accomplished by send a http request and check if the status code is 200.
CONCLUSIONS
? An exploration of MADlib module anatomy was performed and several introductory documents were produced on getting started with MADlib, its installation on different platforms and its application in some scenarios.
? Per JIRA {MADLIB-927}, an implementation of k-Nearest Neighbors algorithm was carried out and is now being tested.
? Per JIRA {MADLIB-410}, an implementation of Gaussian Mixture Models was integrated into MADlib and is now being tested.
? An application of advanced features enabled by MADlib-enabled database backend was explored.
REFERENCES
1. J. M. Hellerstein, F. Schoppmann, D. Z. Wang, E. Fratkin, and C. Welton, "The MADlib Analytics Library or MAD Skills , the SQL," Proc. VLDB Endow., pp. 1700?1711, 2012
2. MADlib methods, retrieved from:
Supported by Dr Daisy Zhe Wang as part of the Projects in Data Science Course
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- next steps load django settings our classes
- 1 lecture 5 numpy and matplotlib
- typical size of data you deal with on a
- chapter 4 normalization villanova
- madlib analytics library contributions
- chapter 6 database tables normalization
- pandas dataframe notes university of idaho
- data transformation with dplyr cheat sheet
- isbn 1 60132 512 6 american council on science
- abstract vtechworks home