Terpconnect.umd.edu



PyData 2016 Tutorial a Data-Driven Dialogue: From Filling Potholes to Disrupting the Cycle of Incarceration - Kelly JinHard problems, tough problems, wicked problems11.7 million americans in 3,100 local jails cost us –a jail – 96 acre – high traffics – Three largest jails – LA(19K) (9.5K) (11K)Mental health 64% substance abuse disorder 68% chronical .. – data driven justice – 100+ jurisdictions 91M population your talent to operationalize solutionsKeynote: How Open Data Science Opens the World of Innovation Robert Cohn, Peter Wang)ANACONDA – from Intel HYPERLINK "" data revolutionClassroom success – which types of students are most likely to drop out or classJob search success Data Set – instructor charac attendance date, student end of class survey data, cerifiction aplicaiton, outcome dataData is everywhere.Amateurs think about tactics - functionsProfessional think about logistics - bandwidth - Peter WangEverything around data is changing The everything revolution Every aspect of how we ingest, store, mange and compute on business data will be disrupted.FPGA – Field programmable gate arrays – semiconductor devices 3D Cross Point – breaks the memory storage barrier SRAM, DRAM, 3D Xpoint, NAND SSD HDD- memory Low latency of intel optane - SFrames in GraphLabParallelism Performance Watt – Deep learning Threading, communication, math, image processing, encryption – FPGA, …Cluster performance with pydall : Mpi4py and PySpark Optimized python package in Anaconda Cambrian eraContinuum analyticsEra of data literacyData exploration and analysis are a new kind of literatcy Language is a human instinct is a natural path to insight – see this in our interaction with python/pydata users whose passion stems from this expressiveness and agilityAn analytical language is thoughtware not software – extent of reach of mindAnalyst paid for Insight, analyst/data developer paid for code that produces insight, Programmer paid for CodeData science is a team sportBiz analyst, data scientist, developer, data engineer databases/data warehouse, etl, Devopps4 years 20 conferences, Python is technology for the sake of empowering people (not for technology’s sake)Challenge because established business don’t know how to manage thatBayesian Network Modeling using R and PythonPragyansmita NayakApplication – 1) missing flignt 2) predict poll outcome 3) relations between genes, environment, disease propagation 4) short-term solar flare level prediction Python Numpy, scipy, bayespy, bayes blocks, stan, penb… R - Based on evidence (i.e. data), estimate a degree of belief for the possible outcomesProbability theory and statisticsP(A|B) = P(B|A) * P(A)/ P(B)Evidence (prior) representation – belief “likelihood” used for prediction posteria potential outcomes.Specifying a prior probability can be trickyThe subjective element, also known as reference class problem Predicting future based on pas observations lends itself naturally to applications requiring predictive analytics Belief is updated with new evidenceContrast “Bayesian” with the alternative Frequentist Probability of an event – frequency over time – proportion of outcomeBayesian – prior posterior and likelihoodP(A) – prior P(A|B) - posterior P(B|A) – likelihood (support of - ) Example photomorphic redshift estimation Bayesian Belif network (BBN)Graphical models to represent and approximate acyclic relationship btw the different substs of variables – inter connections represents the dependencies among the set of variables Na?ve Bayes – a special case of bayes netorkThe classification node is the parent node of all other nodes. Observation independent Numeric data Discretization methods : numeric data to categorized dataModel evaltuation and useCharacteristic of networkCombmitation of multiple generated networksIntelligent aid in fixing missing dataPrediction area Na?ve bayes - Markov-blanket-based 1 or k depence bayseian classification, semi na?ve bayses, selective na?ve bayes –bayesian multinets Na?ve bayes popularly commonly used - simpleAlgoritm categoriesConstraint=based : more efficient than scoring if data size large scoring based, hybrid methods Scikit-learn, numpy, matplotibFrom sklearn,na?ve-bayes import GaussianNBX = np.arrapy()Y = np.array()Model = GaussianNB()Model.fit(x,y)PredictMixture.GussianNB()Python: BayesPy – python 3Only variational Bayesian inference for conjugate-exponential family Python : Bayes block R : Cran Tasks view for Bayesian package for general model fitting – general package Package for specific models or methods – for specific case R Bnlearn – HILL-climiing algorithm – scoring algorithm modeling phaseBaysiannetworkSTAN – both python and R Sharing experiencePython 2 vs 3Quick experiment in R, implement in R – depens of use-caseR-shiny application for ease of experimentsSearch github firstProbability modeling – frequency Bayesian modeling - Relationship of attributesData modeling – probability modeling – Bayesian modeling - Optimization approach NP-HARD problems – contraintsFuzzy Search Algorithms: How and When to Use ThemGithub/ 216/pydata_dcDistance between words SoundexLevenshteinn-gramNLTK/World2vecSoundex coding – soundex coding – number assigned relevant to soundsPackage – Jellyfishj.soundex(“”)j.soundex(“”)Soundex with PostgresCreate extension fuzzystramatch;From sqlalchemy.sql import textDe-- LevenshteinTransposition of adjacent charactersn-gram t-gram – grouping Scoring similarity : Jaccard similarity Intersect UnionReturn intersect/unionWhy do you use Cosine similarity? Vector difference Trigram in Python From sqlalchemy.sql import text Larger data setGist and gin indexes for trigrams Create index trgm_idx on table_name using gist (t gist_trgm_ops);OrLucine and Elastic search – its’ fastOther similarity metrics – NLTK wordnet, word2vec us cosine distanceFrom nltk.corpus import wordnetWordnet.synsets(“pizza”)Word1[].similarity(word2) ?Word2vecData – zipImport word2vecModel = word2vec.load(‘text/text.bin)Model.cosine(word)Model.generate_response(indexes, metrics).tolist()Questions – evaluation : supervised learning modeling – some sorted of label data set – calibrating our data – testing data – different of tolerance (level of confidence)Address mapping – abbreviated Data reductionThe kitchen sink approachKeynote: Become a Data Superhero: How Data Can Change the World Elizabeth Byte back’s mission – improve economic opportunity , founded in 1997, fight back inspired byte back – need for advanced digital skills to acquire living wage jobsTraining in tech, career services, referrals to other services 67% government benefits 27% homeless 74% unemployedHow do nonprofits use data today? 90% collection 5% use data in every decision they makeHackathones for good happening all the time – How does a data drive work? 2r hours collaborative effort – final report presentation Where we locate classroom - Opotimal classroom location – enrollment predication underserved index Recommendations Your turn to be a superheroLead a data drive at your organization, volunteer, mentor, donateAgent-based models AMS - computMASGame theoryCelluar Institute TutteCluster – find groups of data that are all similar – easy in theory and hare in practice Partition data, summarize data, explore data K-means – minimize… maximize – based on centroid Gausissan spherical assumption - not good assumptionAffinityPropagation – results are sensitive to parameters MeanShift – mode region of high density SpectralClustering – k nearest neighbors -Birch – fast low memory consumptionAgglomerativeClustering – distances between groups of points --- choosing the cut of the tree – choosing # of clusters hard DBSCAN – density based clustersWhat makes a clustering good? How do you tell? Goodness-fit measure – depends on what algorithm? A possible clustering What do I mean by a cluster? Cluster don’t need to be balls – not every point is in a cluster real data has noise Clusters are dense areas separated by undensed areDensity based …We don’t know the pdfWhicl level sets to chooseComputational complexityWhat can we do?Locally approximate the densityThe connected components of leve sets form a tree PDF – increase radius – grouping Approximate the level set tree and use excess of mass to select clusters Dedensity cluster treeY distance x ?Simplified cluster tree Lamda value log (number of tpoints)Clusters found by HDBSCAN campello 2013 15 What about performance? We don’t want to run connected compoients for every possibleMinimum spanning trees! - All informationThe weighted graph is complete! We can use spatial indexing to compute fewer distancesSpatial indexing is great for neighbor queries Use Boruvka’ algorithm reduce mst Modified Dual treel boruvka algorithm O(N LOG N)Sklean kmean, scipy kHave to choose a number of points value for the density estimateCant we vary that the way we varied epsilonTopology to the rescue Budiling HSBSCAN Persistent homology Multidimensional persistent homology - PERSISTENCE OVER multiple variables – the result is not a treeWe can re interpret hdbscan using sheaves instead of trees \Can be generalized to the multidimensional case – it will be based density based clusteringRoubust fewer parameter – Con k-mean is not 1st, 2nd sThink hard cluster install –c conda-forge hdbscan pip install hdbscanQ – persistent morphology ?Hudl – supporting high school sports team – capture and bring value every moments of sportsDemocratizing data 2006 SQL Server2010 mongo DB 2014 microservice architecture, mysql server 2015 several mongo DBFind how many football teams had more than 3 users watching game more 3 time in 3 different yearsSSH, SQL Mongo Excel Python, etc.Data Engineeringwhere do we put the datat – Amazon RedShift – SQL, FULLY MANAGD AWS, reasonable priced alter google big query do it your self hive, impala how does it get thereETLMysql - Extract transform laod amazon shiftUse workflow manager luigi airflow Azkaban Dependency management parallelism, idempotenceThink about tooling – 20 different pipelinesUILoggingTriggers – Cron DEPENDENCY GitHub LuigiSingle machine jobs – zendesk salesface google sheetAmazon EMR muti machine jobs database exports mongo processing Luigi?3, How people access toCommercial options looker periscope tableau Re:dash – open source query editor + visualization, hosted version or host your ownAmazon readshit Redash Data Analytics Helping employees use data to make better decisionsAccess for allFinding data isn’t easy – sql, so much data only 3 data analysts -- The name of the identical Columns different Removing roadblockHudl university – rdbms, sql courses table familiarity using redash, data visualization Data dictionary understanding relationshipsDerived table Daily_active-usersUserid, teamid, date, has watched, taged, uploadedReport automationDashboardSlackalytics = redash + python + slack Cons – Bad data, slow queriesKey takeaways – being data driven is a team sport, get the data architecture in place, make data and metrics accessible, be flexibleJenkins – schedulingLuigui workflow management – pythonSqoop –RDBMS extraction Spark data transformation pythonAWS Lamda –event driven processing pythonRedshift – data warehouseRedash – query interface + visualization python – working well with security extension arcdoc Scheduled query –slack ?Problem statement – built a spam filter for twitterTwiter is noisy SolutionBinary classification problem –Acquire good quality dataset, enginer feature some very good indicatorsSelect algorithmParadise lost In production, the model started very well,Our data was changing fastNostationary distributions- a stationary process is time independent avg remain more or less the constantThis is also called drift - distribution Vocabulary in ourdataset changing Brand agent user – classification done by system is wrong Degradation with time in prediction accuracies of accuracies of model shouldn’t come as surprise Monitoring anomaly detection RecommendationStock market predictions Problem statement v 2.0Handle drift in data, leanrin and improve using feedbacks, handlding new vocabularyPossible solutions – Frequently retain your model on the updated data and deploy modelContinous learning model adopts to the new incoming dataGlobal model - deep learning model, batch training , large corpus , no short term updateLocal Model – per-brand model, faster learner , instant feedback Drift – delete drift Tool combined g and l – meta classifier Text repreestnationPreprocessing How good is preprocessing – zipf’s law – able to show Frequency inversely correspond to the rank ZIPF - X query rank y query frequencyRaw data – have to show zipf’s fitPreprocessing – replace mentions, hashtags, urls, emojis, dates, numbers, currency by relevant constants remove stop words Text representation Word embeddingGood’s pretrained word2vec model to replace a word by its correspoinding embedding (300 dimensions)For a twwet, we average all the word embedding vectors for its consitutent wordsMissing workd, generate a random number between -0.25 0.25 for each 300 dimensionsHow to handle word not in word2vec?Final representation : Tweet 300 dim vector of real numbers Python ZENCEN?Global modelDeepNet – Deep learning CNNTrained over a corpus of 8 million tweetsOf the shelf architecture give us 86% accuracyLocal – improve with every feedback higher retention of older conceptsDesired properties – online learner fast learner aggressive model update, incorporates feedback successfully - After model, the same data is presented, After model, the last N data is presented, it must predict its class label with higher accuracy Building feedback loop ML model tweet, y –human if Y is not eqal to Yp, Twee, Y to systemReinforcement learning – reward punish – for binary classification, MDP is too small doesn’t learn muchmini-batch - works fine if velocity of feed back is highinstant feedback tiny batch – Just 1 data one skew the model Model a feedback point as data point presented to local model in online settingOnline method in ML Data model as stream Online algorithms Crammer’s PA-ii as local model Online Passive-Aggressive AlgorithmsImproving accuracy in local model Aggressive parameterPa-ii parameter tuning Wrong feedbacks – we tested the local feeding it with wrong feedbacks Glokal : ensembling global and local Use online stacking to ensemeble local and global Handle drift – drift deletion method (Gama et. Al 2004)Improves running accuracy Personalization : The notion of spam varies from brand to brand Its light weigh, fast thus eqsy to boot-strap, deploy and scaleCons – shifting Instead of modeling classification model it as rankingActionable tweets are high in ranking spam tweets are low in ranking Actionable vs Spam = finding a cut of the rankingIncorporating feedbackDomain where distribution are contiuously evolving, handing drift is mustAnuj Gupta@anujgupta82Sampling-observation weighting -SMOTE sampling1. select minority point2. select neighbor3. create new pointOutliere dectectionPCA,NN : train auto encoder Outlier score Label propagation – identify networks o bad actors ‘Relax’ labels through graph – 1% 3% of graphs (not all)Low Rank Models GLRM reduce dimensionality for dataset with many variabalesreduce dim With generalized PCALDA Topic modeling variable with many levels- documents topic propagation and .bags of words w maximum separation ModelingGrid search NNEnsemble modeling Leverage a diverse set of algorithms Train different data setsGenetic AlgorithmsScore how similar a new autro bjhergerAlgorithms performance, visual analytics Property graph key/value pair - in relational database – sets – interconnection of setsHow to model time?RDBMS – entities dates or times with/without timezoneEvent – series of discrete timestepsTime properties - ordinal time, timestamps, durations, timers Time modifies traversal What makes dymics?Time structures Dynamic GraphTime aggregation Keyphrase analysis over timeRSS, HTML BALEEN Cropus ingestion Natural language graph analysis : data modeling Minke Corpus Processing – extract noun weighted TF-IDF NLGA DATA WRANGLING Centrality of time – degree centrality, eigenvector centrality, betwenness centrality, pagerank centrality , katz centrality Extract week of the year as time structure Keyphase dynamicsSequences of time ordered subgraphs Animating dynamics Network visualization Visual analytics MantraOverview first, zoom and filter details on demand – interactive analysis Are graphs effective for analytics? YesNetworkX, graph2 – speed, gettyTrigger warning mass shootings, guns, probably some salty languageThink like a data journalist Search - visualize – story telling – data journalistSemi-automatic weapons without background check can be just a click away listings by major city on craigslist for gunsGreatest number of firearsm for – no better data source legal Planning to scrape If you need scraper, you have a data problem. If it is good, it isn’t hard to get ..Data hard to get usually not well structured Other way of liberating the dataBudget appropriate Scraper not free?Ethical ramification legalCrul –I easy partTesting – is all edge casesInfrastructure non-trival costs - Scraper architecture – data models, test, atomic request, parallelization , cloud infrastructureDDOSSCRAPY, MECHANIZE Controller scraper Index scraper Listing scraper Parallel index scraping Csvkit – python lib1.Model classes –parse – encapsulate parsing with data model classes , hide calculation and complexityFrom bs4 import beautifulsoup meta programming@property – getter setter in nprapps/armlist-scrapper/2.SCRAPER Script Earthquake Controller – GNU Parallelcsvcut –c 1 cache/index-. |parallel 80k URLs – header in yourselfinfrastructure – Amazon EC3 $0.05 in computing costs per scrape don’t accidently create a denial of service attack Deep learningAndrew NG CORSERA standard More data improve the algorithm performanceModel ScalabilityDATA FEATURES HIDDEN LAYER OUTPUTlayground.width height depth cs231n.github.ioNN Architcture for image classification Inception v1, v2, eRequires lot of imagesDeep learning Gradient boosting interactive playground NN learn most efficient featuresHow do the self learned features look like?Lst level – feature –second level high level featureVideo – deep visualization toolNN AutoencoderInput code output – encoder decoder Input output the same images – minimize the dif of input outputInteractive map of english – Talk is cheap so me demoH2O deeplearning image reconstruction and clustering Import glob import os, defaultdictGPU? No need The best of the two worlds - Wide and deep modelsH2O DeepwaterLocal Jupyter notebook H2o Cluster remote machine executionTrendflow Dmlc mxnet caffee Denseflow – c++libDeeplearning with Batteries included Convolutional neural networks pandas ???Model user/oxdataGeneralized Low Rank Model 2016PcaSparse pcaRobust PCA – more robust for outlierLogistic PCA - loss is LASSOLogistic regressionNon-negativeBoolean PCACategorical PCAOridnal Matrix completionDataframe ..Semi supervised pcaGlrm.pdf Web. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download