Project 3 Hannah Choi Data 8

Project 3 Hannah Choi Data 8

November 23, 2016

1 Project 3 - Classification

Welcome to the third project of Data 8! You will build a classifier that guesses whether a song is hip-hop or country, using only the numbers of times words appear in the song's lyrics. By the end of the project, you should know how to:

1. Build a k-nearest-neighbors classifier. 2. Test a classifier on data.

Administrivia

Piazza While collaboration is encouraged on this and other assignments, sharing answers is never okay. In particular, posting code or other assignment answers publicly on Piazza (or elsewhere) is academic dishonesty. It will result in a reduced project grade at a minimum. If you wish to ask a question and include your code or an answer to a written question, you must make it a private post.

Partners You may complete the project with up to one partner. Partnerships are an exception to the rule against sharing answers. If you have a partner, one person in the partnership should submit your project on Gradescope and include the other partner in the submission. (Gradescope will prompt you to fill this in.)

For this project, you can partner with anyone in the class.

Due Date and Checkpoint Part of the project will be due early. Parts 1 and 2 of the project (out of 4) are due Tuesday, November 22nd at 7PM. Unlike the final submission, this early checkpoint will be graded for completion. It will be worth approximately 10% of the total project grade. Simply submit your partially-completed notebook as a PDF, as you would submit any other notebook. (See the note above on submitting with a partner.)

The entire project (parts 1, 2, 3, and 4) will be due Tuesday, November 29th at 7PM. (Again, see the note above on submitting with a partner.)

On to the project! Run the cell below to prepare the automatic tests. Passing the automatic tests does not guarantee full credit on any question. The tests are provided to help catch some common errors, but it is your responsibility to answer the questions correctly.

1

In [1]: # Run this cell to set up the notebook, but please don't change it.

import numpy as np import math from datascience import *

# These lines set up the plotting functionality and formatting. import matplotlib matplotlib.use('Agg', warn=False) %matplotlib inline import matplotlib.pyplot as plt plt.style.use('fivethirtyeight') import warnings warnings.simplefilter(action="ignore", category=FutureWarning)

# These lines load the tests. from client.api.assignment import load_assignment tests = load_assignment('project3.ok')

===================================================================== Assignment: Project 3 - Classification OK, version v1.6.4 =====================================================================

2 1. The Dataset

Our dataset is a table of songs, each with a name, an artist, and a genre. We'll be trying to predict each song's genre.

The predict a song's genre, we have some attributes: the lyrics of the song, in a certain format. We have a list of approximately 5,000 words that might occur in a song. For each song, our dataset tells us how frequently each of these words occur in that song.

Run the cell below to read the lyrics table. It may take up to a minute to load.

In [2]: # Just run this cell. lyrics = Table.read_table('lyrics.csv')

# The first 5 rows and 8 columns of the table: lyrics.where("Title", are.equal_to("In Your Eyes"))\

.select("Title", "Artist", "Genre", "i", "the", "like", "love")\ .show()

That cell prints a few columns of the row for the song "In Your Eyes". The song contains 168

words.

The

word

"like"

appears

twice:

2 168

0.0119

of

the

words

in

the

song.

Similarly,

the

word

"love"

appears

10

times:

10 168

0.0595

of

the

words.

2

Our dataset doesn't contain all information about a song. For example, it doesn't include the total number of words in each song, or information about the order of words in the song, let alone the melody, instruments, or rhythm. Nonetheless, you may find that word counts alone are sufficient to build an accurate genre classifier.

All titles are unique. The row_for_title function provides fast access to the one row for each title. In [3]: title_index = lyrics.index_by('Title')

def row_for_title(title): return title_index.get(title)[0]

3

Question 1.1 Set expected_row_sum to the number that you expect will result from summing all proportions in each row, excluding the first three columns. In [4]: # Set row_sum to a number that's the (approximate) sum of each row of word

#expected_row_sum = lyrics.with_columns('i',lyrics.column(3),'the',lyrics.c expected_row_sum = lyrics.select('i','the','like','love') expected_row_sum = expected_row_sum.apply(sum) expected_row_sum Out[4]: array([ 0.07275541, 0.09929078, 0.10815603, ..., 0.0802139 ,

0.09615385, 0.06938775])

4

You can draw the histogram below to check that the actual row sums are close to what you expect. In [5]: # Run this cell to display a histogram of the sums of proportions in each r

# This computation might take up to a minute; you can skip it if it's too s Table().with_column('sums', lyrics.drop([0, 1, 2]).apply(sum)).hist(0)

This dataset was extracted from the Million Song Dataset (). Specifically, we are using the complementary datasets from musiXmatch () and Last.fm ().

The counts of common words in the lyrics for all of these songs are provided by the musiXmatch dataset (called a bag-of-words format). Only the top 5000 most common words are represented. For each song, we divided the number of occurrences of each word by the total number of word occurrences in the lyrics of that song.

The Last.fm dataset contains multiple tags for each song in the Million Song Dataset. Some of the tags are genre-related, such as "pop", "rock", "classic", etc. To obtain our dataset, we first extracted songs with Last.fm tags that included the words "country", or "hip" and "hop". These songs were then cross-referenced with the musiXmatch dataset, and only songs with musixMatch lyrics were placed into our dataset. Finally, inappropriate words and songs with naughty titles were removed, leaving us with 4976 words in the vocabulary and 1726 songs.

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches