First Assignment.docx



Technology of PrivacyFall 2016Technical Data Processing ProjectVersion 1.1Due: Friday, November 4, 2016, 5:00 PMGRADED AssignmentWorth 20% of final gradeThe fourth, and final, graded assignment for this class is much like the problem sets you have completed. Unlike problem sets, however, this assignment will require you to hand in documents with specific answers. Unlike most graded assignments in law school, there are “right” and “wrong” answers for much (although not quite all) of this assignment.You will be expected to turn in two documents:A Linux shell script (described below)An Excel workbook.The only prompts below that require you to submit something and that will be graded are clearly labeled, in bold, as “Problem N” (e.g. “Problem 1” or “Problem 9”). Throughout this document, additional, small questions or instructions will be presented to help you understand the assignment. If they are not labeled “Problem N”, the answers to these questions should not be submitted and will not be graded.You can (but are not obligated to) work on this assignment in groups of no more than three people each. Each person must independently produce and submit their own individual shell script and workbook.The Dataset:For this assignment, you will be analyzing a dataset available on our class Canvas. This dataset, called alldata.csv, contains several years’ worth of tweets posted to Twitter from people located on or around the Georgetown Law campus.IMPORTANT: This dataset was shared with me with the strict understanding that it would not be shared broadly or posted publicly. Although all of the tweets in this file were posted to the public Twitter, it is a good reminder of a core theme of this class that aggregations of public data can sometimes raise new sensitivities or otherwise harm individual privacy. Please do not misuse this data!About Shell ScriptsShell scripts can be thought of as very lightweight computer programs. At their most basic, they are simply text files with Linux command line commands—the kind we were exploring earlier in the semester—listed, one per line. When “run,” a shell script will simply march through the file line-by-line, executing each command line in turn. It is like a macro that allows you to repeat a series of command line steps, saving you the burden of having to type them all in each time.Part IV of the Shotts book provides an in-depth overview of shell scripts. Because our shell scripts will be rudimentary, you probably need read only Chapter 24 to learn all you need to know.Shell scripts are created using Linux programs called “text editors.” Think of a text editor as a simplified word processor. The Shotts book recommends you use a text editor called “vi” (also called “vim”) to create and edit your shell scripts. It provides an introduction to vi in Chapter 12. Although vi is widely used and weirdly beloved, it’s also quite counter-intuitive, in my opinion. I use vi, but I think it would be overkill to inflict it on you by forcing you to learn it.You might find it easier to work with “pico”, which is more like Microsoft Word in its commands and conventions. You can find lots of help online about how to use pico. One is . You can use either editor (or any other you happen to know about) for this assignment.Your shell script should reside in the same directory as alldata.csv and should operate on or analyze this file. If the problems below ask you to “print to the screen” a particular answer, your shell script should simply run the appropriate command. If a problem asks you to “create a file” your shell script should use the file redirect mark “>” after a Linux command line to create the correct file. In other words, you will be submitting neither the answers nor the printed output themselves. Instead, you’ll be submitting a shell script that can generate those answers or output.GradingEven though there are “correct” answers for almost every Problem below, I expect some variability in the way you present those answers, which may require some subjectivity on my part as the grader. Each Problem below is worth three points, meaning the entire assignment is worth 30 points. (Problems with multiple parts are worth three points cumulative.) For each problem, I will use the following rubric:PointsMeans3Correct answer, correctly presented or generated.2Minor errors in answer, presentation, or generation.1Major errors in answer, presentation, or generation.0Entirely incorrect or impossible to diagnose what happened.To grade your shell script, I will literally copy it into a new directory containing only alldata.csv and execute your script. If it prints the correct output to the screen and creates the correct files, you will earn full credit. If it prints an incorrect output, fails to properly generate the correct file, or doesn’t work at all, I will read the script to see if I can find your mistake. In this case, I will deduct some credit.To grade your Excel workbook, I will open it up and grade your answers by hand. Setting up your work environment.The following steps are not worth any points, but they are necessary predicates to complete the rest of the assignment.Using the AWS panel, start your instance.Once it has started, use Cyberduck (remember to use the “sftp” protocol) to upload alldata.csv to your instance. You may want to create a new folder for this file. Next, log into your instance using ssh. Find alldata.csv and use the “more” or “less” command to view its contents. Spend some time acquainting yourself with the data. In particular, pay attention to the first line of the file, which we call the “header line”. See if you can decode each entry in that line, which will probably require you to glance at some of the data below. What character is used to separate bits of information on each row? We call this character the “delimiter” and we say that the delimiter separates each row into multiple “fields” of information. What does “csv” in the file extension mean?Mac and Linux users: it may be possible to use the Terminal program to generate the shell script, meaning you won’t need to use your instance at all. Feel free to do this, if you choose.Shell Script TasksYou should name your shell script data-<netID>.sh replacing <netID> with your Net ID. For example, my script would be called data-pko6.sh. Your shell script should do all of the following for full credit:Problem 1. Print to the screen the number of total rows of data in alldata.csv (for this and all subsequent problems, just print the bare number on a line by itself).Problem 2. Print to the screen the number of total rows of data in alldata.csv that contain each of the following words, printing each answer on its own row, or in total, print three rows.2a. Clinton2b. Trump2c. GeorgetownImportant: Your answers should be case insensitive, meaning it should ignore upper or lower case. In other words, a row should be counted for 2a if it includes “Clinton", “clinton”, “CLINTON”, “cLiNtOn”, etc.Problem 3. Print out just the part of each row that contains the screen name field part of each row (called u_screen_name in the header line) but only for the last ten rows of alldata.csv.Hint: You’ll need to use two Linux commands separated by a “pipe” (|) character to accomplish this. The first command is “cut” which is really complicated but powerful. See Shotts, Chapter 20 for more details.Problem 4. How many unique screen names (u_screen_name) are present in this dataset? Print only the numeric answer to the screen.Big Hint: You need to use the following four linux commands separated by pipe characters to accomplish this, in this order: cut, sort, uniq, and wc. sort and uniq are both described in Chapter 20. Why is “sort” necessary for this to work?Problem 5. Create a new file called dates-all.csv containing each unique date appearing in the alldata.csv file in the “date” field as well as the number of times that date appears in the file.Hint: The workflow is almost identical to Problem 4 with two differences. First, “uniq -c" will prepend each row with the count of how many times that value appears, as the problem requires. Second, rather than “wc”, end your pipeline with a redirect (“>”) followed by dates-all.csvHint 2: The first few lines of your file should look like: 6 2014-01-0152 2014-01-0273 2014-01-03...Problem 6. Three of the most common words tweeted in the file are “fitness”, “starbucks”, and “happy”. For each of these words, your script should create the same type of count/date file you created in Problem 5, but listing dates only for those rows containing the word anywhere in the row. Call these files, respectively, dates-fitness.csv, dates-starbucks.csv, and dates-happy.csv.Hints: You need to use the redirect symbol “>” to create each file. Also, be sure to include rows that include those keywords, regardless of case. Reminder: your shell script should create these files each time it is run. In other words, you will be submitting a shell script that can generate these three files, rather than submitting these three files themselves.Problem 7. Find another word that occurs at least 15 times in the dataset. Avoid short words like “and”, “or”, and “not”. For this word, create the same kind of count/date file you created in Problems 5 and 6, listing dates only for those rows containing your word anywhere in the row. Call this file dates-<word>.csv, where <word> is your word.Moving the data to ExcelThe remainder of the assignment will be conducted on your desktop or laptop computer, using Microsoft Excel. You must complete the rest of the problems below inside a single Excel workbook, which you will submit with your shell script.Using Cyberduck, copy the following files from your AWS instance to your local computer: alldata.csv, dates-all.csv, dates-fitness.csv, dates-starbucks.csv, dates-happy.csv, and the dates-<word>.csv file with the <word> you selected.You may also want to copy your shell script (data-<netID>.sh) to your local computer at this time, to prepare to submit it for a grade later.Open Microsoft Excel and create a new, blank workbook called data-<netID>.xlsx. For example, mine would be called data-pko6.xlsx. In this workbook, create two worksheets, one called “answers” and the other called “charts”.Examining the data in ExcelOne benefit of working with csv data is you can double-click a csv file on your computer, and it will load into a spreadsheet in Excel. Even better, it will recognize the commas as delimiters and use them to separate fields into columns.Double click the alldata.csv file. Spend some time looking through the data, comparing what you see to your earlier review of this file in Linux.The rest of the problems assume a general working knowledge of Excel, which I have been led to believe in class discussion you each already possess. I have also provided links on the class website to tutorial pages describing most of the skills you’ll need to use to complete the following. If after reading these, you still don’t know how to do some of the tasks described below, Google is your friend.Problem 8. Use Excel’s “sort” functionality to sort various columns of alldata.csv, deriving answers to the following questions:8a. Which user had posted the most tweets (u_statuses_count) from their account at the time they tweeted? Place that user’s screen name (u_screen_name) in cell A1 of the “answers” worksheet.8b. Which users had the most and second most followers (u_followers) at the time they tweeted? Place the screen name (u_screen_name) for the user with the most followers in cell A2 of the “answers” worksheet and for the user with the second most followers in cell A3 of “answers”.8c. Recall that these tweets are geolocated and restricted to people in or near the Georgetown Law campus. Which user was furthest north at the time they tweeted? Place that user’s screen name (u_screen_name) in cell A4 of the “answers” worksheet.Excel Preprocessing of dates-*.csv filesDouble click dates-all.csv and examine the data in Excel. Scroll all the way to the bottom of the data, and notice that the header field “1 created_at” appears last. Delete just this one cell.In Excel, this data does not automatically appear split up into columns as with alldata.csv. Recall that the four files you created called dates-*.csv have the following format: 6 2014-01-0152 2014-01-0273 2014-01-03...Because the fields in each line of these files are separated by spaces rather than commas, when you open this file, Excel won’t automatically recognize the separate fields, but will instead cram all of the data on each line (e.g. “ 6 2014-01-01”) into a single cell.Excel provides a tool called “Text to Columns” to “parse” this data into columns. In my version of Excel (Mac v.15), this tool is found on the “Data” panel of Excel. This function may appear in different places or operate differently in different versions, so the following instructions might not work precisely for you. Google is your friend.To use “Text to Columns”, first highlight the column you are trying to parse (i.e. click the “A” header at the top of the first column.) Then click “Text to Columns”. A “Convert Text to Columns Wizard” will appear. On this wizard:Choose “Delimited” and click “Next”.For the delimiter, click “Space”. (others can be selected without causing problems). The preview at the bottom should show your data separated into columns. Click “Next”.Click “Finish”.If all goes well, your data should now appear in separate columns. To make the next part a bit easier, you need to swap the order of the two columns. (i.e. the dates should come first and the counts second.) Swap these columns.Save the file (you will be warned that “This workbook contains features that will not work or may be removed if you save it in the selected file format.” This is fine and you can click “Continue”).Repeat this process for all of the other dates-*.csv files you generated. For all of the others, however, there won’t be “created_at” entries for you to delete.Generating Charts in ExcelProblem 9. If all went well, you should now have five dates-*.csv files with the data in each separated into columns. For each file in turn, do all of the following steps:Open the file in Excel.Highlight all of the dataCreate a chart depicting this data in some useful format. Recommended types are “Column”, “Line”, or “Area”. Feel free to experiment and use something else if you think it produces a useful summary. In my version of Excel, creating a chart is as simple as clicking the various chart icons on the “Insert” panel. Change the Chart Title to summarize what the chart depicts.Make any other formatting changes you wish to make your chart more readable or useful.Click on the Chart and copy it into the “charts” worksheet of your data-<netID>.xlsx file.When you are finished, the “charts” worksheet of your file should containing five separate charts, distinguishable by their titles.Problem 10. Find something interesting you can conclude from one or more of the charts you generated about this data. Ideally, find something that would have been difficult just from inspecting the raw data. Place your interesting observation (probably a sentence or two) in cell A5 of the “answers” worksheet.Submitting Your AnswersEmail both your shell script (data-<netID>.sh) and your Excel workbook (data<netID>.xlsx) to me at ohm@law.georgetown.edu. Do not email any of the csv datasets you used or generated. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download