NYU Stern School of Business | Full-time MBA, Part-time ...



December 13, 2004

Violent Crime in America

Introduction

Violent crime in the United States is an important subject, particularly in New York City where people perceive the risk of being victimized by crime to be relatively high. As residents of New York City, the risk of violent crimes affects the way we live our lives, whether or not we actually become a victim of a crime. We have to think twice about traveling alone on the subway late at night, or jogging in Central Park after dark. Therefore, in thinking about the quality of our lives here, we wonder what societal factors must be in place in order to live in a more peaceful world, where the risks of being a violent crime victim would be lower (or maybe we should just move out of the city).

Our data analysis project analyzes violent crime in America. We will determine the most important statistical drivers of violent crime over the period 1970-2002. We are interested in other environmental/societal factors that fluctuate year to year that may be correlated with the rate of violent crime. Are there factors that we assume are correlated but are really not? Are there factors that we assume to have no association with violent crime but that really do? We aim to draw conclusion about what factors must be in place in order for violent crime to be reduced over the next 30 years.

The Data

To analyze the violent crime rate, and its drivers, we have collected the following data:

|Data |Source |Frequency |Timeframe |

|Violent crime rate |Bureau of Justice Statistics |Annual |1960-2002 |

|(target variable) | | | |

|Unemployment rate |Census Bureau |Monthly |1960-2002 |

|Federal Prison population |Federal Bureau of Prisons |Annual |1970-2002 |

|Poverty rate |Census Bureau |Annual |1960-2002 |

|Economic growth – GDP |Bureau of Economic Analysis |Annual |1960-2002 |

In collecting the data, we have already faced several issues. First, we had expected to analyze data for 1964-2003, however several of the data series are not available as far back as 1964 so we will limit it to 1970-2002. Specifically, were unable to find data on prison population, going back to the 1960’s; therefore, we have chosen to use the Federal Prison Population instead, since this data series extends back to 1970. Although less ideal than the total U.S. prison population, we believe the Federal data series may add to our understanding. The second issue we faced is that we have chosen to analyze annual observations; however, the unemployment data seems to be only available on a monthly basis therefore we had to transform it to annual data. Since we don’t have weightings, we have annualized it by calculating an unweighted mean of the monthly data. This transformation could potentially have a negative affect on the validity of our conclusions. We also expected an issue with data for 2001 if victims of 9/11 were counted as victims of violent crime, but upon further analysis, they were not. Were that the case, looking at our other variables in 2001 would not have been as relevant as it is in other years. Finally, our data is in different units: some are rates (crime rate, unemployment rate) that fluctuate over time while some are absolute numbers (prison population) that tend to grow over time. We may need to transform our data to make regression analysis more meaningful.

Expected Outcome

Through an analysis of the data we expect to find that the unemployment rate is correlated to the violent crime rate and that higher unemployment produces higher violent crime. This is because unemployment produces lower income which may drive crime related to robbery. We expect that a higher poverty rate will be associated with higher crime for the same reason. We expect that when GDP is lower or falling, violent crime will rise. We expect higher prison population to be associated with lower violent crime because those most likely to commit violent crime are incarcerated.

We believe that by statistically analyzing the violent crime rate and its potential drivers, we can increase our understanding of crime and what factors are associated with a lower incidence of it.

General Observation of the Variables

We will begin our analysis with an examination of the descriptive statistics as well as a histogram for each of our variables. This will enable us to determine whether or not the data is normally distributed and to see if there are any variables they may cause problems when we go deeper into our statistical analysis of the data. The descriptive statistics are as follows:

Descriptive Statistics

Variable Mean SE Mean StDev Minimum Q1 Median Q3

Violent Crime ra 565.9 18.8 107.9 363.5 491.2 556.6 636.9

Total Prison Pop 48598 5937 34105 19023 21654 30104 75453

Avg Annual Unemp 6.285 0.244 1.404 4.000 5.350 6.000 7.200

Poverty for Fami 10.361 0.190 1.091 8.700 9.300 10.300 11.300

GDP in billions 4883 511 2937 1039 2163 4463 7235

Variable Maximum

Violent Crime ra 758.1

Total Prison Pop 128090

Avg Annual Unemp 9.700

Poverty for Fami 12.300

As is apparent in the data above, some of the variables seem to be fairly normally distributed as the mean and median for the variables are similar to each other. This fact is supported by each of the histograms we looked at as well. The exceptions to this are the variables Total Prison Population and GDP, which both have a higher mean relative to the median. This lack or normality is apparent in the histograms of each of these variables as seen below.

[pic]

[pic][pic]

[pic][pic]

Because Total Prison Population has a long right tail, we decided to perform a transformation by taking a log base 10 of the data in order to see if that would help create a more normal distribution. We also logged the GDP data, since it is money data. As is apparent from the histograms of the logged data, this transformation did not seem to sufficiently affect the distribution of the data.

[pic][pic]

This may have to do with the fact that these are time series data, fixing which is beyond the scope of this project. While taking the logs for Prison Population and GDP did not make them normally distributed, we decided to continue using this logged data in the rest of our analysis.

We also examined correlations among our variables, substituting our two transformed variables for their original variables. The best regressions arise when the predictor variables are highly correlated with the target variable but not with each other. In our data, the poverty rate and log of GDP are highly correlated with the violent crime rate; however, several pairs of predictor variables are highly correlated with one another.

Correlations

Violent Crim Avg Annual U Poverty for LogT Prison

Avg Annual U 0.129

Poverty for 0.656 0.596

LogT Prison 0.395 -0.516 0.017

LogT GDP 0.647 -0.277 0.284 0.888

Single Variable Regressions

While we are ultimately concerned with how all the variables together predict Violent Crime, we are first going to examine how each one, on its own, relates to our target. To do this, we created a scatter plot with a fitted regression line for each of the predictor variables against the target of violent crime rate, as displayed below.

[pic][pic][pic][pic]

In looking at the slope of the fitted line, all of the variables appear to have a positive relationship with the target, indicating that as each variable increases, the violent crime rate increases as well. That being said, however, it seems that no one variable alone has a very strong correlation with the violent crime rate. For instance, the variability between the violent crime rate and the log of GDP is increasing over time. We can therefore conclude at this point that each variable on its own is not a good predictor of violent crime. It is our hope that when these variables are acting together, the relationship will be stronger and as a group perhaps they will be better predictors of the violent crime. In order to determine this, we will move on to our next step in analyzing the data, that of a multiple regression model.

Initial Multiple Regression

Next we ran a multiple regression of the violent crime rate and our four predictor variables (Avg Annual Unemployment Rate, Poverty for Families, log of GDP Current Dollars, and log of Federal Prison Population). The regression equation is given below.

Regression Analysis

The regression equation is

Violent Crime rate = - 96 - 16.0 Avg Annual Unemployment Rate

+ 52.8 Poverty for Families - 200 LogT Prison Pop

+ 316 LogT GDP

Predictor Coef SE Coef T P

Constant -96.2 302.4 -0.32 0.753

Avg Annual Unemployment Rate -16.05 13.16 -1.22 0.233

Poverty for Families 52.77 15.87 3.32 0.002

LogT Prison Pop -199.9 110.7 -1.81 0.082

LogT GDP 315.51 97.71 3.23 0.003

S = 63.1435 R-Sq = 70.0% R-Sq(adj) = 65.7%

In looking at the coefficients of this regression equation, we learn for example that holding all else fixed, a one point increase in the poverty rate is associated with a 52.77 point increase in the violent crime rate. Similarly, the coefficient of the log of the prison population tells us that every one point increase in the log of the prison population is associated with a negative 199.9 point impact on the violent crime rate. Interestingly, an increase in the unemployment rate is associated with a decrease in the violent crime rate, and an increase in the logged GDP is associated with an increase in the violent crime rate. Next, the regression model succeeded in reducing the noise in the violent crime rate from 107.9 before the regression to a standard error of regression of 63.1. This means that we are confident that 95% of the time our regression model can predict the crime rate to within [pic] 2*63.1. This is an indication that a prediction of violent crime using this regression equation would be much more accurate than an estimate based solely on its historical mean and variance. In addition to looking at the standard error, it is also important to examine the degree to which these four variables explain the variance in the violent crime rate. To do this we looked at the adjusted R-Sq. The adjusted R-Sq indicates that the four predictor variables account for 65.7% of the variance in the violent crime rate. It is difficult for us to tell at this time whether this R-Sq is better or worse than other models that attempt to explain crime.

Finally we considered the T and P values of the predictor variables to determine if each is significant to the regression equation. There are two variables for which the P-value is above 0.05 (the log of the prison population and the unemployment rate); therefore, these variables appear statistically insignificant to the model. This indicates that perhaps these variables could be removed without much reduction in model power.

Assumptions

Linear regression involves four major assumptions, and this regression violates two of the four. The first assumption is that the expected value of the error terms for all observations is equal to zero. Judging by the Residuals Versus the Fitted Values plot below, the expected value of the error terms appears approximately equal to zero. Also, there are no known subgroups whose fitted values are systematically above or below the regression line. We believe this first assumption holds. The second assumption is homoscedasticity, that the regression relationship is equally strong throughout the population. That assumption does not hold in this regression. The Residuals Versus the Fitted Values plot shows that the variance is not constant – the variance is larger for larger fitted values. The third assumption is that the residual of one term tells us nothing about the residual of another term. This assumption is violated in this regression, as it is in many regressions of time series data. The Residuals Versus the Order of the Data plot shows that each residual is related to the residual of the prior observation. The fourth assumption of linear regression is that the residuals are normally distributed. The plots Normal Probability Plot of the Residuals and Histogram of the Residuals show that the residuals are approximately normal; therefore this assumption holds for this regression.

[pic]

[pic][pic]

[pic][pic]

In addition to considering the four assumptions, we also looked for any outliers in the data by more closely examining the Normal Probability Plot of the Residuals. We noticed a couple of outliers toward the very top of the graph. Upon analysis of these outliers, we believe they occurred due to the relative increase in the crime rate during the early 1990s and do not feel it necessary to remove the data points from our model at this time.

Improving the Model

Several factors indicate that our initial model may not be the optimal model possible with our predictor variables. First, two variables, the unemployment rate and the log of prison population, have p-values below 0.05. Second, our model violates three of the four assumptions of linear regression. To improve the model, we ran a “best subsets” regression, the output of which follows.

Best Subsets Regression

Response is Violent Crime rate

A=Avg Annual Unemployment Rate

B=Poverty for Families

C=LogT Prison Population

D=LogT GDP

Mallows

Vars R-Sq R-Sq(adj) C-p S A B C D

1 43.1 41.2 24.2 82.686 X

2 66.2 63.9 4.6 64.792 X X

3 68.4 65.2 4.5 63.672 X X X

4 70.0 65.7 5.0 63.143 X X X X

The best subsets analysis indicates that only two variables are necessary to have an adjusted R-Sq of 63.9%, whereas our four-variable equation had an adjusted R-Sq of 65.7%, a very small difference. The two variables that add so little power to the model are the unemployment rate and the log of the prison population; these are the same two variables with low p-values in our initial regression. We believe that by eliminating these two variables, the model will maximize the trade-off between model power and complexity. Our optimal model then is as follows.

Regression Analysis

The regression equation is

Violent Crime rate = - 592 + 50.8 Poverty for Families + 176 LogT GDP

Predictor Coef SE Coef T P

Constant -591.8 153.2 -3.86 0.001

Poverty for Families 50.81 10.94 4.64 0.000

LogT GDP 175.59 38.79 4.53 0.000

S = 64.7915 R-Sq = 66.2% R-Sq(adj) = 63.9%

This new model explains 63.9% of the variance in the violent crime rate (as indicated by the adjusted R-Sq). The original noise in our target variable was 107.9; our model reduces noise in the target variable to 64.8 (the standard error of regression). Both predictor variables are significant to the model (as indicated by p-values less than 0.05). The equation tells us that, all else held constant, a one point increase in the poverty rate is associated with a 50.81 point increase in the violent crime rate. Similarly, a one point increase in the log of GDP is associated with a 175.59 point increase in the violent crime rate.

This new model conforms to the four assumptions of linear regression better than our initial model did. It does not violate the first assumption (expected value of error terms equal to zero), as seen in the below plot. This regression does violate the second assumption (homoscedasticity) since variance of the residuals is higher for larger fitted values, but the variance is more constant than in our initial model. This regression also violates the third assumption (residuals tell us nothing about one another) since it is a time series. The fourth assumption (normality of residuals) is not violated by this regression equation. While not exactly normal, the residuals are approximately normal and certainly more normal than the residuals of our initial regression equation. In sum, our improved model violates two of the four linear regression assumptions, whereas our initial model violated three of the four.

[pic]

[pic][pic]

Initial Conclusion and Original Expectations

First let us take a look at the nature of the relationship of the national violent crime rate with each of the predictor variables, based on the multiple regression model we ran. In half of the cases the direction of the relationship matched our expectations, and in the other half the relationship was the opposite of what we had expected. As stated earlier, we had assumed that an increase in GDP would be associated with a decrease in the crime rate, this does not seem to be the case based on the positive coefficient for the logged GDP. It seems that there is actually a positive rather than negative relationship between the two—an increase in GDP is associated with an increase in the violent crime rate. Additionally, we had expected that an increase in the unemployment rate would be associated with a decrease in the violent crime rate. However, based on the negative coefficient for unemployment, it seems that an increase in unemployment, in our model, is actually associated with a decrease in violent crime. The other two variables do in fact have the relationships we assumed they would have. An increase in the poverty rate correlates with an increase in the violent crime rate as interpreted by the positive coefficient for the poverty rate. In addition, as we had assumed, an increase in the prison population is associated with a decrease in the crime rate. These associations, of course. assume all other variables are held constant.

More importantly perhaps, we chose these four variables under the assumption, prior to statistically analyzing the data, that all four variables together would serve as a fairly good predictor of the national violent crime rate. After looking at the multiple regression model for the data, the results do not fully support our original expectations. To begin with, in order to strengthen our analysis we had to make the choice to completely remove two of the four variables, the unemployment rate and the prison population. We now believe that the national rate of violent crime for the period 1970-2000 is best explained by the poverty rate and the level of GDP. That said, violent crime is quite difficult to predict using the data we have analyzed thus far. Therefore, we decided to try one last thing in our effort to predict the national violent crime rate.

Incorporating a Lagged Variable

We considered the fact that the best predictor of the violent crime rate may be the violent crime rate of the prior year. To examine this we first ran a correlation between the violent crime rate and the lag (by one period) of the violent crime rate.

Correlations: Violent Crime rate, Lag of Violent Crime Rate

Pearson correlation of Violent Crime rate and Lag of Violent Crime Rate = 0.957

This very high correlation of 0.957 tells us that the violent crime in one period is likely to have predictive power in predicting the violent crime rate of the next period. We next constructed a second best subsets regression but this time included the lag variable.

Best Subsets Regression

Response is Violent Crime rate

32 cases used, 1 cases contain missing values

A=Avg Annual Unemployment Rate

B=Poverty for Families

C=LogT Prison Population

D=LogT GDP

E=Lag of Violent Crime Rate

Mallows

Vars R-Sq R-Sq(adj) C-p S A B C D E

1 91.5 91.2 12.4 30.531 X

2 93.3 92.9 5.8 27.557 X X

3 94.1 93.5 3.9 26.284 X X X

4 94.4 93.6 4.5 26.105 X X X X

5 94.5 93.5 6.0 26.326 X X X X X

The result was surprising: a regression with only the lag variable had an adjusted R-Sq of 91.2%, significantly higher than the 63.9% adjusted R-Sq of our previous best subsets model. Once the lag variable was included, the other variables added little additional power. As a result, our new best model has only the lag of the violent crime rate as predictor. The regression equation for this model is below.

Regression Analysis: Violent Crime rate versus Lag of Violent Crime Rate

The regression equation is

Violent Crime rate = 56.8 + 0.907 Lag of Violent Crime Rate

32 cases used, 1 cases contain missing values

Predictor Coef SE Coef T P

Constant 56.83 29.13 1.95 0.061

Lag of Violent Crime Rate 0.90719 0.05039 18.00 0.000

S = 30.5314 R-Sq = 91.5% R-Sq(adj) = 91.2%

Analysis of Variance

Source DF SS MS F P

Regression 1 302126 302126 324.11 0.000

Residual Error 30 27965 932

Total 31 330091

[pic]

The coefficient tells us that each one point increase in the violent crime rate is associated with a 0.907 increase in the violent crime rate for the following year. This regression reduces the noise of the response variable to a standard error of 30.5 from an original standard deviation of 107.9. The adjusted R-Sq tells us that the regression explains 91.2% of the variance in the violent crime rate. The p-value for the predictor variable tells us that the probability that the coefficient is actually zero is less than 0.0005. Since this is now a one variable regression, the F statistic and associated p value tell us the same information as the p value of the coefficient.

Our new regression violates two of the four assumptions of linear regression. It does not violate the first assumption, since the expected value of the residuals appears close to zero. The second assumption is violated since the residuals exhibit non-constant variance; the variance increases for larger fitted values. The regression violates the third assumption since each residual value is related to the residual of the prior year. Our regression does not violate the fourth assumption since the residuals are approximately normally distributed.

Implications

Our analysis has taught us three lessons. First, we learned that the poverty rate and the growth of the economy are each more highly correlated with the violent crime rate than the unemployment rate and the federal prison population are. Second, we confirmed that a higher poverty rate is associated with a higher rate of violent crime and learned that a larger economy is associated with a higher rate of violent crime. Third, we learned that the most effective data for predicting the violent crime rate is in fact the rate itself, from the previous year.

In conclusion, it seems reasonable to suppose that the since the violent crime rate has fallen every year for the past ten years that it may do so next year as well. As for predicting the national violent crime rate based on our original four variables, we found that it is quite difficult to do.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download