NYC Graduation Model

Using the 2005-2015 Graduation Rates Public - School dataset from NYC Open Data, I created a multiple linear regression model that predicted the Total Grads % of Cohort.

The model iterated through the other features to find the best set of inputs by measuring the impact on the r-squared value. If the inclusion of a feature improved this metric, it was kept. Otherwise, it was discarded from the list of input features.

The data required significant pre-processing as the original data had certain formatting issues. After handling this with the replace and strip methods in python, the types of the numeric data was updated from object to floats. Then, null values for features were replaced by the mean or dropped based on the standard deviation of the values.

To visualize the relationships between the features, I generated a seaborn heatmap. It indicated that the three columns that correlated most to Total Grads % of Cohort were 'Total Regents of % of grads', 'Still Enrolled % of cohort', and 'Dropped Out % of cohort'. All these features are members of the final input list for the model. The heatmap also revealed that the Cohort Year had fairly weak correlations with most of the features, suggesting that class year didn't impact graduation rates. It's important to note that this assumption could throw off predictions in anomalous years like 2020 and 2021.

Other visualizations like matplotlib scatterplots and histogram as well as seaborn boxplots revealed key details about the data. The diagrams highlighted links between Regents diplomas and overall graduation rates. The boxplot revealed that overall graduation rates were generally around 70% but there were outliers with extremely low rates. The histogram indicated that traditional June graduations were the most common and that 6 year graduation was more common than August graduations.

NY Graduation Model