Using the 2005-2015 Graduation Rates Public - School dataset from NYC Open Data, I created a multiple
linear regression model that predicted the Total Grads % of Cohort.
The model iterated through the other features to find the best set of inputs by measuring the impact
on the r-squared value. If the inclusion of a feature improved this metric, it was kept. Otherwise,
it was discarded from the list of input features.
The data required significant pre-processing as the original data had certain formatting issues. After
handling this with the replace and strip methods in python, the types of the numeric data was updated
from object to floats. Then, null values for features were replaced by the mean or dropped based on the
standard deviation of the values.
To visualize the relationships between the features, I generated a seaborn heatmap. It indicated that the
three columns that correlated most to Total Grads % of Cohort were 'Total Regents of % of grads',
'Still Enrolled % of cohort', and 'Dropped Out % of cohort'. All these features are members of the final
input list for the model. The heatmap also revealed that the Cohort Year had fairly weak correlations
with most of the features, suggesting that class year didn't impact graduation rates. It's important to note
that this assumption could throw off predictions in anomalous years like 2020 and 2021.
Other visualizations like matplotlib scatterplots and histogram as well as seaborn boxplots revealed key
details about the data. The diagrams highlighted links between Regents diplomas and overall graduation rates.
The boxplot revealed that overall graduation rates were generally around 70% but there were outliers with
extremely low rates. The histogram indicated that traditional June graduations were the most common and that
6 year graduation was more common than August graduations.