adding changes to explain first regression figure, edit to future eth…

…ical implications and data description
ReadyResearchers · May 4, 2023 · 9289616 · 9289616
1 parent f8749a0
commit 9289616
Showing 1 changed file with 5 additions and 5 deletions.
diff --git a/thesis.md b/thesis.md
@@ -238,7 +238,7 @@ Within the Hispanic ethnic group, the dispersion of educational attainment by su
 
 The literature points to several different reasons for the low rates of educational achievement and attainment for Hispanics including the influence that citizenship and immigration, language barriers,and cultural norms and values uniquely play in determining educational outcomes for this ethnic group [@kao], [@schneider], & [@dyer]. There are also other factors that equally impact educational outcomes for the Hispanic, Black, and Native American populations including the lack of adequate academic preparation and advisement towards postsecondary education and the propensity of being from low income or first generation households. Being from a low-income household can impact the types of schools accessible to these populations. Schools in low-income neighborhoods tend to be underresourced and understaffed, unable to provide the same rigor of course work and extraneous academic preparation and enrichment needed to even compare to more affluent schools. Both being from a low-income or first generation household have implications can impact the amount of knowledge of the education system families have to provide their children with sufficient resources and knowledge necessary for college entry.
 
-Citizenship and immigration are probably the factors that impact, and potentially skew, the educational outcomes of the US Hispanic population the most. According to Dyer and Roman-Torres [@dyer], citizenship plays a large role in dermining the likelihood of enrollment into and completion of postsecondary education, especially for the Hispanic population due to the influx of undocumented individuals from this ethnic group. Though it is important to note that the aspirations and completion rates to complete college are not very different between citizens and noncitizens; noncitizens face higher risk of needing to curtail their education due to risk of deportation. Additionally, the difficulties associated with immigration and understanding the American education system, coupled with potential language barriers (pertinent especially for first generation immigrants or children of first generation immigrants) further impact the ability for Hispanics to successfully progress through educational transitions and complete some form of postsecondary education [@schneider]. Language barriers especially impact the formative years of education, where lack of reinforcement of literacy activities at home in non-English speaking households is stifled, which can be especially harmful as the child progress through education system.
+Citizenship and immigration are probably the factors that impact, and potentially skew, the educational outcomes of the US Hispanic population the most. According to Dyer and Roman-Torres [@dyer], citizenship plays a large role in determining the likelihood of enrollment into and completion of postsecondary education, especially for the Hispanic population due to the influx of undocumented individuals from this ethnic group. Though it is important to note that the aspirations and completion rates to complete college are not very different between citizens and noncitizens; noncitizens face higher risk of needing to curtail their education due to risk of deportation. Additionally, the difficulties associated with immigration and understanding the American education system, coupled with potential language barriers (pertinent especially for first generation immigrants or children of first generation immigrants) further impact the ability for Hispanics to successfully progress through educational transitions and complete some form of postsecondary education [@schneider]. Language barriers especially impact the formative years of education, where lack of reinforcement of literacy activities at home in non-English speaking households is stifled, which can be especially harmful as the child progress through education system.
 
 Cultural norms and values uniquely inform Hispanic educational outcomes, especially as it pertains to the strong respect of and connection to family, referred to in relevant scholarship as familismo [@kao] & [@dyer]. As a result of this strong family connection, Hispanics have a higher propensity of living at home and attending two year colleges or altogether not continuing school post high school graduation to pursue jobs to support family [@dyer] & [@krog]. In fact, Hispanics make up the largest proportion of associate degree holders compared to all other racial and ethnic groups [@schneider] & [@krog]. According to Kao & Thompson [@kao], familism is also positively linked to higher academic achievement for the Hispanic population.
 
@@ -278,7 +278,7 @@ The focus of this chapter is to present the process that was followed to complet
 
 ## Data Description
 
-One noteworthy element of this data extract is that even though I had selected the data to encompass the years of 2010-2021, when I was actually able to open up and work with the data extract in RStudio (using the `range()` function), I found that the data set only included data for 2010 until 2015. This is something I would consider a potential limitation of using IPUMS for data, as I was under the impression that more data was being accessed, given that my selected extract should have included 2010-2021. Besides that, I had been under the impression that in creating my data extract, equal or at least representative amounts of data for all of the United States would appear in the extract, which was not the case. The goal of this study is to generate key findings for the entirety of the US, split by year, to attempt to capture how rates of educational attainment (based on other variables) change over time. Due to the aforementioned potential issues with the data, this study may result in findings that are not fully representative of the populations being captured within the data, though it should provide some key insights into the trends present in the US. It is unclear why this data is inaccessible or not included in the extract.
+One noteworthy element of this data extract is that even though the data selected was to encompass the years of 2010-2021, when data extract was opened in RStudio (using the `range()` function), the data set only appeared to include data for 2010 until 2015. This is something that could be considered a potential limitation of using IPUMS as a data source, as there may be a risk of receiving incomplete data extracts. Besides that, it was expected in the creation of the extract that equal or at least representative amounts of data for all of the United States would appear in the extract, which was not the case, as was previously addressed in the Ethical Implications, Data Bias section. The goal of this study is to generate key findings for the entirety of the US, split by year, to attempt to capture how rates of educational attainment (based on other variables) change over time. Due to the aforementioned potential issues with the data, this study may result in findings that are not fully representative of the populations being captured within the data, though it should provide some key insights into the trends present in the US. It is unclear why this data is inaccessible or not included in the extract.
 
 The raw data extract from IPUMS, before cleaning and transformation, looked much like the **Table 3**.
 
@@ -999,7 +999,7 @@ Of the results presented in this section, the most interesting is that of how th
 
 The results of the first regression, Figure 112, show that all of the explanatory variables included in the model are statistically significant, given their p-values. This statistical significance can also be observed by looking for those results followed by an asterisk. Statistical significance refers to the presence of a non-random relationship between two or more variables determined by the p value of each variable (outside of the dependent variable), which is a value that indicates the probability of statistical significance. The  p-value is also the same value as the significance level, which is then used to compute the confidence level which is just 1 minus the significance level. Typically and in this project, the confidence level is 95% and the significance level is 5% or 0.05.  
 
-The regression output of the first logistic regression, Figure 112, starts with a call to the model for this regression using the `glm()` method. Then the distribution of the deviance residuals are output, which are residuals used in generalized linear models to measure the difference between the predicted probability and the observed probability of each outcome, and to assess the fit of the model. For each deviance residual output for the minimum, first quarter, median, third quarter, and maximum values, the deviance residuals are minimally different than the predicted probabilities for this regression, so the model proves to be a good fit. Next, the coefficients of each of the individual variables in the model are output, which indicate the effect that each independent variable has on the dependent variable. As the independent variables included in the analysis are representative of...
+The regression output of the first logistic regression, Figure 112, starts with a call to the model for this regression using the `glm()` method. Then the distribution of the deviance residuals are output, which are residuals used in generalized linear models to measure the difference between the predicted probability and the observed probability of each outcome, and to assess the fit of the model. For each deviance residual output for the minimum, first quarter, median, third quarter, and maximum values, the deviance residuals are minimally different than the predicted probabilities for this regression, so the model proves to be a good fit. Next, the coefficient estimates of the intercept and each of the individual variables in the model are output, which indicate the effect that each independent variable has on the dependent variable. As the coefficients estimated and output are a log of the odds being tested in the regression, direct interpretation of these values is not fully relevant to this project, as the odds ratio will provide a more cohesive interpretation of the coefficients, as they will be exponentiated in order to return the odds of each variable of attaining a certain outcome. However, the coefficient estimates generally provide an idea of the strength, magnitude, and direction of a statistical relationship between the dependent variable and each of the independent variables. In addition to the coefficient estimates, the standard errors (measuring the accuracy of the representation of a population in a sample), z-values (measuring a single value's relationship to the mean group of values from the source data, in terms of standard deviations to determine if the data is typical or atypical), and p-values (determining presence of non-random, statistically significant relationship) will be output. Given the p-value output for each of the independent variables and evaluating on the basis of if the p-value is less than or equal to 0.05, all of the variables are statistically significant. This is also indicated by the significance codes attributed to each of the results, denoted by one or more asterisks. The significance codes legend is also output as to give more context into the strength of the statistical significance of a specific variable and generally the more asterisks that accompany the p-value, the smaller the effect on the outcome variable. In this output, the p-values while all indicating statistical significance, also mostly all convey only slight statistical significance at the 0 level, with only the mixed_race variable being statistically significant at the 0.05 level. This indicates that only the mixed_race variable appears to make the most significant effect on the outcome variable while all of the other variables impact the EDUC variable minimally. Next, the null and residual deviances are output, which provide some context into the goodness of the fit of the model as well as to measure how the independent variables explain the variation of the response variable. Given the values output for the null and residual deviances, the model without the inclusion of any predictor variables and featuring just the intercept explains very little of the variance of the EDUC variable, while in contrast, the fitted model including the predictor variables explains more of the variance in EDUC than the null model. The last two components output in the regression summary include the Akaike Information Criterion (AIC) and the number of Fisher Scoring Iterations. The AIC value output in Figure 112 indicates that the model is a good fit to the data and the Fisher Scoring Iterations score output indicates that the estimates output by the model are likely to be reliable.
 
 In order to look at the accuracy of the results, a confusion matrix of values was constructed in order to compute the accuracy of the model. Given the accuracy result depicted in Figure 105, the accuracy of the first binary logit's results were 86.5% accurate.
 
@@ -1043,9 +1043,9 @@ Lastly, improvements can be made to the implementation of the binary logistic re
 
 ## Future Ethical Implications and Recommendations
 
-Future ethical implications of EduAttain include the static nature of the dashboard application as well as the implementation of the confustion matrix. In relation to the nature of EduAttain, as the information presented in the tool uses specific data from specific years, the findings from this project may become outdated in coming years, especially if there are shifts in the rates of educational attainment within the populations studied in this project. A recommendation to resolve this would be to expand the project to include data after 2015.
+Future ethical implications of EduAttain include the static nature of the dashboard application as well as the implementation of the confustion matrix. In relation to the nature of EduAttain, as the information presented in the tool uses specific data from specific years, the findings from this project may become outdated in coming years, especially if there are shifts in the rates of educational attainment within the populations studied in this project. This presents an ethical concern as the findings presented are confined to the data that the project was developed with and as time passes, the conclusions drawn may not always hold to be true, causing misinformation about a specific population's rates of educational attainment. A recommendation to resolve this would be to expand the project to include data after 2015 or to allow the ability for the inclusion of more data by a user.
 
-Additionally, the confusion matrices used in this project to calculate the accuracy of the regressions may result in inaccurate results due to how they were implemented in the project. The recommended implementation of a confusion matrix involves separating the original data into a training and testing set, using the training set for the the regression and the testing set to project the predicted probabilties. Then the training and testing sets are compared against each other to construct the confusion matrix of true positives and negatives, and false positives and negatives, which are then used to compute the accuracy. Instead of following the recommended implementation, due to issues related to working with a sample size within the regression, the confusion matrix was constructed using the sample data only. To resolve this issues, working on a machine with more computing power should allow for a regression to be run using the entirety of the data, so as to be able to divide up this data into testing and training sets for use in the construction of a regression and confusion matrix.
+Additionally, the confusion matrices used in this project to calculate the accuracy of the regressions may result in inaccurate results due to how they were implemented in the project. The recommended implementation of a confusion matrix involves separating the original data into a training and testing set, using the training set for the the regression and the testing set to project the predicted probabilties. Then the training and testing sets are compared against each other to construct the confusion matrix of true positives and negatives, and false positives and negatives, which are then used to compute the accuracy. Instead of following the recommended implementation, due to being limited to working with a sample size within the regression, the confusion matrix was constructed using the same sample size, instead of separating data into train and test sets. This presents an ethical concern due to the computed accuracy's potential for unreliability, meaning that the regression model constructed could be more or less accurate than what was actually projected in the original implementation of the confusion matrix accuracy. To resolve this issue, working on a machine with more computing power should allow for a regression to be run using the entirety of the data, so as to be able to divide up this data into testing and training sets for use in the construction of a regression and confusion matrix.
 
 ## Conclusions