diff --git a/abstract.md b/abstract.md index 7b3ce69..6d5a786 100644 --- a/abstract.md +++ b/abstract.md @@ -1,3 +1,3 @@ # Abstract -Over the last fifty years, trends in educational attainment have reflected simultaneous movements towards closing and widening disparities between different identity groups. Studying educational attainment, specifically revolved around studying disparities in education, is vital because of the implications for future work opportunities, financial security, and resource access. **EduAttain** identifies and investigates the role certain demographic factors play as determinants of educational attainment, namely, sex, race, and Hispanic ethnicity. Leveraging data from *IPUMS*, and using *R*, *R Shiny*, and *SQLite*, trends in educational attainment across different identity groups are studied through the use of pie charts to display results and draw comparisons displayed on a **[web-based dashboard](https://donizk.shinyapps.io/EduAttain/)**. The statistical relationship between these factors and educational attainment are studied using a *binary logistic regression*, to determine what populations had a higher odds of having a high school diploma or greater. The findings of this project affirm some of the findings presented in the literature, while providing new insight into certain racial and Hispanic ethnic subgroups rates of educational attainment. In general, the highest attaining populations in educational attainment were the White, Non Hispanic, and Female populations, compared to all other respective identity groups. Within the Hispanic ethnic group, the Cuban population maintained the highest level of educational attainment, relative to all other Hispanic ethnic subgroups. Furthermore, these results establish that the *Human Capital Model* fails to consider certain aspects of identity that may greatly influence the level of education an individual attains, outside of the influence of income and financial investments into education. \ No newline at end of file +Over the last fifty years, trends in educational attainment have reflected simultaneous movements towards closing and widening disparities between different identity groups. Studying educational attainment, specifically revolved around studying educational inequity, is vital because of the implications for future work opportunities, financial security, and resource access. **EduAttain** identifies and investigates the role certain demographic factors play as determinants of educational attainment, namely, sex, race, and Hispanic ethnicity. Leveraging data from *IPUMS*, and using *R*, *R Shiny*, and *SQLite*, trends in educational attainment across different identity groups are studied through the use of pie charts to display results and draw comparisons displayed on a **[web-based dashboard](https://donizk.shinyapps.io/EduAttain/)**. The statistical relationship between these factors and educational attainment are studied using a *binary logistic regression*, to determine what populations had a higher odds of having a high school diploma or greater. The findings of this project affirm some of the findings presented in the literature, while providing new insight into certain racial and Hispanic ethnic subgroups rates of educational attainment. In general, the highest attaining populations in educational attainment were the White, Non Hispanic, and Female populations, compared to all other respective identity groups. Within the Hispanic ethnic group, the Cuban population maintained the highest level of educational attainment, relative to all other Hispanic ethnic subgroups. Furthermore, these results establish that the *Human Capital Model* fails to consider certain aspects of identity that may greatly influence the level of education an individual attains, outside of the influence of income and financial investments into education. diff --git a/config.yaml b/config.yaml index 6dce092..5beacda 100644 --- a/config.yaml +++ b/config.yaml @@ -4,7 +4,7 @@ # Project-specific values title: 'EduAttain: A Statistical Analysis of the Impact of Different Demographic Indicators on Educational Attainment' author: 'Kyrie Doniz' -date: '22 July 2022' +date: '5 May 2023' firstreader: 'Janyl Jumadinova' secondreader: 'Timothy Bianco' logo: 'images/logo' diff --git a/references.bib b/references.bib index 9b6849c..ba40c59 100644 --- a/references.bib +++ b/references.bib @@ -268,4 +268,62 @@ @misc{r publisher={The R Foundation}, url={https://www.r-project.org/about.html}, urldate={2023-03-21} +} +@misc{microdata, + title={What do we mean by microdata?}, + publisher={The World Bank}, + url={https://datahelpdesk.worldbank.org/knowledgebase/articles/228873-what-do-we-mean-by-microdata}, + urldate={2023-05-03} +} +@misc{source, + title={About IPUMS CPS}, + publisher={IPUMS CPS}, + url={https://cps.ipums.org/cps/about.shtml}, + urldate={2023-05-03} +} +@misc{survey, + title={The Use of Self-Report Data in Psychology}, + publisher={Very Well Mind}, + url={https://www.verywellmind.com/definition-of-self-report-425267}, + urldate={2023-05-03} +} +@misc{cps, + title={Frequently Asked Questions}, + publisher={United States Census Bureau}, + url={https://www.census.gov/programs-surveys/cps/about/faqs.html#:~:text=About%2059%2C000%20households%20are%20selected,of%20other%20addresses%20and%20people.}, + urldate={2023-05-03} +} +@misc{nyt, + title={Census Miscounted the Population of 14 State, a Review Shows}, + author={Wines, M.}, + publisher={The New York Times}, + url={https://www.nytimes.com/2022/05/19/us/2020-census-miscount-states.html}, + urldate={2023-05-03} +} +@misc{brook, + title={Why census undercounts are problematic for political representation}, + author={Sanchez, G.R.}, + publisher={Brookings}, + url={https://www.brookings.edu/blog/how-we-rise/2022/03/28/why-census-undercounts-are-problematic-for-political-representation/}, + urldate={2023-05-03} +} +@misc{npr, + title={The 2020 census had big undercounts of Black people, Latinos, and Native Americans}, + author={Lo Wang, H.}, + publisher={National Public Radio}, + url={https://www.npr.org/2022/03/10/1083732104/2020-census-accuracy-undercount-overcount-data-quality}, + urldate={2023-05-03} +} +@misc{pew, + title={Key facts about the quality of the 2020 census}, + author={Cohn, D. & Passel, J.S.}, + publisher={Pew Research Center}, + url={https://www.pewresearch.org/short-reads/2022/06/08/key-facts-about-the-quality-of-the-2020-census/}, + urldate={2023-05-03} +} +@misc{jli, + title={Data Equity: What Is It, and Why Does It Matter?}, + publisher={JLI Consulting}, + url={https://www.jliconsultinghawaii.com/blog/2020/7/10/data-equity-what-is-it-and-why-does-it-matter}, + urldate={2023-05-03} } \ No newline at end of file diff --git a/thesis.md b/thesis.md index 88ab88a..ecce55b 100644 --- a/thesis.md +++ b/thesis.md @@ -20,9 +20,9 @@ This chapter introduces the existing research behind **EduAttain**, and detail t The US Census Bureau defines educational attainment as the highest level of education that an individual completes. The study of disparities in educational attainment is paramount because the level of education a person attains is oftentimes directly linked to the professional, personal, financial, and social opportunities available to a person [@holtz-eakin]. Conversely, a person with a lower level of educational attainment may not have access to job opportunities that require a high level of education. Or in contrast, a person with a high level of education may have access to opportunities and resources resulting from a higher social or financial status. Additionally, individuals with a bachelor's degree tend to earn higher wages, have longer life expectancies, and increased intergenerational mobility than those that do not [@dyer]. -An individual’s educational attainment is likely to be impacted by differences in race, gender, and ethnicity [@darling-hammond],[@buch], & [@dyer]. Such demographic factors contribute to an individual’s personal expectations, limitations to access, and accessibility to informational resources, which informs the decision to pursue greater educational opportunities, such as a postsecondary education. The main goal of this research is to determine to what degree these demographic factors play a role in determining an individual’s level of educational attainment. +An individual’s educational attainment is likely to be impacted by differences in race, gender, and ethnicity [@darling-hammond],[@buch], & [@dyer]. Such demographic factors contribute to an individual’s personal expectations, limitations to access, and accessibility to informational resources, which informs the decision to pursue greater educational opportunities, such as a post-secondary education. The main goal of this research is to determine to what degree these demographic factors play a role in determining an individual’s level of educational attainment. -Research shows that differences in educational attainment by gender have varied greatly over the last century, as initially the trends appeared to be favorable to men but by 1982, women lead in rates of enrollment and graduation from high school and college [@buch], [@jaco], [@dipre], & [@gamo]. These gendered differences in educational attainment have various causes including gender expectations towards receiving a higher education, accessibility to education, opportunities available post-graduation, and personal aspirations in completing some form of postsecondary education [@buch], [@jaco], & [@dipre]. The gender disparities in educational attainment also intermingle with racial and ethnic disparities as there are different historical trends between each racial and ethnic group, which determines how the dispersion of rates of educational attainment plays out differently for each group [@dipre]. +Research shows that differences in educational attainment by gender have varied greatly over the last century, as initially the trends appeared to be favorable to men but by 1982, women lead in rates of enrollment and graduation from high school and college [@buch], [@jaco], [@dipre], & [@gamo]. These gendered differences in educational attainment have various causes including gender expectations towards receiving a higher education, accessibility to education, opportunities available post-graduation, and personal aspirations in completing some form of post-secondary education [@buch], [@jaco], & [@dipre]. The gender disparities in educational attainment also intermingle with racial and ethnic disparities as there are different historical trends between each racial and ethnic group, which determines how the dispersion of rates of educational attainment plays out differently for each group [@dipre]. Race also plays a role in determining the level of education a person completes, or at least it has for the better half of the last century [@dipre], [@darling-hammond], [@gamo]. This results from the opportunities available to different racial groups to access education and the ease of accessing it. Historical events and trends have also exacerbated the racial disparities in educational attainment such as the lingering effects of slavery and segregation [@dipre], [@darling-hammond], [@gamo]. @@ -34,7 +34,7 @@ There is a robust literature surrounding educational attainment that captures th Much of the focus in the existing literature concerns differences between white and black individuals. This neglects the disparities and differences in educational attainment that might be observed in other racial and ethnic groups. One of the goals of this study and subsequent tool is to analyze differences across different racial and ethnic groups, with special interest in observing differences in those racial and ethnic groups that literature has failed to cover in detail. -To my knowledge, there is no research that analyzes the combined impact of race, ethnicity, and gender on educational attainment and the degree to which interactions between these factors can influence whether a person will recieve a certain level of educational attainment. This gap in research is particularly concerning given the impact on these intersections of identity on educational outcomes. +To my knowledge, there is no research that analyzes the combined impact of race, ethnicity, and gender on educational attainment and the degree to which interactions between these factors can influence whether a person will receive a certain level of educational attainment. This gap in research is particularly concerning given the impact on these intersections of identity on educational outcomes. None of the studies referenced in the development of this project employed the use of regression to study the effects of different factors on educational attainment. One of the goals of this project is to use a regression model to capture how race, Hispanic ethnicity, and gender explain differential trends in educational attainment. This research shed light on the strength of the relationship between each of the factors. It could also help determine which of these factors plays a bigger role as a determinant of educational attainment. @@ -60,21 +60,23 @@ Across all survey years, the female population had a slightly higher proportion ## Ethical Implications -There are a few ethical issues that need to be addressed within this work involving the security and reliability of the data used to analyze the trends of educational attainment across different groups. Issues related to the data collection and source that could also impact the results include that the data only represents attainable and usable samples, as well, as, issues related to the use of microdata. +The ethical issues that need to be addressed within this work revolve around the security and reliability of the data used to analyze the trends of educational attainment across different groups. Specifically, there are pressing issues related to the source and collection of the data that could also impact the findings of this project including that the data only represents attainable and usable samples, as well, as, issues related to the use of microdata, unit-level data obtained from the Current Population Survey administered by the US Bureau of the Census on a monthly basis to select households [@microdata], [@source]. The issues related to the sourcing of the data, nature of the data collected, and accuracy of the data may also result in ethical implications within the findings presented within this project, and specifically within the communities that are reflected in this project. ### Information Accuracy and Data Collection Issues -To address the first issue of reliability, the data itself is collected from a survey, meaning that responses are self-reported. This can be an issue if there is no internal mechanism, within the Census' collecting of the data, validating the answers given to the survey. This could mean that certain answers could have been given in order to seem better than actuality. In the case of this study, there could have been someone who lied about their highest level of educational attainment or about their level of income in order to seem more or less well-off. Also on the note of survey data, there may also be issues in data collection that may be present. Issues such as incomplete surveys or filling out of data extracts (on IPUMS' behalf), may result in untrustworthy or inclusive statistics and results. +To address the first issue of reliability, the data itself is collected from a survey, meaning that responses are self-reported. This can be an issue if there is no internal mechanism, within the Census' collection of the data, that validates the answers given to the survey. This could mean that certain answers could have been given in order to seem better than actuality or due to differences in the interpretation of survey questions [@survey]. In the case of this study, there could have been someone who misreported their highest level of educational attainment or about their level of income either intentionally--in order to seem more or less well-off-- or unintentionally, in the case of misunderstanding the prompt given. On the topic of survey data, there may also be issues in data collection that may be present, including inconsistencies in the data available between each year, which can result in untrustworthy or inclusive statistics and results. In fact IPUMS CPS does disclose that the source data files from the U.S. Census bureau are relatively inconvenient to use and hard to interpret as a result of the inconsistencies in data collection over the years, but this is resolved through IPUMS's system of harmonizing--or identically coding--related data points from year to year [@source]. -Additionally, there seems to be a potential issue in data collection or extraction, as there was a discrepancy between the data extract created for this project using IPUMS' extract tool and the actual data that ended up being usable for the project. The original intent of this project was to study data from the last 10 years in order to capture how educational attainment has changed over time, with help from the literature. But upon opening and working with the IPUMS data extract, which should have included years 2010-2021, it became clear that only data from 2010-2015 was accessible and usable. It is not clear if this is an issue resulting in incomplete data or with the data extract tool, but either way this could still be an issue future researchers may encounter upon using IPUMS as a data source and extraction tool. +Additionally, there seems to be a potential issue in data collection or extraction, as there was a discrepancy between the data extract created for this project using IPUMS' extract tool and the actual data that ended up being usable for the project. The original intent of this project was to study data from the last 10 years in order to capture how educational attainment has changed over time, with help from the literature. But upon opening and working with the IPUMS data extract, which should have included years 2010-2021, it became clear that only data from 2010-2015 was accessible and usable. After some preliminary research, it is not clear if this is an issue resulting in incomplete data or with the data extract tool itself, but either way this issue is worth mentioning as future researchers may encounter it upon using IPUMS as a data source and extraction tool. ### Third Party Risk -As the data for this project was extracted and downloaded from a third-party tool, IPUMS' online data extract creator, there may be risks with using a service like this. As aforementioned, there may be discrepancies with the data in that more data is expected in an extract than is actually given. There could also be issues in the accessibility of using this data, as an account is required to access the data. Additionally, there is the risk of just having to trust the reliability of the data provided by the third-party source. +As the data for this project was extracted and downloaded from a third-party tool, IPUMS' online data extract creator, there may be risks with using a service like this. As aforementioned, there may be discrepancies with the data in that more data is expected in an extract than is actually given. There is also the issue of the accessibility of using this data, as an account is required to actually create an extract from and access the data. Additionally, there is the risk of just having to trust the reliability and validity of the data provided by the third-party source. ### Data Bias -In the initial use of the data, it became apparent that there may also be some data bias within the dataset. When observations are counted by state, a varying amount of individuals are represented in each state, but never fully representing the populations actually present in those states. These counts remain constant year to year, which seems to indicate that the same sample of people was captured each year by the survey. The fact that this data only represents samples may bring up issues in the reliability of the data, in that it is unclear if there was any bias in the collection of the data or if the data may be incorrectly skewed towards representing one group over another. An example of this issue is present in the **Tables 1-2** below. If you compare the results of 2010 and 2014, you can see that there are ranging amounts of people represented between both samples by each state, though these sample sizes stay the same across different years. The fact that these population counts are relatively constant across each year brings into question what the selection criteria to be considered for the survey was, and what was considered complete data that could be put into IPUMS. This issue may also be a result of a *Census undercount*, where the Census incorrectly counted for certain populations. As a result of this potential bias in the data, the findings produced in this project may be misrepresentative of true circumstances. +With the initial use of the data extract, it became apparent that were some very large inconsistencies present within the data. When observations were counted by state, a varying amount of individuals are represented in each state but never fully represent the populations actually present in those states. These counts remain constant year to year, which seems to indicate that the same sample of people, or at least the same amount of observations, were captured each year by the survey. The fact that this data only represents samples may bring up issues in the reliability of the data, in that it is unclear if there was any bias in the collection of the data or if the data may be incorrectly skewed towards representing one group over another. An example of this issue is present in the **Tables 1-2** below. If you compare the results of 2010 and 2014, you can see that there are ranging amounts of people in each state represented between both samples, with these sample sizes staying the same across different years. The fact that these population counts are relatively constant across each year brings into question what the selection criteria to be considered for the survey was, and what was considered complete data that could be put into IPUMS. This is something that is disclosed by IPUMS, in that the Current Population Survey is distributed monthly to around 65,000 households at a time, according to the about IPUMS CPS page [@source]. It is unclear what the specific selection criteria for the survey but it is a voluntary survey, meaning that households could opt out of taking the survey altogether, also impacting the results of this project [@cps]. + +This issue of data accuracy may also be a result of a *Census undercount*, where the Census incorrectly counted for certain populations. This has been a well-documented issue in the past, as the Census had under-counted, in both 2010 and 2020, the Hispanic, Black, and Native American populations, which can greatly impact the political and social representation for these communities [@nyt], [@brook], [@npr], & [@pew]. As a result of this potential bias in the data, the findings produced in this project may be misrepresent of true circumstances. Table: 2010 Sample Population Count by State @@ -188,11 +190,15 @@ Table: 2014 Sample Population Count by State |50 |MS |1927 | |51 |MT |1810 | -Issues with data entry could also be present in the data, either on behalf of the US Census and Bureau of Labor Statistics, who administer the Current Population Survey or on behalf of IPUMS, where microdata from the CPS is harmonized. This could present additional issues related to data bias because if data is entered in incorrectly, then populations may be missrepresented in analysis. This does not necessarily seem to be big issue as of current, but it is a potential issue that can arise as a result of the data selected for this study. +Issues with data entry could also be present in the data, either on behalf of the US Census and Bureau of Labor Statistics, who administer the Current Population Survey or on behalf of IPUMS, where microdata from the CPS is harmonized. This could present additional issues related to data bias because if data is entered in incorrectly, then populations may be misrepresented in analysis. This does not necessarily seem to be big issue as of current, but it is a potential issue that can arise as a result of the data selected for this study. ### Issues in Equity -The last but probably the most important ethical issue that may arise as a result of this project includes the potential issues in equity, that the project generally looks to address. Because this project is rooted in looking for trends in educational attainment, focusing in on gendered, racial, and ethnic disparities in attainment, the results of this project may have real implications for the groups represented. With that, there should be a great deal of consideration taken when looking at and comparing trends across groups, as well as, when introducing and talking about educational trajectories. This is in order to avoid an unneccessary harm or unreliable characterizations or representations of certain identity groups. +The last and most pressing ethical issue that may arise as a result of this project includes the potential issues in equity that the project generally looks to address. Because this project is rooted in looking into trends in educational attainment, focusing in on gendered, racial, and ethnic disparities in education, the results of this project may have real implications for the groups represented. In order to avoid an unnecessary harm and unreliable characterizations or representations of certain identity groups, there will be a great deal of consideration taken when looking at and comparing trends across groups, as well as, when introducing and talking about educational trajectories. + +Taking steps to ensure data equity within this project are paramount to the core of the project in attempting to address inequities. In every step of development, the nature of the data and it's source, as well as, the information being conveyed through that data was continuously considered within the lens of ensuring the findings being presented are accurate. As this project tackles drawing conclusions about the educational trends of specific marginalized populations, outlining each step in the process of development is an additional step taken within the body of this work to transparently showcase how populations were observed and how conclusions were drawn. + +Additionally, the findings of this project are not stand-alone. That is, each conclusion drawn from the descriptive statistics and statistical analysis of the data is compared to findings from previously published works and statistics. The inclusion of related works not only provides the necessary background knowledge of the educational trends of each population considered within the scope of this project, but also allows for an internal validation mechanism to confirm that the findings presented from the EduAttain tool itself are relatively in-line with the literature. # Related Work @@ -991,7 +997,9 @@ Of the results presented in this section, the most interesting is that of how th ![First Regression Summary](images/reg1.jpg) -The results of the first regression, Figure 104, show that all of the explanatory variables included in the model are statistically significant, given their p-values. This statistical significance can also be observed by looking for those results followed by an asterisk. +The results of the first regression, Figure 112, show that all of the explanatory variables included in the model are statistically significant, given their p-values. This statistical significance can also be observed by looking for those results followed by an asterisk. Statistical significance refers to the presence of a non-random relationship between two or more variables determined by the p value of each variable (outside of the dependent variable), which is a value that indicates the probability of statistical significance. The p-value is also the same value as the significance level, which is then used to compute the confidence level which is just 1 minus the significance level. Typically and in this project, the confidence level is 95% and the significance level is 5% or 0.05. + +The regression output of the first logistic regression, Figure 112, starts with a call to the model for this regression using the `glm()` method. Then the distribution of the deviance residuals are output, which are residuals used in generalized linear models to measure the difference between the predicted probability and the observed probability of each outcome, and to assess the fit of the model. For each deviance residual output for the minimum, first quarter, median, third quarter, and maximum values, the deviance residuals are minimally different than the predicted probabilities for this regression, so the model proves to be a good fit. Next, the coefficients of each of the individual variables in the model are output, which indicate the effect that each independent variable has on the dependent variable. As the independent variables included in the analysis are representative of... In order to look at the accuracy of the results, a confusion matrix of values was constructed in order to compute the accuracy of the model. Given the accuracy result depicted in Figure 105, the accuracy of the first binary logit's results were 86.5% accurate.