final

ReadyResearchers · May 5, 2023 · 5514746 · 5514746
1 parent 59aeb45
commit 5514746
Show file tree

Hide file tree

Showing 2 changed files with 36 additions and 36 deletions.
diff --git a/README.md b/README.md
@@ -53,80 +53,80 @@ baseline requirements. Please note that these are only baseline requirements and
 it is expected that an exceptional senior thesis will exceed these requirements.
 
 **General Thesis Requirements**:
-  - [ ] The abstract provides a concise and compelling summary of the research
-  - [ ] The thesis was submitted on time as a PDF in a tagged release on GitHub
-  - [ ] The GitHub repository of the thesis contains evidence of many commits
-  - [ ] The GitHub repository of the thesis contains multiple releases using the
+  - [X] The abstract provides a concise and compelling summary of the research
+  - [X] The thesis was submitted on time as a PDF in a tagged release on GitHub
+  - [X] The GitHub repository of the thesis contains evidence of many commits
+  - [X] The GitHub repository of the thesis contains multiple releases using the
     [Semantic Versioning Standard](https://semver.org/)
-  - [ ] In adherence to the [Semantic Versioning Standard](https://semver.org/),
+  - [X] In adherence to the [Semantic Versioning Standard](https://semver.org/),
     the GitHub repository of the thesis contains a release greater than `1.0.0`
     for the work in CMPSC 600 and a release greater than `2.0.0` for CMPSC 610
-  - [ ] The thesis has the correct format created through the use of Pandoc and
+  - [X] The thesis has the correct format created through the use of Pandoc and
     LaTeX and the senior thesis template for the Department of Computer Science
-  - [ ] The title of the thesis is both interesting and appropriate
-  - [ ] The thesis includes at least twelve references
-  - [ ] Unless there is a convincing reason to require otherwise, each chapter
+  - [X] The title of the thesis is both interesting and appropriate
+  - [X] The thesis includes at least twelve references
+  - [X] Unless there is a convincing reason to require otherwise, each chapter
     in the senior thesis should contain at least ten to twenty pages of
     contents formatting in the required style
-  - [ ] The thesis consists of at least `7500` words
-  - [ ] The thesis follows a logical flow at the level of chapters, sections,
+  - [X] The thesis consists of at least `7500` words
+  - [X] The thesis follows a logical flow at the level of chapters, sections,
     subsections, and individual paragraphs
-  - [ ] The thesis includes appropriate visual aids, which fall under the broad
+  - [X] The thesis includes appropriate visual aids, which fall under the broad
     categories of:
   * `image`
   * `figure`
   * `table`
   * `graph`
-  - [ ] The thesis contains a sufficient amount of content with a focus on
+  - [X] The thesis contains a sufficient amount of content with a focus on
     scientific, technical, engineering, and/or mathematical content
-  - [ ] The thesis highlights and explains the societal impacts and ethical
+  - [X] The thesis highlights and explains the societal impacts and ethical
     implications of the completed research
-  - [ ] There are no typographical or grammatical errors in the thesis
-  - [ ] There is no extraneous text in the thesis
+  - [X] There are no typographical or grammatical errors in the thesis
+  - [X] There is no extraneous text in the thesis
 
 **Introduction Section Requirements**
-  - [ ] The introduction section clearly describes the completed work
-  - [ ] The introduction section motivates the completed work from a
+  - [X] The introduction section clearly describes the completed work
+  - [X] The introduction section motivates the completed work from a
     professional perspective focused on science, technology, engineering,
     mathematics, and societal implications
-  - [ ] The introduction section outlines the ethical implications of the thesis
+  - [X] The introduction section outlines the ethical implications of the thesis
 
 **Related Work Section Requirements**
-  - [ ] The related work section references and describes relevant literature
-  - [ ] The related work section explains how relevant literature connects to the thesis
-  - [ ] The related work section does not provide a "laundry list" of the related literature
-  - [ ] The related work section situates the completed project in the broader scope
+  - [X] The related work section references and describes relevant literature
+  - [X] The related work section explains how relevant literature connects to the thesis
+  - [X] The related work section does not provide a "laundry list" of the related literature
+  - [X] The related work section situates the completed project in the broader scope
 
 **Method Section Requirements**
-  - [ ] The method section explains the process utilized in the completed study
-  - [ ] The method section addresses as many of the following which are
+  - [X] The method section explains the process utilized in the completed study
+  - [X] The method section addresses as many of the following which are
     applicable (minimum `1`):
   * `description of algorithms`
   * `programming languages`
   * `libraries`
   * `platforms`
   * `software tools`
   * `hardware`
-  - [ ] The method section references the GitHub repository that contains the
+  - [X] The method section references the GitHub repository that contains the
     implementation of the project's computational artifact(s)
-  - [ ] The method section gives examples of the input and output of the
+  - [X] The method section gives examples of the input and output of the
     project's computational artifact(s) and, when appropriate, explains how to
     run the computational artifact (note that the `README.md` file of the GitHub
     repository that contains the computational artifact(s) should furnish
     complete details about the input, output, behavior, and use of the project)
 
 **Experimental Results Section Requirements**
-  - [ ] The experimental results section includes a description of experiments
+  - [X] The experimental results section includes a description of experiments
     such that a reader should be able to reproduce them
-  - [ ] The evaluation subsection describes how the work is validated
-  - [ ] The evaluation subsection contains at least one graph, table of data, or
+  - [X] The evaluation subsection describes how the work is validated
+  - [X] The evaluation subsection contains at least one graph, table of data, or
     some other relevant presentation of the results from the experimental study
-  - [ ] The experimental results section details threats to validity
+  - [X] The experimental results section details threats to validity
 
 **Discussion and Future Work Section Requirements**
-  - [ ] The discussion and future work section discusses the impact of the completed research project
-  - [ ] The discussion and future work section critically reflects on the completed research project
-  - [ ] The conclusion outlines, with sufficient depth and detail, avenues for further and/or future work
+  - [X] The discussion and future work section discusses the impact of the completed research project
+  - [X] The discussion and future work section critically reflects on the completed research project
+  - [X] The conclusion outlines, with sufficient depth and detail, avenues for further and/or future work
 
 ## Explanation
 

diff --git a/thesis.md b/thesis.md
@@ -293,7 +293,7 @@ Table: Raw Data
 | 2.00912E+13 | 2010 | ...      | 71   | 34814   |
 | 2.00912E+13 | 2010 | ...      | 50   | 34814   |
 
-In order to analyze the data properly, each variable was considered in relation to how it needs to be used in analysis. In the case of *CPSIDP*, the nature of the variable is to serve as an identifier variable for each person in the sample, made using a combination of the survey year, the unique identifier assigned each person from every household (captured in the data), and the survey month. This variable will not be considered in analysis because of the nature of the variable and because of the several instances of blank values in the data extract. Additionally, the *SERIAL, YEAR, PERNUM, BPL, INCTOT*, and *MONTH* variables are not be considered in the analysis. The *SEX* variable is a binary variable, taking in values of either 1 or 2, with 1 representing males and 2 representing females. The *STATEFIP* variable represents a qualitative nominal variable, which is one that is seperated into levels of no particular order, and specifies entries by state with a numerical code. EDUC is an ordinal variable, as the entries are sorted into numerical codes, each representing a level of education, in order.  For use in analysis, however, the *EDUC* variable is recoded into a binary variable that takes in a value of 0, indicating an educational attainment at or below some high school participation,  or 1, indicating an educational attainment at or above a high school diploma (or equivalent). *RACE* and *HISPAN* are also qualitative nominal variables, where each level of identification of race, and Hispanic ethnicity is assigned to a numerical code, in no particular order. Also for the purposes of use in statistical analysis, the RACE and HISPAN variables are recoded to create binary variables for each identity group captured within the variables. The full list of numerical code assignments is available on the IPUMS CPS official website. Although recommended by IPUMS, WTFINL and ASECWT are not be used for analysis, as there are some missing values present.
+In order to analyze the data properly, each variable was considered in relation to how it needs to be used in analysis. In the case of *CPSIDP*, the nature of the variable is to serve as an identifier variable for each person in the sample, made using a combination of the survey year, the unique identifier assigned each person from every household (captured in the data), and the survey month. This variable will not be considered in analysis because of the nature of the variable and because of the several instances of blank values in the data extract. Additionally, the *SERIAL, YEAR, PERNUM, BPL, INCTOT*, and *MONTH* variables are not be considered in the analysis. The *SEX* variable is a binary variable, taking in values of either 1 or 2, with 1 representing males and 2 representing females. The *STATEFIP* variable represents a qualitative nominal variable, which is one that is separated into levels of no particular order, and specifies entries by state with a numerical code. EDUC is an ordinal variable, as the entries are sorted into numerical codes, each representing a level of education, in order.  For use in analysis, however, the *EDUC* variable is recoded into a binary variable that takes in a value of 0, indicating an educational attainment at or below some high school participation,  or 1, indicating an educational attainment at or above a high school diploma (or equivalent). *RACE* and *HISPAN* are also qualitative nominal variables, where each level of identification of race, and Hispanic ethnicity is assigned to a numerical code, in no particular order. Also for the purposes of use in statistical analysis, the RACE and HISPAN variables are recoded to create binary variables for each identity group captured within the variables. The full list of numerical code assignments is available on the IPUMS CPS official website. Although recommended by IPUMS, WTFINL and ASECWT are not be used for analysis, as there are some missing values present.
 
 ## **Descriptive Statistics**
 
@@ -1105,7 +1105,7 @@ Lastly, improvements can be made to the implementation of the binary logistic re
 
 Future ethical implications of EduAttain include the static nature of the dashboard application as well as the implementation of the confustion matrix. In relation to the nature of EduAttain, as the information presented in the tool uses specific data from specific years, the findings from this project may become outdated in coming years, especially if there are shifts in the rates of educational attainment within the populations studied in this project. This presents an ethical concern as the findings presented are confined to the data that the project was developed with and as time passes, the conclusions drawn may not always hold to be true, causing misinformation about a specific population's rates of educational attainment. A recommendation to resolve this would be to expand the project to include data after 2015 or to allow the ability for the inclusion of more data by a user.
 
-Additionally, the confusion matrices used in this project to calculate the accuracy of the regressions may result in inaccurate results due to how they were implemented in the project. The recommended implementation of a confusion matrix involves separating the original data into a training and testing set, using the training set for the the regression and the testing set to project the predicted probabilties. Then the training and testing sets are compared against each other to construct the confusion matrix of true positives and negatives, and false positives and negatives, which are then used to compute the accuracy. Instead of following the recommended implementation, due to being limited to working with a sample size within the regression, the confusion matrix was constructed using the same sample size, instead of separating data into train and test sets. This presents an ethical concern due to the computed accuracy's potential for unreliability, meaning that the regression model constructed could be more or less accurate than what was actually projected in the original implementation of the confusion matrix accuracy. To resolve this issue, working on a machine with more computing power should allow for a regression to be run using the entirety of the data, so as to be able to divide up this data into testing and training sets for use in the construction of a regression and confusion matrix.
+Additionally, the confusion matrices used in this project to calculate the accuracy of the regressions may result in inaccurate results due to how they were implemented in the project. The recommended implementation of a confusion matrix involves separating the original data into a training and testing set, using the training set for the regression and the testing set to project the predicted probabilties. Then the training and testing sets are compared against each other to construct the confusion matrix of true positives and negatives, and false positives and negatives, which are then used to compute the accuracy. Instead of following the recommended implementation, due to being limited to working with a sample size within the regression, the confusion matrix was constructed using the same sample size, instead of separating data into train and test sets. This presents an ethical concern due to the computed accuracy's potential for unreliability, meaning that the regression model constructed could be more or less accurate than what was actually projected in the original implementation of the confusion matrix accuracy. To resolve this issue, working on a machine with more computing power should allow for a regression to be run using the entirety of the data, so as to be able to divide up this data into testing and training sets for use in the construction of a regression and confusion matrix.
 
 ## Conclusions