-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmylab6.Rmd
283 lines (200 loc) · 9.34 KB
/
mylab6.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
title: "Geog533 Lab 6 - ANOVA"
author: "Winnie Ngare"
output:
html_document:
df_print: paged
toc: yes
html_notebook:
toc: yes
toc_float: yes
---
Complete the following exercises in Chapter 6 (Analysis of Variance) of the textbook pages 199-203. For each question, you need to specify the null hypothesis and why you accept or reject the null hypothesis.
## Question 1
This is Exercise 2 in Chapter 6 of the Textbook [R].
### Problem
Assume that an analysis of variance is conducted for a study where there are $N = 50$ observations and $k = 5$ categories. Fill in the blanks in the following ANOVA table:
| | Sum of squares | Degrees of freedom | Mean square | *F* |
|----------|-----------------|--------------------|----------------|-----------------|
| Between | | | 116.3 | |
| Within | 2000 | | | |
| Total | | | | |
### Solution
```{r}
N <- 50
K <- 5
df2 <- N-K
BSS <- 116.3*4
mws <- 2000/45
mws
ratio <- 116.3/mws
ratio
total <- BSS+2000
total
f.critical <- qf(0.95,df1 = 4,df2 = 45)
f.critical
if(ratio>f.critical){print("we reject the null hypothesis")
}else{print("we do not reject the null hypothesis")}
print("the means are significantly different")
```
| | Sum of squares | Degrees of freedom | Mean square | *F* |
|----------|-----------------|--------------------|----------------|-----------------|
| Between |`r 116.3*(K-1)` | `r K-1` | 116.3 | `r ratio` |
| Within | 2000 | `r N-K` | `r mws` | |
| Total | `r total` | `r N-1` | | |
## Question 2
This is Exercise 6 in Chapter 6 of the Textbook [R].
### Problem
Is there a significant difference between the distances moved by low- and high-income individuals? Twelve respondents in each of the income categories are interviewed, with the following results for the distances associated with residential moves:
| Respondent | Low income | High income |
|-------------|-------------|-------------|
| 1 | 5 | 25 |
| 2 | 7 | 24 |
| 3 | 9 | 8 |
| 4 | 11 | 2 |
| 5 | 13 | 11 |
| 6 | 8 | 10 |
| 7 | 10 | 10 |
| 8 | 34 | 66 |
| 9 | 17 | 113 |
| 10 | 50 | 1 |
| 11 | 17 | 3 |
| 12 | 25 | 5 |
| Mean | 17.17 | 23.17 |
| Std. dev. | 13.25 | 33.45 |
Test the null hypothesis of homogeneity of variances by forming the ratio $s_1^2 / s_2^2$ which has an F-ratio with $n_1 â 1$ and $n_2 â 1$ degrees of freedom. Then use ANOVA (with $\alpha = 0.10$) to test whether there are differences in the two population means. Set up the null and alternative hypotheses, choose a value of α and a test statistic, and test the null hypothesis. What assumption of the test is likely not satisfied?
### Solution
```{r message=FALSE, warning=FALSE}
low <- c(5,7,9,11,13,8,10,34,17,50,17,25)
high <- c(25,24,8,2,11,10,10,66,113,1,3,5)
groups <- c(rep("low",12),rep("high",12))
income <- c(low,high)
df <- data.frame(income,groups)
###aov test
ftest <- aov(income~groups,data =df)
ftest
fcritical <- qf(0.90,df1 = 1,df2 = 22)
fcritical
summary(ftest)
print("we accept the null hypothesis")
###homogeneity test
library(reshape2)
library(car)
dt <- melt(df)
leveneTest(income~groups,dt)
print("The variances are significantly equal.")
###normality test
shapiro.test(low)
shapiro.test(high)
print("p-value is less than 0.05, The normality test has not been satsified.")
```
## Question 3
This is Exercise 9 in Chapter 6 of the Textbook [R].
### Problem
A sample is taken of incomes in three neighborhoods, yielding the following data:
| | A | B | C | Overall (Combined sample) |
|----------|-----------------|--------------------|----------------|---------------------------|
| N | 12 | 10 | 8 | 30 |
| Mean | 43.2 | 34.3 | 27.2 | 35.97 |
| Std. dev.| 36.2 | 20.3 | 21.4 | 29.2 |
Use analysis of variance (with α = 0.05) to test the null hypothesis that the means are equal.
### Solution
```{r}
library(MASS)
a <- mvrnorm(n=12,mu=43.2,Sigma = 36.2^2,empirical = T)
b <- mvrnorm(n=10,mu=34.3,Sigma = 20.3^2,empirical = T)
c<- mvrnorm(n=8,mu=27.2,Sigma = 21.4^2,empirical = T)
groups <- c(rep("a",12),rep("b",10),rep("c",8))
incomes <- c(a,b,c)
df <- as.data.frame(incomes,groups)
f <- aov(incomes~groups,data=df)
f
summary.aov(f)
fcritical <- qf(0.95,df1 = 2,df2 = 27)
fcritical
print("Accept the null hypothesis:Therefore the means are significantly equal.")
```
## Question 4
This is Exercise 10 in Chapter 6 of the Textbook [R].
### Problem
Use the KruskalâWallis test (with α = 0.05) to determine whether you should reject the null hypothesis that the means of the four columns of data are equal:
| Col 1 | Col 2 | Col 3 | Col 4 |
|----------|-----------------|--------------------|----------------|
| 23.1 | 43.1 | 56.5 | 10002.3 |
| 13.3 | 10.2 | 32.1 | 54.4 |
| 15.6 | 16.2 | 43.3 | 8.7 |
| 1.2 | 0.2 | 24.4 | 54.4 |
### Solution
```{r}
col1 <- c(23.1,13.3,15.6,1.2)
col2 <- c(43.1,10.2,16.2,0.2)
col3 <- c(56.5,32.1,43.3,24.4)
col4 <- c(10002.3,54.4,8.7,54.4)
group <- c(rep("col1",4),rep("col2",4),rep("col3",4),rep("col4",4))
columns <- c(col1,col2,col3,col4)
df <- data.frame(columns, group)
x <- kruskal.test(columns~group,data = df)
x
if(x$p.value>0.05){print("we do not reject the null hypothesis")
}else{print("we reject the null hypothesis")}
print("Therefore the means of the 4 columns are significantly equal")
```
## Question 5
This is Exercise 12 in Chapter 6 of the Textbook [R].
### Problem
A researcher wishes to know whether distance traveled to work varies by income. Eleven individuals in each of three income groups are surveyed. The resulting data are as follows (in commuting miles, one-way):
```{r message=FALSE, warning=FALSE}
## This is the script to generate the table. Do not write your answer inside in this block.
Observations <- seq(1:11)
Low <- c(5,4,1,2,3,10,6,6,4,12,11)
Medium <- c(10,10,8,6,5,3,16,20,7,3,2)
High <- c(8,11,15,19,21,7,7,4,3,17,18)
df<- data.frame(Observations,Low,Medium,High)
library(knitr)
kable(df)
```
Use analysis of variance (with α = 0.05) to test the hypothesis that commuting distances do not vary by income. Also evaluate (using R and the Levene test) the assumption of homoscedasticity. Finally, lump all of the data together and produce a histogram, and comment on whether the assumption of normality appears to be satisfied.
### Solution
```{r}
groups <- c(rep("Low",11),rep("Medium",11),rep("High",11))
income <- c(Low,Medium,High)
ftest <- aov(income~groups,df)
summary(ftest)
q.test <- qf(0.95,df1 = 2,df2 = 30)
q.test
print("Reject the null hypothesis;Therefore commuting distances vary by income.")
###homoscedasticity
leveneTest(income~groups,df)
print("Variances are significantly equal.")
###normality test using a histogram.
shapiro.test(income)
hist(income)
print("Assumption of Normality has not been satsified.")
```
## Question 6
This is Exercise 13 in Chapter 6 of the Textbook [R].
### Problem
Data are collected on automobile ownership by surveying residents in central cities, suburbs and rural areas. The results are:
| | Central cities | Suburbs | Rural areas |
|----------------------|-----------------|--------------------|----------------|
|Number of observations| 10 | 15 | 15 |
| mean | 1.5 | 2.6 | 1.2 |
| Std. dev | 1.0 | 1.1 | 1.2 |
|Overall mean: 1.725 | | | |
|Overall std.dev: 1.2 | | | |
Test the null hypothesis that the means are equal in all three areas.
### Solution
```{r}
c <- mvrnorm(n=10,mu=1.5,Sigma = 1,empirical = T)
s <- mvrnorm(n=15,mu=2.6,Sigma = 1.1^2,empirical = T)
r <- mvrnorm(n=15,mu=1.2,Sigma = 1.2^2,empirical = T)
groups <- c(rep("c",10),rep("s",15),rep("r",15))
resident <- c(c,s,r)
df <- data.frame(resident,groups)
test <- aov(resident~groups,df)
test
summary(test)
fcritical <- qf(0.95,df1 = 2,df2 = 37)
fcritical
print("Reject the null hypothesis:Therefore the means in the areas are significantly different.")
```