forked from gexijin/learnR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-Analyzing-heart-attach-data-set-1.Rmd
462 lines (341 loc) · 27.9 KB
/
05-Analyzing-heart-attach-data-set-1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
# The heart attack data set (I)
The heart attack data set (http://statland.org/R/RC/tables4R.htm), included in the ActivStats^1^ CD, contains all 12,844 cases of hospital discharges of the patients admitted for heart attack but did not have surgery in New York State in 1993. This information is essential for the interpretation of our results, as this is a purely **observational study**. It is not a random sample or **controlled experiment**.
The data set is formatted as a table (Table \@ref(tab:4-01)) with rows representing cases and columns representing characteristics, which is a typical format for many datasets. If you download and open the file in NotePad++ or Excel, you can see that the columns are separated by tabs, and there are **missing values** noted by “NA”. Four columns (DIAGNOSIS, SEX, DRG, DIED) contain **nominal** values, representing labels of **categories**. See an excellent explanation of data types [here](http://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/). DIAGNOSIS column contains codes defined in the International Classification of Diseases (IDC), 9th Edition. This is the code that your doctor sends to your insurance company for billing. The numbers, such as 41041, actually code for which part of the heart was affected. Although these are numbers, it does not make any sense to add or subtract or compute the mean. If I have a room of 30 students each with student ids, such as 732324, it does not make any sense to compute the average of these numbers. Such **categorical data needs to be recognized as factors in R**. Similarly, DRG column has three possible numbers, 121 for survivors with cardiovascular complications, 122 for survivors without complications, and 123 for patients who died. Moreover, DIED also codes for prognosis, it is 1 if the patient passed away, and 0 if survived.
```{r }
URL <- "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/heartatk4R.txt"
heartatk4R <- read.table(URL,
header = TRUE,
sep = "\t",
colClasses = c("character", "factor", "factor", "factor",
"factor", "numeric", "numeric", "numeric"))
```
```{r 4-01, echo=FALSE}
knitr::kable(
head(heartatk4R[, 1:8], 15),
booktabs = TRUE,
caption = 'First 15 rows of the heart attack dataset')
```
Take a look at this dataset in Excel, and consider these questions. What type of people are more likely to suffer from heart attacks? Which patient is more likely to survive a heart attack? Suppose you have a friend who was just admitted to hospital for heart attack. She is a 65 years old with DIAGNOSIS code of 41081. What is the odds that she survives without complication? Also, consider yourself as a CEO of an insurance company, and you want to know what types of patients incur more charges and whether a particular subgroup of people, such as men, or people over 70, should pay a higher premium.
To answer these questions, we need to do is:
• Import data files into R
• Exploratory data analysis (EDA)
• Statistical modeling (regression)
```{r }
x <- heartatk4R # Make a copy of the data for manipulation, call it x.
str(x) # structure of data object, data types for each column
```
## Begin your analysis by examining each column separately
If you are single and meet someone in a bar, you typically start with small talks and learn some basic information about him/her. We should do the same thing with data. But too often, we go right to the business of building models or testing hypothesis without exploring our data.
As a first step, we are going to examine each column separately. This can be very basic things such as mean, median, ranges, distributions, and normality. This is important because sometimes the data is so skewed or far from normal distribution that we need to use **non-parametric tests**, or **transformate the raw data** using log transformation, or more generally box-cox transformation, before conducting other analyses.
>
```{exercise}
>
Perform the following analysis for the heartatk4R dataset. If you forgot the R commands, refer to our previous learning materials. You may also find these commands faster by asking Dr. Google.
>
a. Graphical EDA: Plot distribution of **charges** using box plot, histogram, qqplot, lag plot, sequence plot. And interpret your results in PLAIN English. Note that there are missing values in this column that may cause some problems for some plots. You can remove missing values by defining a new variable by running **temp = CHARGES [ ! is.na (CHARGES) ]** and then run your plot on **temp**.
b. Quantitative EDA: test of normality, and confidence interval. Note that if the Shapiro-Wilk normality test cannot handle the 12,000 data points, you can either try to find other tests in the nortest library or sample randomly by running **temp = sample( CHARGES, 4000)**
```
You can attach your data set if you want to refer to the columns directly by name, such as LOS instead of x$LOS.
```{r}
attach(x)
```
For categorical columns, we want to know how many different levels, and their frequencies. In addition to quantitative analysis, we also use various charts.
For categorical values such as SEX and DIAGNOSIS, we can produce **pie charts and bar plots**, or percentages using the **table( )** function followed by **pie( )** and **barplot()**.
```{r fig.keep='none'}
barplot(table(DIAGNOSIS))
```
This generates a bar plot of counts. This basic plot could be further refined:
(ref:4-1) Barplot by percentage.
```{r 4-1, fig.cap='(ref:4-1)', fig.align='center'}
counts <- sort(table(DIAGNOSIS), decreasing = TRUE) # tabulate&sort
percentages <- 100 * counts / length(DIAGNOSIS) # convert to %
barplot(percentages, las = 3, ylab = "Percentage", col = "green") # Figure 5.1
```
Note that the “las = 3”, changes the orientation of the labels to vertical. Try plot without it or set it to 2. Of course you can do all these in one line.
```{r eval=FALSE}
barplot(100* sort(table(DIAGNOSIS), decreasing=T) / length(DIAGNOSIS), las = 3, ylab = "Percentage", col = "green")
```
(ref:14-2) Pie chart of patients by SEX.
```{r 14-2, message=FALSE, out.width='50%', fig.cap='(ref:14-2)', fig.align='center'}
table(SEX) # tabulate the frequencies of M and F
pie(table(SEX)) # pie chart
```
>
```{exercise}
Compute the count and percentage of each level of DRG by filling the blanks below. Use bar plot and pie chart similar to Figure \@ref(fig:4-1) and Figure \@ref(fig:14-2) to visualize. Briefly discuss your results.
>
drg <- __________ #Import values of DRG
>
counts <- sort(________(drg), decreasing = TRUE) # tabulate & sort
>
percentages <- 100 * __________ # Compute the percentage and convert to %
>
barplot(percentages, ylab = "Percentage", col = "blue") # barplot
>
_________(counts) # pie chart
```
## Possible correlation between two numeric columns?
This is done using various measures of correlation coefficients such as Pearson’s correlation coefficients (PPC), which is given by
$$r=∑_{i=1}^{n}(\frac{x_i-\bar{x}}{s_x}) (\frac{y_i-\bar{y}}{s_y})$$
where $x_i$ and $y_i$ are the $i^{th}$ values, $\bar{x}$ and $\bar{y}$ are sample means, and $s_x$ and $s_y$ are sample standard deviations.
Note that Pearson’s correlation ranges from -1 to 1, with -1 and 1 indicating perfect negative and possitive correlation respectively. **Negative correlations are just as important and informative as positive ones**.
(ref:4-2) Interpretation of Pearson's correlation coefficient. The numbers are Pearson’s correlation coefficient r.
```{r 4-2, echo=FALSE, out.width='80%' ,fig.cap='(ref:4-2)', fig.align='center'}
knitr::include_graphics("images/img0402_coefficient.png")
```
Figure \@ref(fig:4-2) shows some examples of Pearson’s correlation with many scatter plots. The second row of figures shows examples with X-Y plots with different slopes but Pearson’s correlation are all 1. Pearson’s correlation only indicates degree of correlation, and is independent of slope. The figures in the 3rd row show that Pearson’s correlation coefficient’s limitation: it cannot detect nonlinear correlation.
Table \@ref(tab:4-02) below gives some guideline on how to interpret Pearson’s correlation coefficient.
```{r echo=FALSE, message=FALSE}
Correlation <- c("-", "Small", "Medium", "Large")
Negative <- c("-0.09 to 0.0", "-0.3 to -0.1", "-0.5 to -0.3", "-1.0 to -0.5")
Positive <- c("0.0 to 0.09", "0.1 to 0.3", "0.3 to 0.5", "0.5 to 1.0")
dat <- data.frame(Correlation, Negative, Positive)
```
```{r 4-02, echo=FALSE}
knitr::kable(
data.frame(dat),
booktabs = TRUE,
caption = 'Interpretation of correlation coefficient.'
)
```
There is a small, but statistically significant correlation between age and length of stay in the hospital after a heart attack. **The plain English interpretation (the explanation that could be understood by your grandmother) is this**: Older people tend to stay slightly longer in the hospital after a heart attack.
```{r}
cor.test(AGE, LOS)
```
Note that the correlation coefficient r and the p value measure two different things. r indicates the size of effect, while p value tells us statistical significance. Based on the statistic sample, p value tells how certain we are about the difference being real, namely not due to random fluctuation. If we have a large sample, we could detect very small correlation with significance. Conversely, if we only have a few observations, a large r could have large p value, hence not significant. More generally, we need to distinguish effect size and significance in statistical analyses.
```{r echo=FALSE, out.width='80%', fig.align='center'}
knitr::include_graphics("images/img0400_sample.png")
```
Like many commonly-used parametric statistical methods which rely on means and standard deviations, the Pearson’s correlation coefficient is not robust, meaning its value are sensitive to outliers and can be misleading. It is also very sensitive to distribution.
**Non-parametric** approaches typically rank original data and do calculations on the ranks instead of raw data. They are often more robust. The only drawback might be loss of sensitivity. There are corresponding non-parametric versions for most of the parametric tests.
**Spearman’s rank correlation coefficient $\rho$** is a non-parametric measure of correlation. The Spearman correlation coefficient ρ is often thought of as being the Pearson correlation coefficient between the ranked variables. In practice, however, a simpler procedure is normally used to calculate ρ. The n raw scores $X_i$, $Y_i$ are converted to ranks $R_{x_i}$, $R_{y_i}$, and the differences $d_i$ = $R_{x_i}$-$R_{y_i}$ between the ranks of each observation on the two variables are calculated.
If there are no tied ranks, then $\rho$ is given by:$$ρ=1-\frac{6∑d_{i}^{2}}{n(n_{}^{2}-1)}$$
In R, we can calculate Spearman’s $\rho$ and test its significance but customize the cor.test() function:
```{r}
cor.test(AGE, LOS, method = "spearman")
```
Interpretation of Spearman’s $\rho$ is similar to Pearson’s r. The statistical significance can also be determined similarly as demonstrated above. Alternative non-parametric statistic for correlation is Kendall tau rank correlation coefficient.
We already know that we could use scatter plots to visualize correlation between two numeric columns. But when there are many data points, in this case we have over 12,000, it could be hard to comprehend. This is especially the case, when the data is integers and there are a lot of data points overlap on top of each other. Yes, graphics can be misleading.
```{r echo=c(1, 3), fig.show='hold', out.width='50%', fig.cap='Smoothed Scatter plots use colors to code for the density of data points. This is useful when there are overlapping points.', fig.align='center'}
plot(AGE, LOS) # standard scatter plot
text(30, 37, labels = "plot(AGE, LOS)", col = "red")
smoothScatter(AGE, LOS) # a smoothed color density representation of a scatterplot
text(37, 37, labels = "smoothScatter(AGE, LOS)", col = "red")
```
>
```{exercise}
>
Investigate the correlation between length of stay and charges by filling in the blanks below. Remember to include plain English interpretation of your results even your grandpa can understand.
>
LOS <- ______________ # import the values of LOS in the heart attack data
>
CHARGES <- _____________ #import the values of Charges
>
cor.test(LOS, CHARGES, method = ___________) # Pearson correlation
>
cor.test(LOS, CHARGES, method = __________) # spearman correlation
>
plot(LOS, CHARGES) # standard scatter plot
>
__________(LOS, CHARGES) # a smoothed color density representation of a scatterplot
```
## Associations between categorical variables?
There are four columns in the heart attack data set that contain categorical values (DIAGNOSIS, DRG, SEX, and DIED). These columns could be associated with each other. For example, there is a correlation between SEX and DIED. Are men and women equally likely to survive a heart attack?
```{r results='hide', message=FALSE}
counts <- table(SEX, DIED) # tabulates SEX and DIED and generate counts in a two dimension array.
counts
```
```{r echo=FALSE, results='hide'}
Sex <- c("F", "M")
DIED_0 <- c("4298", "7136")
DIED_1 <- c("767", "643")
dat <- data.frame(Sex, DIED_0, DIED_1)
data.frame(dat)
```
```{r 5-01, echo=FALSE}
knitr::kable(
data.frame(dat),
booktabs = TRUE,
caption = 'A 2x2 contingency table summarizing the distribution of DIED frequency by SEX.'
)
```
We got a contingency table as shown in Table \@ref(tab:5-01). To convert into percentages of survived, we can do:
```{r}
counts / rowSums(counts)
```
We can see that 15.1% of females died in the hospital, much higher than the 8.26% for male patients. This gender difference is quite a surprise to me. But could this happen just by chance? To answer this question, we need a statistical test. Chi-square test can be used for testing the correlation of two categorical variables. The null hypothesis is that men and women are equally likely to die from a heart attack.
```{r}
chisq.test(counts)
```
Have you seen this p-value before? Probably! It is the smallest non-zero number R shows for lots of tests. However, p is definitely small! Hence we reject the hypothesis that the mortality rate is the same for men and women. Looking at the data, it is higher for women. The chi-square test for a 2x2 contingency table gives accurate p-values provided that the number of expected observation is greater than 5. If this is not true, then you should use the Fisher Exact test. The chi-square test is an approximation to the Fisher Exact test. The Fisher Exact test is computationally intensive; Karl Pearson developed the chi-square approximation before we had computers to do the work. With fast computers available today, you can use the Fisher Exact test for quite large data sets, and be more confident in the p-values.
You can use the chi-square test for contingency tables that have more than two rows or two columns. For contingency tables that have more than two rows or two columns, the p-value computed by the chi square approximation is reasonably accurate provided that the expected number of observations in every cell is greater than 1, and that no more than 20 percent of the cells have an expected number of observations less than 5. Again, the Fisher Exact test can handle quite large data sets with today’s computers, and avoid the problems with chi-square test.
```{r }
fisher.test(counts) # Fisher’s Exact test
```
In this case, the result of fisher's test is the same as chi-square test.
If you want to make your point to a boss who is either stupid or too busy, you need a chart. Below we show two types of barplots: stacked and side by side.
(ref:5-1) Barplot showing the correlation of two categorical variables. A. Stacked. B. Side by side.
```{r 5-1, echo=c(1, 2, 4), fig.show='hold', out.width='50%', fig.cap='(ref:5-1)', fig.align='center'}
counts <- table(DIED, SEX) # SEX define columns now, as I want the bars to represent M or F.
# Figure A
barplot(counts, legend = rownames(counts), col = rainbow(2), xlab = "DIED", beside = F)
text(0.42, 7500, labels = "A")
# Figure B
barplot(counts, legend = rownames(counts), col = rainbow(2), xlab = "DIED", beside = T)
text(1, 7000, labels = "B")
```
You can add an argument, such as "*args.legend = list(x = "topleft")*", in the function barplot() to change the position of legend. For example, we can move the legend to the top-left of Figure \@ref(fig:5-1) A.
```{r}
barplot(counts, legend = rownames(counts), col = rainbow(2), xlab = "DIED", beside = F,
args.legend = list(x = "topleft"))
```
Another way of showing the proportions is mosaic plot.
(ref:5-2) Mosaic plot of DIED by SEX.
```{r 5-2, fig.width=6, fig.height=4, fig.cap='(ref:5-2)', fig.align='center'}
mosaicplot(table(SEX, DIED), color = T) # Figure 5.6
```
The mosaic plot in Figure \@ref(fig:5-2) is similar to the barplot in Figure \@ref(fig:5-1), but the bars are stretched to the same height, the width is defined by proportion of Male vs. Female. The size of the four blocks in the figure represents the counts of the corresponding combination. Also note that the blocks are also color-coded for different combination. Horizontally, the blocks are divided by SEX, we could observe that there are more men in this dataset than women. Vertically, the blocks are divided by DIED (1 for died in hospital). We could conclude that regardless of gender, only a small proportion of patients died in hospital. Between men and women, we also see that the percentage of women that died in hospital is higher than that in men. This is a rather unusual.
We could use mosaic plots for multiple factors.
(ref:5-3) Mosaic plot of three factors.
```{r 5-3, fig.cap='(ref:5-3)', fig.align='center'}
mosaicplot(table(SEX, DIED, DRG), color = rainbow(3)) # Figure 5.7
```
Here we nested the tabulate command inside the mosaic plot. As shown in Figure \@ref(fig:5-3), we further divided each of the 4 quadrants of Figure \@ref(fig:5-2) into three parts according to DRG codes, in red, green and blue. One thing we could tell is that a smaller proportion of surviving males developed complications, compared with females.
Activity: interpret the mosaic plot of the Titanic dataset(built-in in R).
```{r message=FALSE, eval = FALSE}
? Titanic # this leads you to information about the famous Titanic dataset.
```
```{r message=FALSE}
mosaicplot(~ Sex + Age + Survived, data = Titanic, color = rainbow(2))
```
This is a mosaic plot of the whole Titanic dataset.
```{r}
mosaicplot(Titanic, color = rainbow(2))
```
Think about the questions: Did men and women survive by equal proportion? Did girls and women survive by equal proportion?
>
```{exercise}
>
The DIAGNOSIS column contains IDC codes that specifies the part of the heart that are affected. Use a stacked bar plot and a mosaic plot to compare the difference in frequency of DIAGNOSIS between men and women. Based on your observation, do you think men and women are equal in their frequencies of diagnoses?
>
DIAGNOSIS <- heartatk4R$DIAGNOSIS
>
SEX <- heartatk4R$SEX
>
counts <- _____________(DIAGNOSIS, SEX) # Take SEX as columns
>
barplot(_____________________________________________________)
>
legend("topleft", legend = rownames(counts), fill = rainbow(9), ncol = 3, cex = 0.75)
>
mosaicplot(_____________________________________________)
```
## Associations between a categorical and a numeric variables?
Do women stay longer in the hospital? Does the charges differ for people with different diagnosis (part of the heart affected)? We should know by now how to answer these questions with T-test, and more generally ANOVA, following our examples commands used in our analysis of the Iris flower data set. For data visualization, **boxplot** is the most straight forward way. But beyond boxplot, we can use the ggplot2 package for more detailed examination of distribution of variables in two or more groups.
(ref:new1) Histogram of AGE by SEX using ggplot2 package
```{r new1, fig.width=6, fig.height=4, fig.cap='(ref:new1)', fig.align='center'}
library(ggplot2)
ggplot(heartatk4R, aes(x = AGE, group = SEX,
y = c(..count..[..group.. == 1]/sum(..count..[..group.. == 1]),
..count..[..group.. == 2]/sum(..count..[..group.. == 2])) * 100)) +
geom_histogram(binwidth = 6, colour = "black", fill = "lightblue") +
facet_grid(SEX ~ .) + #split a graph according to panels defined by SEX
labs(y = "Percent of Total")
```
Now the two histograms are arranged, and it is very easy to see that women’s age are more skewed to the right, meaning women are considerably older than men. I am surprised at first by this huge difference, as the average age of women if bigger by 11. Further research shows that for women the symptoms of heart attacks are milder and often go unnoticed.
We can further divide the population according to survival status by adding another factor:
(ref:new2) Histogram of AGE by SEX and DIED using ggplot2 package
```{r new2, warning=FALSE, message=FALSE, fig.cap='(ref:new2)', fig.align='center'}
heartatk4R$GROUP <- paste(DIED, SEX, sep = "-")
ggplot(heartatk4R, aes(x = AGE, group = GROUP,
y = c(..count..[..group.. == 1]/sum(..count..[..group.. == 1]),
..count..[..group.. == 2]/sum(..count..[..group.. == 2]),
..count..[..group.. == 3]/sum(..count..[..group.. == 3]),
..count..[..group.. == 4]/sum(..count..[..group.. == 4])) * 100))+
geom_histogram(binwidth = 5, colour = "black", fill = "lightblue") +
facet_grid(DIED ~ SEX) + #split a graph according to matrix of panels defined by DIED*SEX
labs(y = "Percent of Total")
```
We can see that patients who did not survive heart attack tend to be older, for both men and women. This is perhaps better illustrated with density plot:
(ref:new3) Density plot of AGE by SEX using ggplot2 package
```{r new3, fig.width=6, fig.height=3, fig.cap='(ref:new3)', fig.align='center'}
ggplot(heartatk4R, aes(x = AGE, fill = SEX)) + geom_density(alpha = .3)
```
The result is similar to Figure \@ref(fig:new1), but as a density plot.
Now for each gender, we further divide the patients by their survival status. Instead of splitting into multiple panels, the curves are overlaid.
(ref:new4) Density plot of AGE by SEX and DIED using ggplot2 package
```{r new4, fig.width=6, fig.height=4, fig.cap='(ref:new4)', fig.align='center'}
ggplot(heartatk4R, aes(x = AGE, fill = DIED)) + geom_density(alpha = .3) + facet_grid(SEX ~ .)
```
>
```{exercise}
>
Use the ggplot2 package to compare the distribution of lengths of stay among patients who survived and those who did not. Use both histogram and density plot. Interpret your results.
```
>
```{exercise}
>
Use the ggplot2 package to compare the distribution of lengths of stay among patients who survived and those who did not, but compare men and women separately (similar to Figure \@ref(fig:new4)).
```
>
```{exercise}
>
Use student’s t-test, boxplot, histogram and density plots to compare the age distribution between survived and those who didn’t.
```
>
```{exercise}
>
Use ANOVA, boxplot, and histogram and density plots to compare the charges among people who have different DRG codes.
```
## Associations between multiple columns?
We can use the ggplot2 package to investigate correlations among multiple columns by figures with multiple panels.
(ref:5-6) Multiple boxplots of AGE by DRG and SEX using ggplot2 package
```{r 5-6, fig.width=5, fig.height=3, fig.cap='(ref:5-6)', fig.align='center'}
ggplot(heartatk4R, aes(x = DRG, y = AGE)) + geom_boxplot(color = "blue") + facet_grid(SEX ~ .)
```
Recall that 121 indicate survivors with complication, 122 survivors with no complication, and 123, died. As you could see this clearly indicate our previous observation people who died in hospital are older than survivors and that patients who developed complications seems to be older than those that did not. Did people with complications stayed longer in the hospital?
>
```{exercise}
>
Are the surviving women younger than the women who died? Similar question can be asked for men. Produce a figure that compares, in a gender-specific way, age distribution between patients who died in the hospital and those who survived.
```
>
```{exercise}
>
Use the ggplot2 package to produce boxplots to compare the length of stage of men vs. women for each of the DRG categories indicating complication status. You should produce a plot similar to Figure 5.13. Offer interpretation.
```
>
```{r echo=FALSE, fig.width=6, fig.height=4, fig.cap='Multiple boxplots using ggplot2 package', fig.align='center'}
ggplot(heartatk4R, aes(x = SEX, y = AGE)) +
geom_boxplot(color = "blue") + coord_flip() + facet_grid(DRG ~ .) +
ggtitle("Multiple boxplots using ggplot2 package") +
theme(plot.title = element_text(hjust = 0.5))
```
```{r fig.width=6, fig.height=4, fig.keep='last', fig.align='center'}
# scatterplot of LOS vs. AGE
ggplot(heartatk4R, aes(x = DRG, y = AGE)) + geom_point()
# scatterplot of LOS vs. AGE, divided by SEX
ggplot(heartatk4R, aes(x = DRG, y = AGE)) + geom_point() + facet_grid(SEX ~ .)
# scatterplot colored by DIED
ggplot(heartatk4R, aes(x = AGE, y = LOS, color = DIED)) + geom_point() + facet_grid(SEX ~ .)
```
Here we only show the figure for third code.
Note that ggplot(heartatk4R, aes(x = DRG, y = AGE)) + geom_point() + facet_grid(SEX ~ .) generates multiple scatterplots of LOS ~ AGE according to different values of SEX, while color = DIED will add these two color-coded scatter plots into the same figure.
(ref:5-7) A scatter plot of LOS vs. AGE, using SEX and DIED as factors.
```{r 5-7, echo=FALSE, fig.cap='(ref:5-7)', fig.align='center'}
# scatterplot of LOS vs. AGE, divided by SEX and DIED
ggplot(heartatk4R, aes(x = AGE, y = LOS)) + geom_point(color = "blue") + facet_grid(SEX ~ DIED)
```
Figure \@ref(fig:5-7) seems to suggest that the positive association between AGE and LOS is noticeable in patients who did not die in hospital, regardless of sex. This is a statistician’s language. Try this instead that could be understood by both the statistician and his/her grandmother. Older patients tend to stay longer in the hospital after surviving a heart attack. This is true for both men and women.
Another way to visualize complex correlation is bubble plot. **Bubble plot** is an extension of scatter plot. It uses an additional dimension of data to determine the size of the symbols. Interesting video using bubble plot: [http://youtu.be/jbkSRLYSojo](http://youtu.be/jbkSRLYSojo)
(ref:5-8) Bubble plot example.
```{r 5-8, fig.cap='(ref:5-8)', fig.align='center'}
y <- x[sample(1:12844, 200), ] # randomly sample 200 patients
plot(y$AGE, y$LOS, cex = y$CHARGES / 6000, col = rainbow(2)[y$SEX], xlab = "AGE", ylab = "LOS")
legend("topleft", levels(y$SEX), col = rainbow(2), pch = 1)
```
Figure \@ref(fig:5-8) is a busy plot. Female patients are shown in red while males in blue. Size of the plot is proportional to charges. So on this plot we are visualizing 4 columns of data!
Other common methods we can use to detect complex correlations and structures include principal component analysis (PCA), Multidimensional scaling (MDS), hierarchical clustering etc.
All of these techniques we introduced so far enable us to **LEARN** about your dataset without any of priori hypothesis, ideas, and judgments. Many companies claim that they want to know their customers first as individuals and then do business. Same thing applies to data mining. You need to know you dataset as it is before making predictions, classifications etc.
You should also **INTERACT** with your data by asking questions based on domain knowledge and common sense. Generates lots and lots of plots to support or reject hypothesis you may have. I demonstrated this by using the heart attack dataset in the last few pages. You should do the same thing when you have a new dataset. Sometimes, the thing that you discovered is more important than the initial objectives.
1. Velleman, P. F.; Data Description Inc. ActivStats, 2000-2001 release.; A.W. Longman,: Glenview, IL, 2001.