Why we need graphs?
Characteristics of exploratory graphs
Simple Summary of data
Five number of summary
summary(pollution$pm25)
boxplot
boxplot(pollution$pm25, col = "blue")
# We can also overlay features
# This will draw a line y = 12
abline(h = 12)
histogram
hist(pollution$pm25, col = "green")
hist(pollution$pm25, col = "green", breaks = 100)
# density plot, add a strip under the histogram indicating location of each data point
rug(pollution$pm25)
# This will draw a line x = 12, with width 2.
abline(v = 12, lwd = 2)
# This will draw a line cross the median data point.
abline(v = median(pollution$pm25), col = "magenta", lwd = 4)
barplot: it is for categorical data.
barplot(table(pollution$region), col = "wheat", main = "Number of Counties in Each Region")
boxplot for 2 dimensional data
boxplot(pm25 ~ region, data = pollution, col = "red")
histogram for 2 dimensional data
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
hist(subset(pollution, region == "east")$pm25, col = "green")
hist(subset(pollution, region == "west")$pm25, col = "green")
mar is for margin, see the man page of par.
scatterplot
with(pollution, plot(latitude, pm25))
with(pollution, plot(latitude, pm25, col = region))
From the above examples, we can see there is two ways to refer to the data in a dataframe.
Further resource
base plot system “Artist’s palette” model, add pieces one by one. generation, plot or like function. Then annotation, add labels, axis etc.
Lattice plot system plots are create at once. Good for condition plot, panel plot. Good for putting many many plots on a screen.
ggplot system It mixes the ideas of both
Some questions to think about?
There are 2 steps:
plot and hist will launch a graphics device, if there is no one open.
plot has lots of arguments, letting you set title, labels.
Most of them are documented in the par function man pages.
3 base graph commands: plot, hist, boxplot.
pch: plotting character (see the man page for function points for details)lty: line typelwd: line widthcol: color, can be number, string, hex format, colors() function gives a
vector of color by namexlab: x-axis labelylab: y-axis labelUse par() function to specify these parameters and also to read the current
value of them, like par("col"). Don’t forget the double quote.
las: the orientation of axis labels on the plotbg: background colormar: margin, start from the bottom and clockwise turn. The unit is line of
text.oma: the outer marginmfrow: number of plots per row and per column, filled row-wise.mfcol: number of plots per column and per row, filled column-wise.plot: make a scatterplot, or other plot depending on the class of the
objects being plotted.lines: add lines to a plot.points: add points to a plot.text: add text labels to a plot.title: add titles.mtext: m means margin, add text to margins.axis: add axis ticks and labels.legend: add legend. If they are the line, specify lty. If they are
character, specify pch.Very flexible and offers high degree of control, but maybe tedious.
We can use the examples function to see the examples of a function like
examples(points)
pch 21 ~ 25 are similar to 1 ~ 6, but they have boundaries (with col parameters)
and fills (with bg parameters).
We could make the plot, but don’t put the data in it by doing
plot(x, y, type = "n")
g <- gl(2, 50, labels = c("Male", "Female"))
plot(x, y, type = "n")
plot(x[g == "Male"], y[g == "Male"], col = "blue")
plot(x[g == "Female"], y[g == "Female"], col = "green")
quatz() on Mac, X11() on Linux, windows() on Windows?Device to find all devicesplot, xyplot, qplot will send plot to screen device. And there is only
one screen device for all the 3 platformspdf(file = "myplot.pdf")
dev.off()Vector formats:
Bitmap formats:
dev.cur()dev.set(integer)dev.copy(png, file = "myfile.png")
dev.off()
# if want PDF
dev.copy2pdf
Warning: the plot may not be exactly the same as seen in screen
xyplotbwplotlevelplotgrid package, which we seldom use directlyxyplot: scatterplotbwplot: boxplothistogram: histogramsstripplot: boxplot with actual pointsdotplot:plot dots like “violin strings”splom:scatterplot matrix; like the paris in base systemlevelplot, contourplot: for plotting image dataxyplot functionxyplot(y ~ x | f * g, data)
library(lattice)
library(datasets)
xyplot(Ozone ~ Wind, data = airquality)
transform to change the variable in a
dataframelibrary(lattice)
library(datasets)
xyplot(Ozone ~ Wind, data = airquality)
airquality <- transform(airquality, Month = factor(Month))
xyplot(Ozone ~ Wind | Month, data = airquality, layout = c(5, 1))
Fundamental difference between base plot system
trellis object.# Panel functions
xyplot(y ~ x | f, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(h = median(y), lty = 2)
})
xyplot(y ~ x | f, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.lmline(x, y, col = 2)
})
qplotqplot hides underneathggplot is core function and very flexible# installation
install.packages("ggplot2")
library(ggplot2)
str(mpg)
qplot(displ, hwy, data = mpg)
We map the drv variable to different colors, and the plot is automatically labeled.
qplot(displ, hwy, data = mpg, color = drv)
We can add a smooth line here, note that we want 2 geometric objects here, the data points themselves and the a smooth line.
qplot(displ, hwy, data = mpg, geom = c("point", "smooth"))
Make histogram by just specify single variable. Note that here, we need to use
fill argument to specify colors.
qplot(hwy, data = mpg, fill = drv)
~..qplot(displ, hwy, data = mpg, facets = . ~ drv)
qplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)
qplot(log(eno), data = maacs, geom = "density", color = mopos)
shape argumentmethod# separate by shape
qplot(log(eno), log(pm25), data = maacs, shape = mopos)
# separate by color
qplot(log(eno), log(pm25), data = maacs, color = mopos)
# Adding linear regression model smooth line
qplot(log(eno), log(pm25), data = maacs, color = mopos, geom = c("point", "smooth"), method = "lm")
# separate by facets argument
qplot(log(eno), log(pm25), data = maacs, facets = . ~ mopos, geom = c("point", "smooth"), method = "lm")
ggplot2 full powerggplot part3qplot(logpm25, NocturnalSympt, data = maacs, facets = . ~ bmicat, geom = c("point", "smooth"), method = "lm")
# Initial call to ggplot, specify dataframe, x, y
g <- ggplot(maacs, aes(logpm25, NocturnalSympt))
# Add objects to plot using +
p <- g + geom_point()
print(p)
# Can add smooth line
p <- g + geom_point() + geom_smooth()
p <- g + geom_point() + geom_smooth(method = "lm")
# Then add facets
# The labels are from the variable
# It's better to make sure to label data properly
p <- p + facet_grid(. ~ bmicat)
xlab, ylab, lab, ggtitletheme(legend.position = "none")theme_gray(), theme_bw()geom_point(color = "steelblue", alpha = 1/2, size = 4)
# Note that if I want to assign color to different data, I have to wrap it in
# aes() function, thus subsetting it with different colors based on factor variable values
geom_point(aes(color = bmicat), alpha = 1/2, size = 4)
# Add labels and title
+labs(title = "MAACS Cohort")
+labs(x = expression("log " * PM[2.5]), y = "Nocturnal Symptoms")
# Modify smooth line, se turns off confidence interval
+ geom_smooth(size = 4, linetype = 3, method = "lm", se = FALSE)
# Change the background and font
+ theme_bw(base_family = "Times")
ggplot2 part 5Sometimes we may not want to look at the outlier and only focus the typical data
# if we do this, ggplot will subset the data within the range, outlier is excluded
g <- ggplot(testdat, aes(x, y))
g + geom_line() + ylim(-3, 3)
# We might want do
g + geom_line() + coord_cartesian(ylim(-3, 3))
We want to see the NO2 and BMI, but NO2 is continous variables. We could use cut()
function to make it categorical variable.
# Calculate the deciles of the data
cutpoints <- quantile(maacs$logno2_new, seq(0, 1, length = 4), na.rm = TRUE)
# Cut the data at the deciles and create new
maacs$no2dec <- cut(maacs$logno2_new, cutpoints)
# See the levels of new factor variable
levels(maacs$no2dec)
# The real plotting
g <- ggplot(maacs, aes(logpm25, NocturnalSympt))
g + geom_point(alpha = 1/3)
+ facet_wrap(bmicat ~ no2dec, nrow = 3, ncol = 4)
+ geom_smooth(method = "lm", col = "steelblue", se = FALSE)
+ theme_bw(base_family = "Avenir", base_size = 10)
+ labs(x = expression("log " * PM[2.5]))
+ labs(y = "Nocturnal Symptoms")
+ lebs(title = "MAACS Cohort")
ggplot