Explore Two Variables Using R

This article is in continuation of the Exploratory Data Analysis in R ― One Variable , where we discussed EDA of pseudo facebook dataset.

In this article, we will learn about data aggregation, conditional means and scatter plots, based on pseudo facebook dataset curated by Udacity.

Now we will look at two continuous variables at the same time. Scatter plot is one the best plots to examine the relationship between two variables. Lets draw a scatter plot between age and friend count of all the users.

qplot(age,friend_count,data=pf)

ggplot(aes(x=age,y=friend_count),data=pf)+ geom_point()

scatter plot is the default plot when we use geom_point().

Plot 1 Scatter Plot ― Friend Count VsAge

From the above plot, following two observations are quite evident:

people with an age of less than 30, have thousands of friends there is a vertical line at age 69 and close to 100, which seems incorrect.

We know that the minimum age required to hold a Facebook account is 13 years, and 90 seems a pretty big age, so let's restrict our x-axis from 13 to 90.

ggplot(aes(x=age,y=friend_count),data=pf)+ geom_point()+ xlim(13,90)
Explore Two Variables Using R

Plot 2 Scatter Plot ― Age Vs Friend Count (x axis restricted)

The above plot is better that Plot 1, but the lower part is overcrowded, which makes it difficult to count the number of points in that area.

Alpha aesthetics of geom_point can be used to set transparency levels of the points in the overcrowded area of the plot.

For our scenario, lets set an alpha value of 1/20, which means it will take 20 points to complete a black dot.

ggplot(aes(x=age,y=friend_count),data=pf)+ geom_point(alpha=1/20)+ xlim(13,90)
Explore Two Variables Using R

Plot 3 Scatter Plot ― Age Vs Friend Count ― Alpha Aesthetics

Further, add jitter to our plot to spread out the points which are overplotted.

Jittering refers to adding small amount of random noise to data.

ggplot(aes(x=age,y=friend_count),data=pf)+ geom_jitter(alpha=1/20)+ xlim(13,90)
Explore Two Variables Using R

Plot 4 Scatter Plot ― Age Vs Friend Count ― Jittering

From the above plot, it can be inferred that the majority of users below 25 years of age have friends less than 1000. This inference contradicts our first observation from plot 1, symbolizing the importance of EDA.

Further, let's use coord_trans for visualization.

ggplot(aes(x=age,y=friend_count),data=pf)+ geom_point(alpha=1/20)+ xlim(13,90)+ coord_trans(y = "sqrt")

ggplot(aes(x=age,y=friend_count),data=pf)+ geom_point(alpha=1/20,position=position_jitter(h=0))+ xlim(13,90)+ coord_trans(y = "sqrt")
Explore Two Variables Using R

Plot 5 Scatter Plot ― Age Vs Friend Count ― Use coord_trans

With this plot, it is much easier to see friend count, conditional and age.

From data visualization until now, we can observe every point in the dataset. Still, it's not possible to determine important quantities such as mean and media from such display. Sometimes we want to understand, how mean or median varies with other variable. Let’s create a table which provides us with mean and median of all age. For this we will be using D Plyr . A basic tutorial for Dplyr can be found here and here . Same can be created using functions such as lapply, tapply, and split .

age_groups<-group_by(pf,age) pf.fc_by_age<-summarise(age_groups, friend_count_mean=mean(friend_count), friend_count_median=median(friend_count), n=n()) pf.fc_by_age<-arrange(pf.fc_by_age,age) head(pf.fc_by_age)

Above code produces a tibble , and the output is as shown below:

A tibble: 6 x 4 age friend_count_mean friend_count_median n <int> <dbl> <dbl> <int> 1 13 165. 74 484 2 14 251. 132 1925 3 15 348. 161 2618 4 16 352. 172. 3086 5 17 350. 156 3283 6 18 331. 162 5196

Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating. More about it can found here .

Using the above tibble, we will be plotting scatter plot for age Vs friend_count_mean.

ggplot(aes(age,friend_count_mean),data=pf.fc_by_age) + geom_point()
Explore Two Variables Using R

Plot 6 Scatter Plot ― Age Vs Friend CountMean

Joining above scattered dots with lines. Please observe the geom_line.

ggplot(aes(age,friend_count_mean),data=pf.fc_by_age) + geom_line()
Explore Two Variables Using R

Plot 7 Line Plot ― Age Vs Friend CountMean

Cool! Let's plot our previous scatter plot in plot 5, with the tibble.

ggplot(aes(x=age,y=friend_count),data=pf)+ geom_point(alpha=1/20, position=position_jitter(h=0), color='red')+ xlim(13,90)+ coord_trans(y = "sqrt")+ geom_line(stat='summary',fun.y=mean)+ geom_line(stat='summary',fun.y=median,linetype=2,color='blue')+ geom_line(stat='summary',fun.y=quantile,fun.args=list(probs=0.9),color='blue')
Explore Two Variables Using R

Plot 8 ― Overlaying summaries with rawdata

From above we can infer, that more than 1000 friends is a rare occurrence. To zoom it further we can use coord_cartesian.

ggplot(aes(

Explore Two Variables Using R

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本