In Exploratory analysis of tabular data, univariate analysis is the first step. It consists in exploring, summarizing, visualizing columns of a dataset.
In common circumstances, table wrangling is a prerequisite.
Then, univariate techniques depend on the kind of columns we are facing.
For numerical samples/columns, to name a few:
Boxplots
Histograms
Density plots
CDF
Quantile functions
Miscellanea
For categorical samples/columns, we have:
Bar plots
Column plots
Dataset
Since 1948, the US Census Bureau carries out a monthly Current Population Survey, collecting data concerning residents aged above 15 from \(150 000\) households. This survey is one of the most important sources of information concerning the american workforce. Data reported in file Recensement.txt originate from the 2012 census.
In this lab, we investigate the numerical colums of the dataset.
After downloading, dataset Recensement can be found in file Recensement.csv.
Choose a loading function for the format. Rstudio IDE provides a valuable helper.
Load the data into the session environment and call it df.
Which columns should be considered as categorical/factor?
Deciding which variables are categorical sometimes requires judgement.
Let us attempt to base the decision on a checkable criterion: determine the number of distinct values in each column, consider those columns with less than 20 distinct values as factors.
across() allows us to pick the columns to be categorized, to apply as_factor() to each of them, and to replace the old column by the result of as_factor(....)
There is at least one relation between median and mean for square-integrable distributoins: \[|\text{Median} - \text{Mean}| \leq \text{sd}\] Lévy’s inequality.
Question
Are standard deviation and IQR systematically related ?
truc<-rlang::expr({fill=alpha("white",.5)})p<-df|>ggplot()+aes(x=SAL_HOR, y=after_stat(density))+labs( title ="Wage distribution", subtitle ="Census data", x ="Hourly wage", y ="Density")+theme_minimal()
Code
p+stat_density(fill=alpha("grey", 0.5), color="black")+geom_histogram(fill=alpha("grey", 0.5), color="black", bins=15)+labs( caption ="Overlayed Density Estimates")
Question
How could you comply with the DRY principle ?
solution
This amounts to programming with ggplot2 function. This is not straightforward since ggplot2 relies on data masking.
A major requirement of a good data analysis is flexibility. If your data changes, or you discover something that makes you rethink your basic assumptions, you need to be able to easily change many plots at once. The main inhibitor of flexibility is code duplication. If you have the same plotting statement repeated over and over again, you’ll have to make the same change in many different places. Often just the thought of making all those changes is exhausting! This chapter will help you overcome that problem by showing you how to program with ggplot2.
To make your code more flexible, you need to reduce duplicated code by writing functions. When you notice you’re doing the same thing over and over again, think about how you might generalise it and turn it into a function. If you’re not that familiar with how functions work in R, you might want to brush up your knowledge at https://adv-r.hadley.nz/functions.html.