ANDA'S IT LIBRARY
DATA EXERCISES & PROJECTS (chronological)
R EXAMPLES OF INDIVIDUAL TECHNIQUES
(2019)
Packages
Used in These Exercises
These exercises primarily focus on
R core packages
(automatically initialized)
base
= R Base package
datasets
= R Datasets package
graphics
= R Graphics package
grDevices
= R Graphics Devices and Support for Colours and Fonts
methods
= Formal Methods and Classes
stats
= R Stats package
utils
= R Utils package
Other packages used in these exercises
package::car
- Companion to Applied Regression
package::MASS
- Support Functions and Datasets for Venables and Ripley's MASS
package::psych
- Statistics for Psychology
3D plotting packages
:
scatterplot3D
/
rgl
/
RColorBrewer
Datasets
Used in These Exercises
datasets::
cars
datasets::
chickwts
datasets::
HairEyeColor
datasets::
InsectSprays
datasets::
iris
datasets::
lynx
datasets::
mtcars
datasets::
quakes
datasets::
sleep
datasets::
state.area
datasets::
swiss
datasets::
Titanic
datasets::
trees
datasets::
USJudgeRatings
datasets::
warpbreaks
MASS::
Painters
quantreg::
engel
datasets from packages
imported datasets
manual data collection
:
Google Correlate
,
Google Search
manual data creation
: using random number generators
DATA VISUALIZATION
Describing 1 Outcome Variable
bar chart
categorical (
plot
|
barplot
)
database
:
datasets::chickwts
- 71 observations of 2 variables-predictors
pie chart
categorical (
pie
|
par)
database
:
datasets::chickwts
- 71 observations of 2 variables-predictors
histogram
quantitative (
hist
|
curve
)
timeseries
:
datasets::lynx
- values 144 of observations
histogram overlay plots
quantitative (
hist
|
curve
|
line
)
database
:
datasets::swiss
- 47 observations of 7 variables-predictors
database
:
datasets::iris
- 150 observations of 5 variables
box plot
quantitative (
boxplot
)
database
:
datasets::USJudgeRatings
- 43 observations of 12 variables-predictors
Describing & Comparing 2 Outcome Variables
means bar chart
IF bivariate association between predictor (6 categories-factors) and outcome (count-frequencies) THEN using barchart to compare group means
database
(like frequency table):
datasets::InsectSprays
- 72 observations of 2 variables
group box plots
IF several predictors on same quantitative outcome THEN grouped box plot
database
MASS::painters
- 54 observations of 5 variables
database
: import
SearchData.csv
(Polson) - 51 observations of 1 variables (data from "
Google Correlate
" for search terms across states)
scatter plots
IF relationship between 2 quantitative variables THEN scatterplots
database
:
datasets::cars
- 50 observations of 2 variables
package
:
car
(Companion to Applied Regression) for plotting tools
Discovering, Describing & Comparing Multiple (3+) Outcome Variables
bar chart
IF 3 variables (1 outcome frequency - 2 predictor variables) THEN clustered bar chart
table
:
datasets::warpbreaks
- 54 observations of 3 variables
scatter plots
IF 1 categorical, 2 quantitative variables THEN scatterplot for grouped data
database
:
datasets::iris
- 150 observtions of 5 variables
plot matrices
IF look at association of several quantitative variables THEN scatterplot matrices
package
:
car
database
:
datasets::iris
- 150 observations of 5 varaibles
database
: import
Search.Data.csv
(polson) - 51 observations of 10 variables
3D plots
IF 3 variables THEN
3D scatterplot
packages
:
scatterplot3D
/
rgl
/
RColorBrewer
database
:
datasets::iris
- 150 observations of 5 variables
DATA STATISTICS
Describing 1 Outcome Variable
frequencies
categorical (
table
|
prop.table
)
table
: google search frequencies-outcome by category-predictor
descriptive summaries
quantitative (
summary
|
fivenum
|
boxplot.stats
|
psych::describe
)
database
:
datasets::cars
- 50 observations of 2 variables
database
:
datasets::mtcars
- 32 observations of 11 variables
inferential proportion tests
quantitative (
prop.test
)
single proportion hypothesis test and confidence interval
infer by: compare a teams winning average to a result by chance (50%)
inferential means t-tests
quantitative (
t.test
)
single mean hypothesis test & confidence interval - done on 1 variable-predictor
database
:
datasets::quakes
- 1000 observations of 5 variables-predictors
inferential chi square tests
categorical (
chisq.test
|
margin.table
)
1-sample, goodness-of-fit test (one outcome - from 3 predictors)
table
:
datasets::HairEyeColor
- 3 dimensions x factors for each
Describing & Comparing 2 Outcome Variables
correlations
IF relationship between 2 quantitative variables THEN
correlation coefficient
database
:
datasets::swiss
- 47 observationas of 6 variables
bivariate regression
IF one variable
predicts
outocme of second varaible THEN association statistic (
linear regression
)
database
:
datasets::trees
- 31 observations of 3 variables
means t-test
IF compare means of 2 groups (quantitative variables) THEN
independent 2-group t-test
database
:
datasets::sleep
- 20 observations of 3 variables
paired means t-test
IF comapre means of one group, before and after intervention THEN
paired t-tests
datasets
: create random number datasets -> do intervention on that dataset to create the other dataset
means ANOVA
IF compare mewns o one dimension on many groups THEN on-way
one-factor ANOVA
(analysis of variance)
datasets
: create random data for 4 groups (same N and std dev, different means)
proportions
IF compare proportions of 2 groups THEN
prop.test
datasets
: manual
database
: import
mlb2011.csv
(polson)
categorical: cross tabs
IF tablualte across many dimensions THEN crosstabs for categorical variables. (Narrow down to 2 dimensions >
Chi-square test for indepenence
)
dataset
: table
datasets::Titanic
- 4 dimensions
Discovering, Describing & Comparing Multiple (3+) Outcome Variables
regression
IF several varaibles to predict scores on single outcome varaible THEN compute
mutiple regression
database
:
datasets::USJudgeRatings
- 34 observations of 12 variables
2 factor ANOVA
IF 2 categorical predictor variables, 1 quantitative outcome THEN
2-factor ANOVA
database
: datasets::
warpbreaks
- 54 observations of 3 variables
clusters
IF grouping observations, based on similarities of variables THEN
clustering
3 major types of clustering
split
: into number of clusters (
k-means
)
hierarchical
: start seperate > combine
divide
: start single > split
database
:
datasets::mtcars
- 32 observations of 11 variables
database
: import
StateClusterData.csv
(polson) - google search data
principle component factors
If looking for most significant predictors THEN
principle components
= components as result of varaibles
factor analysis
= variables as result of factor
database
:
datasets::mtcars
- 32 observations of 11 variables
STATISTICS FOR ROBUST DATA
Procedures resistant to outliers and non-normal distributions
uni-variate tests
: IF data that does not fit into assumptions of test THEN robust uni-variate tests
(
median
|
mean
-
trim
|
sd
|
mad
|
IQR
|
fivenum
)
database
: "US States Facts and Figures"
datasets::state.area
- highly skewed with outliers
bivariate tests
IF data that does not fit into assumptions of test THEN robust bi-variate tests
packages:
robust
/
robustbase
/
MASS
/
quantreg
database
:
quantreg::engel
- 235 observations of 2 variables
R EXAMPLES: PREPARING DATA
(2019)
Data Wrangling
getting started:
built in data sets
importing files
manual data entry
exporting files
tables, factors, data frames
convert table to rows in data frame
data set adjusting
creating cases from data sets
subgrouping data sets
splitting & merging data sets
data fixing
outliers
skewed distributions
variable combinations
missing data
Working with Colour
using R's built-in colour tools
using
package::ColorBrewer
R colour examples
back home