ANDA'S IT LIBRARY
DATA EXERCISES & PROJECTS (chronological)
R EXAMPLES OF INDIVIDUAL TECHNIQUES (2019)
Packages Used in These Exercises
These exercises primarily focus on R core packages (automatically initialized)
  • base = R Base package
  • datasets = R Datasets package
  • graphics = R Graphics package
  • grDevices = R Graphics Devices and Support for Colours and Fonts
  • methods = Formal Methods and Classes
  • stats = R Stats package
  • utils = R Utils package
Other packages used in these exercises
  • package::car - Companion to Applied Regression
  • package::MASS - Support Functions and Datasets for Venables and Ripley's MASS
  • package::psych - Statistics for Psychology
  • 3D plotting packages: scatterplot3D / rgl / RColorBrewer
Datasets Used in These Exercises
datasets::cars
datasets::chickwts
datasets::HairEyeColor
datasets::InsectSprays
datasets::iris
datasets::lynx
datasets::mtcars
datasets::quakes
datasets::sleep
datasets::state.area
datasets::swiss
datasets::Titanic
datasets::trees
datasets::USJudgeRatings
datasets::warpbreaks
MASS::Painters
quantreg::engel
datasets from packages
imported datasets
manual data collection: Google Correlate, Google Search
manual data creation: using random number generators
DATA VISUALIZATION
Describing 1 Outcome Variable
  • bar chart categorical (plot | barplot)
    • database: datasets::chickwts - 71 observations of 2 variables-predictors
  • pie chart categorical (pie | par)
    • database: datasets::chickwts - 71 observations of 2 variables-predictors
  • histogram quantitative (hist | curve)
    • timeseries: datasets::lynx - values 144 of observations
  • histogram overlay plots quantitative (hist | curve | line)
    • database: datasets::swiss - 47 observations of 7 variables-predictors
    • database: datasets::iris - 150 observations of 5 variables
  • box plot quantitative (boxplot)
    • database: datasets::USJudgeRatings - 43 observations of 12 variables-predictors
Describing & Comparing 2 Outcome Variables
  • means bar chart
    • IF bivariate association between predictor (6 categories-factors) and outcome (count-frequencies) THEN using barchart to compare group means
    • database (like frequency table): datasets::InsectSprays - 72 observations of 2 variables
  • group box plots
    • IF several predictors on same quantitative outcome THEN grouped box plot
    • database MASS::painters - 54 observations of 5 variables
    • database: import SearchData.csv (Polson) - 51 observations of 1 variables (data from "Google Correlate" for search terms across states)
  • scatter plots
    • IF relationship between 2 quantitative variables THEN scatterplots
    • database: datasets::cars - 50 observations of 2 variables
    • package: car (Companion to Applied Regression) for plotting tools
Discovering, Describing & Comparing Multiple (3+) Outcome Variables
  • bar chart
    • IF 3 variables (1 outcome frequency - 2 predictor variables) THEN clustered bar chart
    • table: datasets::warpbreaks - 54 observations of 3 variables
  • scatter plots
    • IF 1 categorical, 2 quantitative variables THEN scatterplot for grouped data
    • database: datasets::iris - 150 observtions of 5 variables
  • plot matrices
    • IF look at association of several quantitative variables THEN scatterplot matrices
    • package: car
    • database: datasets::iris - 150 observations of 5 varaibles
    • database: import Search.Data.csv (polson) - 51 observations of 10 variables
  • 3D plots
    • IF 3 variables THEN 3D scatterplot
    • packages: scatterplot3D / rgl / RColorBrewer
    • database: datasets::iris - 150 observations of 5 variables
DATA STATISTICS
Describing 1 Outcome Variable
  • frequencies categorical (table | prop.table)
    • table: google search frequencies-outcome by category-predictor
  • descriptive summaries quantitative (summary | fivenum | boxplot.stats | psych::describe)
    • database: datasets::cars - 50 observations of 2 variables
    • database: datasets::mtcars - 32 observations of 11 variables
  • inferential proportion tests quantitative (prop.test)
    • single proportion hypothesis test and confidence interval
    • infer by: compare a teams winning average to a result by chance (50%)
  • inferential means t-tests quantitative (t.test)
    • single mean hypothesis test & confidence interval - done on 1 variable-predictor
    • database: datasets::quakes - 1000 observations of 5 variables-predictors
  • inferential chi square tests categorical (chisq.test | margin.table)
    • 1-sample, goodness-of-fit test (one outcome - from 3 predictors)
    • table: datasets::HairEyeColor - 3 dimensions x factors for each
Describing & Comparing 2 Outcome Variables
  • correlations
    • IF relationship between 2 quantitative variables THEN correlation coefficient
    • database: datasets::swiss - 47 observationas of 6 variables
  • bivariate regression
    • IF one variable predicts outocme of second varaible THEN association statistic (linear regression)
    • database: datasets::trees - 31 observations of 3 variables
  • means t-test
    • IF compare means of 2 groups (quantitative variables) THEN independent 2-group t-test
    • database: datasets::sleep - 20 observations of 3 variables
  • paired means t-test
    • IF comapre means of one group, before and after intervention THEN paired t-tests
    • datasets: create random number datasets -> do intervention on that dataset to create the other dataset
  • means ANOVA
    • IF compare mewns o one dimension on many groups THEN on-way one-factor ANOVA (analysis of variance)
    • datasets: create random data for 4 groups (same N and std dev, different means)
  • proportions
    • IF compare proportions of 2 groups THEN prop.test
    • datasets: manual
    • database: import mlb2011.csv (polson)
  • categorical: cross tabs
    • IF tablualte across many dimensions THEN crosstabs for categorical variables. (Narrow down to 2 dimensions > Chi-square test for indepenence)
    • dataset: table datasets::Titanic - 4 dimensions
Discovering, Describing & Comparing Multiple (3+) Outcome Variables
  • regression
    • IF several varaibles to predict scores on single outcome varaible THEN compute mutiple regression
    • database: datasets::USJudgeRatings - 34 observations of 12 variables
  • 2 factor ANOVA
    • IF 2 categorical predictor variables, 1 quantitative outcome THEN 2-factor ANOVA
    • database: datasets::warpbreaks - 54 observations of 3 variables
  • clusters
    • IF grouping observations, based on similarities of variables THEN clustering
    • 3 major types of clustering
      • split: into number of clusters (k-means)
      • hierarchical: start seperate > combine
      • divide: start single > split
    • database: datasets::mtcars - 32 observations of 11 variables
    • database: import StateClusterData.csv (polson) - google search data
  • principle component factors
    • If looking for most significant predictors THEN
      • principle components = components as result of varaibles
      • factor analysis = variables as result of factor
    • database: datasets::mtcars - 32 observations of 11 variables
STATISTICS FOR ROBUST DATA
Procedures resistant to outliers and non-normal distributions
  • uni-variate tests: IF data that does not fit into assumptions of test THEN robust uni-variate tests
    • (median | mean - trim | sd | mad | IQR  | fivenum )
    • database: "US States Facts and Figures" datasets::state.area - highly skewed with outliers
  • bivariate tests
    • IF data that does not fit into assumptions of test THEN robust bi-variate tests
    • packages: robust / robustbase / MASS / quantreg
    • database: quantreg::engel - 235 observations of 2 variables

back home