ANDA'S IT LIBRARY
DATA EXAMPLES (chronological)
Reference Document by Anda Vitols
Here I explored tidyverse, piping and other tools that have become more intuitive and efficient for wrangling tasks. I also delve deeper into modeling and mining. R is now fully integrated with non-proprietory apps, as well as having a whole set of publishing apps for document and application creation.
ON THIS PAGE
section 1: PREPARE DATA
  • Getting Started
  • Data Wrangling - dataset structure
  • Data Wrangling - recoding values
section 2: DESCRIBE DATA SET
  • Visualizing Data
  • Explore Data
section 3: DATA MODELING
  • Analyzing Data
  • Predict Outcomes
  • Machine Learning for socio-psychological data

section 2: DESCRIBE THE DATASET (2023)
VISUALIZE DATA
About Colour / About Palettes
One Variable or Measure
  • creating - barcharts - shows counts, proportions
  • creating - histograms - shows distributions
  • creating - boxplots - shows distributions
  • creating - scatterplots - shows realtionship between 2 distributions
  • creating - multiple graphs - shows relationship amongst multiple distributions
  • creating - cluster charts - shows grouping of observations/cases
EXPLORE DATA
  • computing - frequencies
  • computing - descriptive statistics
  • computing - correlations
  • creating - contingency tables
DESCRIBE THE DATASET
Visualizing Data
  • About Colour
    • functions: barplot(), colors()
  • About Palettes
    • packages: RColorBrewer, wesanderson
    • Functions: barplot(), palette(), display.brewer.all(), display.brewer.pal(), names()
  • BARCHART
    • dataset: datasets::HairEyeColor
    • packages: datasets, pacman, psych, rio, tidyverse
    • functions: plot(), ggplot2::qplot(), ggplot(), ggsave()
  • HISTOGTRAM
    • dataset: datasets::iris
    • packages: datasets, pacman, psych, rio, tidyverse
    • functions: ggplot(), stat-function(), rnorm(), qplot(), geom_histogram(), geom_density()
  • BOXPLOT
    • dataset: datasets::iris
    • packages: datasets, pacman, rio, tidyverse
    • functions: boxplot(), qplot(), ggplot(), geom_boxplot()
  • SCATTERPLOT
    • dataset: datasets::iris
    • packages.: datasets, pacman, rio, tidyverse
    • functions: qplot(), gglot(), geom_point(), geom_jitter(), geom_smooth(), geom_density2d()
  • MULTIPLE GRAPHS
    • dataset: datasets::iris
    • packages: datasets, pacman, rio, tidyverse
    • functions: ggplot(), geom_histogram(), facet_grid(), geom_density(), geom_smooth(), geom_density2d()
  • CLUSTER CHARTS
    • dataset: datasets::mtcars
    • packages: datasets, pacman, rio, tidyverse
    • functions: glimpse(), row-to-columns(), select(), mutate(), print(), dist, hclust, plot(), rect-clust()
Exploring Data Using Descriptive Statistics
  • dataset used: imported StateData.xlsx
  • FREQUENCIES - computing frequencies
    • packages: magrittr, pacman, rio, tidyverse
  • DESCRIPTIVE STATISTICS - computing descriptive statistics
    • packages: magrittr, pacman, rio, tidyverse, psych
    • functions: summary(), ggplot(), boxplot.stats(), psych::describe()
  • CORRELATIONS - computing correlations
    • packages: corrplot, magrittr, pacman, rio, tidyverse
    • functions: cor(), carplot(), cor.test()
  • CONTINGENCY TABLES creating contingency tables, and then comparing categories & frequency totals
    • packages: magrittr, pacman, rio, tidyverse
    • functions: table(), rowSums(), colSums(), prop.table(), chisq.test()

section 3: DATA MODELING (2023)
ANALYZING DATA
  • comparing - proportions - with chi-square for contingency tables
  • comparing - one-mean to a population - with one-sample t-test
  • comparing - paired means - with paired-samples t-test
  • comparing - two-means - with independent-samples t-test
  • comparing - multiple-means - with one-factor analysis-of-variance
  • comparing - means with multiple predictors - with factorial analysis of variance
PREDICTING OUTCOMES
  • predicting - outcomes - with linear regression
  • predicting - outcomes - with lasso regression
  • predicting - outcomes - with quantile regression
  • predicting - outcomes - with logistic regresssion
  • predicting - outcomes - with log-linear or Poisson regression
  • assessing - predictions - with blocked-entry models
MACHINE LEARNING FOR SOCIO-PSYCHOLOGICAL DATA
About variables: exploring dimension reduction
  • conducting - a principal component analysis
  • conducting - an item analysis
  • conducting - a confirmatory factor analysis
About cases: clustering & classifying observations/cases
  • grouping - cases - with hierarchical clustering
  • grouping - cases - with k-means clustering
  • classifying - cases - with k-nearest neighbors
  • classifying - cases - with decision-tree analysis
  • creating - ensemble models - with random forests
DATA MODELING
Analysing Data By Comparing & Inferences
  • proportions: compare proportions
    • PROPORTIONS - comparing proportions
      • dataset: survival::lung
      • packages: magrittr, pacman, survival, tidyverse
      • functions: propr.test()
  • t-tests: compare 2 things (distributions)
    • ONE-SAMPLE T-TEST - compare 1 mean to population
      • dataset: datasets::quakes
      • packages: datasets, magrittr, pacman, tidyverse
    • PAIRED MEANS T-TEST - compare paired means
      • dataset: create artificial data with random numbers
      • packages: GGally, magrittr, pacman, tidyverse
    • INDEPENDENT-SAMPLES T-TEST - compare 2 means
      • datasets: datasets::sleep, create artificial data with random numbers
      • packages: datasets, magrittr, pacman, tidyverse
  • ANOVA: compare multiple things (distributions)
Predicting Outcomes using Regressions
  • LINEAR REGRESSION
    • dataset: import StateData.xlsx
    • packages: GGally, margrittr, pacman, rio, tidyverse
    • functions: ggpairs(), ggplot(), lm(), summary(), confint(), predict(), lm.influence(), influence.measures(), plot()
  • LASSO REGRESSION
    • dataset: import winequality-red.csv
    • packages: lars, margrittr, pacman, rio, tidyverse
    • functions: summary(), scale(), as.matrix(), lars(), plot(), view(), coef()
  • QUANTILE REGRESSION
    • dataset: import StateData.xlsx
    • packages: GGally, margrittr, pacman, quantreg, rio, tidyverse
    • functions: ggpairs(), ggplot(), rq(), summary()
  • LOGISTIC REGRESSION
    • dataset: import Big5 b5_df.rds
    • packages: broom, margrittr, pacman, rio, skimr, tidyverse
    • functions: ggplot(), skim(), glm(), summary(), tidy(), confint(), view(), predict(), head(), table(), prop.table()
  • LOG-LINEAR (POISSON) REGRESSION
    • dataset: datasets::InsectSprays
    • packages: datasets, margrittr, pacman, rio, tidyverse
    • functions: summary(), glm()
  • BLOCKED-ENTRY MODELS
    • dataset: import Big5 b5_df.rds
    • packages: jmv, margrittr, pacman, rio, tidyverse
    • application: JAMOVI
Machine Learning
  • Clustering Cases
    • HIERARCHICAL CLUSTERING
      • dataset: import StateData.xlsx
      • packages: cluster, factoextra, margrittr, pacman, rio, tidyverse
    • K-MEANS CLUSTERING
      • dataset: datasets::mtcars
      • packages: cluster, datasets, factoextra, margrittr, pacman, rio, tidyverse
  • Classifying Cases
  • Variable-Component Reduction
    • dataset used: import Big 5 b5.csv
    • PRINCIPAL COMPONENT ANALYSIS (PCA)
      • packages: GPArotation, magrittr, pacman, psych, rio, tidyverse
      • functions: prcomp(), princomp(), principal(), summary(), plot(), vss(), nfactors(), fa(), fa.diagram(), iclust()
    • ITEM ANALYSIS
      • packages: GPArotation, magrittr, pacman ,rio, tidyverse
      • functions: function(), as_tibble(), mutate_at(), mutate(), pull(), describe(), error.bars(), list(), scoreItems(), pairs.panesl(), psych::alpha(), hist(), irt.fa(), plot()
    • CONFIRMATORY FACTOR ANALYSIS (CFA)
      • using lavaan = "LAtent VAriable ANalyis" for CFA
      • packages: lavaan, margrittr, pacamn, rio, tidyverse
      • functions: names(), cfa(), summary()

section 1: PREPRARING DATA (2023)
GETTING STARTED
  • prepraring - my R workspace
  • importing & saving - data and in R format
DATA WRANGLING - prepare dataset STRUCTURE for analysis
  • preparing - data frames for data analysis
  • converting - dataset structure
  • extracting - data from dataset
DATA WRANGLING - recoding data VALUES in dataset
  • recoding - variables (categorical / quantitative)
  • transforming - outliers
  • recoding - scores (count / average)
GETTING STARTED
Preprare R Workspace
Import data and Save in R format
DATA WRANGLING - prepare dataset structure for analysis
Preparing Ideal Data Frames for Computer Analysis
  • TIDY DATA
    • Create - tidy data - from time-series, XML, JSON, compound-variables
    • packages: datasets, pacman, tidyverse, lubridate, XML
    • TIME SERIES
      • dataset: datasets::sunspots
    • XML & JSON
      • dataset: XMLdata.xml
    • COMPOUND VARIABLES
      • dataset: enter example artifical data
  • TIBBLES
    • Using - tibbles - for clean data frame
    • packages: datasets, pacman, tidyverse
    • dataset: datasets::Orange
  • DATA.TABLE
    • Using - data.table - package for very large datasets
    • packages: pacman, data.table, htrr, tidyverse
    • dataset: figshare.com website
Converting Dataset Structure
Extracting Data from the Dataset
  • DATES & TIMES
    • packages: datasets, lubridate, pacman, tidyverse, tsibble
    • dataset: datasets::EuStockMarkets
    • functions: ggplot::GAM
  • LISTS
    • packages: pacman, tidyverse
    • dataset: manual entry
  • XML
    • packages: pacman, tidyverse, xml2
    • dataset: Missouri data portal
    • functions: str-squish
  • CATEGORICAL
    • packages: pacman, tidyverse, vcd
    • dataset: vcd::Arthritis
  • FILTERING
    • Filtering cases and subgroups
    • packages: pacman, rio, tidyverse
    • dataset: StateData.xlsx
DATA WRANGLING - recoding data values in dataset
Recoding Variables
  • CATEGORICAL
    • packages: margrittr, pacman, rio, tidyverse
    • dataset: MobileOS.US.xlsx
    • functions from forcats: fct_relevel(), fct_infreq(), fct_collapse(), fct_lump(), fct_lump-num()
  • QUANTITATIVE
    • packages: pacman, tidyverse
    • dataset: enter data
    • functions: scale()
Transforming Outliers
  • OUTLIERS
    • packages: datasets, pacman, tidyverse
    • datasets: datasets::islands
    • functions: enframe(), arrange(), filter(), pull(), select(), fitler()
    • method: windsorizing
Recode Scores
  • COUNT-SCORES
    • packages: pacman, rio, tidyverse
    • dataset: StateData.xlsx
    • functions: select(), arrange(), pull(), mutate(), as.tibble()
  • AVERAGE-SCORES
    • packages: pacman, rio, tidyverse
    • dataset: StateData.xlsx
    • functions: as.tibble(), select(), mutate(), arrange(), pull()
back home | back to reports