R

101

R       # interactive cli
Rstudio # gui

# run single command in shell
Rscript -e "print(system('hostname'))"   
Rscript --quiet -e 'install.packages("pacman",    repos = "http://cran.us.r-project.org")'    

# run a list (set?) of commands by placing them in {}.  p_load() won't run unless the library is loaded explicity.
Rscript --quiet --no-readline --slave -e '{ library(pacman); p_load( ggplot2, scales ) }'


Special symbols

^   power. 2^8 
$   separate table name and column name, eg nhanes$gender, opt$indir
~   tilde operator, 
-   minus, substraction.  eg a-b . Just-dont-use-it-in-var-name, as it is treated as substraction of each terms listed!  

log(100)           # Think ln() here!  By default, R log has base e!  same as bc's l(), python math.log(), matlab too?    calculator is the only place log is base 10, and use ln for base e??
log(1000, base=10) # this is log with base 10
log(1000, 10)      # same as above
log10(1000)        # ok, it is used often enough they do have a fn name for log base 10!

exp(1)             # expoent of natual log.  same as bc's e(), python math.exp()

log(exp(1))        # identity
exp(log(1))        # identity

Note, R doesn't treat dot differently.  it is often used as part of variable name or function name, like chisq.test() or p.hat.
However, neither Python nor bc likes dot in the variable name, so cut-n-paste simple calculator lines with variables with dots in them result in error.  May try to avoid use of dot in variable names for code portability.  (eg pasting from R to jupyther notebook with python kernel)



See also:
  • tilde operator and formula is one of the weird thing in R.
  • model operators (below)

    R

    Jupyter Notebook with R kernel Container

    
    singularity pull shub://tin6150/DIOS_demonstration
    singularity exec DIOS_demonstration_latest.sif /opt/conda/bin/jupyter lab --allow-root
    
    -or-
    
    docker
    podman run  -it --entrypoint=/opt/conda/bin/jupyter  tin6150/r4envids lab --allow-root
    
    
    

    Jupyter Notebook IR Kernel Setup

    conda install -c conda-forge jupyterlab
    
    IRkernel is R kernel for Jupyther Notebook.  Need to install as R package, then add kernel spec to Jupyter Notebook:
    
    Rscript -e "install.packages('IRkernel')"
    
    # run multiple commands in Rscript by placing them inside {}, cmd separated by ;
    Rscript --quiet --no-readline --slave -e '{ print("Hello") ; print("hi") }'
    # p_load requires library(pacman) to be run first, so, try like this:
    Rscript --quiet --no-readline --slave -e '{library(pacman) ; p_load( ggplot2, scales ) ;  print("Hello") ; print("hi") }'
    
    # system wide
    IRkernel::installspec(user = FALSE)
    
    # per user
    Rscript --no-readline --slave -e "IRkernel::installspec(user = TRUE)"
    
    
    /home/tin/anaconda3/bin/jupyter lab 
    # expect to have browser launched with URL http://localhost:8888/lab
    # then click icon to create new R notebook
    
    

    Environment

    
    setenv R_LIBS /path/to/user/R-libraries
    
    # setting R_LIBS will change both where R looks for lib to load and when installing libs 
    
    library( rJava );	# see if it can load the R library named rJava (case sensitive)
    
    listing all installed libraries:
    Rscript -e 'library()' 
    
    sI = sessionInfo()
    sI
    
    
    system('env')	# this is running the env command via the shell and display that info
    system('id -a; hostname')
    
    # in jupyter notebook, the system() command output to console 
    # use combination of intern=TRUE and wrap with print() just to be sure
    print(system( 'uptime', intern=TRUE ))  
    
    
    library(data.table)
    fread( "cat /proc/loadavg", check.names=TRUE ) ## run system command and read result into data.frame
    
    
    

    Setting dir for user package install

    
    setting user's own library install dir.
    for installing CRAN libraries...
    
    You can have a R_LIBS under your home dir.  
    If you setup R_LIBS to be a shared folder, then your student can use the same library that you (or I, not as root) install.  
    
    
    
    mkdir -p  ~/rlibs36
    export R_LIBS=~/rlibs36    (or R_LIBS=/global/home/groups/ac_seasonal/rlibs36 )
    I am assuming this for a new R 3.6, are libraries compatible with old version? For Perl I keep them in separate directories; python is a different beast)
    
    # setting R_LIBS will change both where R looks for lib to load and when installing libs 
    
    Running this as user should install a package called diagram:
        Rscript --quiet -e 'install.packages("diagram",    repos = "http://cran.us.r-project.org")'    
    
    after install, you should see files added to ~/rlibs36
    
    If the above don't work,
    start interactive R as user
    install.packages('diagram',lib='~/rlibs36')
    
    this page has a good write up about this issue:
    https://statistics.berkeley.edu/computing/R-packages
    
    

    Installing packages

    
    installing packages from cran:
    (see r4eta or metabolics container)
    
    
    scripted to run from shell:
    
    Rscript --quiet -e 'install.packages("ggplot2",  repos = "http://cran.us.r-project.org")'    
    Rscript --quiet -e 'install.packages(c("aCRM", "akima", "broom", "cluster", "clusterCrit"), repos = "http://cran.us.r-project.org")'
    
    # do NOT have TAB(s) inside the arg list for packages(...), as Rscript can't tolerate it when invoked from a bash script.
    
    
    Rscript --quiet --no-readline --slave ...
    	# --quiet is same as --silent 
    	# --slave was to make R run as quietly as possible 
    	# --no-readline hopefully do less drawing on screen and make install log more readable...
    	# there isn't a --no-verbose or --no-interactive
    
    
    install many packages in single command:
    https://stackoverflow.com/questions/29041423/how-to-install-multiple-packages
    
    install.packages(c("EIAdata", "gdata", "ggmap", "ggplot2")) # rest omitted
    
    have not tried yet:
    install.r EIAdata gdata ggmap ggplot2    # rest omitted again
    
    

    CSV example

    Simple example reading csv file and getting summary info about it. Utilizes magrittr from dplyr, which allow unix pipe-like operation using %>%
    library( 'magrittr')
    print(getwd())
    data <- read.csv("/mnt/CSV_adjoin/dacsjvnew_AVOC_07_All_Sp.csv")
    data %>% head() # like unix cat data | head
    # print(data)   # maybe big
    data %>% summary()  # summary stat for each column of data, eg min,median,quartilers
    
    str( data ) # see list of variables (column headers) 
    
    

    Misc

    capabilities() 		# see if support X11, etc
    source("myFuncList.R")	# import function list from file
    
    system("id;pwd")	# run OS commands
    setwd("dirname")	# cd to dirname, most useful in interactive R session.
    
    # loading files in Rstudio 
    
    spatial_paths = loadPaths(fileName = normalizePath("C:/Users/tin61/tin-gh-inet-class-only/inet-dev-class/epi_info/travHistProtocol/files/Protocol4/EPI_ISL_412975_MJhist.csv") )
    
    # note that in windoze, C: is assumed and added by system at run time if not specified
    
    # these might be helpful with knitting
    setwd("c:")
    setwd("/Users/tin61/tin-gh-inet-class-only/inet-dev-class/epi_info/travHistProtocol/files/Protocol4/")
    
    # Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
    knitr::opts_chunk$set(root.dir = normalizePath("/Users/tin61/tin-gh-inet-class-only/inet-dev-class/epi_info/travHistProtocol/files/Protocol4/"))
    
    .libPaths()		# list path used to scan for R libs 
    .libPaths(  c( ./libPaths(), '/home/tin/R/x86_64-pc-linux-gnu-library/3.4' ) )   # add path to library 
    
    
    rm(list=ls())  		# remove all var in memory
    
    

    Oracle connection

    Sys.setenv(TNS_ADMIN="/site/conf/TNS_ADMIN",ORACLE_HOME="/usr/prog/oracle-client/11.1.0.6",ORACLE_SID='')
    # The TNS_ADMIN may not matter if ORACLE_HOME has a network/admin with working conf for tns name resolution.
    # ORACLE_SID is very important!  if set, the instance specifier in the dbConnect will be ignored, and db always connect to instance specified by ORACLE_SID !!
    
    library( 'ROracle' )
    
    ora=Oracle()
    con=dbConnect(ora,user='user/pass@instance')
    summary(con)
    # verify connected to desired db)
    
    result=dbGetQuery(con, "select * from tab")
    # perform SQL query
    
    result
    # see result of query (display content of variable)
    
    
    result2=dbListTables( con )
    # see list of tables for the connected schema
    
    

    GC: Garbage Collection

    gc(verbose=TRUE) 	# explicitly call garbage collection
    ls() 			# list all vars
    object.size( get("myData") )	
    rm()
    rm(list=ls())  		# remove all var in memory # probably useful in rstudio when playing, but not in actual program!
    sort( sapply( ls(), function(x) { object.size(get(x))} ),decreasing=TRUE )	# list all vars and their sizes
    
    vars are out of scope at the end of function, and gc can happen to them automatically
    
    bigmemory leave data on file.  some packages are build using bigmemory
    
    file-backed big matrix -- https://ljdursi.github.io/beyond-single-core-R/#/132 
    data = read.big.matrix(...)
    
    csv is slow.
    HDF5 and netCDF recommended. 
    
    
    ref: https://ljdursi.github.io/beyond-single-core-R/#/111

    R Programming

    101.2
    R short primer (pdf) (page 12)
    
    • function(arglist){expr}: 
    function definition: do expr with list of arguments arglist
    
    • if(cond){expr1}else{expr2}: if-statement:
    if cond is true, then expr1, else expr2
    
    • for(var in vec) {expr}: 
    for-loop: the counter var runs through the vector vec 
    and does expr each run
    
    • while(cond){expr}: while-loop: while cond is
    true, do expr each run
    
    
    • x[n]: the n-th element   # R index starts at 1
    • x[m:n]: the m-th to n-th element
    • x[c(k,m,n)]: specific elements
    • x[x>m & x<n]: elements between m and n
    
    • x$n: element of list or data.frame named n.  ie, x is a data.frame, n is the column name
    
    • x[["n"]]: idem		# [[ ]] in R ... 
    
    • [i,j]: element at i-th row and jth column
    • [i, ]: row i
    
    
    • return: show variable on the screen (used in functions)
    
    
    • as.numeric or as.character: change class to
    number or character string
    
    • strptime: change class from character to
    date-time (POSIX)
    
    • data types: lgl (logical: TRUE FALSE; T F are auto parsed as well, but NOT True False), int (Integer), dbl (Double), chr (characters, for strings), bytes (bytes!)  ref: https://cran.r-project.org/web/packages/readxl/vignettes/cell-and-column-types.html
    
    ?"%*%"  # get help on topic, quote it if it has "funny" symbols
    
    %*% # matrix multiply
    %/% # ?? partitioning data into partition for shipping to different nodes 
    
    %:% # compose or nest foreach objects - https://ljdursi.github.io/beyond-single-core-R/#/84
    
    $>% # like unix pipe, used a lot by dplyr.  can be chained.
    
    %in%	# membership evaluation (may not need to be a true math set)
    
    %%  # modulus, ie reminder.  eg 7%%3 is 1
    2^4 and 2**4 both seems to be power, get 16.  2^4 notation maybe preferred.
    
    
    R stuff - NA : Not available data
    
    NA is the token to indicate a value that is not availabe.  Kinda like UNDEF.  
    It is not a string, and it is not NULL.
    It is spelled exactly as NA , both caps.
    
    > foo=NA
    > foo
    [1] NA
    > bar=na
    Error: object 'na' not found
    
    • is.na: test if variable is NA
    
    R by default will not compute on dataset that contain NA.
    If you don’t mind about the missing data and
    want to compute the statistics anyway, you can
    add the argument na.rm=TRUE 
    (Should I remove the NAs? Yes!).
    
    > max(j, na.rm=TRUE)
    [1] 2
    
    
    
    R stuff - stat stuff
    
    • lm(v1∼v2): linear fit (regression line) between 
    vector v1 on the y-axis and v2 on the x-axis
    
    • nls(v1∼a+b*v2, start=ls(a=1,b=0)): nonlinear fit. 
    Should contain equation with variables
    (here v1 and v2 and parameters (here a and b)
    with starting values
    
    • coef: returns coefficients from a fit
    • summary: returns all results from a fit
    
    • sum: sum of a vector
    • mean: mean of a vector
    • sd: standard deviation of a vector
    
    
    • aggregate(x,by=ls(y),FUN="mean"): split
    data set x into subsets (defined by y) and computes means of the subsets.
    Result: a new vector.
    
    • approx: interpolate. Argument: vector with NAs. 
    Result: vector without NAs.
    
    
    stat functions
    Ref:
    - "UNM Stat 5101"
    - CDF and PPF in Excel, R and Python (good graphical guide covering what the p, q, d, r functions cover.)
    - SciPy stats function
    
    
    prefix: p, q, d, r ::
    
    p for "probability", the cumulative distribution function (c.d.f.)
    	return the AUC for X<=x, ie the probability  (when lower.tail=TRUE)
         	return the p-value                           (when lower.tail=FALSE)
    
    	F(x) = P(X ≤  x)
       	Answer What is  P(X < 27.4) when X has Normal distribution with mean 50 and standard deviation of 20 aka N(50,20).  N often styled as cursive cap N.
       	if variance (σ) is given, use the relation s.d. = sqrt(sigma) 
       	pnorm( q=27.4, mean=50, sd=20, lower.tail=TRUE )
    	pnorm( 27.4, 50, 20 )
       	In R, optional param default to mean=0, sd=1, lower.tail=TRUE.
       	Python: scipy.stats.norm.cdf( 27.4, loc=50, scale=20 )
       	Excel:  norm.dist( 27.4, 50, 20, TRUE ) 
    
    	For calculation of the p-value in a right tailed test for a given z-score:
       	1 - pnorm( 27.4, 50, 20 )
       	pnorm(27.4, 50, 20, lower.tail=FALSE )  # lower.tail=FALSE -> upper/right tail
    	Python: 
    	1 - scipy.stats.norm.cdf( 27.4, 50, 20 )
    	scipy.stats.norm.sf( 27.4, 50, 20 )     # sf = 1 - cdf
    
    
    q- for "quantile", the inverse c.d.f. - aka PPF(q) for probability ( 1 - α )
    	Have AUC (ie probability or p-value, typically alpha=0.05), get value of x (aka q, quantile)
       	p = F(x)
    	x = F-inv(p)
       	F-inv is styled as F^-1.
       	So given a number p between zero and one, qnorm looks up the p-th quantile of the normal distribution.
       	Answer What is F-inv(0.95) when X has the N(100, 15^2) distribution?
       	qnorm(0.95, 100, 15)
       	qnorm(p=0.95, mean=100, sd=15, lower.tail=TRUE)
       	qnorm(0.05, mean=100, sd=15, lower.tail=FALSE)   # from the relation (1-alpha) + alpha = 1, here alpha=0.95
       	Python: scipy.stats.norm.ppf( 0.95, loc=100, scale=15 )
       	Excel: norm.inv( 0.95, 100, 15 ) .  Excel has a norm.s.inv(0.05), but only take mean=0 sd=1.
    
       	
    d- for "density", the density function (p.f. or p.d.f.)
    	or more often discrete?  eg dbinom, dpois
      
       		
    
    r- for "random", a random variable having the specified distribution
    	Generating random numbers from a standard normal distribution
    	N(mu=0,sigma=1)  # N cursive, mu=mean of 0, sigma=standard deviation of 1
    	rnorm(n=1, mean=0, sd=1 )
    	Python: scipy.stats.norm.rvs( loc=0, scale=1, size=1, random_state=123 ) # random_state is optional, use seed for repeatable/determinsistic result
    	If n or size > 1 get a vector.
    	Excel:  norm.s.inv( rand() )
    
         [d- and p- are typically grouped as they are often for discrte vs continuous dist, ie X=exact vs X≤up-to a specific value ]
    
    pnorm,  qnorm,  dnorm,  and rnorm.  are the p-, q-, d-, r- fn for Normal Distribution.  
         
    
    pbinom  qbinom  dbinom  rbinom - for binomial distribution
    pchisq	qchisq	dchisq	rchisq - for Chi-Square   scipy.stats.chi2.cdf() etc
    pexp	qexp	dexp	rexp   - for Exponential
    plogis	qlogis	dlogis	rlogis - for Logistic
    plnorm	qlnorm	dlnorm	rlnorm - for Log Normal
    ppois	qpois	dpois	rpois  - for Poisson
    pt	qt	dt	rt     - for Student t    scipy.stats.t.cdf() etc
    punif	qunif	dunif	runif  - for Uniform
    
    Note that for continuous distribution, some of these fn are defined using integrals, and R doesnt do integrations!
    
    
    All function take option mean and standard deviation.  If omitted, default mean=0, sd=1.  
    
    pnorm(27.4, mean=50, sd=20)
    pnorm(27.4, 50, 20)
    
    
    
    

    The Statistics Theme Park "Ride the Distributions!
    Source: Loony Labs - Day 45 or Claudia @ Eigenblogger

    assignment
    a =  5
    a <- 5
    
    # both = and <- can be used for assignment
    # But <- is prefered by the R crowd.  it also screw with my html code here, so maybe this R.html should go have been done in .rst ...
    
    # there are some subtle differences, side effect, formula, etc.
    # see https://renkun.me/2014/01/28/difference-between-assignment-operators-in-r/
    # overall, for better readability of R code, 
    # it was suggested to only use <- for assignment and = for specifying named parameters.
    # But I was not convinced and will likely continue to use = 
    # at least in this R.html doc.
    
    
    # converting a numeric 1, 2 to categorical male, female, but creating a new col (sec_cat) in existing table (nhanes)
    
    nhanes$sex_cat = ifelse(nhanes$sex==1, "Male", "Female") 
    
    
    string to integer conversion (search food: casting)
    myNumber = strtoi(myString, base = 10)
    
    
    

    Data Structures

    Adv R
    
    1D:
    vector: homegenoeus elements.  c()
    list:   heterogeneous elements.  list()
    
    2D:
    matrix:     homegenoeus elements.  matrix()
    dataframe:  heterogeneous elements, named columns. think table of data/spreadsheet :-D.  data.frame(), tidyverse
    
    3D, multi-dimention:
    array:     homogeneous elements.  1-indexed.  array(), dim()
    
    
    attribute()
    structure()
    
    factor() to store categorical data -- ie predefined values
    
    
    vector, matrix
    R short primer (pdf)
    
    vec1 = c( 1, 3, 5, 7)	# c = concatenate, paste together.  
    vec2 = c( 2, 4, 6, 8)
    vec3 = seq(10,25,5)     # aka: vec3 = seq(from=10,to=25,by=5)  # result: [1] 10 15 20 25
    
    sum(vec2)
    
    
    > mat1=matrix( data = c(1,2,3,4,5,6), ncol=3 )   # ncol define how many column the matrix has
    > mat1
         [,1] [,2] [,3]
    [1,]    1    3    5
    [2,]    2    4    6
    
    > mat[2,2]
    [1] 4
    
    matlb/octave matrix syntax is A LOT nicer :-D
    pay special attention to the sequence of the numbers.
    
    cbind	# column bind to create matrix
    rbind 	# row    bind
    
    
    
    array
    R is 1-indexed, so first entry of df is 1
    A 1D array is a vector.
    A 2D array is a matrix.
    3D and higher dim structure is rarer in stats.
    
    dim()
    length()    # length( any_col_name ) to get number of rows, typically can provide count for sample size n.
    names()
    rownames()
    colnames()
    dimnames()
    t() - transpose
    
    c() generalizes to cbind() and rbind() for matrix
    abind() for arrays
    
    as.matrix()
    as.array()  -- for conversions
    
    
    dataframe
    R is 1-indexed, so first entry of df is 1
    
    coord = data.frame(x = runif(n.total), y = runif(n.total), sampled = 0) 
    coord.sampled = coord[coord$sampled == 1,]   # initial sites
    
    data.frame( coord ) is like a table  # convert a vector into a data.frame()
    and $ separate the table name and the column name.
    Thus, the above eg has 
    coord$x
    coord$y
    coord$sampled
    each of them is an array (representing the cell values across many rows of the table).
    coord$x[1]  # table coord, column x, row 1  # R is 1-indexed.  (unlike python 0-indexed)
    coord$x[2]  # table coord, column x, row 2
    
    
    › t = data.frame(x = c(11,12,14), y = c(19,20,21), z = c(10,9,7))
    › t
       x  y  z
    1 11 19 10
    2 12 20  9
    3 14 21  7
    
    › mean(t$x)
    [1] 12.33333
    
    
    typeof( t )
    calss( t )
    is.data.frame( t )
    
    cbind(t, data.frame( z = 3:1 )
    rbind(t, data.frame(...) )
    
    
    
    
    
    It’s a common mistake to try and create a data frame by cbind()ing vectors together. This doesn’t work because cbind() will create a matrix unless one of the arguments is already a data frame. Instead use data.frame() directly
    Since a data frame is a list of vectors, then each vector represents a column, and the data.frame() becomes a list of columns.
    List
    List in a simple form is a vector.
    but it can be arbitrary number of object, each having different length.
    
    syntax looks like it has some nested structure, pay close attention.
    
    > L2=list(vec1,vec2)
    > L2
    [[1]]
    [1] 1 3 5 7
    
    [[2]]
    [1] 2 4 6 8
    
    
    > L = list(one=1, two=c(1,2),five=seq(1, 4, length=5))
    > L
    $one
    [1] 1
    
    $two
    [1] 1 2
    
    $five
    [1] 1.00 1.75 2.50 3.25 4.00
    
    
    [ ]   returns a list
    [[ ]] returns the object which is stored in list
    If it is a named list, then
        List$name or List[["name"]] will return same as List[[ ]]
    While List["name"] returns a list
    
    ref: https://stackoverflow.com/questions/32819539/proper-way-to-access-list-elements-in-r
    
    for list returned by ans = mccollect( list(jobs...) )
    typically ans[[1]], ans[[2]] etc will contain the original answer desired.
    
    

    Strange R stuff

    Formula and tilde (~) operator

    Coming from a programming world, R's definition of formula and its use of the tilde (~) operator is one of the weirest sh-IT of R!! =)

    R has this concept of formula, and, at least from a user perspective, is different than function. (Actually, a formula can be called 'f' and another function 'f' and they can coexist and be called correctly by the right context!)

    For python programmer, forget about programming for a bit. Put on the stat thinking cap :)
    Remember the flower classification example in data science class? stack overflow describes this construct:
    Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
    as "Model Species depends on Sepal Length, Sepal Width, Petal Length and Petal Width"

    The tilde (~) operator is often used for regression model.
    The LHS of the ~ operator is the name of the model (Species).
    The RHS are the variables that is fed into the model, with the + operator being overloaded (don't literary means add them, but to treat each as "columns" to find covariance). Most models use data frames, and each "variable" is a column vector of data.

    formula has delayed evaluation

    Ref: Cran Lazy Eval, scroll down to the Formulas section.

    A formula capture delays the evaluation of an expression so that you can later evaluate it with f_eval(). ie
    f <- ~ 1 + 2 + 3
    
    The formula 'f' will be defined, but the RHS expression will not be calculated yet.
    Only when
    f_eval(f)
    
    is executed is the expression calculated.
    Above was a one-sided formula.
    Below is a two-sided formula:
    	g <- y ~ x + z
    
    lazyeval provides abstraction like:
    f_rhs(g)   # x + z
    f_lhs(g)   # y         
    

    Common formula

    R Doc

    lm and glm are logistic regresion models (ie linear line of best fit).
    ## seed_weight is "Y", the response aka predicted value ## seed_count is "X", the exploratory ie input value to find prediction for ##lm( y ~ x ) seed_mod <- lm( seed_weight ~ seed_count, data = seed_data ) tidy( seed_mod ) seed_mod seed_mod %>% str # there are a lot more than just intercept and slope in the linear model, just not shown when just asked for it plainly. tidy() is a decent compromise ## linear_mode = lm( y ~ x , data="someTable") m = linear_model$coefficients[2] # slope (it is a "named number" in R) m = linear_model$coefficients[[2]] # slope ( [[]] returns a number ) b = linear_model$coefficients[[1]] # intercept

    the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators.


    Model operators:
    + 	separate a series of terms that affect the model.
    : 	??
    * 	factor crossing: a*b interpreted as a+b+a:b
    ^ 	crossing to specific degree.  eg (a+b)^2 for crossing twice
    %in%   	terms on the left are  nested within those on the right.
    -      	remove dependence on the specified series
    
    I() - To avoid this confusion, the function I() can be used to bracket those portions of a model formula where the operators are used in their arithmetic sense
    
    
    
    1. The lattice package uses them to specify the variables to plot.
    2. The ggplot2 package uses them to specify panels for plotting.
    3. The dplyr package uses them for non-standard evaulation.

    Tilde operator

    Try to understand these:
    # sequences 
    x <-   c(1:100)
    y <- seq(0.1, 10, 0.1)
    	
    
    	
    # all these 3 are the same:
    plot( y     ~ I(x^3) ) 
    plot( I(x^3), y      ) 
    plot( x^3,    y      )    # no need to use I() function as ~ operator never used.
    
    # note how ~ syntatically swapped the order of the independent and dependent variables x and y
    
    # but this below is very different:
    # ^ is interpreted as model operator for cross reference.  ( due to presence of ~ operator)
    ploy( y ~ x^3 ) 
    
    	
    
    StackOverflow more extensive discussion on formula
    Regression in Python usinr R-style formula seems to discuss use of ~ and regression formula in R.
    import pandas as pd
    from statsmodels.formula.api import ols
    	
    model = ols("MPG ~ Year", data=df)
    results = model.fit()
    

    Pipe

    stack overflow
    
    
    %>% is defined by magrittr, which is used by dplyr.
    It is the equivalent of unix shell pipe.
    
    iris %>% head() %>% summary() is equiv to summary(head(iris))
    which one to use is a matter of preference.
    
    Note: there is also %in%
    
    

    apply family of function

    see R-bloggers
    apply(t(beaver1),1,max) ## apply function max to data, 1=row-wise)
    lapply() returns a list
    sapply() returns a vector instead of a list
    tapply() does grouping before running summary function on them
    by() ... think of GROUP_BY in SQL
    
    
    

    Libraries

    lifkit

    tbd

    ModeSim

    geoR::lifkit
    
    ## likfit - Likelihood Based Parameter Estimation For Gaussian Random Fields
    ## https://www.rdocumentation.org/packages/geoR/versions/1.8-1/topics/likfit
    
    SimAnneal()
    kridge
    
    ~
    
    raster::extract()
    
    
    makCluster
    registerDoParallel
    %dopar%
    
    

    Python using R

    1. Quick hack: Just write to file, in /dev/shm so it is RAM based.
    2. https://towardsdatascience.com/from-r-vs-python-to-r-and-python-aa25db33ce17
    3. PypeR - R within Python
    4. pyRserve - connect to a remote R server
    5. rpy2 - runs embedded R in a Python process

    R using python

    1. rPython
    2. SnakeCharmR - fork, more modern ver of rPython
    3. PythonInR
    4. Reticulate [cheatsheet]

    Parallelization in R

    2 categories of parallelization: - Shared memory: ie, multiple cores, single machine memory space. typically fork-based. - distributed memory: multiple machines (multiple OS instances), using Socket, MPI, etc 6 main types/bundles of parallization exist in R:
  • same host, FORK based, parent/child memory model with copy-on-write: mclapply, mcparallel, mccollect. multicore is easy if code already use lapply. Not usable in Windows natively as it does not support fork().
  • Rdsm - shared memory within a single computer, so multiple PSOCK same-node processes can read/write to 1 single memory region.
  • PSOCK based multi-node parallelization, without needing MPI (cluster): SNOW: parLapply, clusterApplyLB
  • MPI based -- SNOW cluster implement on this as well.
  • pbdR, hiding some of the complexity of MPI
  • Hardoop/SparkR but that's a platforms of its own.
  • Packages that does parallelism under the hood. eg BLAS, caret, many others
  • future_map(), purrr -- https://byuistats.github.io/M335/parallel_furrr.html


    Tutorials

    Chris Paciorek (UC Berkeley Stats Dept) Parallel Processing for Distributed Computing Cover R, Python, Matlab and C, as well as how to use them in a Slurm cluster.

    Jonathan Dursi Beyond Single-Core R a slide deck on Parallelization in R. Summary slides:
  • Packages that does parallelism under the hood -- https://ljdursi.github.io/beyond-single-core-R/#/22
  • paralle/multicore with mc* -- https://ljdursi.github.io/beyond-single-core-R/#/51
  • cluster* parLapply snow -- https://ljdursi.github.io/beyond-single-core-R/#/73
  • foreach doParallel isplit %:% -- https://ljdursi.github.io/beyond-single-core-R/#/98
  • Best Practices -- https://ljdursi.github.io/beyond-single-core-R/#/106
  • bigmemory morder mpermute mwhich sub.big.matrix -- https://ljdursi.github.io/beyond-single-core-R/#/133
  • Rdsm -- shared memory within node by multiple process -- https://ljdursi.github.io/beyond-single-core-R/#/142

    Dursi's github repo has data for running his example, and I wrapped that with a Jupyter Notebook container so everyone can try using this Docker Container with Jupyter notebook server:
    Start jupyter notebook web server (on specific port, eg 6997):
    
    docker run -p 6997:6997 -v "$PWD":/mnt -it --entrypoint=/opt/conda/bin/jupyter  tin6150/beyond-single-core-r  lab --allow-root  --no-browser --port=6997 --ip=0.0.0.0
    
    Point web browser to something like
      http://127.0.0.1:6997/
    and paste the token URL link shown on the terminal console
    
    /mnt is a mount of the current dir (PWD) where you started the docker process, and files written there will persist after the container is terminated.  (Other files inside the container are ephemeral!)
    

    Parallelization in R using mc* functions

    mclapply
    mcparallel (likely just called parallel in early days)
    mccollect
    mc.core

    Ref: Beyond single core R
    mc*        are multi-core routines
    mcparallel is good for parallel tasks
    mclapply   is good for data parallelism - easily replace lapply()
    mc.cores   really should represent number of tasks, core can be oversubscribed often with little perf degradation
    
    parallel::mclapply - multi-core lapply - https://nceas.github.io/oss-lessons/parallel-computing-in-r/parallel-computing-in-r.html
    
    instead of 
    	result = lapply(starts, fx)
    use
    	result = mclapply(starts, fx, cores=4)
    mclapply gathers up the responses from each of these function calls, and returns a list of responses that is the same length as the list or vector of input data (one return per input item).
    
    starts is a variable holding data, eg rep(100, 40)
    fx is a custom function that lapply() runs
    
    pvec - for simple use case of applying func to each element of vector and return vector, pvec is easier to use
    # https://ljdursi.github.io/beyond-single-core-R/#/48
    
    
    
    https://ljdursi.github.io/beyond-single-core-R/#/30
    mcparallel  # fork a task , run in background
    mccollect   # wait and collect result.  so kinda like MPI scatter/gather
    
    
    library(multicore)                    ## functionality replaced by library('parallel') 
    fun1 = function() {Sys.sleep(10); 1}
    fun2 = function() {Sys.sleep(5);  2}
    f1 = parallel(fun1())                 ## new fn name is mcparallel()
    f2 = parallel(fun2())
    result = collect(list(f1, f2))
    
    result[[1]]  # would contain result of f1
    
    
    Linear Regression Model fitting, in series vs parallel
    https://ljdursi.github.io/beyond-single-core-R/#/32 and slide 33
    
    fit1 =  lm(DISTANCE  ~ AIR_TIME,  data=jan2010))
    fit2 =  lm(ARR_DELAY ~ DEP_DELAY, data=jan2010))
    
    -vs-
    
    parfits = function() {
         pfit1 = mcparallel(lm(DISTANCE  ~ AIR_TIME,  data=jan2010))
         pfit2 = mcparallel(lm(ARR_DELAY ~ DEP_DELAY, data=jan2010))
         mccollect( list(pfit1, pfit2) )
    }
    parfits()
    
    Note that mcollect asked for the tasks output to be combined into a list.
    Some data structure maybe needed to extract more complex model result??
    
    
    domino data lab on caret/doMC
    
    library(doMC)
    library(caret) # popular ML using foreach to parallelize CV-folds, etc.
    	fit.lda.ser vs
    	fit.lda.par
    
    

    Parallelization in R using cluster functions

    makeCluster()
    stopCluster()
    doParallel()
    registerDoSEQ()
    snow
    foreach
    %do%
    %dopar%
    
    for is run for its side effect
    foreach returns result, side effects maybe gone (when using parallel backend).
    foreach should be tought of lapply, not for loop
    
    
    %do% is serial
    %dopar% is parallel, and need a backend.
    Parallel libs: doParallel, doMC, doMPI
      cl = parallel::makeCluster(4)
      cl = parallel::makeCluster(4, type = "FORK")   # unix fork() based, not usable in windows natively.
      cl = parallel::makeCluster(4, type = "PSOCK")  # default.  Parallel Sockets on a SNOW cluster with 4-node
      cl = parallel::makeForkCluster(4)
    
    use registerDoSEQ() if want to force foreach to run in serial 
    
    much of the parallelization is done using fork() in unix (may not work in windows).
    Also, fork() allow access to vars/memory from parent, but result not returned at the end.
    Need to do some special collection/return if need to save those data.
    Thus, "side-effects" such as those produced by `for` vanishes in parallelization.
    
    SNOW - like PSOCK, entirely new process, so ned to explicitly copy data.  but process can be on remote machine.
    
    
    https://ljdursi.github.io/beyond-single-core-R/#/55
    clusterCall()
    clusterEvalQ() # easier to use for tasks
    
    
      cl = parallel::makeCluster(4, type = "PSOCK")  # Parallel Sockets creates a brand new session in each node, 
    						 # R copies data and .packages to each node, so more overhead than fork().  
    						 # 4 here indicate create a 4-node SNOW cluster, which are just cores in a SMP machine.   
    						 # for multinode, provide a vector of hostnames, which R need to ssh to (and run 1 core per machine?) 
    						 # https://stackoverflow.com/questions/36794063/r-foreach-from-single-machine-to-cluster
      multinode foreach call would need to have param for .packages to copy to each node, as well as .export for variables  eg:
      foreach(i = 1:nrow(data_frame_1), .packages = c("package_1","package_2"), .export = c("variable_1","variable_2"))  %dopar% { ... }     
    
    If the foreach loop is in the global environment, variables should be exported automatically. If not, you can use .export = ls(globalenv()) (or .GlobalEnv).
    per https://stackoverflow.com/questions/45767416/how-to-export-many-variables-and-functions-from-global-environment-to-foreach-lo
    
    
    
    

    doParallel

    One way to do parallel processing in R
    NumThread=15
    library(doParallel)
    cl = makeCluster( NumThread, outfile="mkcluster.out" )   ## create a cluster object
    registerDoParallel(cl)
    foreach(i=1:3) %dopar% sqrt(i)
    # the %dopar% says to run the foreach loop in parallel for each item
    # all console output inside cluster object will go to outfile 
    # there are eg combine=rbind to merge result
    stopCluster(cl)
    
    .combine=rbind # row-bind, ie, each result becomes a row
    .combine=cbind # col-bind, ie, each result becomes a col
    .combine=c     # c() ?  concatenate?
    
    
    detectCores() ## return number of cores on machine
    
    
    Ref:

    Rdsm

    https://ljdursi.github.io/beyond-single-core-R/#/139
    library(parallel)
    library(Rdsm)
    
    nrows = 7
    cl = makePSOCKcluster(3)       # form 3-process PSOCK (share-nothing) cluster
    init = mgrinit(cl)             # initialize Rdsm
    mgrmakevar(cl,"m",nrows,nrows)  # make a 7x7 shared matrix
    bar = makebarr(cl)
    
    clusterEvalQ(cl,myid = myinfo$id)
    
    clusterExport(cl,c("nrows"))
    dmy = clusterEvalQ(cl,myidxs = getidxs(nrows))
    dmy = clusterEvalQ(cl, m[myidxs,1:nrows] = myid)
    dmy = clusterEvalQ(cl,"barr()")
    
    print(m[,])
    
    
    

    Clustering on Clusters

    https://ljdursi.github.io/beyond-single-core-R/#/57
    ship data to other nodes of the cluster:
    cl = makeCluster(8)
    clusterExport(cl, "jan2010")
    cares = clusterApply(cl, rep(5,4), do.n.kmeans)
    stopCluster(cl)
    mcres = mclapply(rep(5,4), do.n.kmeans, mc.cores=4)
    
    
    https://ljdursi.github.io/beyond-single-core-R/#/59
    # Running across machines:
    # 2 machine, each with 8 core of process?
    
    hosts = c( rep("localhost",8), rep("192.168.0.10", 8) )
    cl = makePSOCKcluster(names=hosts)
    clusterCall(cl, rnorm, 5)
    clusterCall(cl, system, "hostname")
    stopCluster(cl)
    
    

    Snow

    https://ljdursi.github.io/beyond-single-core-R/#/61
    scheduling
    viz of run
    
    library(snow,quiet=TRUE)
    
    do.kmeans.nclusters <- function(n) { kmeans(jan2010, centers=n, nstart=10) }
    
    cl = makeCluster(2)
    clusterExport(cl,"jan2010")
    
    tm = snow.time( clusterApply(cl, 1:6, do.kmeans.nclusters) )
    plot(tm)
    
    # load balance, below is more like mc.preschedule=FALSE, kicking off task to worker only when needed
    tm.lb = snow.time(clusterApplyLB(cl, 1:6, do.kmeans.nclusters))
    plot(tm.lb)
    
    
    https://ljdursi.github.io/beyond-single-core-R/#/73 - Summary for parallel/snow
  • clusterExport good for for SIMD
  • clusterApplyLB good for MIMD
  • mcparallel is best for MIMD?
  • parLapply
  • makePSOCKcluster for small cluster
  • makeMPIcluster for larger cluster -- but read up on pbdR first
  • pdbR -- https://ljdursi.github.io/beyond-single-core-R/#/143
  • pdb*apply -- https://ljdursi.github.io/beyond-single-core-R/#/152
  • using pdbR -- bring hpc distributed memory processing (SPMD) to R, see section 3.1.2 of CPaciorek's Parallel Processing for Distributed Computing
  • https://ljdursi.github.io/beyond-single-core-R/#/83
    foreach parallel
    .multicombine=TRUE
    
    %:% nest foreach object # slide 84
    
    
    
    MPI cluster
    https://ljdursi.github.io/beyond-single-core-R/#/149
    
    init()
    rank = comm.rank()
    ...
    allreduce(length(dta), op="sum")
    comm.print(data.min)
    
    finalize()
    

    nestedParallel

    https://stackoverflow.com/questions/21306027/nesting-parallel-functions-in-r
    
    cl = makeCluster(2, type = "SOCK")
    clusterCall(cl, function() {
      library(doSNOW)
      registerDoSNOW(makeCluster(2, type = "SOCK"))
      NULL
    })
    registerDoSNOW(cl)
    
    


    Parallel R Baggage

    Some history of the R parallel back ends...
    see: https://stackoverflow.com/questions/28989855/the-difference-between-domc-and-doparallel-in-r
    doParallel is essentially merge of multicore (doMC) and snow.
    
    multicore is w/in same machine, so faster than snow.  
    multicore is fork based.
    snow was intended to leverage HPC MPI stack.   so registerDoSNOW() has more overhead than registerDoParallel().  
    
    parallel is a merger of snow and multicore
    
    
    registerDoParallel vs registerDoSNOW
    registerDoParallel() could be invoking the multicore backend and use doMC/fork when on single node, thus less overhead than registerDoSNOW()
    
    registerDoSEQ() is run in serial, useful for debugging to test whether parallel code produce same result as sequential code.
    
    SNOW use PSOCK, easier to deal with than MPI.
    
    

    EpiStat

    Stat for Epidemiologist
    library("EpiStat")
    
    
    # not sure if these all are from this library, but what I am learning in epi1.
    
    nhanes <- read.csv("ps7_nhanes.csv",header=TRUE)
    
    # converting a numeric 1, 2 to categorical male, female, but creating a new col (sec_cat) in existing table (nhanes)
    nhanes$sex_cat = ifelse(nhanes$sex==1, "Male", "Female") 
    
    tab1x2 <- table(nhanes$sex) # cmd to create table ... by counting freq of values in col sex
    
    table2x2 <- table(nhanes$dpq_bin, nhanes$vitd_def)  # create 2x2 epi table
    addmargins(table2x2)   # addmargins create total (row and col) 
    
    
    # express table "nhanes" entries as proportions
    prop.table(tab2)    
    # OR
    proportions(tab2)
    
    sum(tab2) # return a scalar containing total freq (ie count)  ## here think of count row by group
    
    help(CS) # Cohort Study measuring Risk
    CS( x,         "cases",   "exposure", full=FALSE)
    CS( tableName, "outcome", "exposure", full=FALSE)
    CS(nhanes,     "dpq_bin", "sex_cat",  full=FALSE) 
    
    

    R studio cheatsheets eg on data viz using ggplot, dplyr, etc.

    Rnd tidbits

    shortcuts: R allow abreviating option names when things can be uniquely identified. eg:
    Both of these accepted by R:
    
    Result = t.test(Data, mu=8, alternatives="two.sided", conf.level=0.95)
    Result = t.test(Data, mu=8, alt="two.sided",          conf=0.95)
    
    This is why col='Red' and color='Red' are both allowed in ggplot.
    
    PS. alternative refers to the alternate hypothesis (vs null hypothesis).
    
    
    
    attributes
    Fn like t.test(), binom.test() return a sizable list of things.  They are internally separated by "attributes" and can be retrieved directly.  eg
    attributes( Result )
    
    Result$conf.int
    Result$p.value
    Result$parameter    # only df is returned as "parameter" 
    
    
    ref: R tutorial on Student t Test

    PH142 - Stats for Public Health

    Much of the R code below are from the PH 142 stat class by Prof Dr Mi-Suk Kang Dufour

    course core concepts - ccc

    in: x-axis value.  out: AUC, which is prob of event x
    	d- (density) and p- (distribution) class of function.  eg dnorm, dpois , pbinom,
    in: AUC, a probability of event.  out: x-axis value.  eg, have 60% chance of event, what is the expected value.
    	q- (quantile) class of function.  eg qpois, qbinom,
    generate a list of possible values according to a distribution.
    	r- (random) class.  eg rnorm(n,mean,sd)
    eg: sim_01 = rbinom(n=200,size=1,prob=0.3030) # think of the rbinom as "200 choose 1, with prob of 30.30%" but not quite?
    
    
    	• dnorm(x=…, mean = 0, sd = 1 )    density .  (in dbinom is exact value, but for Norm dist of continuous var, this does not usually make sense (?)
    	• pnorm(q = 1.2, mean = 0, sd = 2), to calculate the cumulative probability below a given x-axis value
    	• qnorm(p = 0.75, mean = 0, sd = 1) to calculate the x-value for which some percent of the data lies below it (ie input is AUC, a probability value, thus named p=… in first input argument)
    
    	• dbinom(x=#success,size,prob of success)    k=number of successful event desired, exact match.
    	• pbinom(q=k, size=10,prob=0.29) binomial, CUMULATIVE prob, X ≤ k. [Lect 16 slide 35]
    	• qbinom(p, size, prob, lower.tail=T)
    	• rbinom(n=200,size=1,prob=0.3030) # think of the rbinom as "200 choose 1, with prob of 30.30%" but not quite?
    
    	• dpois(x=…, lambda=mu) : discrete prob, X=x exact value.
    ppois(q=k, lambda=mu)  : cumulative, up to X ≤ k
    
    
    calculating critical cut-off values of z.  [wk8 p119]
    eg z* for 90 or 95% Confidence Intervarl
    z.star95 = qnorm(0.025, lower.tail=F)  #1.96
    z.star90 = qnorm(0.05,  lower.tail=F)  #1.64
    
    
    


    Exam 1: Review

    Constructing 2x2 or RxC contingency table
    # Lab 11 [wk14] video @ 30:15
    # easiest way to generate 2x2 table out of raw data!
    d.frame %>% group_by(exploratoryVarColumn) %>% count(ResponseVarColumn)
    
    # manually
    # wk14 Lect 32 slide 33
    two_way_tibble = tribble( ~disease, ~ noDisease, # header create col for response vars
                                 16,      84,        # treatment grp / exploratory var 1 
                                  3,      97)        # treatment grp / exploratory var 2 
    
    

    Course: Review

    What test to use? Decision tree
    
    
    input:  continuous
    output: continuous
    Test:   lm(), scatter plot with linear fit.  [or convert to log scale first, or quadratic, etc]
    
    ~~~~~
    
    What test to use?  Decision tree: 
    
    output: CONTINUOUS variables as OUTCOME, 
    input:  1 , 2 or more categories, form basis of decision tree.
    
    eg experiment/sample has an new x.bar as result.
    essentially, non True/False binomial output.
    [wk15 video@28:45 p16]
    ch 25 of text does decision tree a bit differently.
    
    Num of (input) Group/Categories
      |
      +----- One ---know sigma? --+--→  Z-test (know sigma)     parametric, definition of Normal dist
      |                           +--→  t-test (unknown sigma)  still parametric, assume  Normal distribution
      |                           \--→  bootstrap (non parametric)
      |
      +----- Two ---+--- Independent? --+-- Non-parametric -→ Wilcoxon Rank Sum 
      |             |                   \-----  param      -→ two-sample T test
      |             |
      |             \--- Non-Indep    --+-- Non-parametric -- Wilcoxon Signed Rank
      |                                 \------ param      -- paired T test
      |
      +--- Three+ --+--- non-parametric ---Kruskal-Wallis
      |             \------  parametric -- anova  ----- if support H_a , subsequent post-hoc test, eg Tukey HSD or Bonferroni correction tests
      |
      \--- continuous --→ Linear Regression  lm()
    
    t-test can be used when data is large enough, 
    from Central Limit Theorem that when data is large enough --→ approx Normal dist.
    
    ~~~~
    
    Categorical variables (as input and output)
    Input:  exposure Yes/No  (Categorical, proportions of Yes in explanatory [input] var).  calc p.hat1 vs p.hat2 w/in each of these groups!
    Output: disease  Yes/No  (Categorical, proportions of Yes as response)
    The underlaying data has binomial distribution., we are counting "number of Yes(sucess)"
    And still need #sucess ≥ 10 ; #failure ≥ 10 (so that binomial appox Normal)
    ONE group/cat:   eg juror by race distribution, COVID deaths by race dist, does the count match their pop% for each race (chisq GOP)
    TWO groups/cats: Common with 2x2 or RxC contingency tables.
    
    
    Num of Group/Categories
      |
      +----- One --- --+---- binary (2 groups) ----→  1 sample proportion (Z based)
      |                |
      |                \--→  mult cat of outcome --→  chi-sq Goodness of Fit GOF (non parametric)
      |
      +----- two --- --+---- binary   -------------→  2 sample proportion (Z based)
      |                |
      |                \--→  mult cat of outcome --→  chi-sq test of Independence (non param)  [or Rnd Permutation]
      |                                                   ^
      |                                                   |
      \---- Mult -----------------------------------------/
    
    
    Chi-squared test requirements:
    Ei ≥ 5 for 80% of the cells (in a RxC contingency table)
            2x2 table then need to have each cell ≥ 5
    Ei ≥ 1 in all cells (here used ≥ rather than a strictly gretater!) [video @34:56]
            10x2 table, then some small cell can be ≥ 1 (but NOT zero success)
    
    
    Testing method:
    
    Null hypothesis:  
    
    signal
    ------- = mu   compare to some distribution
    noise  
    
    CI:
    
    signal estimate -/+ theoretical dist * (noise)
    
    
    
    Final review slide by GSI, p91.
    Cover a case of 
    input: categorical var
    output: continuous var
    Use Linear Regression with Categorical Vars.
    (not covered in lecture of Stat142?)
    
    
    

    Exam 1: Review

    dplyr
    
    PPDAC = Problem, Plan, Data, Analyze, Conclusion.
    
    3 main poblem types for stat using PPDAC model:
    ● Descriptive: learning about some particular attribute of a population
    ● Causative/Etiologic: do changes in an _explanatory_ variable cause changes in a response variable?
      eg: study the whole scatter plot of how x relates to y
    ● Predictive: how can we best predict the value of the response variable for an individual
      eg: laser focus on individual, how to use all the stat knowledge and figure out what will happen to 1 person
    
    
    Note it is explaNAtory, not exploRAtory, for the x vs y axis interpretation!
    
    
    col operation
    select() 	# awk picking column
    mutate() 	# add computed column
    rename() 	# change col name
    
    
    row operation
    arrange() 	# sort
    filter()	# grep 
    
    think SQL summary operation
    group_by() 
    summarize() 
    
    
    # create data.frame
    OFC1=c(.379,4.1,2.01,12.6,4.98,1.92)
    OFC2=c(6.31,7.94, 9.46,11.0, 11.1, 8.9)
    OFCdata = data.frame(OFC1,OFC2)
    OFCdata
    
    # lm() only works on dataframe, not matrix.  
    fit = lm(OFC1 ~ OFC2, data=OFCdata)
    library(broom)
    tidy(fit)
    # tidy produce intercept, slope, and stats like std.error, p.value of the fit.
    
    
    # marginal distribution vs conditional distribution, p 35
    # marginal refers to margin, ie whole group of the 2x2 table.  
    # Essentially just the whole pop prob(A)
    # conditional is a narrower group.  P(A|B) kind of stuff.
    
    # Simpson's Paradox is tied to confounder (lurking variable).
    # when looking at whole population, a different pattern emerge, 
    # because cofounder was the cause, and it was not being measured
    # review deck page 37
    
    
    
    ggplot
    
    ggplot(data = ______, aes(x = var1, y = var2)) +
    geom_point(aes(col = var3)) +
    facet_wrap(~ var4) +
    labs(title = “Main”, x = “Type x-axis here”, y = “Type y-axis here”)
    
    facet_wrap(      ~ var1, nrow=3 )	# cretae series of ggplot based on single var1
    facet_grid( var1 ~ var2 )        	# create a n*m grids of graph for var1 vs var2  [review page 61]
    
    stat="identity" # when y is specified  # arg of geom_histogram() (and geom_bar?)  
    # eg lab06 line 300
    p16 = ggplot(data=obs_data,aes(x=x_vals,y=probs)) + geom_histogram( binwidth = 1, stat="identity" )
    
    geom_bar( ... ) params: 
    position="dodge"	# side by side grouping in barchart
    position="stack"  	# on top of, less liked by stat wanting to do comparison.
    
    

    Exam 1: Review

    slide 65
    food = read.csv("food_supply.csv")  # food is a df
    
    # convert select columns to vector, for easier average calc
    vector_of_alcohol_supply = food$alc.beverages
    sum(vector_of_alcohol_supply) / length (vector_of_alcohol_supply)
    
    # dplyr finding average and created as df
    food %>% summarize( mu=mean(vector_of_alcohol_supply) )
    
    
    

    Lecture 1: Intro to R

    # File menu, new, R Markdown doc = will generate a new doc with basic example in it
    # kinit button in toolbar will render the doc for web view
    # To speed up knit, one time run in console.
    tinytex::install_tinytex()
    # http://rmarkdown.rstudio.com = Rmd info
    # http://socviz.co/gettingstarted.html = tutorial for Rmd and kinit
    
    ```{r code-chunk-desc, echo = FALSE}
    print("R code chunk goes between ```{r}  ``` in Rmd")
    ```
    
    ```{r another-code-eg, eval = FALSE}
    # can use F or FALSE, T or TRUE
    # echo = F   # do not echo command into output by kinit
    # eval = F   # do NOT evaluate/interpret as code to run, just do verbatim text output [also for output of kinit]
    
    bad-var-name = 0  # dont use - in variable name, R treats it as math substraction!  use _
    ```
    
    ```{r}
    # this one below is example of invoking autograder to check this checkpoint
    . = ottr::check("tests/p1.R")
    ```
    
    
    help("library")
    ?library
    ??histogram	# double ? for "topic" search
    
    functions/packages introducted in this lecture:
    - ggplot2, 
    - filter() aka stats::filter()
    
    
    Resources:

  • Hadley Wickham and Garrett Grolemund have written a helpful book about R that is freely available here: https://r4ds.had.co.nz/
  • UCLA has a website with many helpful examples here: https://stats.idre.ucla.edu/r/


  • Lecture 2: Working with data in R

    library(readr)                              # provides read_csv
    lake_data = read_csv("mercury-lake.csv")		# read csv into an array (R is 1-indexed).
    
    
    head(lake_data)		# show first few lines
    lake_data %>% head	# same, using unix pipe style 
    dim(lake_data)		# dimension as num of rows, cols
    str(lake_data)		# structure, compact display of the kind of data in the array
    names(lake_data)	# aka colnames(lake_data), show column heading name  
    			# this eg has 6 cols:  lakes ph chlorophyll mercury number_fish age_data
    
    # becareful with space in csv, no space allowed after comma, or R will read the space into the data structure, no automatic stripping.
    # foo,TRUE will be logical operator, but "foo, TRUE" will just be string.
    # TSV, tab delimited, maybe better...
    
    
    library(dplyr) 		# PH142 use dplyr extensively
    
    # rename column heading names, note first param is the array variable holding the data.
    # new_array    = rename(old_array,  new_col_name = old_col_name)
    lake_data_tidy = rename(lake_data,  name_of_lake = lakes       )
    #                                               ^^^ don't use <- here      
    
    # rename multiple cols at same time:
    lake_data_tidy = rename(lake_data,                  # array_name
                               name_of_lake = lakes,    # new_name = old_name
                               ph_level = ph)
    
    
    # using %>% pipe method.  think unix bash cli:
    lake_data_tidy = lake_data  %>% rename( new_name = old_name )
    
    
    # there are some nuances with pipe, if run like this, rename result NOT stored back into the array!
    # because there were no assignment operation where the rename() line ran
    new_lake_data_tidy = lake_data  
    new_lake_data_tidy              %>% rename( new_name = old_name )
    new_lake_data_tidy %>% head
    
    
    
    # selecting (choosing columns) to keep, think SQL
    # think awk to select which column to keep
    # new_array  = select(old_array,  col1,  col2, col3       )   # first param for select() is source array/table
    smaller_data = select(lake_data,  lakes, ph,   chlorophyll)
    smaller_data %>% head(2)
    
    # Negative select, ie drop named columns, note the minus sign before the column name
    # new_array    = select(old_array,  - col_name_to_drop)
    smaller_data_2 = select(lake_data,  - age_data        )
    
    
    # drop multiple columns, using pipe operator syntax
    smaller_data_3 = lake_data %>% select(- age_data, ph)
    smaller_data_3 %>% head(2)
    
    View(lake_data)			# interactive table to view array data, note it is uppercase V.  utils::View(df)
    
    # arrange() = display data with sorting		# think sort, display rows in new order (but no acutal change to the data)
    lake_data %>% arrange(ph)			# show data sorted by value in the ph column, ascending order is default
    lake_data %>% arrange(desc(ph))			# show data sorted by value in the ph column, DESCending order
    lake_data %>% arrange(age_data, - ph)		# can sort by mult cols, and ph in descending order
    
    
    # mutate() perform operation on a column, adding it as new col to a new array
    # new_array        = old_array %>% mutate(new_column_name     = old_col_name OPERATION )
    lake_data_new_fish = lake_data %>% mutate(actual_fish_sampled = number_fish    * 100   )
    
    
    # chain multiple operations together using pipe operator
    tidy_lake_data = lake_data %>% 
      rename(lake_name = lakes) %>%
      mutate(actual_fish_sampled = number_fish * 100) %>%
      select(- age_data, - number_fish)
    
    tidy_lake_data %>% head(3)
    
    
    
    # filter() - this is like grep, extract select rows.  eg keep record where col age_data has value "recent"
    # new_array        = old_array %>% filter(col_name  FILTER_CONDITION             )
    lake_data_filtered = lake_data %>% filter(age_data   ==     'recent'             )
    lake_data_filtered = lake_data %>% filter(age_data   %in% c('recent','str2')     )
    # %in% with a concat set of strings allow multiple == conditions to be specified.  eg match one of the listed string in the c() set.
    # use != for not equal
    
    
    # filter() can also with math operations
    # eg: only keep records where math operation is satisfied (in col named ph, keep rows whose values s larger than 7)
    lake_data_filtered = lake_data %>% filter(ph > 7)	
    lake_data_filtered = lake_data %>% filter(age_data=`recent`)
    
    # use %in% operator
    # eg, filter (ie show) rows where lake (name) is Alligator or  Blue Cypress:
    #                               VVV--- c() is the concatenate function in R to create a list.
    lake_data %>% filter(lakes %in%  c("Alligator", "Blue Cypress")  ) 
    
    
    # multiple filters at once, comma or & separated:
    lake_data %>% filter(ph > 6 , chlorophyll > 30)
    lake_data %>% filter(ph > 6 & chlorophyll > 30)
    #                          ^^^--- use of & is optional, comma also implied to be an AND operation
    lake_data %>% filter(ph > 6 | chlorophyll > 30)
    #                          ^^^--- vertical bar is an OR operation
    
    
    
    
    # group_by() and summarize() can create new summary tables with aggregate data
    # note fn name is summarize, summary() does a very different thing!
    # mean() and sd() calculates the mean (average) and standard deviation of variables.
    
    lake_data %>% 
      group_by(age_data) %>% 
      summarize(mean_ph = mean(ph))
    
    # eg multiple summary operations
    lake_data %>% 
      group_by(age_data) %>% 
      summarize(mean_ph               = mean(ph),
                standard_deviation_ph = sd(ph),
    	    length                = length(ph) )   # this count number of item in the group  # wk11 reader p8
    
    length()    # length( any_col_name ) to get number of rows, typically can provide count for sample size n.
    nrow( nhanes %>% filter( hs=='No' )   # this count number of rows, only those where hs=no.  $ Lab09wk12
    
    

    Reference material:
    - [15 min intro to dplyr](https://www.youtube.com/watch?v=aywFompr1F4)
    - [Data wrangling cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

    Lecture 3: Visualizing/plotting data

     - Visualization of categorical data: use `ggplot`'s `geom_bar()`
     - Visualization of continuous  data: use `ggplot`'s `geom_histogram()`
    
    
    1. 'ggplot' to set up a canvas for graphics
    2. `geom_bar(stat = "identity")` to make a bar chart when you specify the y variable
    3. `geom_histogram()` to make a histogram for which ggplot needs to calculate the count
    4. `fct_reorder(var1, var2)` to reorder a categorical variable (`var1`) by a numeric variable (`var2`)
        - from the `forcats` package
    4a.`fct_relevel( apgar5Cat, "Low", "Med")` to reorder a categorical variable by hand  # Lect 5, Prj 3
    5. `geom_point()` to make a plot with dots  (scatter plots also use this)
    6. `geom_line()` to make a plot with lines
    7. `geom_boxplot()` to make a box and whisker plot 
    8. `geom_jitter(width=.1)` to add points with jittering to boxplot
    
    
    color() # return the color name that R has "build-in"
    
    
    id_data = read_csv("Ch01_ID-data.csv")
    library(ggplot2)
    ggplot(id_data, aes(x = disease, y = percent_cases)) 	+   # aes specify axis
      geom_bar(stat = "identity")   +                           # geom_bar specify plotting bar chart
      labs(y = "Percent", x = "")   +                           # label the axis
      theme_minimal(base_size = 15)                             # font size for labels
    
    
    
    eg2: 
    fct_reorder()                    # order disease by decreasing percentage -- library(forcats) ?
    aes( fill=type )                 # aes in geom_bar: color bar graph according to disease type (one of the col in data table)
    theme(legend.position = "top")   # add legends of color vs types at top of plot
    
    id_data = id_data %>% 
      mutate(disease_ordered = fct_reorder(disease,             percent_cases, .desc = T))
    #        ^^new_col_name^               ^^what to rearrange  ^^order by this field/col.
    
    
    # used in group project part 3
    cord_clamp_song_ext = na.omit(song) %>%
      mutate(  apgar5Cat  =   case_when(       apgar5==0  ~ "Low" ,
                                               apgar5==6  ~ "Low" ,
                                               apgar5==7  ~ "Med" ,
                                               apgar5==9  ~ "High" 
                                               ) ) %>%
      mutate( apgar5Cat = fct_relevel( apgar5Cat, "Low", "Med", "High" ))
    
    
    ggplot(id_data, aes(x = disease_ordered, y = percent_cases)) + 
      geom_bar(stat = "identity", aes(fill = type)) +
      labs(y = "Percent", x = "")      +
      theme_minimal(base_size = 15)    +
      theme(legend.position = "top")
    
    
    
    eg3, geom_histogram()
    
    
    - Histograms look a lot like bar charts, except that the bars touch because the 
    underlying scale is continuous and the order of the bars matters
    - In order to make a histogram, the underlying data needs to be **binned** into 
    categories and the number or percent of data in each category becomes the height 
    of each bar. 
    - the **bins** devide the entire range of data into a series of intervals and counts
    the number of observations in each interval
    - the intervals must be consecutive and non-overlapping and are almost always chosen to be of equal size
    
    
    geom_point() produces an x-y sequence of points (but not connected by line).  
    kinda like a x-y scatter plot, but seems like 1 value per x entry.
    
    geom_line() is like geom_point(), but draws a continuous line instead of a series of points.
    
    
    ### Lab 2
    
    ### df CS_data_raw is reassigned to itself!  
    ### so think of this as appending column to existing df.
    ### fct_relevel affects plot output (ordering of categories), but does NOT change the data in the source table (df)
    CS_data_raw <- CS_data_raw %>% 
      mutate(Income_Group_order = forcats::fct_relevel(Income_Group, "Low income", 
                                                  "Lower middle income", "Upper middle income", 
                                                  "High income: nonOECD", "High income: OECD"))
    
    ### df CS_data_raw is reassigned to itself!
    ### so think of this as appending column to existing df.
    ### mutate end up adding new column,  named Income_Group_order
    ### new col created using forcats::fct_relevel()
    ### if one do:
    str(CS_data_raw)
    ### it shows that the new column store "Factor" types data, albeit View() doesn't show any difference
    ### Income_Group_order: Factor w/ 5 levels "Low income","Lower middle income
    
    
    Factor as variable type to store categorical data, 
    so that it is sortable.
    eg
    
    CS_data %>% pull(Income_Group_order) %>% levels()
    
    lab02 p6 eg:
    ggplot(CS_data, aes(x=Income_Group)) + geom_bar()
      ### geom_bar() is the implicit y-axis,
      ### here didn't specify stats="identity", so it counted the x axis var (Income_Group)
      ### use x=Income_Group_order to use the Factor var.  
    
    
    p14 = ggplot(CS_data, aes(x=GDP_2006)) + geom_histogram( binwidth=8000)
    p14
    
    p14 = p14 + facet_wrap( ~ Income_Group_order )  # facet_wrap create a series of panel to ggplot, one per each variable entry in the  "~ var_class" Income_Group_order   ### hw02
    
    ggsave( "lab02_cs_plot.png", plot = last_plot())   # save plot to file
    
    ## hw02
    
    ### add a vertical line to indicate median.  lty=1 is solid line, 2 is dashed line, 3 is dotted.
    geom_vline(aes(xintercept = median(GDP_2006)),  lty=2)
    
    
    

    Lecture 4/HW02

    
    ### manually calculate (Min, Q1, Median, Q3, Max)
    ### (Note that R's calc method is very slightly different than textbook, unless add "type=2" as option to quantile() )
    ### and add them as yintercept line to a boxplot
    
    ### most R stat fn has na.rm=F as default, which is not to drop empty value/cell, but may return an error about NaN when it hit such cell
    
    Min = min(CS_data$CS_rate_100, na.rm=T)
    Q1 = quantile(CS_data$CS_rate_100, probs = 0.25)
    Median = median(CS_data$CS_rate_100)
    Q3 = quantile(CS_data$CS_rate_100, probs = 0.75)
    Max = max(CS_data$CS_rate_100)
    Outliers: any data point more than 1.5*IQR away from median.
    IQR: InterQuartile Range: Q1-Median or Q3-Median
    
    five_num_summary = CS_data %>% summarize( min = Min,
                                              Q1 = Q1,
                                              median = Median,
                                              Q3 = Q3,
                                              max = Max)
    fns2 = five_num_summary
    
    p14b =  ggplot(data = CS_data, aes(y = CS_rate_100)) +
              geom_boxplot(col = 'black', fill = 'sienna2') +
              theme_minimal() + 
              geom_hline(aes(yintercept = fns2$min), col = 'green') +
              geom_hline(aes(yintercept = fns2$Q1), col = 'green') +
              geom_hline(aes(yintercept = fns2$median), col = 'green') +
              geom_hline(aes(yintercept = fns2$Q3),     col = 'green') +
              geom_hline(aes(yintercept = fns2$max),    col = 'green') 
    
    
    
    

    Lecture 6

    
    ## read data from excel spreadsheet:
    library(readxl)
    spending_dat = read_xlsx("country-spending.xlsx", sheet = 2, range = "A7:D47" )
    
    
    library(tidyverse)  # provides tidy
    library(dplyr)      # provides %>%
    
    ## back ticks must be used when variable name has space in them
    ## thus use rename() to change the variable name (column heading)
    spending_dat = spending_dat %>%
    rename(country_code = Country.code,
           health_expenditure = `Health expenditure per capita`, # back ticks
           GDP = `GDP per capita`) # back ticks must be used when variable name has space in them
    # new_name =  old_col_name
    
    
    tidy(data_table)		# fit data per tidyverse convention?
    
    lin_model =  lm ( y ~ x ) 	# lm() is linear regression fit model, for y ~ x (x is predictor)
    glance( lin_model )		# provide data like r^2, etc
    cor( x, y ) 			# Pearson's correlation = r // xref correlation-squared aka r^2, r2, r-squared [wk02 L05 slide 51]
    				# x and y are column names of table/dataframe, thus use inside summarize()
    mana_cor = mana_data %>% summarize(corr_mana = cor(powerboats, deaths))
    
    
    # ? correlation vs R^2 ??    
    # correlation is r, but calc often return r^2 (R-squared) ## slide 25
    # so would need sqrt() if ask for correlation r  
    # r^2 value, when *100%, represent the % of correlation for the data points.  R^2 of 0.92 means 92% correlated.
    
    # wk03 L06 slide 13
    # correlation r is related to slope b.  In linear model y = a + bx, b aka beta, in algebra often as m
    # b = r( Sy/Sx )
    # r is the correlation variable
    # pull( r.squared ) returns r^2, so take sqrt() to get r
    
    
    library(broom)
    glance( lin_model ) %>% pull( r.squared )  	# this return the r^2 value.  # slide 25
    
    # is pull() like awk, grabbing column from data.frame?  ++ ??   select() pick columns from list (those with named entries)
    # pull() is from dplyr.  returns a vector (the same str as one returned by concat cmd c() )
    
    geom_abline() add the fitting/trend line.   it need intercept and slope)
    geom_abline(intercept = b, slope = m)
    geom_abline( -46.75, 0.1358)
    
    geom_text_repel()	# repel text, ie, don't render them with overlap. 	# slide 46
    
    m = linear_model$coefficients[2]   # slope  (it is a "named number" in R)
    m = linear_model$coefficients[[2]] # slope ( [[]] returns a number )
    b = linear_model$coefficients[[1]] # intercept
    
    
    geom_smooth( method="lm", se=FALSE)  # R automatic way to have trend line from lm()  ## Lab10 wk13
    geom_abline( slope=0, intercept=mean(boston2$medv), col="red" )  # horizontal line indicating mean value
    
    # stuff about prediction in p28, review++
    
    
    ## scatter plot: geom_point()
    mana_death1 = ggplot(mana_data, aes(x = powerboats, y = deaths)) +
      geom_point() +
      theme_minimal(base_size = 15)
    
    ## + line plot
    tmp1 = ggplot(CS_data, aes( x=CS_rate_100, y=GDP_2006) ) + geom_point() +  geom_line() 
    tmp2 = tmp1 + geom_smooth()  # add smooth line best curve fit
    
    # next, colour the points by income_group
    # note that the aes() is placed inside geom_point         -----------VVVVVVV
    tmp6  = ggplot(CS_data_log, aes( y=log_CS, x=log_GDP) ) + geom_point(    aes(col=Income_Group) )   + geom_smooth() 
    ugly6 = ggplot(CS_data_log, aes( y=log_CS, x=log_GDP) ) + geom_point() + aes(col=Income_Group)     + geom_smooth() 
    # if aes of coloring by income_group is tackled onto the parent plot object, it would make geom_smooth() interrupted by Income_Group rather than span across these categories, as seen in ugly6 above
    
    
    
    
    

    Lecture 7

    
    lecture topic summary:
    1. geom_bar(aes(col = var), stat = "identity", position = "dodge")
    2. geom_bar(aes(col = var), stat = "identity", position = "stack")
         fct_relevel() to reorder grouping w/in bar charts
    3. Marginal vs conditional distributions
         Margins are the row and col totals
         conditional distribution is conditional probabilities, eg P(LungCancer|Smoker), 
         	look at row value and totals only, eg A/(A+B)
    	rather than whole 2x2 table.
    	English + Math interpretation should make sense.
    4. Simpson’s Paradox
    
    
    # slide 14, some crazy multiple ~ operator thing
    
    two_way_data <- tribble(~ smoking, ~ lung_cancer, ~ percent, ~number,
    "smoker", "lung cancer", 4.8, 12,
    "smoker", "no lung cancer", 95.2,238)
    
    

    Lecture 8

    
    # SRS = Simple Random Sample, slide 33, 38
    set.seed(123)    # make future run repeatable
    CS_100_1 = CS_data %>% smaple_n(100)        # pick n=100 rnd sample from data source
    CS_100_1 = CS_data %>%  sample_frac(0.05)   # do 5% rather than static n size
    
    # Proprtionate Stratified Sampling , slide 40
    # group them , then sample, this provide stratum equal size proportions (eg for epi group randomized study)
    CS_10percent_grouped <- CS_data %>%
      group_by(HOSP_BEDSIZE) %>%
      sample_frac(0.1)
    dim(CS_10percent_grouped)
    CS_10percent_grouped%>%group_by(HOSP_BEDSIZE) %>% tally()   
    ## tally() can be used to do counts!   like SQL count(*) 
    ## there is also a count() fn, it it is much pickier than tally().  the following does work:
    CS_10percent_grouped%>%group_by(HOSP_BEDSIZE) %>% count()   
    
    
    Q9df = read.csv("c:/Users/tin61/tin-gh/inet-dev-class/r_stat142/exam/simpleDF.csv")
    Q9df %>% filter(status=="True") %>% group_by(zip) %>% tally()
    
    

    Lab/HW03

    case_when() eg, also note use of ~ :
    
    insure_data = insure_data %>%
      mutate(bmi_cat = case_when(bmi < 16 ~ "Underweight",
                                 bmi >= 16 & bmi < 25 ~ "Normal",
                                 bmi >= 25 & bmi < 30 ~ "Overweight",
                                 bmi >= 30 ~ "Obese"                   )  )
    #
    
    types of variables in stat:
    
    - 'ordinal'
    - 'nominal'
    - 'continuous'	# measured
    - 'discrete'	# counting
    Note that integers are considered discrete!  thus num_of_child "int" is discrete
    continuous is only for floats
    
    
    p11 = ggplot(insure_subset, aes(x=age,y=charges)) + geom_point() +
      labs(title="age vs insurance charges for smoker with normal bmi")
    p16 = p11 + geom_abline(intercept=b,slope=m)
    p20 = p16 + geom_abline(intercept=new_b,slope=new_m,lty=2,col="red")
    # lty=2 means use dashed line
    
    insure_model = lm(charges ~ age, data=insure_subset)    
    
    
    paused in line 387 ++FIXME++
    

    Lecture 11-13 - Probability (week 5, post Midterm 1)

    
    dplyr::sample_n( tbl=cold_data, size=5 )    # reader p24
    geom_density    # reader p50  like smothened histogram.
    
    
    ## Lab04 ?    HW04 
    round( 123.4567, digits=1 )   # round to 1 decimal place
    
    uncount()
    library(janitor)
    janitor.tibyl    # tabulate
    janitor.adorn_
    
    
    # lab04 video ~[19:00]
    # Try using uncount() and tabyl() here!
    chd_dat_exp = chd_dat %>% uncount(n)     # n is the col for number of individual  uncount()expands to individual row, end up with long table!
    #chd_dat_exp
    
    chd_dat_exp %>%
      tabyl(Smoking, CHD,  Age) %>%
            ^^rows   ^col  ^stratified tables
    
    eg output:
    $old            #CHD=        #CHD=
     Smoking           no          yes
      +->  no 70.00% (490) 30.00% (210) <-- cell b: smoke=no,CHD=yes (age=old)
      \-> yes 40.00% (120) 60.00% (180)
              ^^^^^^^^^^^^
              cell c is smoke=ye
                        CHD=no
    
    
    chd_dat_exp %>% janitor::tabyl(Smoking, CHD, Age) %>%
      janitor::adorn_percentages("row") %>%
      janitor::adorn_pct_formatting(digits=2) %>%
      janitor::adorn_ns()    # give n values in parenthesis
    
    

    Lecture 14 - Normal Distribution (week 6)

  • rnorm(n = 100, mean = 2, sd = 0.4), to generate Normally distributed data from the specified distribution
  • pnorm(q = 1.2, mean = 0, sd = 2), to calculate the cumulative probability below a given value
  • qnorm(p = 0.75, mean = 0, sd = 1) to calculate the x-value for which some percent of the data lies below it
  • stat_qq() and stat_qq_line() to make a QQplot. Notice that aes(sample = var1) is neede
  • dbinom(x=3, size=10,prob=0.29) binomial, exactly X=x [Lect 15 slide 39]
  • pbinom(k=3, size=10,prob=0.29) binomial, CUMULATIVE prob, X <= k. [Lect 16 slide 35]
    dbinom(#success,size,prob of success)
    dbinom is referred as discrte prob of binomial, (it is not the density fn as when applied to normal dist?)
  • sim_01 = rbinom(n=200,size=1,prob=0.3030) # think of the rbinom as "200 choose 1, with prob of 30.30%" but not quite?
  • pois = poisson dist, similar to binomial, but no upper bound. typically for rare events, eg Infect Disease. [Lec 17 wk7]
  • dpois(x=?, lambda=?) discrete prob, when x=exact number. lambda (aka mu, ie average) is the sole param in poisson.
  • ppois(x=?, lambda=?) continuous prob, for x up to value specified. up to and including. X<=k **if do lower.tail=FALSE then decrement k by 1 to properly account for the discrete unit change. [Lect 17 slide 18]. eg: 1-ppois(q=3,lambda=0.1) # ans: [1] 3.846834e-06
    ppois(q=3,lambda=0.1,lower.tail=FALSE)
  • r____(n=?, param) : r-class is always for generating random number of entries for simulation, according to parameters necessary for the stat dist (so Normal vs Binom vs Poisson would take different number of params).




  • Finding Normal *Probabilities* - pnorm
    calc probability using pnorm( q=x ) given specific value of x
    
    pnorm( q=x, mean=0, sd=1, lower.tail=T )  # last 3 params are defaults in R
    
    mean is value on x-axis where peak of curve is
    sd, standard deviation, is reflected on how tall and fat the curve of the CDF is.
    
    pnorm( q=x,    mean=m. sd=sigma)  : provide q as input, get probability as output, which is AUC up to the cut off point q=x)
    	x is a cut off point, up to this value.  area on left is the probility (CDF)
    	eg P( x<1.2)  --->  prob with x as cut off at 1.2.  in r is:
    	pnorm( q=1.2 ) # default mean=0, sd=1
    	the result is the AUC up to the cut off point q of x=1.2.  # slide 35
    
    
    Finding Normal *Percentiles* - qnorm
    Finding Normal Prob is calc probability using pnorm( q=x ) given specific value of x
    Here is the inverse:
    Given the probability, find the corresponding x-value.
    qnorm() is the function for this task.
    (slide 43)
    
    the percentile given is the area under the curve, thus:
    
    qnorm( p=area, mean=mu, sd=sigma)  : provide p as input, get q as output, which is on the x axis .  
    
    eg, given a N(mu,sigma), want to know the 75%, what x value that corresponds to.  qnorm(0.75) provides the answer.
    
    
    Standard Normal and z-score
    (slide 48)
  • The standard Normal distribution N(0,1) has µ = 0 and σ = 1.
  • X ∼ N(0, 1), implies that the random variable X is Normally distributed.
  • Conversion: z = ( x - mu ) / sigma
  • This standardized value is called the z-score
  • Interpretation: z is the number of standard deviations that x is above or below the mean of the data.
    Simulating Normally distributed data in R
    (slide 55) (line 404 of su21/l12-code.Rmd)
    
    rnorm( )  # generate an array with rnd data that is normally distributed
              # optionally add mean and standard deviation (else result in std normal set)
    runif(n=1,min=0,max=1)  # rnd number with uniform dist, this is what most programming lang urand number generates ? 
    
    heights.women = rnorm(n = 1000, mean = 64.5, sd = 2.5)  # return array of 1000 elements, with mean and sd as indicated
    heights.women = data.frame(heights.women)    # convert array to dataframe  (unlike unix, the var will be reassigned correctly, not overwritten and lost as with `cat > foo`
    
    
    ```{r plot heights, out.width="80%", echo=F, warning=F, message=F}
    # create a histogram using data generated by rnorm() ::
    hist = ggplot(heights.women, aes(x = heights.women)) + geom_histogram(col = "white") + theme_minimal(base_size = 15) 
    
    # density plot
    dens = ggplot(heights.women, aes(x = heights.women)) + 
      geom_density()  +                                                                       # density fn line plot for df heights.women
      stat_function(fun = dnorm, args = c(mean = 64.5, sd = 2.5), col = "forest green")  +    # add a green line for what is the expected distribution supposed to look like (dnorm=?density normal dist?) 
     theme_minimal(base_size = 15)
    
    hist + dens + plot_layout()
    
    plot_layout affect how multiple ggplots are displayed, was to control num of cols and rows when there are many plots.
    ```
    
    
    # this does conversion to z-score t comform to stadard normal  
    # ie, the z-score is calculated using manual formula (There might be build-in R fn, but not presented for class)
    heights.women = heights.women %>% mutate(mean = mean(heights.women), 
                                              sd = sd(heights.women), 
                                              z = (heights.women - mean)/sd)
    
    
    

    Lecture 15

    
    obj:
      - Introduce the binomial distribution
      - What kinds of outcomes follow a binomial
      - Understanding the probability space for a binomial
      - What is the theoretical distribution for binomials 
      - discuss exact vs. cummulative probabilities for the binomial 
      - introduce some R code for binomial distributions
      - choose(n,k) is the R fn for C(n,k) aka "n choose k"
    
    
    QQplot ...
    quantile-quantile plot
    
    quantile = row_number() / n()
    z_score = qnorm(quantile, mean=m,sd=sigma)
    
    R can do the qq plot automatically, instead of doing the manual calc with z-score and stuff.
    counts = c(18, 4, 22, 15, 18, 19, 22, 12, 12)     # number of trees in 9 plots as vector
    tree_data = data.frame(counts)                    # convert to df, as that's what ggplot (?stat_qq) needs
    ggplot(tree_data, aes(sample=counts)) + stat_qq() + stat_qq_line()   # L15 slide 10, hw05 wk07
    
    
    Gaussian  == Normal
    Bernoulli == Binomial
    
    params for normal: 
    	mean, std dev
    
    params for binomial, these 2 params are enough to tell shape of the distribution graph: 
      1. number of trials, number of samples to pick.  eg n=10 for sampling 10 bottles.  size param in R dbinom/pbinom
      2. prob of success  (ie, how frequent get yes, head, true, contaminated by benzene, etc)
    
    Binomial need independent outcomes.  sample picking must be from a large batch, pop size >> sample size.  at least 20x was good?
      
    Also needed for calculations:
      * k, like "n choose k", k is the number of success being sought for (eg k = 3 bottles contaminated by benzene)
    
    dbinom(k,n,p)   X=x   exact, (k is standin for x, fuzzy when to use which)
    pbinom(k,n,p)   X<=k  cumulative, up to 
    
    dbinom(x=3, size=10,prob=0.29) binomial, DISCRETE prob,    X=x exactly. param name is x in R [Lect 15 slide 39]
    pbinom(q=3, size=10,prob=0.29) binomial, CUMULATIVE prob,  X<=k.        param name is q in R [Lect 16 slide 35] 
    
    can still calculate:
    mean: mu = n*p
    std dev: sigma = sqrt( n*p*(1-p) )
    
    
    # rep() = replicate elements of Vectors and Lists
    # sequentially placed, ie NOT randomized placement
    c(rep(0,4),rep(1,2)) # get [1] 0 0 0 0 1 1   
    
    slide 41, [~35:20]   su21:l13-code.rmd line 330
    simulate data... pop size of 10,000 containers , 10% contaminated
    container.id = 1:10000
    benzene = c( rep(0,9000), rep(1, 1000) )     # this create sequence of 9000 zeros/failure and 1000 ones/success , ie NOT random!
    pop_data = data.frame(container.id, benzene)
    
    set.seed(1)
    
    # below are code to create the simulation, creating a hypothetical data set and seeing the distribution stat.
    # provide whole pop stat: num of contaminated w/ benzene, average (ie %contamination)
    # part of the trick here depends on success = 1
    # this overall is just to say that the artificially created sample fits the desired probability
    # they were not in random order, it is fine, cuz sample_n() will pick randomly.
    pop_stats  = pop_data %>% summarize( pop_num_success = sum(benzene), pop_mean = mean(benzene) )
    
    # draw a sample of 10 and see which one is contaminated with beneze (ie success in this framework)
    sample_dat = pop_data %>% sample_n(10)  
    
    
    head(many.sample.stats) ##??
    
    ## pop is 20x larger than sample size, then generally considered selection not affecting the next outcome, cuz it is a tiny diff, won't affect conclusion.
    
    # this seems to be what HW04 was asking for.
    
    
    [42:00]
    X big x is 
    little x ... 
    P(X=x)  number of success equal to some value...
    
    
    
    HW04
    (R quirkyness, though undertandable if understand the underneath mechanics)
    
        these two are slightly different
    output_01 = sim_01 %>% as.data.frame() 
    output_01 =            as.data.frame(sim_01) 
        as.data.frame() with "piped input" don't get a name of the source variable
        and as a result, subsequent rename will not find it.
        thus, only the latter of the next two line works:
    output_01 = sim_01 %>% as.data.frame()       %>% rename(birth_defect = sim_01)   # rename fails!
    output_01 =            as.data.frame(sim_01) %>% rename(birth_defect = sim_01)   # rename works
    
    
    ascii diagram , ascii arts =P
    
    template tree diagram for bayes in ascii, potential paste for lab or exam:
    ref: Lect 13 wk5 reader p131
    study (pop) = medicare enrolled heavy smokers
    
             /-- T+
           D+
         /   \-- T-
    study
         \    /---- T+
           D-
              \---- T-
    
    
    HIV status:
    
             /-- Test+     0.9985
           Ab+
         /   \-- Test-     0.0015
    root  
         \    /---- Test+  0.0060
           Ab-
              \---- Test-  0.9940 
    
    
    lecture example:
    first layer for disease state D+/D-
    second layer for test result  T+/T-
    
    Rmd ascii 2x2 contingency table:
    
          |--response vars-|
    
    2x2   | D+       | D-  | margin |
    ------|----------|-----|--------|
    T+    |   26.7   | 67.9|  94.6  |
    T-    |    3.3   |902.1| 905.4  |
    margin|   30     |970  |1000    |
    
    ↑
    explanatory input, 
    researcher controlled vars
    
    
    
    |            |   Diease     | Disease     | (Margin)     |
    |            |   True +     |  True -     |  Sum         |
    |------------|--------------|-------------|--------------|
    | test +     |        TP    |    FP       | denom of PPV |
    | test -     |        FN    |    TN       | denom of PPN |
    | sum        | denom of Se  | denom of Sp | pop tot      |
    
    

    Lecture 17: Poisson Dist

  • pois = poisson dist, similar to binomial, but no upper bound. typically for rare events, eg Infect Disease. [Lec 17 wk7]
  • dpois(x=?, lambda=?) discrete prob, when x=exact number. lambda (aka mu, ie average) is the sole param in poisson.
  • ppois(q=?, lambda=?) continuous prob, for x up to value specified. up to and including. X≤k **if do lower.tail=FALSE then decrement k by 1 to properly account for the discrete unit change. [Lect 17 slide 18]. eg: 1-ppois(q=3,lambda=0.1) # ans: [1] 3.846834e-06
    ppois(q=3,lambda=0.1,lower.tail=FALSE)
  • simulation  for 5 years  see reader page 50
    
    rpois( n=12*5, lambda=0.536 )
    
    vs manual simulation
    
    set.seed(200)
    polydactyly=(rpois(n = 12*5, lambda = 0.536))
    polydactyly
    
    mean(polydactyly)  # 0.45,       randomness and small sample made our number differ from theoretical of 1/500*268 = 0.536  [27:45]
    sd(polydacttyly)   # 0.64
    
    
    when to use poisson vs binomial...
    [33:08]
    mean of binomial outcome 
    vs
    how many success of failure from larger size (poisson).
    
    (hmm... fuzzy... real diff i hear is that binom has a cap, a finite size, vs poisson is infinite)
    
    
    

    Lab06

    
    # test and install a package as needed
    if (!require("MASS")){
      install.packages("MASS")
    }
    
    
    # unload functions loaded by package MASS (as they clobber with fn of same name from dplyr):
    detach("package:MASS", unload = T)		
    
    
    # binomial sample data generation, individually create each sample point
    # lab06 line 280
    # This is just the range of values x can take
    x_vals = 0:812
    
    # This generates the probabilities
    probs = dbinom(0:812, size = 812, prob = 0.27)
    
    # This combines the range of values with the probabilities as a dataframe with 2 columns: x_vals and probs.
    # View(obs_data) in your console to see what this dataframe looks like
    obs_data = as.data.frame(cbind(x_vals, probs))
    
    
    
    

    Lecture 24: T and paired T test

    
    [wk 9 reader p20]
    for t-distribution (dist of t-scores?)
    very much like pnorm and qnorm.
    pt(q = , df = , lower.tail = ) 
    qt(p = , df = , lower.tail = )
    
    
    
    [ wk 11 reader p 50 - Lect 24 (Topic 1, video 2, toward the end of the video)   ]
    
    coding notes
    
    A one sample t- test will take the form:
    t.test(x = x_variable, alternative = greater, less or two.sided, mu = null_hypothesis_value)
    
    
    
    A two sample t-test will take the form:
    t.test(first_sample_data, second_sample_data, alternative = greater, less or two.sided)
    
    	t.test( df$col1   ~ df$col2 , alternative="two.sided") # 2 vars, independent # w11 reader p11
    	t.test( df$wegith ~ df$sex  , alternative="two.sided") # note use of ~ and 2nd var would have been the group_by filter
    
    A paired t-test will take the form:
    t_stat = t.test(first_data_points, second_data_points, alternative = greater, less or two.sided, paired=TRUE)
    t_stat$statistics # contain the actual t-score
    t_stat$p.value    # contain the p-value
    
    

    Lecture 25: ANOVA: Comparing 3 or more means

    First mostly about visualizing multi column data (eg think multiple treatment method being teest for cancer, some are combination of treatments). Background, how are 3 groups of data need to be structured for comparison
    
    
    grp1 = rnorm( n=2, mean=15, sd=3 )   # # each rnorm() produce a vector as a result
    grp2 = rnorm( n=2, mean=15, sd=3 )
    
    egSame = data.frame( grp1, grp2 ) 
    # construct dataframe, each vector becomes a column, which also insert the original variable name as column heading!!
    # output looks like this
          grp1     grp2
    1 19.07604 16.16301
    2 14.69164 14.83858
    
    
    egSameNarrow = egSame %>% gather( key="Grp", value="Measure", grp1:grp2 )
    # gather() is interesting…  it produced a new data.frame() with a pivoted arrangement of the data, maybe formatting it into the tidyverse way.
    # the main cells of the data.frame() becomes the value part, which  gather() also give a header
    # the "key" is the first col, which get the value from the header of the original data.frame() cols
    # output looks like this
       Grp  Measure
    1 grp1 19.07604
    2 grp1 14.69164
    3 grp2 16.16301
    4 grp2 14.83858
    
    
    # ah, so geom_point would be xy scatter plot when fed with continuous variables on both axis
    # but when one var is categorical, it produce the data points of a box plot, w/o the box
    ggplot(egSameNarrow, aes(x = Grp, y = Measure)) + geom_point()
    
    + geom_boxplot()    # to get box, add geom_boxplot()   // order matter, last one is top layer
    
    
    # density plot in [wk11 reader p80]
    ggplot(egSameNarrow, aes(x = Measure))  + geom_density(aes(fill=Grp), alpha=0.4)
    
    
    # Lecture 26: ANOVA Part 2.  (Su21 L20 Line 510)
    ggplot(egSameNarrow, aes(x = Grp, y = Measure)) +
      geom_jitter(pch = 4, width = 0.1)                            # like geom_point, but wiggle points randomly within "width" so points can be seen when they overlap
    
    ggplot(cancer_data, aes(x = trt_order, y = tumor_volume)) +
      geom_jitter(pch = 4, width = 0.1)                            # like geom_point, but wiggle points randomly within "width" so points can be seen when they overlap
    
    # pch ???
    
    
    ANOVA test:
    
    library(broom)   # wk12 reader p12
    cancer_anova = aov(formula = tumor_volume ~ treatment, data = cancer_data)
    tidy(cancer_anova)
    
    # for aov(), the var after ~ is column containing the grouping.
    
    		cancer_data data.frame looks like this
    		     treatment          tumor_volume
    		1  Irradiation           30
    		2  Irradiation           46
    		...
    		5 Cannabinoids           12
    		^^
    		row number added by R
    
    pf() would find the p-value, but aov() already provides it, so maybe seldom need to call this explicitly?
    
    after finding stat sig anova, need to run multiple pairwise t-test, or use TukeyHSD() wrapper, wk12 p29
    
    Bonferroni Correction wk12 p22
    
    
    diffs = TukeyHSD( cancer_anova, conf.level=0.05, )  %>% tidy()
    diffs %>% select(contrast, adj.p.value)		# select() pick columns from list (with named col),  vs pull(), for data.frame ? 
    
    TukeyHSD() provides an adj p.value, correcting for multiple tests.  this adj prevent it from use in CI.  [wk12 reader 31]
    
    Bonferroni vs TukeyHSD produce slightly different result.  [wk12 p33]
    	
    
    

    Lecture 27: non parametrics


    
    
    

    Lab10

    # maybe move...
    
    library(MASS)           # load it to import the data set
    detach(package:MASS)	# remove the select() fn which would conflict with dplyr 
    
    
    




    Missing Data

    na.rm=T   # many function take this as option to drop entries with missing data, ie, where data is "na"
    nhanes = read_csv("nhanes.csv")
    nhanes = na.omit( nhanes )       # remove row that has missing values.
    
    

    Constructing 2x2 or RxC contingency table using tibble, chi-squared test

    
                  Exposure+        Exposure-       ←- explanatory var as input
    
    |Group        | Sandwich     | No Sandwich   | Row total |
    |-------------|--------------|---------------|-----------|
    |Ill          | 109          |  4            | 113       |
    |Not Ill      | 116          | 34            | 150       |
    |Column total | 225          | 38            | 263       |
    
    ↑
    response
    output
    D+ vs D-
    (this is a more unusual layout of 2x2... )
    
    # the library it tibble, but the contruction fn is "tribble"
    library(tibble)
    two_way = tribble(~ sandwich, ~ nosandwich,
                         109,        4,   #row for Illness
                         116,        34)  #row for no Illness
    
    chisq.test(two_way, correct = F) # correct = F means not using Yates' correction for continuity, so that it can match calc "by hand".
    
    

    ascii/unicode symbols

    for easy cut-n-paste
    
    ↑ ↓ ← →   arrows up down left right
    « » 
    ⟨ ⟩ ‹ ›   angle brackets not parsed by HTML code, but may create confusion with cut-n-paste .  
              useful to replace R› prompt
    
    • bullet
    · middle dot 
    ◆ diamond
    ◇
    ♢
    ▶
    better bullet for table▶yes?
    ▶  ▶
    
    
    √¯¯¯  sqrt
    ≈  approx
    ≤  <=   combined less than equal to.  
    ≥  >=   useful when dont want to be parsed by HTML 
    ≠  !=   not equal, R Knit LaTeX barf at this symbol :-\
    ±
    ¬    not
    ∪ ∩  union intersect
    ×    multiply
    ÷    divide
    Ø    empty set
    
    λ - lambda - eigenvalue
    Λ - Lamda - eigenvector
    
    ε - epsilon
    Ε - Epsilon
    
    χ² - chi-squared
    σ² - sigma squared = variance
    μ  - mu mean average
    
    αβ∑Δ
    
    ΑΒΚΔΕΦΓΗΞΚΛΜΝΟΠΚΡΣΤΎΒΩΧΥΖ
    αβκδεφγηξκλμνοπκρστύβωχυζ
    
    Ñ  alt 0209  spanish n
    ñ  alt 0241  mac: option + n, n
    
    Symbols
    
    ° zeroth :P degree ascii 176
    ¹ first
    ² square
    ³ cube
    ª primera
    º primero
    ‰ per mill
    
    § section
    ¤ currency
    €
    ¢ cent
    Ð Latin Eth
    Ñ ñ
    ¡ ¿ 
    ¶ pi new paragraph
    
    
    
    https://www.ascii-code.com/?msclkid=fada967ec20611ec82769a8bbf923192 has an ascii table that is text based, can copy-n-paste, quite a bit of ads at top, but at least no pop ups

    [Doc URL: http://tin6150.github.io/psg/R.html]
    [Doc URL: http://tin6150.gitlab.io/psg/R.html]
    Last Updated: 2022-07-07
    (cc) Tin Ho. See main page for copyright info.


    psg101 sn50 tin6150
    hoti1
    nSarCoV2
    bofh1