Sys Admin Pocket Survival Guide - OS
Home	python	Perl	Kubernetes	Singularity	CoreOS	AWS	BigData	Vagrant	shellScript

R

101

R       # interactive cli
Rstudio # gui

# run single command in shell
Rscript -e "print(system('hostname'))"   
Rscript --quiet -e 'install.packages("pacman",    repos = "http://cran.us.r-project.org")'    

# run a list (set?) of commands by placing them in {}.  p_load() won't run unless the library is loaded explicity.
Rscript --quiet --no-readline --slave -e '{ library(pacman); p_load( ggplot2, scales ) }'


Special symbols

^   power. 2^8 
$   separate table name and column name, eg nhanes$gender, opt$indir
~   tilde operator, 
-   minus, substraction.  eg a-b . Just-dont-use-it-in-var-name, as it is treated as substraction of each terms listed!  

log(100)           # Think ln() here!  By default, R log has base e!  same as bc's l(), python math.log(), matlab too?    calculator is the only place log is base 10, and use ln for base e??
log(1000, base=10) # this is log with base 10
log(1000, 10)      # same as above
log10(1000)        # ok, it is used often enough they do have a fn name for log base 10!

exp(1)             # expoent of natual log.  same as bc's e(), python math.exp()

log(exp(1))        # identity
exp(log(1))        # identity

Note, R doesn't treat dot differently.  it is often used as part of variable name or function name, like chisq.test() or p.hat.
However, neither Python nor bc likes dot in the variable name, so cut-n-paste simple calculator lines with variables with dots in them result in error.  May try to avoid use of dot in variable names for code portability.  (eg pasting from R to jupyther notebook with python kernel)

R

Jupyter Notebook with R kernel Container


singularity pull shub://tin6150/DIOS_demonstration
singularity exec DIOS_demonstration_latest.sif /opt/conda/bin/jupyter lab --allow-root

-or-

docker
podman run  -it --entrypoint=/opt/conda/bin/jupyter  tin6150/r4envids lab --allow-root

Jupyter Notebook IR Kernel Setup

conda install -c conda-forge jupyterlab

IRkernel is R kernel for Jupyther Notebook.  Need to install as R package, then add kernel spec to Jupyter Notebook:

Rscript -e "install.packages('IRkernel')"

# run multiple commands in Rscript by placing them inside {}, cmd separated by ;
Rscript --quiet --no-readline --slave -e '{ print("Hello") ; print("hi") }'
# p_load requires library(pacman) to be run first, so, try like this:
Rscript --quiet --no-readline --slave -e '{library(pacman) ; p_load( ggplot2, scales ) ;  print("Hello") ; print("hi") }'

# system wide
IRkernel::installspec(user = FALSE)

# per user
Rscript --no-readline --slave -e "IRkernel::installspec(user = TRUE)"


/home/tin/anaconda3/bin/jupyter lab 
# expect to have browser launched with URL http://localhost:8888/lab
# then click icon to create new R notebook

Environment


setenv R_LIBS /path/to/user/R-libraries

# setting R_LIBS will change both where R looks for lib to load and when installing libs 

library( rJava );	# see if it can load the R library named rJava (case sensitive)

listing all installed libraries:
Rscript -e 'library()' 

sI = sessionInfo()
sI


system('env')	# this is running the env command via the shell and display that info
system('id -a; hostname')

# in jupyter notebook, the system() command output to console 
# use combination of intern=TRUE and wrap with print() just to be sure
print(system( 'uptime', intern=TRUE ))  


library(data.table)
fread( "cat /proc/loadavg", check.names=TRUE ) ## run system command and read result into data.frame

Setting dir for user package install


setting user's own library install dir.
for installing CRAN libraries...

You can have a R_LIBS under your home dir.  
If you setup R_LIBS to be a shared folder, then your student can use the same library that you (or I, not as root) install.  



mkdir -p  ~/rlibs36
export R_LIBS=~/rlibs36    (or R_LIBS=/global/home/groups/ac_seasonal/rlibs36 )
I am assuming this for a new R 3.6, are libraries compatible with old version? For Perl I keep them in separate directories; python is a different beast)

# setting R_LIBS will change both where R looks for lib to load and when installing libs 

Running this as user should install a package called diagram:
    Rscript --quiet -e 'install.packages("diagram",    repos = "http://cran.us.r-project.org")'    

after install, you should see files added to ~/rlibs36

If the above don't work,
start interactive R as user
install.packages('diagram',lib='~/rlibs36')

this page has a good write up about this issue:
https://statistics.berkeley.edu/computing/R-packages

Installing packages


installing packages from cran:
(see r4eta or metabolics container)


scripted to run from shell:

Rscript --quiet -e 'install.packages("ggplot2",  repos = "http://cran.us.r-project.org")'    
Rscript --quiet -e 'install.packages(c("aCRM", "akima", "broom", "cluster", "clusterCrit"), repos = "http://cran.us.r-project.org")'

# do NOT have TAB(s) inside the arg list for packages(...), as Rscript can't tolerate it when invoked from a bash script.


Rscript --quiet --no-readline --slave ...
	# --quiet is same as --silent 
	# --slave was to make R run as quietly as possible 
	# --no-readline hopefully do less drawing on screen and make install log more readable...
	# there isn't a --no-verbose or --no-interactive


install many packages in single command:
https://stackoverflow.com/questions/29041423/how-to-install-multiple-packages

install.packages(c("EIAdata", "gdata", "ggmap", "ggplot2")) # rest omitted

have not tried yet:
install.r EIAdata gdata ggmap ggplot2    # rest omitted again

CSV example

Simple example reading csv file and getting summary info about it. Utilizes magrittr from dplyr, which allow unix pipe-like operation using %>%

library( 'magrittr')
print(getwd())
data <- read.csv("/mnt/CSV_adjoin/dacsjvnew_AVOC_07_All_Sp.csv")
data %>% head() # like unix cat data | head
# print(data)   # maybe big
data %>% summary()  # summary stat for each column of data, eg min,median,quartilers

str( data ) # see list of variables (column headers)

Misc

capabilities() 		# see if support X11, etc
source("myFuncList.R")	# import function list from file

system("id;pwd")	# run OS commands
setwd("dirname")	# cd to dirname, most useful in interactive R session.

# loading files in Rstudio 

spatial_paths = loadPaths(fileName = normalizePath("C:/Users/tin61/tin-gh-inet-class-only/inet-dev-class/epi_info/travHistProtocol/files/Protocol4/EPI_ISL_412975_MJhist.csv") )

# note that in windoze, C: is assumed and added by system at run time if not specified

# these might be helpful with knitting
setwd("c:")
setwd("/Users/tin61/tin-gh-inet-class-only/inet-dev-class/epi_info/travHistProtocol/files/Protocol4/")

# Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
knitr::opts_chunk$set(root.dir = normalizePath("/Users/tin61/tin-gh-inet-class-only/inet-dev-class/epi_info/travHistProtocol/files/Protocol4/"))

.libPaths()		# list path used to scan for R libs 
.libPaths(  c( ./libPaths(), '/home/tin/R/x86_64-pc-linux-gnu-library/3.4' ) )   # add path to library 


rm(list=ls())  		# remove all var in memory

Oracle connection

Sys.setenv(TNS_ADMIN="/site/conf/TNS_ADMIN",ORACLE_HOME="/usr/prog/oracle-client/11.1.0.6",ORACLE_SID='')
# The TNS_ADMIN may not matter if ORACLE_HOME has a network/admin with working conf for tns name resolution.
# ORACLE_SID is very important!  if set, the instance specifier in the dbConnect will be ignored, and db always connect to instance specified by ORACLE_SID !!

library( 'ROracle' )

ora=Oracle()
con=dbConnect(ora,user='user/pass@instance')
summary(con)
# verify connected to desired db)

result=dbGetQuery(con, "select * from tab")
# perform SQL query

result
# see result of query (display content of variable)


result2=dbListTables( con )
# see list of tables for the connected schema

GC: Garbage Collection

gc(verbose=TRUE) 	# explicitly call garbage collection
ls() 			# list all vars
object.size( get("myData") )	
rm()
rm(list=ls())  		# remove all var in memory # probably useful in rstudio when playing, but not in actual program!
sort( sapply( ls(), function(x) { object.size(get(x))} ),decreasing=TRUE )	# list all vars and their sizes

vars are out of scope at the end of function, and gc can happen to them automatically

bigmemory leave data on file.  some packages are build using bigmemory

file-backed big matrix -- https://ljdursi.github.io/beyond-single-core-R/#/132 
data = read.big.matrix(...)

csv is slow.
HDF5 and netCDF recommended.

ref: https://ljdursi.github.io/beyond-single-core-R/#/111

R Programming

101.2

R short primer (pdf) (page 12)


• function(arglist){expr}: 
function definition: do expr with list of arguments arglist

• if(cond){expr1}else{expr2}: if-statement:
if cond is true, then expr1, else expr2

• for(var in vec) {expr}: 
for-loop: the counter var runs through the vector vec 
and does expr each run

• while(cond){expr}: while-loop: while cond is
true, do expr each run


• x[n]: the n-th element   # R index starts at 1
• x[m:n]: the m-th to n-th element
• x[c(k,m,n)]: specific elements
• x[x>m & x<n]: elements between m and n

• x$n: element of list or data.frame named n.  ie, x is a data.frame, n is the column name

• x[["n"]]: idem		# [[ ]] in R ... 

• [i,j]: element at i-th row and jth column
• [i, ]: row i


• return: show variable on the screen (used in functions)


• as.numeric or as.character: change class to
number or character string

• strptime: change class from character to
date-time (POSIX)

• data types: lgl (logical: TRUE FALSE; T F are auto parsed as well, but NOT True False), int (Integer), dbl (Double), chr (characters, for strings), bytes (bytes!)  ref: https://cran.r-project.org/web/packages/readxl/vignettes/cell-and-column-types.html

?"%*%"  # get help on topic, quote it if it has "funny" symbols

%*% # matrix multiply
%/% # ?? partitioning data into partition for shipping to different nodes 

%:% # compose or nest foreach objects - https://ljdursi.github.io/beyond-single-core-R/#/84

$>% # like unix pipe, used a lot by dplyr.  can be chained.

%in%	# membership evaluation (may not need to be a true math set)

%%  # modulus, ie reminder.  eg 7%%3 is 1
2^4 and 2**4 both seems to be power, get 16.  2^4 notation maybe preferred.

R stuff - NA : Not available data


NA is the token to indicate a value that is not availabe.  Kinda like UNDEF.  
It is not a string, and it is not NULL.
It is spelled exactly as NA , both caps.

> foo=NA
> foo
[1] NA
> bar=na
Error: object 'na' not found

• is.na: test if variable is NA

R by default will not compute on dataset that contain NA.
If you don’t mind about the missing data and
want to compute the statistics anyway, you can
add the argument na.rm=TRUE 
(Should I remove the NAs? Yes!).

> max(j, na.rm=TRUE)
[1] 2

R stuff - stat stuff


• lm(v1∼v2): linear fit (regression line) between 
vector v1 on the y-axis and v2 on the x-axis

• nls(v1∼a+b*v2, start=ls(a=1,b=0)): nonlinear fit. 
Should contain equation with variables
(here v1 and v2 and parameters (here a and b)
with starting values

• coef: returns coefficients from a fit
• summary: returns all results from a fit

• sum: sum of a vector
• mean: mean of a vector
• sd: standard deviation of a vector


• aggregate(x,by=ls(y),FUN="mean"): split
data set x into subsets (defined by y) and computes means of the subsets.
Result: a new vector.

• approx: interpolate. Argument: vector with NAs. 
Result: vector without NAs.

stat functions

Ref:
- "UNM Stat 5101"
- CDF and PPF in Excel, R and Python (good graphical guide covering what the p, q, d, r functions cover.)
- SciPy stats function



prefix: p, q, d, r ::

p for "probability", the cumulative distribution function (c.d.f.)
	return the AUC for X<=x, ie the probability  (when lower.tail=TRUE)
     	return the p-value                           (when lower.tail=FALSE)

	F(x) = P(X ≤  x)
   	Answer What is  P(X < 27.4) when X has Normal distribution with mean 50 and standard deviation of 20 aka N(50,20).  N often styled as cursive cap N.
   	if variance (σ) is given, use the relation s.d. = sqrt(sigma) 
   	pnorm( q=27.4, mean=50, sd=20, lower.tail=TRUE )
	pnorm( 27.4, 50, 20 )
   	In R, optional param default to mean=0, sd=1, lower.tail=TRUE.
   	Python: scipy.stats.norm.cdf( 27.4, loc=50, scale=20 )
   	Excel:  norm.dist( 27.4, 50, 20, TRUE ) 

	For calculation of the p-value in a right tailed test for a given z-score:
   	1 - pnorm( 27.4, 50, 20 )
   	pnorm(27.4, 50, 20, lower.tail=FALSE )  # lower.tail=FALSE -> upper/right tail
	Python: 
	1 - scipy.stats.norm.cdf( 27.4, 50, 20 )
	scipy.stats.norm.sf( 27.4, 50, 20 )     # sf = 1 - cdf


q- for "quantile", the inverse c.d.f. - aka PPF(q) for probability ( 1 - α )
	Have AUC (ie probability or p-value, typically alpha=0.05), get value of x (aka q, quantile)
   	p = F(x)
	x = F-inv(p)
   	F-inv is styled as F^-1.
   	So given a number p between zero and one, qnorm looks up the p-th quantile of the normal distribution.
   	Answer What is F-inv(0.95) when X has the N(100, 15^2) distribution?
   	qnorm(0.95, 100, 15)
   	qnorm(p=0.95, mean=100, sd=15, lower.tail=TRUE)
   	qnorm(0.05, mean=100, sd=15, lower.tail=FALSE)   # from the relation (1-alpha) + alpha = 1, here alpha=0.95
   	Python: scipy.stats.norm.ppf( 0.95, loc=100, scale=15 )
   	Excel: norm.inv( 0.95, 100, 15 ) .  Excel has a norm.s.inv(0.05), but only take mean=0 sd=1.

   	
d- for "density", the density function (p.f. or p.d.f.)
	or more often discrete?  eg dbinom, dpois
  
   		

r- for "random", a random variable having the specified distribution
	Generating random numbers from a standard normal distribution
	N(mu=0,sigma=1)  # N cursive, mu=mean of 0, sigma=standard deviation of 1
	rnorm(n=1, mean=0, sd=1 )
	Python: scipy.stats.norm.rvs( loc=0, scale=1, size=1, random_state=123 ) # random_state is optional, use seed for repeatable/determinsistic result
	If n or size > 1 get a vector.
	Excel:  norm.s.inv( rand() )

     [d- and p- are typically grouped as they are often for discrte vs continuous dist, ie X=exact vs X≤up-to a specific value ]

pnorm,  qnorm,  dnorm,  and rnorm.  are the p-, q-, d-, r- fn for Normal Distribution.  
     

pbinom  qbinom  dbinom  rbinom - for binomial distribution
pchisq	qchisq	dchisq	rchisq - for Chi-Square   scipy.stats.chi2.cdf() etc
pexp	qexp	dexp	rexp   - for Exponential
plogis	qlogis	dlogis	rlogis - for Logistic
plnorm	qlnorm	dlnorm	rlnorm - for Log Normal
ppois	qpois	dpois	rpois  - for Poisson
pt	qt	dt	rt     - for Student t    scipy.stats.t.cdf() etc
punif	qunif	dunif	runif  - for Uniform

Note that for continuous distribution, some of these fn are defined using integrals, and R doesnt do integrations!


All function take option mean and standard deviation.  If omitted, default mean=0, sd=1.  

pnorm(27.4, mean=50, sd=20)
pnorm(27.4, 50, 20)

The Statistics Theme Park "Ride the Distributions!

Source: Loony Labs - Day 45 or Claudia @ Eigenblogger

assignment

a =  5
a <- 5

# both = and <- can be used for assignment
# But <- is prefered by the R crowd.  it also screw with my html code here, so maybe this R.html should go have been done in .rst ...

# there are some subtle differences, side effect, formula, etc.
# see https://renkun.me/2014/01/28/difference-between-assignment-operators-in-r/
# overall, for better readability of R code, 
# it was suggested to only use <- for assignment and = for specifying named parameters.
# But I was not convinced and will likely continue to use = 
# at least in this R.html doc.

# converting a numeric 1, 2 to categorical male, female, but creating a new col (sec_cat) in existing table (nhanes)

nhanes$sex_cat = ifelse(nhanes$sex==1, "Male", "Female") 


string to integer conversion (search food: casting)
myNumber = strtoi(myString, base = 10)

Data Structures

Adv R


1D:
vector: homegenoeus elements.  c()
list:   heterogeneous elements.  list()

2D:
matrix:     homegenoeus elements.  matrix()
dataframe:  heterogeneous elements, named columns. think table of data/spreadsheet :-D.  data.frame(), tidyverse

3D, multi-dimention:
array:     homogeneous elements.  1-indexed.  array(), dim()


attribute()
structure()

factor() to store categorical data -- ie predefined values

vector, matrix

R short primer (pdf)


vec1 = c( 1, 3, 5, 7)	# c = concatenate, paste together.  
vec2 = c( 2, 4, 6, 8)
vec3 = seq(10,25,5)     # aka: vec3 = seq(from=10,to=25,by=5)  # result: [1] 10 15 20 25

sum(vec2)


> mat1=matrix( data = c(1,2,3,4,5,6), ncol=3 )   # ncol define how many column the matrix has
> mat1
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

> mat[2,2]
[1] 4

matlb/octave matrix syntax is A LOT nicer :-D
pay special attention to the sequence of the numbers.

cbind	# column bind to create matrix
rbind 	# row    bind

array

R is 1-indexed, so first entry of df is 1
A 1D array is a vector.
A 2D array is a matrix.
3D and higher dim structure is rarer in stats.


dim()
length()    # length( any_col_name ) to get number of rows, typically can provide count for sample size n.
names()
rownames()
colnames()
dimnames()
t() - transpose

c() generalizes to cbind() and rbind() for matrix
abind() for arrays

as.matrix()
as.array()  -- for conversions

dataframe

R is 1-indexed, so first entry of df is 1


coord = data.frame(x = runif(n.total), y = runif(n.total), sampled = 0) 
coord.sampled = coord[coord$sampled == 1,]   # initial sites

data.frame( coord ) is like a table  # convert a vector into a data.frame()
and $ separate the table name and the column name.
Thus, the above eg has 
coord$x
coord$y
coord$sampled
each of them is an array (representing the cell values across many rows of the table).
coord$x[1]  # table coord, column x, row 1  # R is 1-indexed.  (unlike python 0-indexed)
coord$x[2]  # table coord, column x, row 2


› t = data.frame(x = c(11,12,14), y = c(19,20,21), z = c(10,9,7))
› t
   x  y  z
1 11 19 10
2 12 20  9
3 14 21  7

› mean(t$x)
[1] 12.33333


typeof( t )
calss( t )
is.data.frame( t )

cbind(t, data.frame( z = 3:1 )
rbind(t, data.frame(...) )

It’s a common mistake to try and create a data frame by cbind()ing vectors together. This doesn’t work because cbind() will create a matrix unless one of the arguments is already a data frame. Instead use data.frame() directly
Since a data frame is a list of vectors, then each vector represents a column, and the data.frame() becomes a list of columns.

List

List in a simple form is a vector.
but it can be arbitrary number of object, each having different length.

syntax looks like it has some nested structure, pay close attention.

> L2=list(vec1,vec2)
> L2
[[1]]
[1] 1 3 5 7

[[2]]
[1] 2 4 6 8


> L = list(one=1, two=c(1,2),five=seq(1, 4, length=5))
> L
$one
[1] 1

$two
[1] 1 2

$five
[1] 1.00 1.75 2.50 3.25 4.00


[ ]   returns a list
[[ ]] returns the object which is stored in list
If it is a named list, then
    List$name or List[["name"]] will return same as List[[ ]]
While List["name"] returns a list

ref: https://stackoverflow.com/questions/32819539/proper-way-to-access-list-elements-in-r

for list returned by ans = mccollect( list(jobs...) )
typically ans[[1]], ans[[2]] etc will contain the original answer desired.

Strange R stuff

Formula and tilde (~) operator

Coming from a programming world, R's definition of formula and its use of the tilde (~) operator is one of the weirest sh-IT of R!! =)

R has this concept of formula, and, at least from a user perspective, is different than function. (Actually, a formula can be called 'f' and another function 'f' and they can coexist and be called correctly by the right context!)

For python programmer, forget about programming for a bit. Put on the stat thinking cap :)
Remember the flower classification example in data science class? stack overflow describes this construct:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
as "Model Species depends on Sepal Length, Sepal Width, Petal Length and Petal Width"

The tilde (~) operator is often used for regression model.
The LHS of the ~ operator is the name of the model (Species).
The RHS are the variables that is fed into the model, with the + operator being overloaded (don't literary means add them, but to treat each as "columns" to find covariance). Most models use data frames, and each "variable" is a column vector of data.

formula has delayed evaluation

Ref: Cran Lazy Eval, scroll down to the Formulas section.

A formula capture delays the evaluation of an expression so that you can later evaluate it with f_eval(). ie

f <- ~ 1 + 2 + 3

The formula 'f' will be defined, but the RHS expression will not be calculated yet.
Only when

f_eval(f)

is executed is the expression calculated.
Above was a one-sided formula.
Below is a two-sided formula:

	g <- y ~ x + z

lazyeval provides abstraction like:

f_rhs(g)   # x + z
f_lhs(g)   # y

Common formula

R Doc

lm and glm are logistic regresion models (ie linear line of best fit).
## seed_weight is "Y", the response aka predicted value ## seed_count is "X", the exploratory ie input value to find prediction for ##lm( y ~ x ) seed_mod <- lm( seed_weight ~ seed_count, data = seed_data ) tidy( seed_mod ) seed_mod seed_mod %>% str # there are a lot more than just intercept and slope in the linear model, just not shown when just asked for it plainly. tidy() is a decent compromise ## linear_mode = lm( y ~ x , data="someTable") m = linear_model$coefficients[2] # slope (it is a "named number" in R) m = linear_model$coefficients[[2]] # slope ( [[]] returns a number ) b = linear_model$coefficients[[1]] # intercept

the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators.

Model operators:

+ 	separate a series of terms that affect the model.
: 	??
* 	factor crossing: a*b interpreted as a+b+a:b
^ 	crossing to specific degree.  eg (a+b)^2 for crossing twice
%in%   	terms on the left are  nested within those on the right.
-      	remove dependence on the specified series

I() - To avoid this confusion, the function I() can be used to bracket those portions of a model formula where the operators are used in their arithmetic sense

The lattice package uses them to specify the variables to plot.
The ggplot2 package uses them to specify panels for plotting.
The dplyr package uses them for non-standard evaulation.

Tilde operator

Try to understand these:

# sequences 
x <-   c(1:100)
y <- seq(0.1, 10, 0.1)

	
# all these 3 are the same:
plot( y     ~ I(x^3) ) 
plot( I(x^3), y      ) 
plot( x^3,    y      )    # no need to use I() function as ~ operator never used.

# note how ~ syntatically swapped the order of the independent and dependent variables x and y

# but this below is very different:
# ^ is interpreted as model operator for cross reference.  ( due to presence of ~ operator)
ploy( y ~ x^3 )

StackOverflow more extensive discussion on formula
Regression in Python usinr R-style formula seems to discuss use of ~ and regression formula in R.

import pandas as pd
from statsmodels.formula.api import ols
	
model = ols("MPG ~ Year", data=df)
results = model.fit()

Pipe

stack overflow



%>% is defined by magrittr, which is used by dplyr.
It is the equivalent of unix shell pipe.

iris %>% head() %>% summary() is equiv to summary(head(iris))
which one to use is a matter of preference.

Note: there is also %in%

apply family of function

see R-bloggers

apply(t(beaver1),1,max) ## apply function max to data, 1=row-wise)
lapply() returns a list
sapply() returns a vector instead of a list
tapply() does grouping before running summary function on them
by() ... think of GROUP_BY in SQL

Libraries

lifkit

tbd

ModeSim

geoR::lifkit

## likfit - Likelihood Based Parameter Estimation For Gaussian Random Fields
## https://www.rdocumentation.org/packages/geoR/versions/1.8-1/topics/likfit

SimAnneal()
kridge

~

raster::extract()


makCluster
registerDoParallel
%dopar%

Python using R

Quick hack: Just write to file, in /dev/shm so it is RAM based.
https://towardsdatascience.com/from-r-vs-python-to-r-and-python-aa25db33ce17
PypeR - R within Python
pyRserve - connect to a remote R server
rpy2 - runs embedded R in a Python process

R using python

rPython
SnakeCharmR - fork, more modern ver of rPython
PythonInR
Reticulate [cheatsheet]

Parallelization in R

2 categories of parallelization: - Shared memory: ie, multiple cores, single machine memory space. typically fork-based. - distributed memory: multiple machines (multiple OS instances), using Socket, MPI, etc 6 main types/bundles of parallization exist in R:

same host, FORK based, parent/child memory model with copy-on-write: mclapply, mcparallel, mccollect. multicore is easy if code already use lapply. Not usable in Windows natively as it does not support fork().

Rdsm - shared memory within a single computer, so multiple PSOCK same-node processes can read/write to 1 single memory region.

PSOCK based multi-node parallelization, without needing MPI (cluster): SNOW: parLapply, clusterApplyLB

Socket based, can work on windoze.
Snow cluster is well suited for use in HPC env. But better to use parallel rather than code using snow primitive, since it implent snow and multicore in a single library, as well as parallel random num generation.

MPI based -- SNOW cluster implement on this as well.

pbdR, hiding some of the complexity of MPI

Hardoop/SparkR but that's a platforms of its own.

Packages that does parallelism under the hood. eg BLAS, caret, many others

future_map(), purrr -- https://byuistats.github.io/M335/parallel_furrr.html

Tutorials

Chris Paciorek (UC Berkeley Stats Dept) Parallel Processing for Distributed Computing Cover R, Python, Matlab and C, as well as how to use them in a Slurm cluster.

Jonathan Dursi Beyond Single-Core R a slide deck on Parallelization in R. Summary slides:

Packages that does parallelism under the hood -- https://ljdursi.github.io/beyond-single-core-R/#/22

paralle/multicore with mc* -- https://ljdursi.github.io/beyond-single-core-R/#/51

cluster* parLapply snow -- https://ljdursi.github.io/beyond-single-core-R/#/73

foreach doParallel isplit %:% -- https://ljdursi.github.io/beyond-single-core-R/#/98

Best Practices -- https://ljdursi.github.io/beyond-single-core-R/#/106

bigmemory morder mpermute mwhich sub.big.matrix -- https://ljdursi.github.io/beyond-single-core-R/#/133

Rdsm -- shared memory within node by multiple process -- https://ljdursi.github.io/beyond-single-core-R/#/142

Dursi's github repo has data for running his example, and I wrapped that with a Jupyter Notebook container so everyone can try using this Docker Container with Jupyter notebook server:

Start jupyter notebook web server (on specific port, eg 6997):

docker run -p 6997:6997 -v "$PWD":/mnt -it --entrypoint=/opt/conda/bin/jupyter  tin6150/beyond-single-core-r  lab --allow-root  --no-browser --port=6997 --ip=0.0.0.0

Point web browser to something like
  http://127.0.0.1:6997/
and paste the token URL link shown on the terminal console

/mnt is a mount of the current dir (PWD) where you started the docker process, and files written there will persist after the container is terminated.  (Other files inside the container are ephemeral!)

Parallelization in R using mc* functions

mclapply
mcparallel (likely just called parallel in early days)
mccollect
mc.core

Ref: Beyond single core R

mc*        are multi-core routines
mcparallel is good for parallel tasks
mclapply   is good for data parallelism - easily replace lapply()
mc.cores   really should represent number of tasks, core can be oversubscribed often with little perf degradation

parallel::mclapply - multi-core lapply - https://nceas.github.io/oss-lessons/parallel-computing-in-r/parallel-computing-in-r.html

instead of 
	result = lapply(starts, fx)
use
	result = mclapply(starts, fx, cores=4)
mclapply gathers up the responses from each of these function calls, and returns a list of responses that is the same length as the list or vector of input data (one return per input item).

starts is a variable holding data, eg rep(100, 40)
fx is a custom function that lapply() runs

pvec - for simple use case of applying func to each element of vector and return vector, pvec is easier to use
# https://ljdursi.github.io/beyond-single-core-R/#/48



https://ljdursi.github.io/beyond-single-core-R/#/30
mcparallel  # fork a task , run in background
mccollect   # wait and collect result.  so kinda like MPI scatter/gather


library(multicore)                    ## functionality replaced by library('parallel') 
fun1 = function() {Sys.sleep(10); 1}
fun2 = function() {Sys.sleep(5);  2}
f1 = parallel(fun1())                 ## new fn name is mcparallel()
f2 = parallel(fun2())
result = collect(list(f1, f2))

result[[1]]  # would contain result of f1

Linear Regression Model fitting, in series vs parallel
https://ljdursi.github.io/beyond-single-core-R/#/32 and slide 33


fit1 =  lm(DISTANCE  ~ AIR_TIME,  data=jan2010))
fit2 =  lm(ARR_DELAY ~ DEP_DELAY, data=jan2010))

-vs-

parfits = function() {
     pfit1 = mcparallel(lm(DISTANCE  ~ AIR_TIME,  data=jan2010))
     pfit2 = mcparallel(lm(ARR_DELAY ~ DEP_DELAY, data=jan2010))
     mccollect( list(pfit1, pfit2) )
}
parfits()

Note that mcollect asked for the tasks output to be combined into a list.
Some data structure maybe needed to extract more complex model result??

domino data lab on caret/doMC


library(doMC)
library(caret) # popular ML using foreach to parallelize CV-folds, etc.
	fit.lda.ser vs
	fit.lda.par

Parallelization in R using cluster functions

makeCluster()
stopCluster()
doParallel()
registerDoSEQ()
snow
foreach
%do%
%dopar%


for is run for its side effect
foreach returns result, side effects maybe gone (when using parallel backend).
foreach should be tought of lapply, not for loop


%do% is serial
%dopar% is parallel, and need a backend.
Parallel libs: doParallel, doMC, doMPI
  cl = parallel::makeCluster(4)
  cl = parallel::makeCluster(4, type = "FORK")   # unix fork() based, not usable in windows natively.
  cl = parallel::makeCluster(4, type = "PSOCK")  # default.  Parallel Sockets on a SNOW cluster with 4-node
  cl = parallel::makeForkCluster(4)

use registerDoSEQ() if want to force foreach to run in serial 

much of the parallelization is done using fork() in unix (may not work in windows).
Also, fork() allow access to vars/memory from parent, but result not returned at the end.
Need to do some special collection/return if need to save those data.
Thus, "side-effects" such as those produced by `for` vanishes in parallelization.

SNOW - like PSOCK, entirely new process, so ned to explicitly copy data.  but process can be on remote machine.


https://ljdursi.github.io/beyond-single-core-R/#/55
clusterCall()
clusterEvalQ() # easier to use for tasks


  cl = parallel::makeCluster(4, type = "PSOCK")  # Parallel Sockets creates a brand new session in each node, 
						 # R copies data and .packages to each node, so more overhead than fork().  
						 # 4 here indicate create a 4-node SNOW cluster, which are just cores in a SMP machine.   
						 # for multinode, provide a vector of hostnames, which R need to ssh to (and run 1 core per machine?) 
						 # https://stackoverflow.com/questions/36794063/r-foreach-from-single-machine-to-cluster
  multinode foreach call would need to have param for .packages to copy to each node, as well as .export for variables  eg:
  foreach(i = 1:nrow(data_frame_1), .packages = c("package_1","package_2"), .export = c("variable_1","variable_2"))  %dopar% { ... }     

If the foreach loop is in the global environment, variables should be exported automatically. If not, you can use .export = ls(globalenv()) (or .GlobalEnv).
per https://stackoverflow.com/questions/45767416/how-to-export-many-variables-and-functions-from-global-environment-to-foreach-lo

doParallel

One way to do parallel processing in R

NumThread=15
library(doParallel)
cl = makeCluster( NumThread, outfile="mkcluster.out" )   ## create a cluster object
registerDoParallel(cl)
foreach(i=1:3) %dopar% sqrt(i)
# the %dopar% says to run the foreach loop in parallel for each item
# all console output inside cluster object will go to outfile 
# there are eg combine=rbind to merge result
stopCluster(cl)

.combine=rbind # row-bind, ie, each result becomes a row
.combine=cbind # col-bind, ie, each result becomes a col
.combine=c     # c() ?  concatenate?


detectCores() ## return number of cores on machine

Ref:

CRAN doParallel
R Doc on doParallel
Quick Intro to Parallel Computing in R

Beyond Single-Core R (and github source at https://github.com/ljdursi/beyond-single-core-R

Rdsm

https://ljdursi.github.io/beyond-single-core-R/#/139

library(parallel)
library(Rdsm)

nrows = 7
cl = makePSOCKcluster(3)       # form 3-process PSOCK (share-nothing) cluster
init = mgrinit(cl)             # initialize Rdsm
mgrmakevar(cl,"m",nrows,nrows)  # make a 7x7 shared matrix
bar = makebarr(cl)

clusterEvalQ(cl,myid = myinfo$id)

clusterExport(cl,c("nrows"))
dmy = clusterEvalQ(cl,myidxs = getidxs(nrows))
dmy = clusterEvalQ(cl, m[myidxs,1:nrows] = myid)
dmy = clusterEvalQ(cl,"barr()")

print(m[,])

Clustering on Clusters

https://ljdursi.github.io/beyond-single-core-R/#/57

ship data to other nodes of the cluster:
cl = makeCluster(8)
clusterExport(cl, "jan2010")
cares = clusterApply(cl, rep(5,4), do.n.kmeans)
stopCluster(cl)
mcres = mclapply(rep(5,4), do.n.kmeans, mc.cores=4)

https://ljdursi.github.io/beyond-single-core-R/#/59

# Running across machines:
# 2 machine, each with 8 core of process?

hosts = c( rep("localhost",8), rep("192.168.0.10", 8) )
cl = makePSOCKcluster(names=hosts)
clusterCall(cl, rnorm, 5)
clusterCall(cl, system, "hostname")
stopCluster(cl)

Snow

https://ljdursi.github.io/beyond-single-core-R/#/61

scheduling
viz of run

library(snow,quiet=TRUE)

do.kmeans.nclusters <- function(n) { kmeans(jan2010, centers=n, nstart=10) }

cl = makeCluster(2)
clusterExport(cl,"jan2010")

tm = snow.time( clusterApply(cl, 1:6, do.kmeans.nclusters) )
plot(tm)

# load balance, below is more like mc.preschedule=FALSE, kicking off task to worker only when needed
tm.lb = snow.time(clusterApplyLB(cl, 1:6, do.kmeans.nclusters))
plot(tm.lb)

https://ljdursi.github.io/beyond-single-core-R/#/73 - Summary for parallel/snow

clusterExport good for for SIMD

clusterApplyLB good for MIMD

mcparallel is best for MIMD?

parLapply

makePSOCKcluster for small cluster

makeMPIcluster for larger cluster -- but read up on pbdR first

pdbR -- https://ljdursi.github.io/beyond-single-core-R/#/143

pdb*apply -- https://ljdursi.github.io/beyond-single-core-R/#/152

using pdbR -- bring hpc distributed memory processing (SPMD) to R, see section 3.1.2 of CPaciorek's Parallel Processing for Distributed Computing

https://ljdursi.github.io/beyond-single-core-R/#/83

foreach parallel
.multicombine=TRUE

%:% nest foreach object # slide 84

MPI cluster

https://ljdursi.github.io/beyond-single-core-R/#/149

init()
rank = comm.rank()
...
allreduce(length(dta), op="sum")
comm.print(data.min)

finalize()

nestedParallel

https://stackoverflow.com/questions/21306027/nesting-parallel-functions-in-r


cl = makeCluster(2, type = "SOCK")
clusterCall(cl, function() {
  library(doSNOW)
  registerDoSNOW(makeCluster(2, type = "SOCK"))
  NULL
})
registerDoSNOW(cl)

Parallel R Baggage

Some history of the R parallel back ends...
see: https://stackoverflow.com/questions/28989855/the-difference-between-domc-and-doparallel-in-r

doParallel is essentially merge of multicore (doMC) and snow.

multicore is w/in same machine, so faster than snow.  
multicore is fork based.
snow was intended to leverage HPC MPI stack.   so registerDoSNOW() has more overhead than registerDoParallel().  

parallel is a merger of snow and multicore


registerDoParallel vs registerDoSNOW
registerDoParallel() could be invoking the multicore backend and use doMC/fork when on single node, thus less overhead than registerDoSNOW()

registerDoSEQ() is run in serial, useful for debugging to test whether parallel code produce same result as sequential code.

SNOW use PSOCK, easier to deal with than MPI.

EpiStat

Stat for Epidemiologist

library("EpiStat")


# not sure if these all are from this library, but what I am learning in epi1.

nhanes <- read.csv("ps7_nhanes.csv",header=TRUE)

# converting a numeric 1, 2 to categorical male, female, but creating a new col (sec_cat) in existing table (nhanes)
nhanes$sex_cat = ifelse(nhanes$sex==1, "Male", "Female") 

tab1x2 <- table(nhanes$sex) # cmd to create table ... by counting freq of values in col sex

table2x2 <- table(nhanes$dpq_bin, nhanes$vitd_def)  # create 2x2 epi table
addmargins(table2x2)   # addmargins create total (row and col) 


# express table "nhanes" entries as proportions
prop.table(tab2)    
# OR
proportions(tab2)

sum(tab2) # return a scalar containing total freq (ie count)  ## here think of count row by group

help(CS) # Cohort Study measuring Risk
CS( x,         "cases",   "exposure", full=FALSE)
CS( tableName, "outcome", "exposure", full=FALSE)
CS(nhanes,     "dpq_bin", "sex_cat",  full=FALSE)

R studio cheatsheets eg on data viz using ggplot, dplyr, etc.

Rnd tidbits

shortcuts: R allow abreviating option names when things can be uniquely identified. eg:

Both of these accepted by R:

Result = t.test(Data, mu=8, alternatives="two.sided", conf.level=0.95)
Result = t.test(Data, mu=8, alt="two.sided",          conf=0.95)

This is why col='Red' and color='Red' are both allowed in ggplot.

PS. alternative refers to the alternate hypothesis (vs null hypothesis).

attributes

Fn like t.test(), binom.test() return a sizable list of things.  They are internally separated by "attributes" and can be retrieved directly.  eg
attributes( Result )

Result$conf.int
Result$p.value
Result$parameter    # only df is returned as "parameter"

ref: R tutorial on Student t Test

PH142 - Stats for Public Health

Much of the R code below are from the PH 142 stat class by Prof Dr Mi-Suk Kang Dufour

course core concepts - ccc

in: x-axis value.  out: AUC, which is prob of event x
	d- (density) and p- (distribution) class of function.  eg dnorm, dpois , pbinom,
in: AUC, a probability of event.  out: x-axis value.  eg, have 60% chance of event, what is the expected value.
	q- (quantile) class of function.  eg qpois, qbinom,
generate a list of possible values according to a distribution.
	r- (random) class.  eg rnorm(n,mean,sd)
eg: sim_01 = rbinom(n=200,size=1,prob=0.3030) # think of the rbinom as "200 choose 1, with prob of 30.30%" but not quite?


	• dnorm(x=…, mean = 0, sd = 1 )    density .  (in dbinom is exact value, but for Norm dist of continuous var, this does not usually make sense (?)
	• pnorm(q = 1.2, mean = 0, sd = 2), to calculate the cumulative probability below a given x-axis value
	• qnorm(p = 0.75, mean = 0, sd = 1) to calculate the x-value for which some percent of the data lies below it (ie input is AUC, a probability value, thus named p=… in first input argument)

	• dbinom(x=#success,size,prob of success)    k=number of successful event desired, exact match.
	• pbinom(q=k, size=10,prob=0.29) binomial, CUMULATIVE prob, X ≤ k. [Lect 16 slide 35]
	• qbinom(p, size, prob, lower.tail=T)
	• rbinom(n=200,size=1,prob=0.3030) # think of the rbinom as "200 choose 1, with prob of 30.30%" but not quite?

	• dpois(x=…, lambda=mu) : discrete prob, X=x exact value.
ppois(q=k, lambda=mu)  : cumulative, up to X ≤ k


calculating critical cut-off values of z.  [wk8 p119]
eg z* for 90 or 95% Confidence Intervarl
z.star95 = qnorm(0.025, lower.tail=F)  #1.96
z.star90 = qnorm(0.05,  lower.tail=F)  #1.64

Exam 1: Review

Constructing 2x2 or RxC contingency table

# Lab 11 [wk14] video @ 30:15
# easiest way to generate 2x2 table out of raw data!
d.frame %>% group_by(exploratoryVarColumn) %>% count(ResponseVarColumn)

# manually
# wk14 Lect 32 slide 33
two_way_tibble = tribble( ~disease, ~ noDisease, # header create col for response vars
                             16,      84,        # treatment grp / exploratory var 1 
                              3,      97)        # treatment grp / exploratory var 2

Course: Review

What test to use? Decision tree



input:  continuous
output: continuous
Test:   lm(), scatter plot with linear fit.  [or convert to log scale first, or quadratic, etc]

~~~~~

What test to use?  Decision tree: 

output: CONTINUOUS variables as OUTCOME, 
input:  1 , 2 or more categories, form basis of decision tree.

eg experiment/sample has an new x.bar as result.
essentially, non True/False binomial output.
[wk15 video@28:45 p16]
ch 25 of text does decision tree a bit differently.

Num of (input) Group/Categories
  |
  +----- One ---know sigma? --+--→  Z-test (know sigma)     parametric, definition of Normal dist
  |                           +--→  t-test (unknown sigma)  still parametric, assume  Normal distribution
  |                           \--→  bootstrap (non parametric)
  |
  +----- Two ---+--- Independent? --+-- Non-parametric -→ Wilcoxon Rank Sum 
  |             |                   \-----  param      -→ two-sample T test
  |             |
  |             \--- Non-Indep    --+-- Non-parametric -- Wilcoxon Signed Rank
  |                                 \------ param      -- paired T test
  |
  +--- Three+ --+--- non-parametric ---Kruskal-Wallis
  |             \------  parametric -- anova  ----- if support H_a , subsequent post-hoc test, eg Tukey HSD or Bonferroni correction tests
  |
  \--- continuous --→ Linear Regression  lm()

t-test can be used when data is large enough, 
from Central Limit Theorem that when data is large enough --→ approx Normal dist.

~~~~

Categorical variables (as input and output)
Input:  exposure Yes/No  (Categorical, proportions of Yes in explanatory [input] var).  calc p.hat1 vs p.hat2 w/in each of these groups!
Output: disease  Yes/No  (Categorical, proportions of Yes as response)
The underlaying data has binomial distribution., we are counting "number of Yes(sucess)"
And still need #sucess ≥ 10 ; #failure ≥ 10 (so that binomial appox Normal)
ONE group/cat:   eg juror by race distribution, COVID deaths by race dist, does the count match their pop% for each race (chisq GOP)
TWO groups/cats: Common with 2x2 or RxC contingency tables.


Num of Group/Categories
  |
  +----- One --- --+---- binary (2 groups) ----→  1 sample proportion (Z based)
  |                |
  |                \--→  mult cat of outcome --→  chi-sq Goodness of Fit GOF (non parametric)
  |
  +----- two --- --+---- binary   -------------→  2 sample proportion (Z based)
  |                |
  |                \--→  mult cat of outcome --→  chi-sq test of Independence (non param)  [or Rnd Permutation]
  |                                                   ^
  |                                                   |
  \---- Mult -----------------------------------------/


Chi-squared test requirements:
Ei ≥ 5 for 80% of the cells (in a RxC contingency table)
        2x2 table then need to have each cell ≥ 5
Ei ≥ 1 in all cells (here used ≥ rather than a strictly gretater!) [video @34:56]
        10x2 table, then some small cell can be ≥ 1 (but NOT zero success)


Testing method:

Null hypothesis:  

signal
------- = mu   compare to some distribution
noise  

CI:

signal estimate -/+ theoretical dist * (noise)



Final review slide by GSI, p91.
Cover a case of 
input: categorical var
output: continuous var
Use Linear Regression with Categorical Vars.
(not covered in lecture of Stat142?)

Exam 1: Review

dplyr


PPDAC = Problem, Plan, Data, Analyze, Conclusion.

3 main poblem types for stat using PPDAC model:
● Descriptive: learning about some particular attribute of a population
● Causative/Etiologic: do changes in an _explanatory_ variable cause changes in a response variable?
  eg: study the whole scatter plot of how x relates to y
● Predictive: how can we best predict the value of the response variable for an individual
  eg: laser focus on individual, how to use all the stat knowledge and figure out what will happen to 1 person


Note it is explaNAtory, not exploRAtory, for the x vs y axis interpretation!


col operation
select() 	# awk picking column
mutate() 	# add computed column
rename() 	# change col name


row operation
arrange() 	# sort
filter()	# grep 

think SQL summary operation
group_by() 
summarize() 


# create data.frame
OFC1=c(.379,4.1,2.01,12.6,4.98,1.92)
OFC2=c(6.31,7.94, 9.46,11.0, 11.1, 8.9)
OFCdata = data.frame(OFC1,OFC2)
OFCdata

# lm() only works on dataframe, not matrix.  
fit = lm(OFC1 ~ OFC2, data=OFCdata)
library(broom)
tidy(fit)
# tidy produce intercept, slope, and stats like std.error, p.value of the fit.


# marginal distribution vs conditional distribution, p 35
# marginal refers to margin, ie whole group of the 2x2 table.  
# Essentially just the whole pop prob(A)
# conditional is a narrower group.  P(A|B) kind of stuff.

# Simpson's Paradox is tied to confounder (lurking variable).
# when looking at whole population, a different pattern emerge, 
# because cofounder was the cause, and it was not being measured
# review deck page 37

ggplot


ggplot(data = ______, aes(x = var1, y = var2)) +
geom_point(aes(col = var3)) +
facet_wrap(~ var4) +
labs(title = “Main”, x = “Type x-axis here”, y = “Type y-axis here”)

facet_wrap(      ~ var1, nrow=3 )	# cretae series of ggplot based on single var1
facet_grid( var1 ~ var2 )        	# create a n*m grids of graph for var1 vs var2  [review page 61]

stat="identity" # when y is specified  # arg of geom_histogram() (and geom_bar?)  
# eg lab06 line 300
p16 = ggplot(data=obs_data,aes(x=x_vals,y=probs)) + geom_histogram( binwidth = 1, stat="identity" )

geom_bar( ... ) params: 
position="dodge"	# side by side grouping in barchart
position="stack"  	# on top of, less liked by stat wanting to do comparison.

Exam 1: Review

slide 65
food = read.csv("food_supply.csv")  # food is a df

# convert select columns to vector, for easier average calc
vector_of_alcohol_supply = food$alc.beverages
sum(vector_of_alcohol_supply) / length (vector_of_alcohol_supply)

# dplyr finding average and created as df
food %>% summarize( mu=mean(vector_of_alcohol_supply) )

Lecture 1: Intro to R

# File menu, new, R Markdown doc = will generate a new doc with basic example in it
# kinit button in toolbar will render the doc for web view
# To speed up knit, one time run in console.
tinytex::install_tinytex()
# http://rmarkdown.rstudio.com = Rmd info
# http://socviz.co/gettingstarted.html = tutorial for Rmd and kinit

```{r code-chunk-desc, echo = FALSE}
print("R code chunk goes between ```{r}  ``` in Rmd")
```

```{r another-code-eg, eval = FALSE}
# can use F or FALSE, T or TRUE
# echo = F   # do not echo command into output by kinit
# eval = F   # do NOT evaluate/interpret as code to run, just do verbatim text output [also for output of kinit]

bad-var-name = 0  # dont use - in variable name, R treats it as math substraction!  use _
```

```{r}
# this one below is example of invoking autograder to check this checkpoint
. = ottr::check("tests/p1.R")
```


help("library")
?library
??histogram	# double ? for "topic" search

functions/packages introducted in this lecture:
- ggplot2, 
- filter() aka stats::filter()

Resources:

Hadley Wickham and Garrett Grolemund have written a helpful book about R that is freely available here: https://r4ds.had.co.nz/

UCLA has a website with many helpful examples here: https://stats.idre.ucla.edu/r/

Lecture 2: Working with data in R

library(readr)                              # provides read_csv
lake_data = read_csv("mercury-lake.csv")		# read csv into an array (R is 1-indexed).


head(lake_data)		# show first few lines
lake_data %>% head	# same, using unix pipe style 
dim(lake_data)		# dimension as num of rows, cols
str(lake_data)		# structure, compact display of the kind of data in the array
names(lake_data)	# aka colnames(lake_data), show column heading name  
			# this eg has 6 cols:  lakes ph chlorophyll mercury number_fish age_data

# becareful with space in csv, no space allowed after comma, or R will read the space into the data structure, no automatic stripping.
# foo,TRUE will be logical operator, but "foo, TRUE" will just be string.
# TSV, tab delimited, maybe better...


library(dplyr) 		# PH142 use dplyr extensively

# rename column heading names, note first param is the array variable holding the data.
# new_array    = rename(old_array,  new_col_name = old_col_name)
lake_data_tidy = rename(lake_data,  name_of_lake = lakes       )
#                                               ^^^ don't use <- here      

# rename multiple cols at same time:
lake_data_tidy = rename(lake_data,                  # array_name
                           name_of_lake = lakes,    # new_name = old_name
                           ph_level = ph)


# using %>% pipe method.  think unix bash cli:
lake_data_tidy = lake_data  %>% rename( new_name = old_name )


# there are some nuances with pipe, if run like this, rename result NOT stored back into the array!
# because there were no assignment operation where the rename() line ran
new_lake_data_tidy = lake_data  
new_lake_data_tidy              %>% rename( new_name = old_name )
new_lake_data_tidy %>% head



# selecting (choosing columns) to keep, think SQL
# think awk to select which column to keep
# new_array  = select(old_array,  col1,  col2, col3       )   # first param for select() is source array/table
smaller_data = select(lake_data,  lakes, ph,   chlorophyll)
smaller_data %>% head(2)

# Negative select, ie drop named columns, note the minus sign before the column name
# new_array    = select(old_array,  - col_name_to_drop)
smaller_data_2 = select(lake_data,  - age_data        )


# drop multiple columns, using pipe operator syntax
smaller_data_3 = lake_data %>% select(- age_data, ph)
smaller_data_3 %>% head(2)

View(lake_data)			# interactive table to view array data, note it is uppercase V.  utils::View(df)

# arrange() = display data with sorting		# think sort, display rows in new order (but no acutal change to the data)
lake_data %>% arrange(ph)			# show data sorted by value in the ph column, ascending order is default
lake_data %>% arrange(desc(ph))			# show data sorted by value in the ph column, DESCending order
lake_data %>% arrange(age_data, - ph)		# can sort by mult cols, and ph in descending order


# mutate() perform operation on a column, adding it as new col to a new array
# new_array        = old_array %>% mutate(new_column_name     = old_col_name OPERATION )
lake_data_new_fish = lake_data %>% mutate(actual_fish_sampled = number_fish    * 100   )


# chain multiple operations together using pipe operator
tidy_lake_data = lake_data %>% 
  rename(lake_name = lakes) %>%
  mutate(actual_fish_sampled = number_fish * 100) %>%
  select(- age_data, - number_fish)

tidy_lake_data %>% head(3)



# filter() - this is like grep, extract select rows.  eg keep record where col age_data has value "recent"
# new_array        = old_array %>% filter(col_name  FILTER_CONDITION             )
lake_data_filtered = lake_data %>% filter(age_data   ==     'recent'             )
lake_data_filtered = lake_data %>% filter(age_data   %in% c('recent','str2')     )
# %in% with a concat set of strings allow multiple == conditions to be specified.  eg match one of the listed string in the c() set.
# use != for not equal


# filter() can also with math operations
# eg: only keep records where math operation is satisfied (in col named ph, keep rows whose values s larger than 7)
lake_data_filtered = lake_data %>% filter(ph > 7)	
lake_data_filtered = lake_data %>% filter(age_data=`recent`)

# use %in% operator
# eg, filter (ie show) rows where lake (name) is Alligator or  Blue Cypress:
#                               VVV--- c() is the concatenate function in R to create a list.
lake_data %>% filter(lakes %in%  c("Alligator", "Blue Cypress")  ) 


# multiple filters at once, comma or & separated:
lake_data %>% filter(ph > 6 , chlorophyll > 30)
lake_data %>% filter(ph > 6 & chlorophyll > 30)
#                          ^^^--- use of & is optional, comma also implied to be an AND operation
lake_data %>% filter(ph > 6 | chlorophyll > 30)
#                          ^^^--- vertical bar is an OR operation




# group_by() and summarize() can create new summary tables with aggregate data
# note fn name is summarize, summary() does a very different thing!
# mean() and sd() calculates the mean (average) and standard deviation of variables.

lake_data %>% 
  group_by(age_data) %>% 
  summarize(mean_ph = mean(ph))

# eg multiple summary operations
lake_data %>% 
  group_by(age_data) %>% 
  summarize(mean_ph               = mean(ph),
            standard_deviation_ph = sd(ph),
	    length                = length(ph) )   # this count number of item in the group  # wk11 reader p8

length()    # length( any_col_name ) to get number of rows, typically can provide count for sample size n.
nrow( nhanes %>% filter( hs=='No' )   # this count number of rows, only those where hs=no.  $ Lab09wk12

Reference material:
- [15 min intro to dplyr](https://www.youtube.com/watch?v=aywFompr1F4)
- [Data wrangling cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

Lecture 3: Visualizing/plotting data

 - Visualization of categorical data: use `ggplot`'s `geom_bar()`
 - Visualization of continuous  data: use `ggplot`'s `geom_histogram()`


1. 'ggplot' to set up a canvas for graphics
2. `geom_bar(stat = "identity")` to make a bar chart when you specify the y variable
3. `geom_histogram()` to make a histogram for which ggplot needs to calculate the count
4. `fct_reorder(var1, var2)` to reorder a categorical variable (`var1`) by a numeric variable (`var2`)
    - from the `forcats` package
4a.`fct_relevel( apgar5Cat, "Low", "Med")` to reorder a categorical variable by hand  # Lect 5, Prj 3
5. `geom_point()` to make a plot with dots  (scatter plots also use this)
6. `geom_line()` to make a plot with lines
7. `geom_boxplot()` to make a box and whisker plot 
8. `geom_jitter(width=.1)` to add points with jittering to boxplot


color() # return the color name that R has "build-in"


id_data = read_csv("Ch01_ID-data.csv")
library(ggplot2)
ggplot(id_data, aes(x = disease, y = percent_cases)) 	+   # aes specify axis
  geom_bar(stat = "identity")   +                           # geom_bar specify plotting bar chart
  labs(y = "Percent", x = "")   +                           # label the axis
  theme_minimal(base_size = 15)                             # font size for labels



eg2: 
fct_reorder()                    # order disease by decreasing percentage -- library(forcats) ?
aes( fill=type )                 # aes in geom_bar: color bar graph according to disease type (one of the col in data table)
theme(legend.position = "top")   # add legends of color vs types at top of plot

id_data = id_data %>% 
  mutate(disease_ordered = fct_reorder(disease,             percent_cases, .desc = T))
#        ^^new_col_name^               ^^what to rearrange  ^^order by this field/col.


# used in group project part 3
cord_clamp_song_ext = na.omit(song) %>%
  mutate(  apgar5Cat  =   case_when(       apgar5==0  ~ "Low" ,
                                           apgar5==6  ~ "Low" ,
                                           apgar5==7  ~ "Med" ,
                                           apgar5==9  ~ "High" 
                                           ) ) %>%
  mutate( apgar5Cat = fct_relevel( apgar5Cat, "Low", "Med", "High" ))


ggplot(id_data, aes(x = disease_ordered, y = percent_cases)) + 
  geom_bar(stat = "identity", aes(fill = type)) +
  labs(y = "Percent", x = "")      +
  theme_minimal(base_size = 15)    +
  theme(legend.position = "top")



eg3, geom_histogram()


- Histograms look a lot like bar charts, except that the bars touch because the 
underlying scale is continuous and the order of the bars matters
- In order to make a histogram, the underlying data needs to be **binned** into 
categories and the number or percent of data in each category becomes the height 
of each bar. 
- the **bins** devide the entire range of data into a series of intervals and counts
the number of observations in each interval
- the intervals must be consecutive and non-overlapping and are almost always chosen to be of equal size


geom_point() produces an x-y sequence of points (but not connected by line).  
kinda like a x-y scatter plot, but seems like 1 value per x entry.

geom_line() is like geom_point(), but draws a continuous line instead of a series of points.


### Lab 2

### df CS_data_raw is reassigned to itself!  
### so think of this as appending column to existing df.
### fct_relevel affects plot output (ordering of categories), but does NOT change the data in the source table (df)
CS_data_raw <- CS_data_raw %>% 
  mutate(Income_Group_order = forcats::fct_relevel(Income_Group, "Low income", 
                                              "Lower middle income", "Upper middle income", 
                                              "High income: nonOECD", "High income: OECD"))

### df CS_data_raw is reassigned to itself!
### so think of this as appending column to existing df.
### mutate end up adding new column,  named Income_Group_order
### new col created using forcats::fct_relevel()
### if one do:
str(CS_data_raw)
### it shows that the new column store "Factor" types data, albeit View() doesn't show any difference
### Income_Group_order: Factor w/ 5 levels "Low income","Lower middle income


Factor as variable type to store categorical data, 
so that it is sortable.
eg

CS_data %>% pull(Income_Group_order) %>% levels()

lab02 p6 eg:
ggplot(CS_data, aes(x=Income_Group)) + geom_bar()
  ### geom_bar() is the implicit y-axis,
  ### here didn't specify stats="identity", so it counted the x axis var (Income_Group)
  ### use x=Income_Group_order to use the Factor var.  


p14 = ggplot(CS_data, aes(x=GDP_2006)) + geom_histogram( binwidth=8000)
p14

p14 = p14 + facet_wrap( ~ Income_Group_order )  # facet_wrap create a series of panel to ggplot, one per each variable entry in the  "~ var_class" Income_Group_order   ### hw02

ggsave( "lab02_cs_plot.png", plot = last_plot())   # save plot to file

## hw02

### add a vertical line to indicate median.  lty=1 is solid line, 2 is dashed line, 3 is dotted.
geom_vline(aes(xintercept = median(GDP_2006)),  lty=2)

Lecture 4/HW02


### manually calculate (Min, Q1, Median, Q3, Max)
### (Note that R's calc method is very slightly different than textbook, unless add "type=2" as option to quantile() )
### and add them as yintercept line to a boxplot

### most R stat fn has na.rm=F as default, which is not to drop empty value/cell, but may return an error about NaN when it hit such cell

Min = min(CS_data$CS_rate_100, na.rm=T)
Q1 = quantile(CS_data$CS_rate_100, probs = 0.25)
Median = median(CS_data$CS_rate_100)
Q3 = quantile(CS_data$CS_rate_100, probs = 0.75)
Max = max(CS_data$CS_rate_100)
Outliers: any data point more than 1.5*IQR away from median.
IQR: InterQuartile Range: Q1-Median or Q3-Median

five_num_summary = CS_data %>% summarize( min = Min,
                                          Q1 = Q1,
                                          median = Median,
                                          Q3 = Q3,
                                          max = Max)
fns2 = five_num_summary

p14b =  ggplot(data = CS_data, aes(y = CS_rate_100)) +
          geom_boxplot(col = 'black', fill = 'sienna2') +
          theme_minimal() + 
          geom_hline(aes(yintercept = fns2$min), col = 'green') +
          geom_hline(aes(yintercept = fns2$Q1), col = 'green') +
          geom_hline(aes(yintercept = fns2$median), col = 'green') +
          geom_hline(aes(yintercept = fns2$Q3),     col = 'green') +
          geom_hline(aes(yintercept = fns2$max),    col = 'green')

Lecture 6


## read data from excel spreadsheet:
library(readxl)
spending_dat = read_xlsx("country-spending.xlsx", sheet = 2, range = "A7:D47" )


library(tidyverse)  # provides tidy
library(dplyr)      # provides %>%

## back ticks must be used when variable name has space in them
## thus use rename() to change the variable name (column heading)
spending_dat = spending_dat %>%
rename(country_code = Country.code,
       health_expenditure = `Health expenditure per capita`, # back ticks
       GDP = `GDP per capita`) # back ticks must be used when variable name has space in them
# new_name =  old_col_name


tidy(data_table)		# fit data per tidyverse convention?

lin_model =  lm ( y ~ x ) 	# lm() is linear regression fit model, for y ~ x (x is predictor)
glance( lin_model )		# provide data like r^2, etc
cor( x, y ) 			# Pearson's correlation = r // xref correlation-squared aka r^2, r2, r-squared [wk02 L05 slide 51]
				# x and y are column names of table/dataframe, thus use inside summarize()
mana_cor = mana_data %>% summarize(corr_mana = cor(powerboats, deaths))


# ? correlation vs R^2 ??    
# correlation is r, but calc often return r^2 (R-squared) ## slide 25
# so would need sqrt() if ask for correlation r  
# r^2 value, when *100%, represent the % of correlation for the data points.  R^2 of 0.92 means 92% correlated.

# wk03 L06 slide 13
# correlation r is related to slope b.  In linear model y = a + bx, b aka beta, in algebra often as m
# b = r( Sy/Sx )
# r is the correlation variable
# pull( r.squared ) returns r^2, so take sqrt() to get r


library(broom)
glance( lin_model ) %>% pull( r.squared )  	# this return the r^2 value.  # slide 25

# is pull() like awk, grabbing column from data.frame?  ++ ??   select() pick columns from list (those with named entries)
# pull() is from dplyr.  returns a vector (the same str as one returned by concat cmd c() )

geom_abline() add the fitting/trend line.   it need intercept and slope)
geom_abline(intercept = b, slope = m)
geom_abline( -46.75, 0.1358)

geom_text_repel()	# repel text, ie, don't render them with overlap. 	# slide 46

m = linear_model$coefficients[2]   # slope  (it is a "named number" in R)
m = linear_model$coefficients[[2]] # slope ( [[]] returns a number )
b = linear_model$coefficients[[1]] # intercept


geom_smooth( method="lm", se=FALSE)  # R automatic way to have trend line from lm()  ## Lab10 wk13
geom_abline( slope=0, intercept=mean(boston2$medv), col="red" )  # horizontal line indicating mean value

# stuff about prediction in p28, review++


## scatter plot: geom_point()
mana_death1 = ggplot(mana_data, aes(x = powerboats, y = deaths)) +
  geom_point() +
  theme_minimal(base_size = 15)

## + line plot
tmp1 = ggplot(CS_data, aes( x=CS_rate_100, y=GDP_2006) ) + geom_point() +  geom_line() 
tmp2 = tmp1 + geom_smooth()  # add smooth line best curve fit

# next, colour the points by income_group
# note that the aes() is placed inside geom_point         -----------VVVVVVV
tmp6  = ggplot(CS_data_log, aes( y=log_CS, x=log_GDP) ) + geom_point(    aes(col=Income_Group) )   + geom_smooth() 
ugly6 = ggplot(CS_data_log, aes( y=log_CS, x=log_GDP) ) + geom_point() + aes(col=Income_Group)     + geom_smooth() 
# if aes of coloring by income_group is tackled onto the parent plot object, it would make geom_smooth() interrupted by Income_Group rather than span across these categories, as seen in ugly6 above

Lecture 7


lecture topic summary:
1. geom_bar(aes(col = var), stat = "identity", position = "dodge")
2. geom_bar(aes(col = var), stat = "identity", position = "stack")
     fct_relevel() to reorder grouping w/in bar charts
3. Marginal vs conditional distributions
     Margins are the row and col totals
     conditional distribution is conditional probabilities, eg P(LungCancer|Smoker), 
     	look at row value and totals only, eg A/(A+B)
	rather than whole 2x2 table.
	English + Math interpretation should make sense.
4. Simpson’s Paradox


# slide 14, some crazy multiple ~ operator thing

two_way_data <- tribble(~ smoking, ~ lung_cancer, ~ percent, ~number,
"smoker", "lung cancer", 4.8, 12,
"smoker", "no lung cancer", 95.2,238)

Lecture 8


# SRS = Simple Random Sample, slide 33, 38
set.seed(123)    # make future run repeatable
CS_100_1 = CS_data %>% smaple_n(100)        # pick n=100 rnd sample from data source
CS_100_1 = CS_data %>%  sample_frac(0.05)   # do 5% rather than static n size

# Proprtionate Stratified Sampling , slide 40
# group them , then sample, this provide stratum equal size proportions (eg for epi group randomized study)
CS_10percent_grouped <- CS_data %>%
  group_by(HOSP_BEDSIZE) %>%
  sample_frac(0.1)
dim(CS_10percent_grouped)
CS_10percent_grouped%>%group_by(HOSP_BEDSIZE) %>% tally()   
## tally() can be used to do counts!   like SQL count(*) 
## there is also a count() fn, it it is much pickier than tally().  the following does work:
CS_10percent_grouped%>%group_by(HOSP_BEDSIZE) %>% count()   


Q9df = read.csv("c:/Users/tin61/tin-gh/inet-dev-class/r_stat142/exam/simpleDF.csv")
Q9df %>% filter(status=="True") %>% group_by(zip) %>% tally()

Lab/HW03

case_when() eg, also note use of ~ :

insure_data = insure_data %>%
  mutate(bmi_cat = case_when(bmi < 16 ~ "Underweight",
                             bmi >= 16 & bmi < 25 ~ "Normal",
                             bmi >= 25 & bmi < 30 ~ "Overweight",
                             bmi >= 30 ~ "Obese"                   )  )
#

types of variables in stat:

- 'ordinal'
- 'nominal'
- 'continuous'	# measured
- 'discrete'	# counting
Note that integers are considered discrete!  thus num_of_child "int" is discrete
continuous is only for floats


p11 = ggplot(insure_subset, aes(x=age,y=charges)) + geom_point() +
  labs(title="age vs insurance charges for smoker with normal bmi")
p16 = p11 + geom_abline(intercept=b,slope=m)
p20 = p16 + geom_abline(intercept=new_b,slope=new_m,lty=2,col="red")
# lty=2 means use dashed line

insure_model = lm(charges ~ age, data=insure_subset)    


paused in line 387 ++FIXME++

Lecture 11-13 - Probability (week 5, post Midterm 1)


dplyr::sample_n( tbl=cold_data, size=5 )    # reader p24
geom_density    # reader p50  like smothened histogram.


## Lab04 ?    HW04 
round( 123.4567, digits=1 )   # round to 1 decimal place

uncount()
library(janitor)
janitor.tibyl    # tabulate
janitor.adorn_


# lab04 video ~[19:00]
# Try using uncount() and tabyl() here!
chd_dat_exp = chd_dat %>% uncount(n)     # n is the col for number of individual  uncount()expands to individual row, end up with long table!
#chd_dat_exp

chd_dat_exp %>%
  tabyl(Smoking, CHD,  Age) %>%
        ^^rows   ^col  ^stratified tables

eg output:
$old            #CHD=        #CHD=
 Smoking           no          yes
  +->  no 70.00% (490) 30.00% (210) <-- cell b: smoke=no,CHD=yes (age=old)
  \-> yes 40.00% (120) 60.00% (180)
          ^^^^^^^^^^^^
          cell c is smoke=ye
                    CHD=no


chd_dat_exp %>% janitor::tabyl(Smoking, CHD, Age) %>%
  janitor::adorn_percentages("row") %>%
  janitor::adorn_pct_formatting(digits=2) %>%
  janitor::adorn_ns()    # give n values in parenthesis

Lecture 14 - Normal Distribution (week 6)

rnorm(n = 100, mean = 2, sd = 0.4), to generate Normally distributed data from the specified distribution

pnorm(q = 1.2, mean = 0, sd = 2), to calculate the cumulative probability below a given value

qnorm(p = 0.75, mean = 0, sd = 1) to calculate the x-value for which some percent of the data lies below it

stat_qq() and stat_qq_line() to make a QQplot. Notice that aes(sample = var1) is neede

dbinom(x=3, size=10,prob=0.29) binomial, exactly X=x [Lect 15 slide 39]

pbinom(k=3, size=10,prob=0.29) binomial, CUMULATIVE prob, X <= k. [Lect 16 slide 35]
dbinom(#success,size,prob of success)
dbinom is referred as discrte prob of binomial, (it is not the density fn as when applied to normal dist?)

sim_01 = rbinom(n=200,size=1,prob=0.3030) # think of the rbinom as "200 choose 1, with prob of 30.30%" but not quite?

pois = poisson dist, similar to binomial, but no upper bound. typically for rare events, eg Infect Disease. [Lec 17 wk7]

dpois(x=?, lambda=?) discrete prob, when x=exact number. lambda (aka mu, ie average) is the sole param in poisson.

ppois(x=?, lambda=?) continuous prob, for x up to value specified. up to and including. X<=k **if do lower.tail=FALSE then decrement k by 1 to properly account for the discrete unit change. [Lect 17 slide 18]. eg: 1-ppois(q=3,lambda=0.1) # ans: [1] 3.846834e-06
ppois(q=3,lambda=0.1,lower.tail=FALSE)

r____(n=?, param) : r-class is always for generating random number of entries for simulation, according to parameters necessary for the stat dist (so Normal vs Binom vs Poisson would take different number of params).

Finding Normal Probabilities - pnorm

calc probability using pnorm( q=x ) given specific value of x


pnorm( q=x, mean=0, sd=1, lower.tail=T )  # last 3 params are defaults in R

mean is value on x-axis where peak of curve is
sd, standard deviation, is reflected on how tall and fat the curve of the CDF is.

pnorm( q=x,    mean=m. sd=sigma)  : provide q as input, get probability as output, which is AUC up to the cut off point q=x)
	x is a cut off point, up to this value.  area on left is the probility (CDF)
	eg P( x<1.2)  --->  prob with x as cut off at 1.2.  in r is:
	pnorm( q=1.2 ) # default mean=0, sd=1
	the result is the AUC up to the cut off point q of x=1.2.  # slide 35

Finding Normal Percentiles - qnorm

Finding Normal Prob is calc probability using pnorm( q=x ) given specific value of x
Here is the inverse:
Given the probability, find the corresponding x-value.
qnorm() is the function for this task.
(slide 43)

the percentile given is the area under the curve, thus:

qnorm( p=area, mean=mu, sd=sigma)  : provide p as input, get q as output, which is on the x axis .  

eg, given a N(mu,sigma), want to know the 75%, what x value that corresponds to.  qnorm(0.75) provides the answer.

Standard Normal and z-score

(slide 48)

The standard Normal distribution N(0,1) has µ = 0 and σ = 1.

X ∼ N(0, 1), implies that the random variable X is Normally distributed.

Conversion: z = ( x - mu ) / sigma

This standardized value is called the z-score

Interpretation: z is the number of standard deviations that x is above or below the mean of the data.

Simulating Normally distributed data in R

(slide 55) (line 404 of su21/l12-code.Rmd)


rnorm( )  # generate an array with rnd data that is normally distributed
          # optionally add mean and standard deviation (else result in std normal set)
runif(n=1,min=0,max=1)  # rnd number with uniform dist, this is what most programming lang urand number generates ? 

heights.women = rnorm(n = 1000, mean = 64.5, sd = 2.5)  # return array of 1000 elements, with mean and sd as indicated
heights.women = data.frame(heights.women)    # convert array to dataframe  (unlike unix, the var will be reassigned correctly, not overwritten and lost as with `cat > foo`


```{r plot heights, out.width="80%", echo=F, warning=F, message=F}
# create a histogram using data generated by rnorm() ::
hist = ggplot(heights.women, aes(x = heights.women)) + geom_histogram(col = "white") + theme_minimal(base_size = 15) 

# density plot
dens = ggplot(heights.women, aes(x = heights.women)) + 
  geom_density()  +                                                                       # density fn line plot for df heights.women
  stat_function(fun = dnorm, args = c(mean = 64.5, sd = 2.5), col = "forest green")  +    # add a green line for what is the expected distribution supposed to look like (dnorm=?density normal dist?) 
 theme_minimal(base_size = 15)

hist + dens + plot_layout()

plot_layout affect how multiple ggplots are displayed, was to control num of cols and rows when there are many plots.
```


# this does conversion to z-score t comform to stadard normal  
# ie, the z-score is calculated using manual formula (There might be build-in R fn, but not presented for class)
heights.women = heights.women %>% mutate(mean = mean(heights.women), 
                                          sd = sd(heights.women), 
                                          z = (heights.women - mean)/sd)

Lecture 15


obj:
  - Introduce the binomial distribution
  - What kinds of outcomes follow a binomial
  - Understanding the probability space for a binomial
  - What is the theoretical distribution for binomials 
  - discuss exact vs. cummulative probabilities for the binomial 
  - introduce some R code for binomial distributions
  - choose(n,k) is the R fn for C(n,k) aka "n choose k"


QQplot ...
quantile-quantile plot

quantile = row_number() / n()
z_score = qnorm(quantile, mean=m,sd=sigma)

R can do the qq plot automatically, instead of doing the manual calc with z-score and stuff.
counts = c(18, 4, 22, 15, 18, 19, 22, 12, 12)     # number of trees in 9 plots as vector
tree_data = data.frame(counts)                    # convert to df, as that's what ggplot (?stat_qq) needs
ggplot(tree_data, aes(sample=counts)) + stat_qq() + stat_qq_line()   # L15 slide 10, hw05 wk07


Gaussian  == Normal
Bernoulli == Binomial

params for normal: 
	mean, std dev

params for binomial, these 2 params are enough to tell shape of the distribution graph: 
  1. number of trials, number of samples to pick.  eg n=10 for sampling 10 bottles.  size param in R dbinom/pbinom
  2. prob of success  (ie, how frequent get yes, head, true, contaminated by benzene, etc)

Binomial need independent outcomes.  sample picking must be from a large batch, pop size >> sample size.  at least 20x was good?
  
Also needed for calculations:
  * k, like "n choose k", k is the number of success being sought for (eg k = 3 bottles contaminated by benzene)

dbinom(k,n,p)   X=x   exact, (k is standin for x, fuzzy when to use which)
pbinom(k,n,p)   X<=k  cumulative, up to 

dbinom(x=3, size=10,prob=0.29) binomial, DISCRETE prob,    X=x exactly. param name is x in R [Lect 15 slide 39]
pbinom(q=3, size=10,prob=0.29) binomial, CUMULATIVE prob,  X<=k.        param name is q in R [Lect 16 slide 35] 

can still calculate:
mean: mu = n*p
std dev: sigma = sqrt( n*p*(1-p) )


# rep() = replicate elements of Vectors and Lists
# sequentially placed, ie NOT randomized placement
c(rep(0,4),rep(1,2)) # get [1] 0 0 0 0 1 1   

slide 41, [~35:20]   su21:l13-code.rmd line 330
simulate data... pop size of 10,000 containers , 10% contaminated
container.id = 1:10000
benzene = c( rep(0,9000), rep(1, 1000) )     # this create sequence of 9000 zeros/failure and 1000 ones/success , ie NOT random!
pop_data = data.frame(container.id, benzene)

set.seed(1)

# below are code to create the simulation, creating a hypothetical data set and seeing the distribution stat.
# provide whole pop stat: num of contaminated w/ benzene, average (ie %contamination)
# part of the trick here depends on success = 1
# this overall is just to say that the artificially created sample fits the desired probability
# they were not in random order, it is fine, cuz sample_n() will pick randomly.
pop_stats  = pop_data %>% summarize( pop_num_success = sum(benzene), pop_mean = mean(benzene) )

# draw a sample of 10 and see which one is contaminated with beneze (ie success in this framework)
sample_dat = pop_data %>% sample_n(10)  


head(many.sample.stats) ##??

## pop is 20x larger than sample size, then generally considered selection not affecting the next outcome, cuz it is a tiny diff, won't affect conclusion.

# this seems to be what HW04 was asking for.


[42:00]
X big x is 
little x ... 
P(X=x)  number of success equal to some value...

HW04
(R quirkyness, though undertandable if understand the underneath mechanics)

    these two are slightly different
output_01 = sim_01 %>% as.data.frame() 
output_01 =            as.data.frame(sim_01) 
    as.data.frame() with "piped input" don't get a name of the source variable
    and as a result, subsequent rename will not find it.
    thus, only the latter of the next two line works:
output_01 = sim_01 %>% as.data.frame()       %>% rename(birth_defect = sim_01)   # rename fails!
output_01 =            as.data.frame(sim_01) %>% rename(birth_defect = sim_01)   # rename works

ascii diagram , ascii arts =P


template tree diagram for bayes in ascii, potential paste for lab or exam:
ref: Lect 13 wk5 reader p131
study (pop) = medicare enrolled heavy smokers

         /-- T+
       D+
     /   \-- T-
study
     \    /---- T+
       D-
          \---- T-


HIV status:

         /-- Test+     0.9985
       Ab+
     /   \-- Test-     0.0015
root  
     \    /---- Test+  0.0060
       Ab-
          \---- Test-  0.9940 


lecture example:
first layer for disease state D+/D-
second layer for test result  T+/T-

Rmd ascii 2x2 contingency table:

      |--response vars-|

2x2   | D+       | D-  | margin |
------|----------|-----|--------|
T+    |   26.7   | 67.9|  94.6  |
T-    |    3.3   |902.1| 905.4  |
margin|   30     |970  |1000    |

↑
explanatory input, 
researcher controlled vars



|            |   Diease     | Disease     | (Margin)     |
|            |   True +     |  True -     |  Sum         |
|------------|--------------|-------------|--------------|
| test +     |        TP    |    FP       | denom of PPV |
| test -     |        FN    |    TN       | denom of PPN |
| sum        | denom of Se  | denom of Sp | pop tot      |

Lecture 17: Poisson Dist

pois = poisson dist, similar to binomial, but no upper bound. typically for rare events, eg Infect Disease. [Lec 17 wk7]

dpois(x=?, lambda=?) discrete prob, when x=exact number. lambda (aka mu, ie average) is the sole param in poisson.

ppois(q=?, lambda=?) continuous prob, for x up to value specified. up to and including. X≤k **if do lower.tail=FALSE then decrement k by 1 to properly account for the discrete unit change. [Lect 17 slide 18]. eg: 1-ppois(q=3,lambda=0.1) # ans: [1] 3.846834e-06
ppois(q=3,lambda=0.1,lower.tail=FALSE)

simulation  for 5 years  see reader page 50

rpois( n=12*5, lambda=0.536 )

vs manual simulation

set.seed(200)
polydactyly=(rpois(n = 12*5, lambda = 0.536))
polydactyly

mean(polydactyly)  # 0.45,       randomness and small sample made our number differ from theoretical of 1/500*268 = 0.536  [27:45]
sd(polydacttyly)   # 0.64


when to use poisson vs binomial...
[33:08]
mean of binomial outcome 
vs
how many success of failure from larger size (poisson).

(hmm... fuzzy... real diff i hear is that binom has a cap, a finite size, vs poisson is infinite)

Lab06


# test and install a package as needed
if (!require("MASS")){
  install.packages("MASS")
}


# unload functions loaded by package MASS (as they clobber with fn of same name from dplyr):
detach("package:MASS", unload = T)		


# binomial sample data generation, individually create each sample point
# lab06 line 280
# This is just the range of values x can take
x_vals = 0:812

# This generates the probabilities
probs = dbinom(0:812, size = 812, prob = 0.27)

# This combines the range of values with the probabilities as a dataframe with 2 columns: x_vals and probs.
# View(obs_data) in your console to see what this dataframe looks like
obs_data = as.data.frame(cbind(x_vals, probs))

Lecture 24: T and paired T test


[wk 9 reader p20]
for t-distribution (dist of t-scores?)
very much like pnorm and qnorm.
pt(q = , df = , lower.tail = ) 
qt(p = , df = , lower.tail = )



[ wk 11 reader p 50 - Lect 24 (Topic 1, video 2, toward the end of the video)   ]

coding notes

A one sample t- test will take the form:
t.test(x = x_variable, alternative = greater, less or two.sided, mu = null_hypothesis_value)



A two sample t-test will take the form:
t.test(first_sample_data, second_sample_data, alternative = greater, less or two.sided)

	t.test( df$col1   ~ df$col2 , alternative="two.sided") # 2 vars, independent # w11 reader p11
	t.test( df$wegith ~ df$sex  , alternative="two.sided") # note use of ~ and 2nd var would have been the group_by filter

A paired t-test will take the form:
t_stat = t.test(first_data_points, second_data_points, alternative = greater, less or two.sided, paired=TRUE)
t_stat$statistics # contain the actual t-score
t_stat$p.value    # contain the p-value

Lecture 25: ANOVA: Comparing 3 or more means

First mostly about visualizing multi column data (eg think multiple treatment method being teest for cancer, some are combination of treatments). Background, how are 3 groups of data need to be structured for comparison



grp1 = rnorm( n=2, mean=15, sd=3 )   # # each rnorm() produce a vector as a result
grp2 = rnorm( n=2, mean=15, sd=3 )

egSame = data.frame( grp1, grp2 ) 
# construct dataframe, each vector becomes a column, which also insert the original variable name as column heading!!
# output looks like this
      grp1     grp2
1 19.07604 16.16301
2 14.69164 14.83858


egSameNarrow = egSame %>% gather( key="Grp", value="Measure", grp1:grp2 )
# gather() is interesting…  it produced a new data.frame() with a pivoted arrangement of the data, maybe formatting it into the tidyverse way.
# the main cells of the data.frame() becomes the value part, which  gather() also give a header
# the "key" is the first col, which get the value from the header of the original data.frame() cols
# output looks like this
   Grp  Measure
1 grp1 19.07604
2 grp1 14.69164
3 grp2 16.16301
4 grp2 14.83858


# ah, so geom_point would be xy scatter plot when fed with continuous variables on both axis
# but when one var is categorical, it produce the data points of a box plot, w/o the box
ggplot(egSameNarrow, aes(x = Grp, y = Measure)) + geom_point()

+ geom_boxplot()    # to get box, add geom_boxplot()   // order matter, last one is top layer


# density plot in [wk11 reader p80]
ggplot(egSameNarrow, aes(x = Measure))  + geom_density(aes(fill=Grp), alpha=0.4)


# Lecture 26: ANOVA Part 2.  (Su21 L20 Line 510)
ggplot(egSameNarrow, aes(x = Grp, y = Measure)) +
  geom_jitter(pch = 4, width = 0.1)                            # like geom_point, but wiggle points randomly within "width" so points can be seen when they overlap

ggplot(cancer_data, aes(x = trt_order, y = tumor_volume)) +
  geom_jitter(pch = 4, width = 0.1)                            # like geom_point, but wiggle points randomly within "width" so points can be seen when they overlap

# pch ???

ANOVA test:


library(broom)   # wk12 reader p12
cancer_anova = aov(formula = tumor_volume ~ treatment, data = cancer_data)
tidy(cancer_anova)

# for aov(), the var after ~ is column containing the grouping.

		cancer_data data.frame looks like this
		     treatment          tumor_volume
		1  Irradiation           30
		2  Irradiation           46
		...
		5 Cannabinoids           12
		^^
		row number added by R

pf() would find the p-value, but aov() already provides it, so maybe seldom need to call this explicitly?

after finding stat sig anova, need to run multiple pairwise t-test, or use TukeyHSD() wrapper, wk12 p29

Bonferroni Correction wk12 p22


diffs = TukeyHSD( cancer_anova, conf.level=0.05, )  %>% tidy()
diffs %>% select(contrast, adj.p.value)		# select() pick columns from list (with named col),  vs pull(), for data.frame ? 

TukeyHSD() provides an adj p.value, correcting for multiple tests.  this adj prevent it from use in CI.  [wk12 reader 31]

Bonferroni vs TukeyHSD produce slightly different result.  [wk12 p33]

Lecture 27: non parametrics

Lab10

# maybe move...

library(MASS)           # load it to import the data set
detach(package:MASS)	# remove the select() fn which would conflict with dplyr

Missing Data

na.rm=T   # many function take this as option to drop entries with missing data, ie, where data is "na"
nhanes = read_csv("nhanes.csv")
nhanes = na.omit( nhanes )       # remove row that has missing values.

Constructing 2x2 or RxC contingency table using tibble, chi-squared test


              Exposure+        Exposure-       ←- explanatory var as input

|Group        | Sandwich     | No Sandwich   | Row total |
|-------------|--------------|---------------|-----------|
|Ill          | 109          |  4            | 113       |
|Not Ill      | 116          | 34            | 150       |
|Column total | 225          | 38            | 263       |

↑
response
output
D+ vs D-
(this is a more unusual layout of 2x2... )

# the library it tibble, but the contruction fn is "tribble"
library(tibble)
two_way = tribble(~ sandwich, ~ nosandwich,
                     109,        4,   #row for Illness
                     116,        34)  #row for no Illness

chisq.test(two_way, correct = F) # correct = F means not using Yates' correction for continuity, so that it can match calc "by hand".

ascii/unicode symbols

for easy cut-n-paste


↑ ↓ ← →   arrows up down left right
« » 
⟨ ⟩ ‹ ›   angle brackets not parsed by HTML code, but may create confusion with cut-n-paste .  
          useful to replace R› prompt

• bullet
· middle dot 
◆ diamond
◇
♢
▶
better bullet for table▶yes?
▶  ▶


√¯¯¯  sqrt
≈  approx
≤  <=   combined less than equal to.  
≥  >=   useful when dont want to be parsed by HTML 
≠  !=   not equal, R Knit LaTeX barf at this symbol :-\
±
¬    not
∪ ∩  union intersect
×    multiply
÷    divide
Ø    empty set

λ - lambda - eigenvalue
Λ - Lamda - eigenvector

ε - epsilon
Ε - Epsilon

χ² - chi-squared
σ² - sigma squared = variance
μ  - mu mean average

αβ∑Δ

ΑΒΚΔΕΦΓΗΞΚΛΜΝΟΠΚΡΣΤΎΒΩΧΥΖ
αβκδεφγηξκλμνοπκρστύβωχυζ

Ñ  alt 0209  spanish n
ñ  alt 0241  mac: option + n, n

Symbols

° zeroth :P degree ascii 176
¹ first
² square
³ cube
ª primera
º primero
‰ per mill

§ section
¤ currency
€
¢ cent
Ð Latin Eth
Ñ ñ
¡ ¿ 
¶ pi new paragraph

https://www.ascii-code.com/?msclkid=fada967ec20611ec82769a8bbf923192 has an ascii table that is text based, can copy-n-paste, quite a bit of ads at top, but at least no pop ups

[Doc URL: http://tin6150.github.io/psg/R.html]
[Doc URL: http://tin6150.gitlab.io/psg/R.html]
Last Updated: 2022-07-07
(cc) Tin Ho. See main page for copyright info.

psg101 sn50 tin6150
hoti1
nSarCoV2
bofh1

R

101

R

Jupyter Notebook with R kernel Container

Jupyter Notebook IR Kernel Setup

Environment

Setting dir for user package install

Installing packages

CSV example

Misc

Oracle connection

GC: Garbage Collection

R Programming

101.2

R stuff - NA : Not available data

R stuff - stat stuff

stat functions

assignment

Data Structures

vector, matrix

array

dataframe

List

Strange R stuff

Formula and tilde (~) operator

formula has delayed evaluation

Common formula

Tilde operator

Pipe

apply family of function

Libraries

lifkit

ModeSim

Python using R

R using python

Parallelization in R

Parallelization in R using mc* functions

Parallelization in R using cluster functions

doParallel

Rdsm

Clustering on Clusters

Snow

nestedParallel

Parallel R Baggage

EpiStat

Rnd tidbits

PH142 - Stats for Public Health

course core concepts - ccc

Exam 1: Review

Constructing 2x2 or RxC contingency table

Course: Review

Exam 1: Review

dplyr

ggplot

Exam 1: Review

Lecture 1: Intro to R

Lecture 2: Working with data in R

Lecture 3: Visualizing/plotting data

Lecture 4/HW02

Lecture 6

Lecture 7

Lecture 8

Lab/HW03

Lecture 11-13 - Probability (week 5, post Midterm 1)

Lecture 14 - Normal Distribution (week 6)

Finding Normal *Probabilities* - pnorm

Finding Normal *Percentiles* - qnorm

Standard Normal and z-score

Simulating Normally distributed data in R

Lecture 15

Lecture 17: Poisson Dist

Lab06

Lecture 24: T and paired T test

Lecture 25: ANOVA: Comparing 3 or more means

Lecture 27: non parametrics

Lab10

Missing Data

Constructing 2x2 or RxC contingency table using tibble, chi-squared test

ascii/unicode symbols

Finding Normal Probabilities - pnorm

Finding Normal Percentiles - qnorm