## NCEAS plant traits DGS- spring 2008## R tutorial day 3## Nathan Kraft & David Ackerly, UC Berkeley## nkraft@berkeley.edu, dackerly@berkeley.edu## This tutorial is intended to give you an introduction to: 1) if/ else statements 2) function writing 3) the use of monte carlo randomization and null models for significance testing, and 4) methods to test for non-random assembly of ecological communities, based on species traits. We'll assume that you have worked through the first two tutorials before starting this one. As before, the commenting focuses on the mechanics of the R language- please refer to your favorite stats reference for more details.#######################################################  "If/ else" statments   #####################################################################"if" statements are similar in structure to for loops, and are useful for doing something if a certain condition is met##lets set up a boolean (true/ false) variable:friendly <- TRUE##if statements just do whatever is in the {} if the statement in () following "if" is TRUE:if(friendly){	print("hello")	}##input the entire statment- you should get a greeting##lets change "friendly" to be false:friendly<-FALSEif(friendly){	print("hello")	}	##now you should get no greeting at all##you can follow up the "if" with an "else" statement that gives an alternative if the first test is false:if(friendly){	print("hello")	}else{		print("grrrrr...")		}##switch "friendly" and run it again:
friendly <-!friendly        ##remember the "!" means "not"if(friendly){	print("hello")	}else{		print("grrrrr...")		}		#The statement in the parentheses can also be a comparison that generates a true or false result. In this case, remember that the double equals is used to make a comparison, single equals to assign a value:

x <- 1
if (x==1) print('x = 1')

##notice that you can omit the {} if you only have one command to perform after the if(test...)
##there is a lot more that can be done here, but this is a good start for now!!  #######################################################  Writing functions ##################################################################### Dan gave us a quick peek at function writing in week one with the meanstderr() function that he wrote.##Today we want to go more in depth and have you all practice writing some of your own later on. ##Functions are very useful for saving time in scripting.  If you have a complex operation that you do all the time, writing a function can save you a lot of time typing the same list of things over and over again. ##lets start with a *very* basic one:greet=function(){	print("hello there!")	}		##this is the format of function writing: a function name of your choosing- "greet" in this case- (action verbs are recommended to keep them distinct from variables), followed by the "=function" plus a () containing any arguments that the function requires.  Here the function doesn't require an argument.  The code inside the {} is performed whenever the function is called.##You need to copy and paste the entire text of the function into the console before you can use the function.  There are other ways to do this we'll cover in a little while. Once this is input, type:greet()## and you should see a message in the console. ##Here is an example with an argument:square=function(x){	x*x	}##as with function names, you can call arguments anything you want.  When you call functions, you can refer to the arguments explictly:square(x=2) ## should tell you "4"## this should also work without a hitch:square(2)##once a function is input, you can always see the code for it by typing it in without the ()square##you can write functions with multiple arguments:divide=function(x,y){	x/y	}##try it out:divide(x=8, y=2)##same as:divide(8,2)##not the same as:divide(2,8)## But if you explictly name the arguments, the order of the arguments does not matter:divide(y=2, x=8) ##You need to specify something for all arguments in this kind of function.  For example, this should fail:divide(4)##You can however specify defaults for certain arguments when you write a function. Here is an example:increase=function(x, factor=2){	x*factor	}##input the function then run these:	increase(x=3, factor=2)##is the same asincrease(x=3)##because we specified a default for 'factor'##same as:increase(3)##but you can also modify the argument "factor" when you call the function even though it has a default:increase(3, factor=10)##same asincrease(3,10)##the "return" command is very useful in functions for "outputting" what you want from the function.  Here we've modified the "increase" function to assign to output of the operation to a variable called "output"increase_V2=function(x, factor=2){	output <- (x*factor)	}##now run:increase_V2(50)##whoops!  No result!##try calling the "output" object that we created:output  ##(this should fail if you've run the script as is)##Just as with "for" loops, anything created in the function (such as the 'output' variable) vanishes when it is done running. ## here is a third version that has added a line of code telling the function to "return" the output variable- that is- send the value of it out to the world once it is done:increase_V3=function(x, factor=2){	output <- (x*factor)	return(output)	} ##now run:increase_V3(50)##now we have the value of the "output" object to play with again##notice that we still do not have a variable named output:output  ##should fail again## Assign the returned value to a variable for further use:
result <- increase_V3(50)
result ##Here's one final example.  It is very common to use booleans (true/ false) as arguments with defaults in a function.  Lets say you want a function that returns the range of values in a vector, and sometimes you want the absolute minimum and maximum numbers, but sometime you want to know how far apart the extremes are.  You could put both options into one function:get_range=function(vector, minMax=FALSE){	vector_min <- min(vector)	vector_max <- max(vector)		if(minMax){		temp <- data.frame(vector_min, vector_max)		return(temp)		}else{			range <- vector_max-vector_min			return(range)			}		}	##you actually *don't* need the "else" here- but lets keep it in for clarity.##once it's input, try it out:##here is a practice vector:v <- seq(5,20)vget_range(vector=v, minMax=FALSE)##same as:get_range(v)##different from:get_range(v, minMax=TRUE)##notice that if minMax is FALSE the function returns a number, but if it is TRUE the function returns a dataframe.  This might cause errors in other things you might be doing with the output of this function, so you might want to avoid things like this when writing functions for your own use. ##as a final note, you can place functions in .r text file and then import them into your workspace with the source() command, such as: source("my_favorite_functions.r").  This can be faster than  copying and pasting the functions you always use into the start of a script. ############################### data import ###############################set your working directory to the place where Will's datafiles are stored using setwd() or the GUI menus. ##read in the data:raw <- read.csv("cornwellPlots.csv") ##raw plot datahead(raw)##lets give the data less cryptic names:names(raw) <- c("plot", "species", "abund", "sla", "leaf_area", "seedsize", "wood_density")head(raw)##read in a file of species trait means (created in last week's session):read.csv(file="CornwellTraitMeans.csv")[,2:8]->traitmeanshead(traitmeans)#########################################################  Basic community trait stats ##############################################################Here is code to look at a few common community trait metrics.##Let's start with data from one plot at Jasper Ridge:sample_plot <- subset(raw, raw$plot==10) ##pick plot 10sample_plot ##looks like a good place to start			## notice that poison oak (line 111) has no leaf trait data measured in that plot. I wonder why?? : )			## lets replace those NA's with species means. This is a clunky way to do this- in a bigger dataset you'd want to write a script to do this automatically.  See the end of the script for more on this. poison_oak<-subset(traitmeans, traitmeans$species=="Toxicodendron diversilobum") poison_oak  ##here are the trait means for the species in question##replace the plot NA for sla with the species mean slasample_plot[sample_plot$species=="Toxicodendron diversilobum", "sla"] <- poison_oak$sla##same for leaf area:sample_plot[sample_plot$species=="Toxicodendron diversilobum", "leaf_area"] <- poison_oak$leaf_area##lets check:sample_plot##looks good- no more NA's#######################################on to community trait measures:## means ########## simple meanmean(sample_plot$sla)##abundance weighted mean:weighted.mean(sample_plot$sla, w=sample_plot$abund)##variance:var(sample_plot$sla)##standard deviation:sd(sample_plot$sla)##range##after all that monkeying around with functions there actually is  a function called "range()" already to give us min and max:range(sample_plot$sla)

##and we can make a simple function to give us the difference (the diff command would work as well)range_difference=function(vector){	return(range(vector, na.rm=TRUE)[2]- range(vector, na.rm=TRUE)[1])	}	range_difference(sample_plot$sla)#############################nearest neighbor distance##first we need to order the vector of SLA's to figure out who the nearest neighbors are in trait space. Sort is good for this:ordered_sla <- sort(sample_plot$sla)ordered_sla##now we need the distance between each neighbor. diff() will do this.length(ordered_sla)##there are ten species so we should get 9 nearest neighbor distances:sla_NN_dist <- diff(ordered_sla)length(sla_NN_dist) ##good- there are 9 differences!sla_NN_dist ## and there they are!mean_NN_dist<-mean(sla_NN_dist)mean_NN_dist	##the average NN distancesd_NN_dist <- sd(sla_NN_dist)sd_NN_dist 		##std deviation of nearest neighbor distance##handy functions for the last two:get_ave_NN_dist=function(vector){	return(mean(diff(sort(vector))))	}	get_sd_NN_dist=function(vector){	return(sd(diff(sort(vector))))	}		##another advantage to writing functions for more complex operations is that you can apply them with a function like aggregate() from the last two tutorials:##this will calculate the range of each trait in each plot:
plot_ranges <- aggregate(raw[,4:7], by=list(raw$plot), FUN=range_difference)	

## The aggregate command is very powerful, though not necessarily intuitive. You'll probably want to read the help document to understand better why that command does what it does.
names(plot_ranges)[1] <- "plot" ##fix col 1 headerplot_ranges ## looks like the range of values in each community varies widely.#e.g.:min(plot_ranges$seedsize)max(plot_ranges$seedsize)##and:summary(plot_ranges)##we can do the same with the standard deviation of nearest neighbor distances:plot_NN_SD <- aggregate(raw[,4:7], by=list(raw$plot), FUN=get_sd_NN_dist)plot_NN_SD ##again- looks like there is a lot of variation in the data## so now what if we want to test hypotheses with these metrics??#########################################################  Intro to Monte Carlo Methods  ########################################################

## Before we test these statistics, let's take a side trip and learn how to construct monte carlo simulations for significance testing. 

##The core of any monte carlo simulation in R is the sample() command. Lets try it on a test dataset:

species <- c("a","b","c","d","e","f","g","h","i","j")
abundance <- seq(100,10,-10)

##fake species and abundance data:
species
abundance

## note that "a" is the most abundant, "b" slightly less, etc. on down the alphabet...

##sample() draws a specified number of elements from a vector at random:

sample(species, 5)

##you can sample the vector weighting the chance of choosing each species by its abundance using the "prob" argument:

sample(species, 5, prob=abundance)

##notice that you should have gotten more species with names early in the alphabet, as they are more abundant in our fake data. 

##if you sample an entire vector, it the same as shuffling the order of the elements. Here we are sampling wihout replacement. 

sample(species, length(species)) ## We'll use this later.

##you can sample with replacement if you need to:

sample(species, length(species), prob=abundance, replace=TRUE)

##should be a lot of a,b and c as it is abundance weighted. 

## Now that we have "sample" down, lets build a monte carlo simulation for a familiar statistic, so we can compare the results to the p-values obtained from statistical tables. For this example, we'll use the t-test:

## Start by defining two samples. Maybe these represent seed size of two populations of a plant, and we want to test whether the mean seed size of the two populations is different:

ss1 <- c(3.13,3.22,3.35,4.11,2.97,2.07,2.98,2.43,4.21,3.13)
ss2 <- c(3.29,4.01,5.62,4.06,5.65,3.94,3.94,2.66,5.29,3.50)

## now let's put the data into a dataframe and label the two populations
pop = as.factor(c(rep(1,10),rep(2,10)))
pop

ss = data.frame(pop,c(ss1,ss2))
names(ss) = c('pop','seed_size')
ss

## start by comparing means and calculating difference between the means:
means = aggregate(ss$seed_size,by=list(ss$pop),mean)
means

diff.mean = diff(means$x)
diff.mean ##the difference between the two means

## box-whisker plots to visualize two populations
plot(ss$pop,ss$seed_size)

## with only two variables in this dataframe, we can also simply plot ss and get the same boxplot:
plot(ss)

## They look pretty different! Let's try a t-test, as we explored last week: 

t.test(seed_size~pop, data=ss, alternative='two.sided')

## Yes, they are significantly different, at a p-value = 0.01542
## What does this mean? Under the null hypothesis that these two populations are drawn from the same underlying distribution, with equal standard deviations, the probability of drawing two samples of 10 each with means that differ by as much or more than we observe here is less than 2%. 

## Let's generate the corresponding p-value by simulation. Rather than making any assumption about whether the values are drawn from a normal distribution, we will simply take the 20 values of seed size that we measured. In each randomization, we will randomly assign the 20 values into two groups, and test the difference in means between the groups. Since it's a two tailed test, we take the absolute value of the mean difference, as we don't have a prior hypothesis about which group should be bigger. Then we count the number of times that the result is as large or larger than our observed difference, which gives the probability of acheiving the observed result by chance. Be patient - a loop of 10000 reps takes a few seconds. Note that we assign the observed result to the first position in our vector of results. This way, if none of the randomizations ends up with a more extreme value, then the p-value is <= 0.0001.

##initialize a variable to hold the results:
results = c()  ##can do this with "NULL" or "c()" 

##put out observed difference in the first position in the results vector:
results[1] = diff.mean

##how many times to run the null model??:
nreps = 10000

for (i in 2:nreps) ##notice that the loop starts on 2 because we already have a value for the first element of the results vector.
{
	##here we're using "sample" (see above) to draw 20 values
	##from the seed size vector without replacement.
	##Because the vector is only 20 observations long,
	##this is operation is the same as shuffling it:
	
	rand.samp = sample(ss$seed_size,20)
	
	##Here we are using the original list of population 
	##factors (ss$pop) to group the randomly shuffled
	##data.  This means that the first 10 randomly drawn 
	##seed sizes are grouped into pop 1 and the second 10 
	##in pop two.  Then we use aggregate to take the means of
	##the newly shuffled population.  Make sure you
	##understand what it going on here- this is the key step!
	
	r.means = aggregate(rand.samp, by=list(ss$pop), mean)
	
	##store the absoulte value of the difference between the
	##shuffled population means:
	results[i] = abs(diff(r.means$x))
}

## Look at the distribution of results under the null
hist(results)

##remember our observed difference was 1.036- which is in the right tail of this distribution.

##to test significance, start by calculating the number of our shuffled results that are greater than or equal to our observed difference between means.  Remeber that the observed difference is stored in position one of the results vector:
num_greater_than = length(which(results>=results[1]))

##Now divide that number by the total number of reps to get a p- value:

pval <- num_greater_than/ nreps

pval

## We got 0.015- not bad compared to our t-test result of 0.0154. You may get a slightly different value, since there is a random component to this test.

## Sometimes you may want to tabulate the p-value as we go, so we don't have to save the entire vector of results. we can do this by keeping track of the number of times that the random result is greater than or equal to our observed difference 

##we'll use a counter variable for that purpose:
greater_than_counter = 1 ## for the observed value

for (i in 2:nreps) 
{
	##first two lines are the same:
	rand.samp = sample(ss$seed_size,20)
	r.means = aggregate(rand.samp,by=list(ss$pop),mean)
	
	##if the random difference is greater or equal,
	##increase the counter by one:
	if (abs(diff(r.means$x))>=diff.mean){
		 greater_than_counter <- greater_than_counter + 1
		 }
}

##notice how adding the "if" statement in the "for" loop can slow things down a little more. 

pval = greater_than_counter/nreps
pval ##should be close to what you got before

## Finally, we can use the 'replicate' command to avoid the 'for loop'. The replicate command is very powerful, as you can pass any function to it. If the function returns a value, then replicate N times will return a vector with N values; if the function returns a vector, then replicate will return a matrix. For example, the following gives a matrix of 10 columns where each column is a vector of 5 numbers drawn from a normal distribution:

replicate(10,rnorm(5))

## By calculating the mean of the random numbers, you can get a vector of the expected means from samples of 5 from a normal distribution:
replicate(10,mean(rnorm(5)))

## Back to our t-test example. By building up a series of nested commands, we can conduct this entire monte carlo test in a single line! The final result is not for the faint of heart. 

## random sampling
sample(ss$seed_size,20) 

## nest this in the mean calculation
aggregate(sample(ss$seed_size,20),by=list(ss$pop),mean) 

## nest this in the command to calculate the difference in means:
abs(diff(aggregate(sample(ss$seed_size,20),by=list(ss$pop),mean)$x))

## nest this in the replicate command, and put the observed value in front to get a vector of length nreps:
results <- c(diff.mean,replicate(nreps-1,abs(diff(aggregate(sample(ss$seed_size,20),by=list(ss$pop),mean)$x))))

pval <- length(which(results>=results[1]))/nreps
pval ## Voila!

#############################################################  Monte Carlo methods and Jasper Ridge #####################################################

##Lets try to apply these monte carlo methods to the new communuity metrics we've been playing with at Jasper Ridge


#############################
##remember that we have plot 10 pulled out as a sampled plot:

sample_plot

##and we have a file of trait means that we can use as our species pool for the entire site:

traitmeans

##lets say we want to know if the SLA range that we observed in plot 10:

obs_range <- range_difference(sample_plot$sla)

obs_range

##is smaller that what a null model would predict.  Remember from David's lecture that this would be a patern consistent with habitat filtering. 

##For now, lets build a null model that keeps richness constant at 10 species (the number of species in this sample plot), and draws species with equal probability from the total pool of species. We'll be using a one tailed test because we are interested in a range reduction in traits associated with habitat filtering.

nreps <- 10000 

range_results <- NULL ##initialize a vector for results
range_results[1] <- obs_range ## put the observed value in slot 1

for(i in 2:nreps){
	##make a null community with 10 species drawn from the pool
	null_community <- sample(traitmeans$sla, 10)
	
	##calculate the range of the null community
	range_results[i] <- range_difference(null_community)

	}

hist(range_results)

##notice that shape of the null distribution of trait ranges is non-normal and multimodal.  This is common with these kinds of monte carlo trait tests, and is a big difference from parametric tests, which have well behaved distributions of test statistics (like the t or F distribution).  

obs_range

##looks like our observed range in plot 10 is one of the smallest in the distribution.

##to get a p value, see how many of the null communities had a smaller range than the one we observed and divide it by the number of null communites we created:

range_p_val <- length(which(range_results<=range_results[1]))/nreps

range_p_val  ## I got p= .0039- yours may vary slightly as there is a random component to the test. It looks like the range of SLA is significantly reduced in plot 10!!

##############################
##In the null model we just used, species we drawn with equal probability.  This gives added importance to rare species and reduced weight to common species.  You could argue that an abundance weighted null model- that is- a null model that draws species from the pool in proportion to their abundance- is more appropriate.  Lets see if our result is robust to this consideration:

range_results_abwt <- NULL ##initialize a vector for results
range_results_abwt[1] <- obs_range ## put the observed value in slot 1

for(i in 2:nreps){
	null_community <- sample(traitmeans$sla, 10, prob=traitmeans$total_abundance)
	
	##note that we're weighting the sampling by abundance

	range_results_abwt[i] <- range_difference(null_community)

	}

hist(range_results_abwt)

range_p_val_abwt <- length(which(range_results_abwt<=range_results_abwt[1]))/nreps
range_p_val_abwt

##I got p=0.0209- still significant but not quite as strong of a result. Which null model do you think is more appropriate??

#############################
##Here's a more complete example to test every plot in the dataset for even spacing in seed size:nreps<- 1000 ##this is scalled down in the interest of time here. 

plot_list <- unique(raw$plot)
seedsize_sdNN_pval <- NULL

for(i in 1:length(plot_list)){
	current_plot <- subset(raw, raw$plot==plot_list[i])
	##pull out one plot at a time
	
	richness <- dim(current_plot)[2] ##number of species- needed for null
	
	obs_sdNN<-get_sd_NN_dist(log(current_plot$seedsize))
	##get the observed seed size sdNN- note the log transform first
	
	seedsize_sdNN_results <- NULL
	seedsize_sdNN_results[1] <- obs_sdNN
	
	##the null model part for each plot:
	for(j in 2:nreps){
		null_community<-sample(traitmeans$seedsize, richness, prob=traitmeans$total_abundance)
		seedsize_sdNN_results[j]<-get_sd_NN_dist(log(null_community))
		}
		
	##record the p value for the current plot:
	seedsize_sdNN_pval[i] <- length(which(seedsize_sdNN_results<=seedsize_sdNN_results[1]))/nreps

	
	}
result <- data.frame(plot_list, seedsize_sdNN_pval)
names(result) <- c("plot", "pval")result  ##table of plot names and p-values

sum(result$pval<=.05)  ##I get that 13 of the 44 plots have signficantly even spacing- your results may vary!
####################################################
#####  Suggested excercises ######################
#############################################

##1. Database Management: write a script (or better yet- a function!) to replace all of the NA's in the "raw" species by plot file with species means from the "traitmeans" file.  We did this manually for SLA in the sample plot (plot 10), but in bigger datasets you'd probably want to automate this process. The "is.na()" test function from last week will be helpful for this. 

##2. Function Writing:  write a function or two to perform some of the additional tests for even spacing described in the Stubbs and Wilson paper David mentioned in lecture.   The univariate ones are relatively easy; the multivariate ones will be a little more tricky if you are looking for a bigger challenge. Try it out on very simple "dummy" data to make sure it is doing what you want it to be doing.

##3. Application.  Test one plot (or all plots) in Will's dataset (or your own!) with your new metric.  Think carefully about the null model that you want to use!! Is it presence/ absence or abundance weighted?  Does it include all species in your dataset or only a subset?  Do you transform the data first or not?  Is your test one tailed or two tailed??

## A good paper to start you thinking about null models in a community assembly context is:

##Colwell, R. K., and D. W. Winkler. 1984. A null model for null models in biogeography. Pages 344-359 in D. R. Strong, D. S. Simberloff, L. G. Abele, and A. B. Thistle, eds. Ecological communities: conceptual issues and the evidence. Princeton University Press, Princeton, NJ.