No Longer Reading: An inequality involving beta

Continued from the previous post and following William James Tychonievich's post "Calculating beta diversity"

In the previous post, a formula was described that related alpha, gamma, the number of forests, and delta if we assumed all forests are the same size. The formula is:

Notice that α is being divided by F, so as the number of forests increases, the effect of α on 𝛾 diminishes. On the other hand, δ is multiplied by (F-1) and divided by F. So, as the number of forests increases, the effect of δ on 𝛾 will be much larger than α. And this makes sense because δ is the probability of picking two different trees assuming that each tree comes from a different forest. As the number of forests becomes larger, there will be many more ways to pick different trees from different forests than there are ways to pick different trees from the same forest.

𝛾 is the probability of selecting two different trees from the population as a whole and there are two ways to do this: select two different trees from the same forest or two different trees from two different forests. The equation consists of one term for the average probability of the first, and a term for the average probability of the second.

If we wanted to know the probability of selecting three different trees from a population as a whole, we would probably have an equation with three terms because there are three ways to select three different trees: they may all be from the same forest, all three may be from different forests, or two may be from the same forest and one from a different forest. We would then have terms representing the average probability of all three of these events.

In his previous post, William James Tychonievich wondered if there was a formula relating alpha, gamma, the number of trees, and beta, where β is determined using his method of slice-matching. I have not been able to find an equation, but there is an inequality involving β because β and δ are related. To see this, let's first look at some easy cases. If every forest is the same, then β = 0 and α = δ because the probability of picking two trees from different forests is the same as picking one tree from the same forest twice. But then, we can subsitute α in for δ in the above formula and determine that in this case γ = α. Or, if every forest is different, then β = 1 and the probability of picking two different trees from two different forests is 1, so we can then substitute δ = 1 into the formula and it simplfies to:

We see that β affects δ. Consider the following example:

In this situation, we have two forests where exactly half of the tree species overlap, so β = 0.5. We can also calculate δ. There are four ways to pick one tree from the first forest and one from the second. We can pick a red tree from the first and a blue from the second, a red from the first and a yellow from the second, a blue from the first and a blue from the second, and a blue from the first and a yellow from the second. Of these, three are ways to pick two different trees and one is a way to select two of the same type of trees. The probability of any of these four cases is (0.5)(0.5) = 0.25. So, we can determine δ for this pair either by adding up the three instances that give two different trees or subtracting from 1 the instance that gives two of the same trees. Either way, δ for this pair is 0.75.

And this relationship holds for any pair of forests. Suppose we have a pair of forests, forest j and forest k. β for this pair of forests is the fraction of each forest which remains once we have removed all overlap.

In other words given any pair of forests, we first find the largest possible region which is identical in both forests (denoted here by X). Then, whatever remains (here denoted by Y and Z) takes up a proportion of β of the forest. (If both forests are the same, then X is the entire forest, while if both are different Y and Z are the entire forests, respectively). We know that Y and Z are completely dissimilar in terms of their tree species because if there was any overlap between Y and Z, we could remove that and add it to X.

There are four ways to select one tree from one forest and one from another. We could select one tree from Y and one from Z, a tree from Y and a tree from X, a tree from Z and a tree from X, or a tree from X in one forest and a tree from X in the other forest.

If we select a tree from Y and a tree from Z, we know that both trees will be different species because Y and Z do not overlap at all. Since both Y and Z are of proportion beta for this pair, delta for this pair is at least as large as beta squared for this pair. We can express this symbolically as:

In other words, the probability of selecting two different trees from both forests is at least as big as the probability of selecting one tree from Y and one from Z. The subscripts emphasize that this is delta and beta specifically for the pair forest j and forest k.

In addition to selecting a tree from Y and one from Z, any of the other three ways to select trees could also pick two different trees, so we want to calculate the probability of that. Consider the lefthand circle, representing forest j. Alpha for forest j is the probability of selecting two different trees from forest j. There are three ways this could happen. Both trees could come from Y, both from X or one from Y and one from X. We can express this in the following equation:

In this equation, alpha j is the probability of selecting two trees from forest j as a whole, alpha x is the probability of selecting two trees from X, alpha y the same for Y and alpha x y the probability of selecting one tree from X and one from Y. We need to multiple alpha x y by 2 because we could select a tree from X first and then a tree from Y or one tree from Y and one from X.

We can make a similar equation for forest k:

Based on these two equations, we can express delta for forest j and k:

So, delta for this pair of forests depends on alpha each forest as well. I am not sure how to deal with these alpha terms. Maybe they cancel nicely when you add up all the pairs. But, even without those, we can form an inequality which relates beta to alpha and gamma. So, we return to the inequality:

We want to find an expression in terms of δ, the average of the deltas of each of the pairs. Given that there are F forests, the number of pairs of forests is:

So, for both sides of the inequality, we will add up all F(F-1)/2 terms (one for each pair of forests) and then divide by F(F-1)/2. On the left hand side, this is just δ, but what do we get on the right hand side?

In order to form a simple expression for the right hand side, first we must conceptualize α, δ, and β in a somewhat different way. Suppose we have a collection of forests, each of which has its own α value, and for which each pair of forests has its own β and δ values. Consider the entire distribution of values over all forests. There are three distributions, one for α, one for β, and one for δ.

We will call these three distributions (or populations) A, B, and D, respectively. All of these three have a mean, which is the average of all the values for the alphas of each forest or for the betas and deltas each pair. We already know the means: they are α, β, and δ. But all of these populations also have a variance, which measures the spread of values around the mean. Do the different forests have widely varied α diversity values or are the α values similar? Similarly, we can ask the same question for the β and δ values. The more famous standard deviation is the square root of the variance

There is a formula for the variance of a general population X, which states that:

In other words, to find the population variance, first we square each value in the population X, then take their average. Next, we take the average of the values of the population and square this average. Subtracting the first of these numbers from the second gives the population variance.

But we can solve this for the first term on the right hand side and write: