__WHY Statistics in Biology? (from __** Chad S Vanhouten**)

** Introduction**:
In AP biology this year you will be collecting MULTIPLE sets of data. The word Biology actually means “the study of
life,” therefore, the GOAL is to actually study life in this class. To do this, you will have to become “well
versed” in the following skills…

● Observations and constructing observational studies

● Hypothesis generating

● Experimental design

● Data collection and data manipulation (graphing, tables, charts, statistics, confidence intervals, hypothesis testing, presentations)

● Claims, evidence (for these claims), and reasoning (why biological principals and your evidence together support your claim)

In order to be proficient in these skills, you will need to have a short statistics introduction (which will be followed by a year long accumulation of statistical enhancement skills). But MOST IMPORTANTLY, WHY DO WE NEED STATISTICS IN BIOLOGY? WHAT DOES IT DO FOR US?

1)
__Let’s look at
observations and observational studies. __

a. This is where you should start. WHAT DO YOU SEE?

b. Secondly, WHAT DO YOU ALREADY KNOW?

c. You need this for a BASELINE. It’s really to have something to compare to. How can you control an experiment on plant growth unless you know how they grow normally? How can you do a pillbug behavior experiment unless you know how they naturally behave? How can you compare environments for trichome density if you don’t know what happens naturally?

d. Observational studies allow you to make some observations (because you really aren’t controlling anything purposely), which in conjunction with your past knowledge now allow you to make a hypothesis.

e. For example, if you grow some plants and monitor them for a few weeks. What do you know about them? Can you make a hypothesis about two different plants and the trichome densities that you see (you might need to research trichomes here)? What if we grew them in different environments? Would anything change?

f.
In an observational study, you are looking for some ** CORRELATIONS** (that’s why these
are sometimes called a correlational study).
This

g. So before you get started, look around and ask yourself does X influence Y? When I see X, what happens to Y? (I know this sounds like algebra, but math will help us soon enough).

__Look up some stuff
about plants and trichomes…notes below…can you find
any X’s with Y’s?__

2)
__Hypothesis
generating __

a. This begins the fun part. Now you have some ideas (proposed correlations). But this is where you take your correlations and create a POSSIBILITY for causation. This is your first introduction to Statistics. You need to think about Statistics and data generation NOW! What could you test? Why do you think that? Is this test even feasible? What do you EXPECT to happen, and WHY do you expect that? How would you know if your hypothesis is supported or rejected?

b.
FOR EVERY EXPERIMENT FROM NOW ON, you will always ** WRITE TWO HYPOTHESES** (but you
need to “mess about” for a little while before you settle on one you want to
pursue, it’s good for thinking of all possibilities).

i.
** THE NULL
HYPOTHESIS (H_{o} = H-nought**) – This
is where you say there is NO DIFFERENCE between the variables in your
experiment.

We will use stats TO TRY TO REFUTE
THE NULL HYPOTHESIS. In fact, this is
your GOAL in every experiment; __GO GET
THE STRONGEST STATISTICAL EVIDENCE AGAINST THE NULL YOU CAN.__

ii.
** THE
ALTERNATIVE HYPOTHESIS (H_{1})** – This is the opposite of the
null hypothesis. You basically say there
IS A DIFFERENCE between the variables in your experiment.

iii. To stick with plants that have trichomes…maybe in an observational study of their habitat, you noticed that they seem to grow trichomes most of the time, but not in the same numbers/density per plant. You do some background checking, and you find that trichomes seem to have evolved for, protection from predators, they protect plants from frost, etc. So you think to yourself, a logical experiment would be do they grow more trichomes in different environments? Notice, a hypothesis is often called an “educated guess,” therefore; you have reason to think (education) they will grow more trichomes in the outside areas (predators) over inside areas (no predators) (which you decided on from your observations and background checking about plants). HOWEVER, notice your hypotheses…

**H _{o}
= There IS NO SIGNIFICANT DIFFERENCE between the trichome
density of a plant and its environment. **

**H _{1}
= There IS A SIGNIFICANT DIFFERENCE between the trichome density of a plant and its environment.**

iv. It is important to note, you are trying to support your alternative hypothesis by REJECTING THE NULL HYPOTHESIS. Please don’t ever say you have proved anything. You are now going to design an experiment to ATTACK that Null Hypothesis. SHOW US THAT YOU HAVE SIGNIFICANT EVIDENCE AGAINST IT. If you can do this, by definition you will be lending support (not proof) to the alternative hypothesis. So the next question is…HOW STRONG IS YOUR EVIDENCE?

__Are there any
Hypotheses you could generate right now?
Practice writing a H _{0} and H_{1}__

3)
__Experimental
Design: __

a. Here’s another fun part…you get to DO SOME SCIENCE.

b.
First, your experimental design needs to have only ** ONE VARIABLE** that is manipulated
(to refer to our plant lab, maybe environment like outside vs. inside). This variable is called the

c.
The variable you measure is the ** RESPONSE
VARIABLE** (

1. EXPLANATORY VARIABLE ** CAUSES** RESPONSE VARAIBLE.

2. INDEPENDENT VARIABLE ** CAUSES** DEPENDENT VARIABLE

3. X ** CAUSES**
Y (which from algebra will lead you right into a graph)

d.
To do any of the above, the null hypothesis will state there is NO CAUSE
between variables. You want to be
able to say I HAVE EVIDENCE TO ** REJECT**
THE NULL HYPOTHESIS (and therefore can lend support to the alternative
hypothesis that one variable causes the other. Notice I didn’t say prove
it). If you have a good audience during
a presentation, someone will ask you, “well how
confident are you?” You will be MUCH
MORE CONFIDENT, if you only manipulated one variable. In fact, the reason why we only manipulate one
variable at a time is so that

e. SET IT UP! I’m going to let you figure this out, but this includes, what data will you collect? What variable is being manipulated? You need to think about everything else to this point (see above). What do I already know? What happened in my observational studies to this point? What data will REJECT MY NULL HYPOTHESIS? How will I collect that data? What controls do I need to set up to ensure only ONE VARIABLE is tested?

f. I left this all rather vague because I want you to figure it out, and because all experimental designs do not need to be identical. What is the best way to get data to CONFIDENTLY REJECT THE NULL HYPOTHESIS? (see quantitative skills guide if necessary)

__What should we take
into consideration for this experiment?
Variables (DV/IV)? Constants and Controls (not the same things). __

4)
__Data
collection__

__SO WHY DO WE NEED STATISTICS IN BIOLOGY? WHAT DOES IT DO FOR US?__

If you collected every plant in the
world, watched their trichome preference between
outside and inside areas over time, and collected the results, THERE WOULD BE
NO NEED FOR MORE STATISTICS. You would have every piece of data in the world.
You would be 100% confident in your results (whatever they are). You would be able to say, I am 100% sure **There IS or ISN’T A SIGNIFICANT DIFFERENCE
between the trichome density of a plant and its
environment. **But, is this really
feasible? Can you collect every plant in the world? Instead we collect a RANDOM sample of plants
from the population. Because the sample
is random, there is some variability in the results (ex. maybe we randomly
selected a bunch of healthy seeds)

** So
guess what, you can pretty much NEVER be 100% confident in your results, hence,
we can’t say we can PROVE things in science.** So, if we can’t prove it, then HOW CONFIDENT
ARE WE? To answer this, you need to know BASIC STATISTICS.

It is important to note that most of your data this year will fall into the category of a normal distribution. We will talk about others later, but for the most part, here are the stats used with a normal distribution (see below, but right off Wikipedia, and it should look familiar from algebra).

A.
__THE SAMPLE
YOU ARE STUDYING:__

a.
How random is your sample? Since we can’t collect all
plants, which ones do you think are in our sample? F_{1}/F_{2}? Green/yellow? Are they all the same species?
Is the sample we are studying a true representative of the TOTAL POPULATION
(which wasn’t feasible to collect). The
sample you use could definitely affect how CONFIDENT you are in your results.
One of the characteristics of good statistics therefore is that your sample is
truly ** RANDOM** (or as random as
we can get it). Random sample is important because it means
that our sample should be

b.
It is also important that you have a ** LARGE ENOUGH SAMPLE SIZE**. Large sample size is important for two reasons. (1) It allows us to safely say that our
sampling distribution is approximately normal (see above) and (2) as the sample
size increases, the variability of the sampling distribution decreases (the
normal curve gets taller and skinnier.)

B.
__THE MEAN__

a.
You will do this experiment with a sample (not one
individual). Because of that, you need to calculate the ** Mean** (average) of every individual in the sample because
your results will have some variance. Not every individual will perform the
same, and therefore, you need to account for this by calculating the

b.
Once you have the mean, you want to know, how much
variance was there from the mean? This is the ** standard deviation**. It gives you a better idea of how far
away your data varied from the mean.
Once you have calculated standard deviation, go back to your algebra
days and think about a normal distribution (above, or refer to quantitative
skills guide page 34 if necessary). Once
you have the standard deviation of your sample, you can find the standard error
of the mean by doing s/(square root of n)....(see quantitative skills guide)

c. With your normal curve (above), the mean you calculated is in the middle. If you move one standard error to the left and one to the right, you would now be encompassing 68% of the distribution (see above in dark blue). So you could say that you are 68% Confident (+/- 1S.E.) that the true population mean (which wasn’t feasible to collect) falls within the range of the confidence interval.

d. If you move two standard errors to the left and two to the right, you would now be encompassing 95% of your data points (see above in light blue). . . So you could say that now you are 95% Confident (+/- 2S.E.) that the true population mean (which wasn’t feasible to collect) falls within the range of the confidence interval.

e. If you move three standard deviations you are 99% confident.

f. The SEM ends up as confidence bars on your data graph (whichever type you decide)

g.
Notice, you __aren’t saying what the true mean of the
population is__, just that the true mean is (68%, 95%, 99%) confident that __it
will fall within this range__.

h. SO YOUR FINAL QUESTION SHOULD BE, could the Mean(s) you collected have any error? (which would make you less confident)

C.
__ERROR of the
mean. __

a.
It is possible that you do __multiple investigations__
with __multiple populations__ (in fact, this is preferred in science as it
improves your confidence). Therefore, calculating a mean happens lots of times,
and you want to compare those means. How
much do the means differ from each other? This is another example of the use of
Standard Error of the Mean (SEM).

b. Perhaps you are comparing one set of plants to another (not just your outside and inside), like Chad’s outside plants to Luke’s outside plants. Here’s how to think about this.

1) How confident am I in my results from each experiment? 68%, 95%, 99%? (each mean for each population/experiment should be equally confident) Luke should be 95% confident and Chad should be 95% confident in their means for their outside plant populations.

2) How much variance is there in the calculated means between populations? Use standard error of the mean. Does Chad’s outside population differ significantly vs Luke’s outside population at 95% confidence (+/- 2S.E.)?

3) Think of this the same way with Chad’s outside population vs his inside population.

__Why do you think 95%
is used so often? What’s the probability
of getting some data that doesn’t fall in your confidence interval? Do you think different confidence intervals
could be used for different things? WHY?__

5)
__Claim,
Evidence and Reasoning. __

a. At this point, you have performed an experiment(s) and are ready to make a claim, use evidence to support it, and biological principles for reasoning. Really your claim is just your hypothesis restated, and then encompassing the statistical test you used (evidence). REMEMBER YOU WERE GOING AFTER (attacking) THE NULL HYPOTHESIS, SO…

H_{o} = There IS NO
SIGNIFICANT DIFFERENCE between the trichome density
of a plant and its environment.** If you are less than 95% confident that your
standard error of the mean (SEM) shows a difference between environments (or if
error bars on a graph overlap) your response is…**

__CLAIM__:
We fail to reject the null hypothesis (notice that we didn’t say “accept the
null”) We DO NOT have significant __EVIDENCE__
against the null. Our results could have happened purely by chance, and we do
not have good evidence to support the alternative hypothesis.

OR

H_{o} = There IS NO
SIGNIFICANT DIFFERENCE between the trichome density
of a plant and its environment.** If you are more than 95% confident that your
standard error of the mean (SEM) shows a difference between the environments
(or if error bars on a graph DO NOT overlap) your response is… **

__CLAIM__:
WE DO have significant __EVIDENCE__ to reject the null hypothesis. Our
results could NOT have happened purely by chance; therefore, we lend good
evidence to support the alternative hypothesis (which is below).

H_{1} = There IS A SIGNIFICANT DIFFERENCE between the trichome
density of a plant and its environment.

b.
** Reasoning**
– Reasoning happens when you apply all you know with the results of an
experiment.

i. For Example, if plant trichomes were significantly more dense in an outside population

1.
** Claim/
Evidence:** at 95% confidence (which is +/- 2 S.E. from the mean), we
believe there is a significant difference in trichome
density for different environments.

2.
** Reasoning:**
We believe this to be true because…

a. Research suggests trichomes protect plants from predators (and there are more predators in different environments)

b. Trichomes help keep the frost away from the living surface of cells (and it is colder in different environments)

c. Our test was a fair test (due to 95% confidence, set up, controls, etc)

d. You also acknowledge where people will have problems and refute them if possible…(ex. you might think we had a small sample, but let me remind you of why we had a representative sample).

e.
NOTICE
how you are bringing everything together at the end, your research, knowledge,
hypothesis, evidence, biological principles, etc. That’s what reasoning is;
discussing ** REASONS** you are
CONFIDENT that your

__Any Questions? Let’s
look at the AP Biology Test from 2014. See if you can answer FRQ #1. __

__Also, look at FRQ
#4. Don’t answer the whole question,
just tell me everything you can about the graph, and answer part A.__