WHY Statistics in Biology? (from Chad S Vanhouten)

 

Introduction: In AP biology this year you will be collecting MULTIPLE sets of data.  The word Biology actually means “the study of life,” therefore, the GOAL is to actually study life in this class.  To do this, you will have to become “well versed” in the following skills…

     Observations and constructing observational studies

     Hypothesis generating

     Experimental design

     Data collection and data manipulation (graphing, tables, charts, statistics, confidence intervals, hypothesis testing, presentations)

     Claims, evidence (for these claims), and reasoning (why biological principals and your evidence together support your claim)

 

In order to be proficient in these skills, you will need to have a short statistics introduction (which will be followed by a year long accumulation of statistical enhancement skills).  But MOST IMPORTANTLY, WHY DO WE NEED STATISTICS IN BIOLOGY?  WHAT DOES IT DO FOR US?

 

1)      Let’s look at observations and observational studies.

a.       This is where you should start. WHAT DO YOU SEE? 

b.      Secondly, WHAT DO YOU ALREADY KNOW?

c.       You need this for a BASELINE.  It’s really to have something to compare to. How can you control an experiment on plant growth unless you know how they grow normally?  How can you do a pillbug behavior experiment unless you know how they naturally behave?  How can you compare environments for trichome density if you don’t know what happens naturally?

d.      Observational studies allow you to make some observations (because you really aren’t controlling anything purposely), which in conjunction with your past knowledge now allow you to make a hypothesis.

e.       For example, if you grow some plants and monitor them for a few weeks. What do you know about them? Can you make a hypothesis about two different plants and the trichome densities that you see (you might need to research trichomes here)?  What if we grew them in different environments?  Would anything change?

f.       In an observational study, you are looking for some CORRELATIONS (that’s why these are sometimes called a correlational study).  This DOESN’T MEAN CAUSATION, just that you think something might be influencing something else.  It won’t become an experiment until you deliberately impose a treatment.

g.      So before you get started, look around and ask yourself does X influence Y?  When I see X, what happens to Y? (I know this sounds like algebra, but math will help us soon enough).

 

Look up some stuff about plants and trichomes…notes below…can you find any X’s with Y’s?

 

 

 

 

 

 

 

 

 

 

 

 

 

2)      Hypothesis generating

a.       This begins the fun part. Now you have some ideas (proposed correlations).  But this is where you take your correlations and create a POSSIBILITY for causation. This is your first introduction to Statistics. You need to think about Statistics and data generation NOW! What could you test? Why do you think that? Is this test even feasible? What do you EXPECT to happen, and WHY do you expect that? How would you know if your hypothesis is supported or rejected?

b.      FOR EVERY EXPERIMENT FROM NOW ON, you will always WRITE TWO HYPOTHESES (but you need to “mess about” for a little while before you settle on one you want to pursue, it’s good for thinking of all possibilities). 

 

                                                              i.      THE NULL HYPOTHESIS (Ho = H-nought) – This is where you say there is NO DIFFERENCE between the variables in your experiment.

We will use stats TO TRY TO REFUTE THE NULL HYPOTHESIS.  In fact, this is your GOAL in every experiment; GO GET THE STRONGEST STATISTICAL EVIDENCE AGAINST THE NULL YOU CAN.

 

                                                            ii.      THE ALTERNATIVE HYPOTHESIS (H1) – This is the opposite of the null hypothesis.  You basically say there IS A DIFFERENCE between the variables in your experiment.   THE GOAL is to lend support to H1, by REFUTING H0. (which you believe would show some causation between variables as in X CAUSES Y to happen)

 

                                                          iii.      To stick with plants that have trichomes…maybe in an observational study of their habitat, you noticed that they seem to grow trichomes most of the time, but not in the same numbers/density per plant. You do some background checking, and you find that trichomes seem to have evolved for, protection from predators, they protect plants from frost, etc. So you think to yourself, a logical experiment would be do they grow more trichomes in different environments?  Notice, a hypothesis is often called an “educated guess,” therefore; you have reason to think (education) they will grow more trichomes in the outside areas (predators) over inside areas (no predators) (which you decided on from your observations and background checking about plants). HOWEVER, notice your hypotheses…

 

Ho = There IS NO SIGNIFICANT DIFFERENCE between the trichome density of a plant and its environment.

 

H1 = There IS A SIGNIFICANT DIFFERENCE between the trichome density of a plant and its environment.

 

                                                          iv.      It is important to note, you are trying to support your alternative hypothesis by REJECTING THE NULL HYPOTHESIS.  Please don’t ever say you have proved anything.  You are now going to design an experiment to ATTACK that Null Hypothesis. SHOW US THAT YOU HAVE SIGNIFICANT EVIDENCE AGAINST IT.  If you can do this, by definition you will be lending support (not proof) to the alternative hypothesis. So the next question is…HOW STRONG IS YOUR EVIDENCE?

 

Are there any Hypotheses you could generate right now?  Practice writing a H0 and H1

 

 

 

 

 

3)      Experimental Design:

a.       Here’s another fun part…you get to DO SOME SCIENCE. 

b.      First, your experimental design needs to have only ONE VARIABLE that is manipulated (to refer to our plant lab, maybe environment like outside vs. inside).  This variable is called the EXPLANATORY VARIABLE (or the INDEPENDENT variable).

c.       The variable you measure is the RESPONSE VARIABLE (DEPENDENT variable). Now you are looking for an experiment where you impose a treatment…(lots of ways to say this)

 

1. EXPLANATORY VARIABLE CAUSES RESPONSE VARAIBLE. 

2. INDEPENDENT VARIABLE CAUSES DEPENDENT VARIABLE

3. X CAUSES Y (which from algebra will lead you right into a graph)

 

d.      To do any of the above, the null hypothesis will state there is NO CAUSE between variables. You want to be able to say I HAVE EVIDENCE TO REJECT THE NULL HYPOTHESIS (and therefore can lend support to the alternative hypothesis that one variable causes the other. Notice I didn’t say prove it).  If you have a good audience during a presentation, someone will ask you, “well how confident are you?”  You will be MUCH MORE CONFIDENT, if you only manipulated one variable.  In fact, the reason why we only manipulate one variable at a time is so that IF we notice a difference in the response variable (or dependent variable or Y – all the same thing), we know that it was CAUSED by the ONE variable that we manipulated. 

e.       SET IT UP! I’m going to let you figure this out, but this includes, what data will you collect?  What variable is being manipulated? You need to think about everything else to this point (see above). What do I already know? What happened in my observational studies to this point?  What data will REJECT MY NULL HYPOTHESIS?  How will I collect that data?  What controls do I need to set up to ensure only ONE VARIABLE is tested?

f.       I left this all rather vague because I want you to figure it out, and because all experimental designs do not need to be identical.  What is the best way to get data to CONFIDENTLY REJECT THE NULL HYPOTHESIS? (see quantitative skills guide if necessary)

 

What should we take into consideration for this experiment?  Variables (DV/IV)? Constants and Controls (not the same things).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4)      Data collection

 

SO WHY DO WE NEED STATISTICS IN BIOLOGY?  WHAT DOES IT DO FOR US?

 

If you collected every plant in the world, watched their trichome preference between outside and inside areas over time, and collected the results, THERE WOULD BE NO NEED FOR MORE STATISTICS. You would have every piece of data in the world. You would be 100% confident in your results (whatever they are).  You would be able to say, I am 100% sure There IS or ISN’T A SIGNIFICANT DIFFERENCE between the trichome density of a plant and its environment.  But, is this really feasible? Can you collect every plant in the world?  Instead we collect a RANDOM sample of plants from the population.  Because the sample is random, there is some variability in the results (ex. maybe we randomly selected a bunch of healthy seeds)

 

So guess what, you can pretty much NEVER be 100% confident in your results, hence, we can’t say we can PROVE things in science.  So, if we can’t prove it, then HOW CONFIDENT ARE WE? To answer this, you need to know BASIC STATISTICS. 

 

It is important to note that most of your data this year will fall into the category of a normal distribution.  We will talk about others later, but for the most part, here are the stats used with a normal distribution (see below, but right off Wikipedia, and it should look familiar from algebra).  

 

A.    THE SAMPLE YOU ARE STUDYING:

a.       How random is your sample? Since we can’t collect all plants, which ones do you think are in our sample? F1/F2?  Green/yellow? Are they all the same species? Is the sample we are studying a true representative of the TOTAL POPULATION (which wasn’t feasible to collect).  The sample you use could definitely affect how CONFIDENT you are in your results. One of the characteristics of good statistics therefore is that your sample is truly RANDOM (or as random as we can get it).  Random sample is important because it means that our sample should be representative of the population.  This allows us to generalize our results to the entire population from which we took our random sample. 

b.      It is also important that you have a LARGE ENOUGH SAMPLE SIZE.  Large sample size is important for two reasons.  (1) It allows us to safely say that our sampling distribution is approximately normal (see above) and (2) as the sample size increases, the variability of the sampling distribution decreases (the normal curve gets taller and skinnier.)

 

B.     THE MEAN

a.       You will do this experiment with a sample (not one individual). Because of that, you need to calculate the Mean (average) of every individual in the sample because your results will have some variance. Not every individual will perform the same, and therefore, you need to account for this by calculating the Mean (which gets closer to the “true population mean” the larger the sample). This is probably the easiest calculation to do, and while it could be close to the true mean, how confident are you that it is?

b.      Once you have the mean, you want to know, how much variance was there from the mean? This is the standard deviation. It gives you a better idea of how far away your data varied from the mean.  Once you have calculated standard deviation, go back to your algebra days and think about a normal distribution (above, or refer to quantitative skills guide page 34 if necessary).  Once you have the standard deviation of your sample, you can find the standard error of the mean by doing s/(square root of n)....(see quantitative skills guide)

c.       With your normal curve (above), the mean you calculated is in the middle. If you move one standard error to the left and one to the right, you would now be encompassing 68% of the distribution (see above in dark blue).  So you could say that you are 68% Confident (+/- 1S.E.) that the true population mean (which wasn’t feasible to collect) falls within the range of the confidence interval.

d.      If you move two standard errors to the left and two to the right, you would now be encompassing 95% of your data points (see above in light blue).  .  .  So you could say that now you are 95% Confident (+/- 2S.E.) that the true population mean (which wasn’t feasible to collect) falls within the range of the confidence interval.

e.       If you move three standard deviations you are 99% confident.

f.       The SEM ends up as confidence bars on your data graph (whichever type you decide)

g.      Notice, you aren’t saying what the true mean of the population is, just that the true mean is (68%, 95%, 99%) confident that it will fall within this range.

h.      SO YOUR FINAL QUESTION SHOULD BE, could the Mean(s) you collected have any error? (which would make you less confident)

 

C.     ERROR of the mean.

a.       It is possible that you do multiple investigations with multiple populations (in fact, this is preferred in science as it improves your confidence). Therefore, calculating a mean happens lots of times, and you want to compare those means.  How much do the means differ from each other? This is another example of the use of Standard Error of the Mean (SEM).

b.      Perhaps you are comparing one set of plants to another (not just your outside and inside), like Chad’s outside plants to Luke’s outside plants. Here’s how to think about this.

 

1)      How confident am I in my results from each experiment?  68%, 95%, 99%?  (each mean for each population/experiment should be equally confident) Luke should be 95% confident and Chad should be 95% confident in their means for their outside plant populations.

2)      How much variance is there in the calculated means between populations? Use standard error of the mean. Does Chad’s outside population differ significantly vs Luke’s outside population at 95% confidence (+/- 2S.E.)?

3)      Think of this the same way with Chad’s outside population vs his inside population.

 

Why do you think 95% is used so often?  What’s the probability of getting some data that doesn’t fall in your confidence interval?  Do you think different confidence intervals could be used for different things? WHY?

 

 

 

 

 

5)      Claim, Evidence and Reasoning.

a.       At this point, you have performed an experiment(s) and are ready to make a claim, use evidence to support it, and biological principles for reasoning.  Really your claim is just your hypothesis restated, and then encompassing the statistical test you used (evidence).  REMEMBER YOU WERE GOING AFTER (attacking) THE NULL HYPOTHESIS, SO…

 

Ho = There IS NO SIGNIFICANT DIFFERENCE between the trichome density of a plant and its environment.  If you are less than 95% confident that your standard error of the mean (SEM) shows a difference between environments (or if error bars on a graph overlap) your response is…

 

CLAIM: We fail to reject the null hypothesis (notice that we didn’t say “accept the null”) We DO NOT have significant EVIDENCE against the null. Our results could have happened purely by chance, and we do not have good evidence to support the alternative hypothesis.

 

                                    OR

 

Ho = There IS NO SIGNIFICANT DIFFERENCE between the trichome density of a plant and its environment.  If you are more than 95% confident that your standard error of the mean (SEM) shows a difference between the environments (or if error bars on a graph DO NOT overlap) your response is…

 

CLAIM: WE DO have significant EVIDENCE to reject the null hypothesis. Our results could NOT have happened purely by chance; therefore, we lend good evidence to support the alternative hypothesis (which is below).

 

H1 = There IS A SIGNIFICANT DIFFERENCE between the trichome density of a plant and its environment.

 

b.      Reasoning – Reasoning happens when you apply all you know with the results of an experiment.

                                                              i.      For Example, if plant trichomes were significantly more dense in an outside population

1.      Claim/ Evidence: at 95% confidence (which is +/- 2 S.E. from the mean), we believe there is a significant difference in trichome density for different environments.

2.      Reasoning: We believe this to be true because…

a.       Research suggests trichomes protect plants from predators (and there are more predators in different environments)

b.      Trichomes help keep the frost away from the living surface of cells (and it is colder in different environments)

c.       Our test was a fair test (due to 95% confidence, set up, controls, etc)

d.      You also acknowledge where people will have problems and refute them if possible…(ex. you might think we had a small sample, but let me remind you of why we had a representative sample).

e.       NOTICE how you are bringing everything together at the end, your research, knowledge, hypothesis, evidence, biological principles, etc. That’s what reasoning is; discussing REASONS you are CONFIDENT that your EVIDENCE allows you to make this CLAIM!

 

Any Questions? Let’s look at the AP Biology Test from 2014. See if you can answer FRQ #1.

Also, look at FRQ #4.  Don’t answer the whole question, just tell me everything you can about the graph, and answer part A.