# Fall of the *p* paradigm: Modern methods for reporting linguistic research

Null Hypothesis Significance Testing (NHST) has been thoroughly discredited, yet *ignoramus et ignorabimus*, many linguists continue to report *p* values as if they provide their readers with something of value. In fact, "statistically significant" (often *p* < .05) is so often confused with practical importance that authors who use* p* (and journal editors who publish the research) do little to support the narrated conclusions and qualitative *bona fides* of otherwise very good research.

The question becomes, how do we report research without relying on the ubiquitous *p*? NHST is what we know, and often the only thing we know well. We have learned to interpret probability claims like *p* < .05, and statements like "was found to be significant" or "very significant," or for something like *p *= .051 we might see "trending towards significance". As we know, these are all meaningless statements if they are not cast as simply revealing that the *p *value was under an arbitrary alpha value for the sample size used in the study. Whether *p* is larger or smaller than your chosen alpha says nothing about the size of the effect you are trying to measure, nor does it affirm or deny a linguistic theory or hypothesis, probabilistic or otherwise. It simply reports the internal dynamics of your sample. At best, it can help you at the beginning of your study.

If you are compelled to report *p*, it is possible to mention it as part of your research preparation or pretesting procedure. For example, you could write:

"I found that my sample size did not give me a *p* value smaller than my alpha value which I set at 5%. Therefore I increased my sample size from 22 to 32 participants before proceeding with my study. A sample size of 32 allowed me to cross my 5% alpha threshold, meaning that my sample size may help me get closer to the true population parameter when I perform my statistical analysis."

As shown above, some elements of NHST have value if they are used at the outset of your study. Major software programs like SAS include a sample size calculator, and some programs specialize in these and other calculations to help you get off to a good start. One commercial program is NCSS PASS. A free program is G*Power. These leverage what NHST is good for: Planning.

It is important to restate in no uncertain terms that NHST, if used at all, should be part of planning your study; it should not be used for making conclusions about your linguistic hypotheses, or for comparing your study with somebody else's study. Doing so evokes a cascade of fallacies as described by many authors including Rex Kline in * Beyond Significance Testing: Statistics Reform in the Behavioral Sciences* published by the APA (American Psychological Association) in 2013.

Now, what do you report?

As the APA recommends, we should report **effect sizes** and **confidence intervals** for all important statistics. The size of the effect, or effect size, is actually what researchers care about. Confidence intervals reveal how precise your statistics are when you account for sampling error, an unavoidable part of almost all linguistic research. It is best to be honest about sampling error by reporting confidence intervals, and using them in your charts and graphics does exactly that. Each leg of the confidence interval is roughly 2X the margin of error. They are part of the estimation thinking that defines modern methods for reporting research.

For estimates of effect sizes, it is common to provide either Pearson's *r*, Cohen's *d*, or ratios. If you really want to be perfect, use the least biased measures: replace Pearson's *r* with Omega-squared, ω^{2}, and replace Cohen's *d* with Hedges' *g*. What is bias? Bias means that your statistic, the number you generated (mean, effect size, regression coefficient, etc.), is getting farther from the true population parameter you are trying to get at. For example, Cohen's *d* is inflated by 4% when your sample size is 20, and it is inflated by 2% when your sample size is 50. This means that the estimate of Cohen's *d*, a number between -1 and +1, is reporting a larger effect than what is real (note that zero means no effect). As you look into it further, you will notice that many statistical biases are against small sample sizes which, as it turns out, are common in some of the most meaningful linguistic research, especially in sociolinguistics and SLA (Second Language Acquisition).

As for confidence intervals, they are just another tick on the report-what dialogue box in your statistics program. In prose they are reported as such:

"The sample mean was M = 120, 95% CI [115, 125]. Given the confidence interval, it was plausible that the sample was drawn from a population whose mean was somewhere between 115 and 125."

Important: The statement above did NOT include the word "probability". Remember, a confidence interval does not make a probability statement about the population you are studying; it only makes a statement about the sample you took from the population you are studying. Curiously enough, confidence intervals are sometimes put through a frequentist sieve to produce statements like "the confidence interval captures the true population mean," then a reference is used to somebody who is fettered to NHST. This should be avoided. It represents point thinking (an illusion of accuracy), not estimation thinking. Use the language above as sanctioned by the statisticians who invented confidence interval methodology, and refer to APA guidelines for other elements.

Unfortunately, if you want more information, our published statistics references (before 2012) within linguistics cannot be recommended without reservation at this time. Besides positing NHST as if it were the only thing that exists in statistics, simple things, like confusing alpha with* p*, seem to pop up every few pages. Major issues crop up as well. For example, one noteworthy linguist who authored a book on statistics for linguistic research repeatedly mistakes the mathematical framework for Analysis of Variance (ANOVA*) *with a posterior development, and then he cites a respected statistics book that did not make this mistake. Just be aware that these things are going on, and that it may behoove you to access statistics resources outside of linguistics. Coursera, for example, offers free online introductory statistics courses for social science researchers. Among the best is *Statistics One* by Dr. Andrew Conway of Princeton. In addition, here are two excellent books:

**Discovering Statistics using IBM SPSS Statistics** by Andy Field, 2013.

**Understanding The New Statistics: Effect Sizes, CIs, and Meta-Analysis** by Geoff Cumming, 2012.

You can rent Dr. Cumming's book from Amazon.com for under $11 (as of April 2014 when this was written).

In conclusion, the paradigm of *p* as a standalone measure of importance has fallen. It is time for linguists to move forward with statistical methods that provide information that is useful, meaningful, honest, and encourages replication and communication among researchers.

**- Paul Kidhardt, Arizona State University**