Avoiding Data Analysis Pitfalls

Avoiding Data Analysis Pitfalls
By: Quinhua (Queenie) Nian

As a statistician and researcher with over 10 years of experience, I have provided data analysis assistance to many stakeholders and clients, for a variety of projects. In my current role as a Statistician and Data Analyst with the Discovery Center for Evaluation, Research, and Professional Learning, much of my work involves interacting with project leaders, researchers, other statisticians, as well as doctoral and graduate students to plan and implement appropriate quantitative and qualitative methods for evaluating pK-20+ education initiatives. Through my interactions with clients and colleagues over the years, I have observed a wide range of expertise and experience with data analysis. Essential skills for successfully planning and executing data analyses include:

Applying a careful and logical approach to data analysis;
Making sense of and measuring change among statistical results;
Conversing with colleagues to validate their own, or others’ analyses; and
Accurately interpreting and visualizing results.

Based on these observations, and combined with my own expertise and experiences, I offer a set of practical tips that others, particularly novices, can follow when conducting data analysis. My goal is to help others better understand, anticipate, and avoid the potential pitfalls that can occur during the data analysis process.

Tip #1: Ensure that the analysis is necessary and accurate: Before conducting data analysis, be sure you understand the purpose of the analysis, and the characteristics and components of the data with which you are working.

Always start with questions, not data or a technique. Whether you are collecting data or analyzing pre-existing data for a study, make sure you understand the purpose of the work to be completed. Take time to formulate questions or hypotheses about the measurable outcomes/impacts related to the data you are analyzing. This ensures that you are collecting appropriate data or helps you identify possible gaps in data already collected. Generally, analysis without preliminary inquiry is aimless. In addition, don’t get hung up on using a “favorite technique” or any other fancy statistical method for your analysis, if it’s not appropriate for your data and inquiry. This could limit the scope of your work by leading you to focus only on certain evaluation questions or data where those specific techniques can be applied.
Be aware of data “vital signs.” During early stages of analysis, and before conducting analyses to respond to research or evaluation questions, you should check for data “vital signs” (i.e., frequencies, missing data). This is a critical step for “validating” your data, and may assist in detecting problems with the data.
One slice of data vs. all. You should consider “slicing the data” you are working with (e.g., filtering the data to define sub-groups), in order to find out if the data reveal differences in results for affected subgroups (e.g., gender, race, grade-level). Even if you do not anticipate differential results, looking at a few slices of data for internal consistency gives you greater confidence that you are measuring what you intended to measure. Also, if you are working with a large data set, slicing can save time—that is, instead of working with millions of records, you can randomly sample,[1] in order to test your coding, detect key relationships, and/or make decisions for next steps of your analysis based on your analyses of a random sample of the data.

Tip #2: Ensure the analysis is appropriate: Once you have defined the purpose of your study, and you have a good understanding of the data you will be working with, you can determine the type of statistical methods to use.

Check data distribution. Most often, findings from a data set are represented using summary statistics (e.g., means, median, standard deviation). While these statistics can be accurate forms of measurement for a data set, you also should consider data distribution, by using histograms, Q-Q plots, or P-P plots.[2] These will allow you to see important or interesting features of the data, such as a significant class of outliers, skewness, or kurtosis (Field, 2000 & 2009; Gravetter & Wallnau, 2014; Trochim & Donnelly, 2006). Understanding specific characteristics of your data will help you choose appropriate statistical methods for analyzing it (Helberg, 1996).’
Observe and explore the outliers. If appropriate, always investigate any outliers in your data. Outliers can raise or reveal fundamental problems with your analysis, particularly, if the outliers contain patterns. If so, you should consider conducting exploratory analyses to find the reason for the patterns. You should consider the purpose of your study, research questions, and sample-size, before deciding whether to exclude outliers from your data, or before modifying or lumping them together into reporting categories (Morrow, 2016). When reporting, be sure to note the way you conducted and used your data, and I recommend maintaining your original raw data, in case you decide to conduct additional analyses in the future.
Don’t stop at “p-value < .05.” Every quantitative analysis that you conduct should have a “notion of confidence” (i.e., probability that an event happened) in the estimate attached to it, to ensure statistical power. Typically, people rely on the p-value and consider p < .05 as the gold standard for “statistically significant” findings. This practice ignores the fact that if the sample is large, nearly any difference, no matter how small or meaningless from a practical view, will be “statistically significant.” Consider specifying confidence intervals and effect size, together with p-value, to covey the magnitude and relative importance of an effect and to reach a more rigorous conclusion (Helberg, 1996; Nuzzo, 2014).

Tip #3: Ensure that you are explaining results properly: After examining your data or data sets, and applying appropriate statistical methods to analyze data, you’ll then need to determine how to interpret and report your findings.

Correlation is not causation. Many studies are very limited in their ability to illuminate causal relationships. Correlation is a tool every data analyst uses frequently. The biggest caution in using and interpreting correlation analyses is that they should not be treated as causation (Helberg, 1996). If two events happen close to each other or around the same time, that does not necessarily mean that one causes the other. Random assignment, time-order relationship, and covariation are required in order to determine a causal inference[3]—in the absence of these constraints, the result can only show evidence of a relationship, without explaining why or how it occurred (Trochim & Land, 1982).
Provide interpretation for lay audiences. Often, you will present your analyses and results to people who are not data experts. You must help them understand the data, and how to interpret findings as well as draw conclusions. This is especially important when analysis results could be misinterpreted or might be professionally referenced. You may need to explain the following to your audience: the definition of statistical terms, such as “confidence intervals” or “correlation;” explanations of typical effect sizes (good and bad); and/or reasoning why a particular statistical method is unreliable or unfit for a specific project. Increasingly, it is the statistician or data analyst’s responsibility to provide context behind the numbers.
Converse and cross-check with colleagues. It can be very helpful to share your analyses with colleagues before presenting or/and sharing findings with others. Discussing this process with colleagues is a very good way of cross-checking and validating your work. Colleagues can offer opinions and suggestions based on their own experiences and additional expertise to detect possible inconsistencies, to correct illegal values (i.e., values outside of a domain range), or to find conflicts between your findings and those of previous research.

Working with large or complicated data sets and navigating the data analysis process can be a tedious and tricky process. There are many resources, groups (e.g., AEA, OPEG, WWC to name a few), and experienced professionals who can support you in this work, including myself, and other colleagues at the Discovery Center. Please feel free to contact me (nianq@miamioh.edu), or the Discovery Center (discoverycenter@miamioh.edu) if you need or are seeking assistance with your research or data analysis work.

References

Field, A. (2009). Discovering statistics using SPSS. London: SAGE.
Field, A. (2000). Discovering statistics using spss for windows. London-Thousand Oaks- New Delhi: SAGE.
Gravetter, F., & Wallnau, L. (2014). Essentials of statistics for the behavioral sciences (8th ed.). Belmont, CA: Wadsworth.
Helberg, C. (1996, November). Pitfalls of data analysis. Practical assessment, research & evaluation, 5(5), 1-3.
Morrow, J. (2016, July 13). PD presenters week: Jennifer Ann Morrow on what to do with “dirty” data – Steps for getting evaluation data clean and useable. Retrieved from American Evaluation Association: http://aea365.org/blog/
Nuzzo, R. (2014, February 13). Statistical errors. Nature, 506, 150-152.
Trochim, W. M., & Donnelly, J. P. (2006). The research methods knowledge base (3rd ed.). Cincinnati, OH: Atomic Dog.
Trochim, W. & Land, D. (1982). Designing designs for research. The Researcher, 1(1), 1-6.

Endnotes

[1] For more information see: http://www.stat.yale.edu/Courses/1997-98/101/sample.htm
[2] For more information see: http://www.theanalysisfactor.com/anatomy-of-a-normal-probability-plot/
[3] For more information see: http://core.ecu.edu/psyc/wuenschk/StatHelp/Correlation-Causation.htm

Recent Posts

Recent Comments

Archives

Categories

Meta

Recent Posts

Recent Comments

Archives

Categories

Meta