Regression Analysis: Correlation vs Causation
Will you lose more weight if you increase your exercising hours? Will your math scores rise if you study more? Will you grow tall if your parents are tall? These are some examples of questions that regression analysis can help answer.
What is regression analysis?
Regression analysis is a statistical method that can find how much one variable correlates with another. There is a regression coefficient r which can have a value between -1 and 1. Positive r values represent positive correlation; negative r values represent negative correlation. Additionally, if the value is very high (like 0.9 or -0.9), the greater the correlation between the variables. I will demonstrate this using the graphs below.
The above image shows four common types of graphs when comparing two random variables. Graph 1 displays a strong positive correlation; we can see that both variables correlate with each other strongly as all the data points lie on or very close to the line of best fit. Similarly, graph two also shows a strong correlation, except it’s negative. Thus, both these graphs have r values of about 0.95. Graphs 3 and 4 show weak positive and weak negative correlation respectively. We know they have a weak correlation as most data points are not too close to the line of best fit. Thus, they have a lower r value of about 0.4.
Calculating the Regression Coefficient
Note: People don’t look at all the data points and calculate the regression coefficient themselves, while it is possible, as it would be very slow and inefficient. All the data entries are analyzed by a computer which then gives us our coefficient.
Correlation vs Causation
The last point I want to mention here, and almost any statistics/data science book would tell you this, is that correlation does not mean causation. For example, if the number of hamburgers sold per day positively correlates with how much it rains, it does not mean that buying more hamburgers causes more rain; there is only correlation here - which may only be a coincidence.
Although, sometimes correlation does mean causation; for example, the number of hours studied correlates with the midterm scores. There is causation here, as if you study more, you will gain more knowledge, and thus get a better result in your midterm exam.