Sampling

Mar 21

Hey everyone, I recently learned about sampling. Sampling is simply the method of choosing a random group of people from a population, collecting unbiased data (which will be discussed later in this blog), and then making predictions about the rest of the population.

However, can’t data just be collected from an entire population for maximum accuracy? Most times, no, it would take too long to collect all the data, and it can get excessively expensive. Thus, a random group of people is selected, and a survey is conducted to obtain data. Obviously, the results will not be exact, since it is only a sample. Nevertheless, if the sample is not extremely small, it is likely that the data will be very accurate to the actual population.

Just to clear up any confusion for the rest of this blog, a population is the group of people that we are interested in, and a sample is the group of people (from that population) that we will obtain data from.

For example, suppose you want to figure out the mean weight of Americans in their 30s. You can’t measure weights of every single American in their 30s, thus, a random sample needs to be collected. Even though there are millions of Americans in their 30s, a sample size of even 100 people would give an accurate result.

Although, some measures need to be taken to avoid biases, so that we can obtain an accurate result. Biases are avoided by selecting a random sample (as previously mentioned), which can be achieved through many ways. Firstly, don’t collect the sample from only one state. Secondly, don’t only select men, or women, select both. Thirdly, don’t only select the people that are on a diet. By taking these measures, we can ensure that the sample is completely random and that it would provide us with an accurate result.

Nonetheless, even by avoiding all biases, the sample mean weight will not be exact to the population mean weight. This uncertainty is called the ‘standard error’ and it can be calculated. The formula for the standard error is: SE = σ/sqrt(n). SE is short for standard error, σ is the standard deviation, and n is the size of the sample (sqrt is short for square root). As you can see here, the standard error decreases as the size of the population increases.

The standard error basically means how much the population mean could differ from the sample mean. For example, let’s assume that the sample mean is 60 kg, with a standard error of 5 kg. Now, the population mean is 68% likely to be between one standard error above and below the sample mean. So there’s a 68% chance that the population mean falls between 55 and 65 kg. And there is a 95% chance that the population mean is 2 standard errors above and below the sample mean, so 50 and 70 kg. (This is actually related to the normal curve, you can learn more about this from my previous blog.)

Shivansh Goel

Sampling

R: The Programming Language

The Bell Curve

Shivansh Goel