Blog 4: Statistical Analysis
This week I wanted to get back to R and write some code to help students solve statistical analysis problems like mean, median, mode, and quartiles from first principles. I attempted to solve for these values without using the baked in functions, just to help illustrate the algebra and where answers come from.
Starting with the simplest equation allowed me to put in some groundwork. Clearly I can just sum all the terms in the dataset and then divide by the count, but for all the other solutions I needed my dataset to be ordered from least to greatest. The solution is really simple, just use the "decreasing" logical function. Here it is in work:
a_new = sort(a, decreasing = FALSE)
a_new
[1] 2, 3, 8, 10, 14, 19, 19
And, voila, sorted from least to greatest and printed out.
Now let's do the actual average/mean using the same data set, a, from before.
# the "sum" function adds up all the terms, and the "length" function counts the number of terms in the string.
a_avg = sum(a_new) / length(a_new)
print(paste("The mean is", a_avg,"!"))
[1] "The mean is 10.7142857142857 !"
a_avg = sum(a_new) / length(a_new)
print(paste("The mean is", a_avg,"!"))
[1] "The mean is 10.7142857142857 !"
And with significant digits, the answer is 10. Both "length" and "sum" are built-in functions with the software and will sort of act like the operators for the following solutions.
The new length function gets used here again to help find the middle term in the function. I tried to make "if" statements in order to differentiate the solutions between even and odd functions, but I found a solution instead:
https://www.datamentor.io/r-programming/examples/odd-even/
Here is their code (tweaked a bit for our use):
if(((length(a_new)) %% 2) == 0) {
middle = length(a_new)/2
a_mid = (a_new[middle] + a_new[middle+1])/2
print(paste("The median is", a_mid,"!"))
} else {
middle = length(a_new)/2 + 0.5
a_mid = a_new[middle]
print(paste("The median is", a_mid,"!"))
}
[1] "The median is 10 !"
middle = length(a_new)/2
a_mid = (a_new[middle] + a_new[middle+1])/2
print(paste("The median is", a_mid,"!"))
} else {
middle = length(a_new)/2 + 0.5
a_mid = a_new[middle]
print(paste("The median is", a_mid,"!"))
}
[1] "The median is 10 !"
The beginning of this code is complicated because it uses the math of remainders. It asks whether the length of the dataset is even or odd. If it is even, then it'll say "TRUE" and go to the next line. If it is "FALSE" then it'll skip to the "else" line and do that calculation instead. This website was very useful through this whole investigation https://www.datamentor.io/r-programming/if-else-statement/. I used it and google image searches of simple questions like "how to print in r" to get what I was looking for.
In order to pull the "middle" term out from "a_new", I needed to use square brackets [] to use what's called indexing. I found out what that is https://rspatial.org/intr/4-indexing.html. What "middle" does is give me the middle term in the dataset that's been sorted from least to greatest by taking the length and dividing by two. For odd numbers, I need to add a half to round up, and for even numbers this is the lower of the two middle numbers.
From there, I had a little fun figuring out how to print a sentence in R, namely using the "print(paste())" idea. I'm not sure how it works, but it works, and that's good enough for me.
I sort of cheated on this one. There is no mode calculator in R that's built in, but I found a very good one while doing my research and it felt like a waste of time to try and recreate it myself. Instead I'm going to show it and then explain its parts.
https://www.tutorialspoint.com/r/r_mean_median_mode.htm
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
As you can see, this is a function, meaning we are creating a set of instructions that we can use later. Let's read the function from top to bottom, then right to left to see what it does.
First, it collects all the unique numbers in the set (it uses v as a placeholder) into a new term "uniqv". Meaning any duplicates will only be counted once. Then, it uses a matching function to compare v to uniqv and "tabulates" or counts how many times each unique number shows up.
For reference, uniqv is a 1:6 string "uniqv = 2, 3, 8, 10, 14, 19" and when asked to tabulate, the number of times each number shows up becomes the new string: "1, 1, 1, 1, 1, 2".
And then finally, it asks which number is the highest using "which.max" (the 6th number) and calls the 6th number in the uniqv string. It's just two lines of code, but it gets complicated really quickly.
print(paste("The mode is", a_mode, "!"))
This is the easy part. Apply "a_new" to the problem and call the answer "a_mode". Then ask R to print the answer out.
This code is very robust. It can work without sorting the data, when there's multiple answers, and for strings of any length.
Some rules about quartiles before we get started, because what we choose to be true will influence the code. And that's kind of the point.
- For an even number of terms, Q1 is the middle number of the bottom half, Q3 is the middle number of the top half.
- For an odd number of terms, Q1 is the middle number of the bottom half INCLUDING THE MEDIAN, Q3 is the middle number of the top half INCLUDING THE MEDIAN.
For this solution, I thought about getting decimal numbers and rounding up to get the correct index. For example, for a string of 13 numbers, one quarter of that is 3.25. If I round up, the 4th number is my Q1 number. and if I check, that is correct!
For an odd string:
1 2 3 4 5 6 7 8 9 10 11 12 13
The bolded numbers are (in order): Min=1, Q1=4, Q2=7, Q3=10, Max=13
And for an even string:
1 2 3 4 5 6 7 8 9 10
The middle number in this string is the average between 5 and 6, so 5.5. Q1 is the middle number between 1 and 5, which is 3. Q3 is the middle number between 6 and 10, which is 8.
Great, now I can start the code.
I have found ceiling(x) is the operator that takes any number and rounds it up to the nearest integer. For completeness, floor(x) rounds down. https://www.statology.org/round-in-r/
print(paste("The first quartile number is", a_Q1, "!"))
Simple!
First, take the length of the string and divide it by four. Then, round it up to the nearest integer, and finally, call up the number in the dataset that fits that index. In fact, this is how I'll do Q3, as well.
Quartile 2 is just the Median, silly!
print(paste("The third quartile number is", a_Q3, "!"))
print(paste("The max is", a_max, "!"))
And
print(paste("The min is", a_min, "!"))
And there you go!
I definitely think figuring out the median and the mode were the most difficult parts of this problem. Which is funny because when doing it on paper they're the most simple to find. I think that's a good thing because then students can balance the strengths and weaknesses of code and classic calculating to help them in the best way. Finding quartiles is usually like pulling teeth in a math class, but the code makes it seem so simple. I definitely think it's worth including.
Comments
Post a Comment