Commenting on the data generally (answering 1b)
Nothing too much of note here. The distribution of values match with expectations. 110/118 folks aspiring for higher education is interesting. There are also A LOT of parents that are together. Maybe that is the child of divorce talking, but 103/118 seems like a lot for this age group.
^Just using g2
^using g1 and g2
Using all variables (except g3 of course)
Comments:
Getting an R^2 value of 0.8 with only 2 predictors is pretty good (g1 and g2), but to answer if we could “predict the final grade based on the previous two grades” I would need more information. 0.8 is not a good enough value to make any decisions based on the model, but the model does clearly show that better grades on G1 AND ESPECIALLY G2 (model uses most G2 to predict G3) lead to better grades on G3. As with any course, students with good grades should feel comfortable in their performance, and students with worse grades should feel less comfortable.
Using all the variables in the model is most accurate, but not by much (only a percentage or 2 points better R^2 value)
The "best model" is probably just the model that includes G1 and G2 only. However, this model cannot be used to predict grades prior to class starting. This makes the model not terribly "useful" beyond confirming that doing well in class leads to continuing to do well.
Of note, it looks like the one of the most significant variables in the predictive model with all variables is SchoolMS. We would hope that school choice does not affect performance, so this is a concern.
Moving to Student-Por:
I created a logistic model to predict final grade. Originally, the directions had us including g1,2,3 in the model, so of course the model was 100% accurate. Adjusting the model by creating one that does not include ANY of the grades, but all other variables:
The model correctly predicts failure 65 times, correctly predicts passing 73 times, incorrectly predicts failure 30 times and incorrectly predicts passing 26 times. The model is correct 138/194 tries, for about 71% accuracy. 71% is pretty good for a model that does not take into account performance in the class so far. But, again, it's not good enough to make any real inferences or decisions, at least on an individual level.
Moving back to the Student-Mat data:
I created a decision tree to predict course outcome (pass/fail)
It appears that the most important variable is failures (if the student had any prior failures, the model predicts failure, which is 13% of the data).
It could, but not in any particularly insightful way. The model shows that students that have failed before, are absent, and do not study fail more often. This is not new news to teachers. These students should already be marked as at risk.
This decision tree is just a different view of the logistic model created in the previous section. As I said before, 71% is pretty good for a model that does not take into account performance in the class so far. But it's not good enough to make any real inferences or decisions. I guess we could make the argument that students who have failed a class before should receive extra attention or monitoring, but this already should be the case. Maybe this model suggests further study on why when students fail it appears to be endemic.
I also built a random forest model to predict pass/fail for Student-Mat. This model had a 83% accuracy rate, which is really good. Medu, Failures, and Mjob were the model important factors in the model. I am still not sure what the question, "Which Threshold selection value would you use to create an application guide instructor-led intervention for students at risk of failing the course?" is asking. The students at risk of failure are students with bad grades... There is nothing in the random forest or decision tree that suggests that any student is at significant risk of failure PRIOR to the beginning of the course. I would just keep an eye on students that have failed a class before.