Student Performance Analysis

This Student Performance dataset includes scores from three exams and a variety of personal, social, and economic factors that have interaction effects upon them.
Data source: http://roycekimmons.com/tools/generated_data/exams

Data samples:
table_sample.png

1
2
3
4
5
6
7
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pyecharts import Bar

color = sns.color_palette()
SP = pd.read_csv("StudentsPerformance.csv")

Purpose of visualization:

How effective is the test preparation course?

(x, y) represents the proportion(y) of the number of people with this score(x) to the total number of people with/without test preparation.

1
2
3
4
5
6
7
8
9
prep_none = SP.loc[SP['test preparation course'] == 'none']
prep = SP.loc[SP['test preparation course'] == 'completed']

bar0 = Bar("Test Preparation Efficiency", "Math")
bar0.add("none", prep_none['math score'].value_counts().sort_index().index,
prep_none['math score'].value_counts().sort_index().values / len(prep_none['math score']))
bar0.add("completed", prep['math score'].value_counts().sort_index().index,
prep['math score'].value_counts().sort_index().values / len(prep['math score']))
bar0.render("1-prep2math.html")

1_Test_math.png
1
2
3
4
5
6
bar1 = Bar("Test Preparation Efficiency", "Reading")
bar1.add("none", prep_none['reading score'].value_counts().sort_index().index,
prep_none['reading score'].value_counts().sort_index().values / len(prep_none['reading score']))
bar1.add("completed", prep['reading score'].value_counts().sort_index().index,
prep['reading score'].value_counts().sort_index().values / len(prep['reading score']))
bar1.render("1-prep2reading.html")

1_Test_reading.png
1
2
3
4
5
6
bar2 = Bar("Test Preparation Efficiency", "Writing")
bar2.add("none", prep_none['writing score'].value_counts().sort_index().index,
prep_none['writing score'].value_counts().sort_index().values / len(prep_none['writing score']))
bar2.add("completed", prep['writing score'].value_counts().sort_index().index,
prep['writing score'].value_counts().sort_index().values / len(prep['writing score']))
bar2.render("1-prep2writing.html")

1_Test_writing.png

These histograms show that test preparation courses do not play a big role in overall scores, but they do help people to get high scores(100) in reading and writing.

What is the distribution of parent level of education and how they contribute to students’ scores?

1
2
3
4
5
6
7
8
plt.rcParams['figure.figsize'] = (15, 9)
sns.countplot(SP['parental level of education'], palette='RdPu')
plt.title('Comparison of Parental Education', fontweight=30, fontsize=20)
plt.xlabel('Degree')
plt.ylabel('count')
plt.savefig("2-EducationLevel.png")
plt.show()
plt.clf()

EducationLevel.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def getgrade(score):
if score >= 90:
return '>=90'
if score >= 80:
return '80~90'
if score >= 70:
return '70~80'
if score >= 60:
return '60~70'
else:
return 'Fail'


sns.countplot(x=SP['parental level of education'], data=SP, hue=SP.apply(lambda x: getgrade(x['math score']), axis=1),
palette='Reds')
plt.title('Parent Education vs Math Scores', fontsize=20, fontweight=30)
plt.savefig("2-Education2Math.png")
plt.show()
plt.clf()

sns.countplot(x=SP['parental level of education'], data=SP,
hue=SP.apply(lambda x: getgrade(x['reading score']), axis=1), palette='Blues')
plt.title('Parent Education vs Reading Scores', fontsize=20, fontweight=30)
plt.savefig("2-Education2Reading.png")
plt.show()
plt.clf()

sns.countplot(x=SP['parental level of education'], data=SP,
hue=SP.apply(lambda x: getgrade(x['writing score']), axis=1), palette='Set2')
plt.title('Parent Education vs Writing Scores', fontsize=20, fontweight=30)
plt.savefig("2-Education2Writing.png")
plt.show()
plt.clf()

Education2Math.png
Education2Reading.png
Education2Writing.png
Most of parents have degree from college or high school, only a few parents have bachelor’s or master’s degrees. Students whose parents have bachelor’s or master’s degrees are less likely to fail the class.

What are the statistical characteristics of each subject’s scores and the distribution of total score?

1
2
3
4
5
6
7
plt.rcParams['figure.figsize'] = (15, 9)
colorDic = dict(boxes='DarkGreen', whiskers='DarkOrange', medians='DarkBlue', caps='Gray')
SP.plot(kind='box', color=colorDic, sym='r+', vert=False)
plt.title('Comparison of Scores', fontweight=30, fontsize=20)
plt.savefig("3-Scores_box.png")
plt.show()
plt.clf()

Education2Writing.png

1
2
3
4
5
plt.rcParams['figure.figsize'] = (15, 9)
SP.hist(['math score', 'reading score', 'writing score'], ec='black', grid=False, color='slategrey')
plt.savefig("3-Scores_hist.png")
plt.show()
plt.clf()

Education2Writing.png
The highest scores in three subjects are 100. Median of writing and reading is around 72 while median of math score is around 66. 25% - 75% scores are in range of 56/58 to 78/80.
The distribution of each subject scores can be approximated to the normal distribution.

What is the hidden relationship among students’ performance in various subjects?

1
2
3
4
5
plt.rcParams['figure.figsize'] = (15, 9)
SP.plot.scatter(x='math score', y='reading score', c='writing score', colormap='viridis')
plt.savefig("4-Relationship.png")
plt.show()
plt.clf()

Relation.png
We can see from the scatter plot combined with heat map that the performance of each student in the three courses is very similar, and there is no obvious biased student. Most students pass all three courses. Students with high reading scores also have high writing scores, but their math scores are not necessarily as high. There are more people with a score of 80 or higher in reading and writing than in math.

How does gender and race influence to each subject?

1
2
3
4
5
6
7
sns.countplot(x=SP['gender'], data=SP,
hue=SP.apply(lambda x: getgrade((x['math score'] + x['reading score'] + x['writing score']) / 3), axis=1),
palette='pastel')
plt.title('Gender vs Average Scores', fontsize=20, fontweight=30)
plt.savefig("5-Gender2Ave.png")
plt.show()
plt.clf()

Gender2Ave.png
This histogram shows that the average score of female is higher than that of male, and the proportion of male who drop out of courses is greater than female.

1
2
3
4
5
6
7
sns.countplot(x=SP['race/ethnicity'], data=SP,
hue=SP.apply(lambda x: getgrade((x['math score'] + x['reading score'] + x['writing score']) / 3), axis=1),
palette='husl')
plt.title('Race/Ethnicity vs Average Scores', fontsize=20, fontweight=30)
plt.savefig("5-Race2Ave.png")
plt.show()
plt.clf()

Race2Ave.png
Group C has a higher proportion of failing the sources in average while group E are more likely to get a nice score (>=90).