Student Habits & Academic Performance Analysis

Author

Tim Chen

Published

May 18, 2025

1 Introduction

This report provides an in-depth analysis of the Student Habits and Academic Performance dataset, exploring how demographic, lifestyle, and psychological factors relate to exam outcomes. We cover data loading, cleaning, exploratory visualization, feature engineering, predictive modeling, and interpret key findings in light of recent global developments.

2 Setup and Data Loading

We begin by importing core Python libraries and loading the CSV file into a pandas DataFrame.

3 Data Source

This analysis uses the “Student Habits and Academic Performance” dataset sourced from Kaggle (Aryan208). The dataset and accompanying code are available at: (kaggle.com)

4 Data Overview

A quick summary reveals:

  • 80,000 student records, each with 31 features (demographics, habits, support, and performance).

  • No missing values detected across any columns.

  • Key summary statistics:

    • Study Hours/Day: mean ~4.17 hrs, median ~4.13 hrs, range 0–12 hrs.
    • Attendance %: centered around ~70% with a span of 40–100%.
    • Sleep Hours: most students sleep between 6–8 hrs.
    • Exam Score: skewed high, median 93, with many perfect scores (100).

These sleep patterns resonate with findings from an NIH‑funded study (Mathew & Hale 2024) showing that irregular bedtimes and inconsistent sleep are significantly associated with poorer grades and behavior problems among adolescents (wgrt.com).

       study_hours_per_day  attendance_percentage   sleep_hours    exam_score
count         80000.000000           80000.000000  80000.000000  80000.000000
mean              4.174388              69.967884      7.017417     89.141350
std               2.004135              17.333015      1.467377     11.591497
min               0.000000              40.000000      4.000000     36.000000
25%               2.800000              55.000000      6.000000     82.000000
50%               4.125624              69.900000      7.000000     93.000000
75%               5.500000              84.900000      8.000000    100.000000
max              12.000000             100.000000     12.000000    100.000000

5 Data Cleaning & Preprocessing

All categorical columns were cast to the category type and text entries standardized (e.g., Yes/No to lowercase). No rows were removed.

6 Exploratory Data Analysis (EDA)

6.1 Univariate Distributions

  • Study Hours show a roughly normal distribution with a right tail (some students study up to 12 hrs).
  • Attendance is left-skewed: most students attend at least 70% of classes.
  • Sleep peaks around 7 hrs.
  • Exam Score heavily clusters at high values, indicating a ceiling effect.

This ceiling effect mirrors global exam trends, where extraordinary clustering at top scores challenges differentiation. In India’s NEET‑UG 2024, 67 candidates scored a perfect 720/720, sparking debate about score inflation and exam fairness (Times of India 2024) (timesofindia.indiatimes.com).

6.2 Categorical Insights

  • Majors: Arts (16.9%), Business (16.3%), Computer Science (14.8%), etc.
  • Gender: roughly equal representation of Male and Female, with a small Other category.
  • Learning Styles: Kinesthetic most common (~20%), followed by Reading and Auditory.

6.3 Correlation Analysis

A heatmap of key numeric features indicates:

  • Study Hours & Exam Score: moderate positive correlation (~0.35).
  • Sleep Hours & Exam Score: mild positive correlation (~0.20).
  • Attendance & Exam Score: weaker but positive (~0.15).

Although study time correlates with performance, excessive workload shows diminishing returns. A systematic review of K–12 homework found that while homework generally benefits learning, too much increases cognitive load and reduces motivation, ultimately impairing achievement (Guo et al. 2024) (pubmed.ncbi.nlm.nih.gov).

7 Feature Engineering

Two new features were created:

  1. Age Group (<=18, 19–22, 23–26, 27+) to capture life‑stage differences.
  2. Total Screen Time: sum of social media, Netflix, and overall screen usage.

Global policies increasingly address digital distractions. In England, secondary schools with effective mobile phone bans were found more than twice as likely to be rated outstanding and saw students achieve one to two grades higher on GCSE exams, underscoring the impact of screen management on learning (Busby 2024) (independent.co.uk).

8 Predictive Modeling

A baseline Linear Regression model was built using four features:

  • Study Hours/Day
  • Sleep Hours
  • Motivation Level
  • Time Management Score
RMSE: 10.88
R²:   0.125

9 Results & Interpretation

  • RMSE ≈ 10.88: on average, predictions deviate by ~11 points (out of 100).
  • R² ≈ 0.125: the model explains only ~12.5% of variance, indicating substantial unexplained factors.
  • Residual Plot shows wide scatter, suggesting nonlinearity and missing predictor variables.

10 Conclusion & Next Steps

Our analysis confirms that while study habits, sleep, and time management are positively associated with academic outcomes, their effects are moderate and subject to external influences. Recent global trends—from exam score clustering in high‑stakes tests to nuanced homework policies and phone bans—corroborate our findings and highlight the importance of balanced, evidence‑based interventions.

Recommendations:

  1. Test non‑linear models (e.g., Random Forest, XGBoost) to capture complex relationships.
  2. Incorporate additional predictors such as total_screen_time and previous_gpa.
  3. Perform cross‑validation and hyperparameter tuning.
  4. Explore classification approaches for dropout risk and policy impact analysis.