Portfolio / PYTHON LAB

PYTHON ANALYTICS LAB

Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn applied to real datasets — EDA, statistical analysis, machine learning, and a full Python desktop application built in Semester III.

0
PROJECTS
0
VISUALIZATIONS
0
LIBRARIES USED
0
DATASETS ANALYZED
0
ML MODEL BUILT
• LIBRARIES USED ACROSS ALL PROJECTS
pandas
Data manipulation • cleaning • aggregation
numpy
Numerical computation • arrays
matplotlib
Charts • histograms • scatter plots
seaborn
Statistical charts • heatmaps • boxplots
scikit-learn
Logistic regression • PCA • train/test split
tkinter + sqlite3
Desktop app • embedded database
↗ View Code on GitHub • See Case Studies

PYTHON PROJECTS

PROJECT_01 · SEM III · EDA
📊 Data Science Jobs Market — EDA
Exploratory analysis of a real data science job postings dataset — uncovering salary distributions, in-demand skills, top hiring companies, and the relationship between company rating and compensation.
pandas seaborn matplotlib numpy 12 CHARTS
histplot — salary distribution
barplot — top 10 companies
boxplot — salary by job category
scatterplot — rating vs salary
countplot — skill demand (Python/R/AWS)
heatmap — correlation matrix
kde=True — density overlay
value_counts + head(10) — top filter
• KEY SNIPPET — Correlation Heatmap + Skill Demand Chart
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("eda_data.csv")

# Skill demand — Python, R, Spark, AWS, Excel
skills = ['python_yn', 'R_yn', 'spark', 'aws', 'excel']
skill_counts = df[skills].sum()
sns.barplot(x=skill_counts.index, y=skill_counts.values)
plt.title("Skill Demand in Data Science Job Postings")
plt.show()

# Correlation heatmap across all numeric variables
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

# Salary vs Company Rating — colored by job category
sns.scatterplot(data=df, x="Rating", y="avg_salary", hue="job_simp")
plt.title("Salary vs Company Rating by Job Category")
plt.show()
PROJECT_02 · SEM III · EDA ASSIGNMENT
🧠 Student Stress Dataset — EDA
Analyzed a survey dataset on student stress indicators — anxiety, sleep problems, concentration issues, and mood. Produced 14 visualizations mapping frequency distributions and correlations between stress symptoms.
pandas matplotlib seaborn 14 CHARTS SURVEY DATA
groupby + mean — anxiety by age
line plot — trend over age groups
countplot — stress distribution
pie chart — palpitations, loneliness
histogram — anxiety frequency bins
bar chart — sleep problems, headaches
scatter — age vs overwhelm
marker, linestyle, alpha — chart styling
• KEY SNIPPET — Anxiety by Age + Stress Distribution
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("Stress_Dataset.csv")

# Line plot: Average anxiety level by age group
avg_anxiety = df.groupby("Age")[
    "Have you been dealing with anxiety or tension recently?"
].mean()

plt.figure(figsize=(8, 5))
plt.plot(avg_anxiety.index, avg_anxiety.values,
         marker="s", markersize=7, linestyle="--",
         linewidth=2, color="g")
plt.title("Anxiety Level by Age")
plt.xlabel("Age")
plt.ylabel("Average Anxiety Level")
plt.grid(True, alpha=0.4)
plt.show()

# Countplot: Stress level distribution
sns.countplot(
    x="Have you recently experienced stress in your life?",
    data=df, palette="Set2"
)
plt.title("Stress Level Distribution")
plt.show()
PROJECT_03 · SEM III · END SEM ASSIGNMENT
📱 Digital Habits Survey — Data Cleaning + EDA
End-semester assignment analyzing a real survey on digital habits and environmental awareness. Full data cleaning pipeline followed by descriptive statistics and visualization — screen time vs environmental concern correlation.
pandas matplotlib DATA CLEANING SURVEY · 121 ROWS
fillna — handle missing Name values
drop_duplicates — remove dupes
str.strip — clean whitespace
isnull / df[df[col].isnull()] — audit
mean, median, mode, std — desc stats
min, max, range — spread analysis
pivot_table — age group aggregation
scatter — screen time vs env concern
• KEY SNIPPET — Full Data Cleaning Pipeline
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_excel("Your Digital Habits Survey.xlsx")

# ── Data Cleaning Pipeline ──
df["Name"] = df["Name"].fillna("Unknown")       # fill nulls
df.drop_duplicates(inplace=True)                # remove dupes
df["What is your age?"] = df["What is your age?"].str.strip()  # clean whitespace

# ── Descriptive Statistics ──
col = 'How concerned are you about environmental issues?'
print(f"Mean:   {df[col].mean():.2f}")
print(f"Median: {df[col].median()}")
print(f"Std:    {df[col].std():.2f}")
print(f"Range:  {df[col].max() - df[col].min()}")

# ── Pivot: Average concern by age group ──
pivot = df.pivot_table(values=col,
                       index='What is your age?',
                       aggfunc='mean')
pivot.plot(kind='bar')
plt.title("Environmental Concern by Age Group")
plt.xticks(rotation=45)
plt.show()
PROJECT_04 · KAGGLE COMPETITION · MACHINE LEARNING
🤖 Customer Churn Prediction — Logistic Regression
Kaggle Playground Series competition entry — built a churn prediction model using logistic regression. Full pipeline: combined train/test datasets, label-encoded all categorical features, trained a classifier, and generated probability predictions for submission.
scikit-learn pandas LabelEncoder LogisticRegression KAGGLE COMPETITION
LabelEncoder — encode categoricals
pd.concat — combine train + test
train_test_split — holdout set
LogisticRegression(max_iter=1000)
predict_proba — soft predictions
target mapping: Yes→1, No→0
submission CSV generation
select_dtypes — auto detect cols
• KEY SNIPPET — Full ML Pipeline
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

train = pd.read_csv("train.csv")
test  = pd.read_csv("test.csv")

# Map target to binary
train["Churn"] = train["Churn"].map({"Yes": 1, "No": 0})

# Combine for consistent label encoding
combined = pd.concat([train.drop("Churn", axis=1), test], axis=0)

for col in combined.select_dtypes(include="object").columns:
    if col != "id":
        le = LabelEncoder()
        combined[col] = le.fit_transform(combined[col].astype(str))

# Split back into train / test
X_train = combined.iloc[:len(train)]
X_test  = combined.iloc[len(train):]
y_train = train["Churn"]

# Train and predict churn probabilities
model = LogisticRegression(max_iter=1000)
model.fit(X_train.drop("id", axis=1), y_train)
pred_prob = model.predict_proba(X_test.drop("id", axis=1))[:, 1]

# Generate submission file
pd.DataFrame({"id": test["id"], "Churn": pred_prob}).to_csv(
    "submission.csv", index=False
)
PROJECT_05 · PYTHON APPLICATION · MOST ADVANCED
🌿 Eco Monitor — Desktop CO₂ Tracker App
A fully functional desktop application that monitors real-time CPU and GPU usage, estimates CO₂ emissions per second, and stores daily history in SQLite. Features a live animated UI — a plant that wilts as emissions rise, or a city skyline whose sky darkens — built with Tkinter. Most advanced Python work in the portfolio.
tkinter sqlite3 psutil threading GPUtil FULL APP LIVE UI
REAL-TIME MONITORING
Reads CPU and GPU usage every second using psutil and GPUtil — converts watt consumption to CO₂ grams using gCO₂/kWh formula
🌱
ANIMATED PLANT UI
Plant grows green when CO₂ is low, wilts brown as daily budget is exceeded — visual feedback using Canvas color interpolation
🏙️
CITY VIEW TOGGLE
Switchable city skyline view — sky color fades from blue to grey as pollution rises, with animated building windows
🗄️
SQLITE DATABASE
Every second of usage is written to SQLite — daily summaries tracked with INSERT OR REPLACE for anomaly detection
🧵
MULTI-THREADING
Background daemon thread runs the monitoring loop — UI thread stays responsive while data collection runs continuously
🤖
AI USAGE DETECTION
Detects browser + GPU usage spikes to estimate CO₂ from AI tool usage — flags sessions likely involving LLM inference
• KEY SNIPPET — CO₂ Calculation + SQLite Write
import psutil, sqlite3, threading, time, json

CPU_MAX_POWER_W   = 100
GPU_MAX_POWER_W   = 200
gCO2_per_kWh      = 700
DAILY_BUDGET_G    = 200

def update_metrics_loop(self):
    while self.running:
        cpu_pct   = psutil.cpu_percent(interval=1)
        gpu_pct   = self.get_gpu_usage()

        # Watts → CO₂ grams per second
        power_cpu = CPU_MAX_POWER_W * (cpu_pct / 100.0)
        co2_per_s = (power_cpu / 1000) * (gCO2_per_kWh / 3600)

        self.cumulative_co2 += co2_per_s

        # Persist to SQLite with daily summary rollover
        conn = sqlite3.connect("sustainability_data.db")
        conn.execute(
            "INSERT INTO usage (cpu_percent, co2_grams) VALUES (?, ?)",
            (cpu_pct, co2_per_s)
        )
        conn.execute(
            "INSERT OR REPLACE INTO daily_summary (day, total_co2_grams) VALUES (?, ?)",
            (date.today(), self.cumulative_co2)
        )
        conn.commit()
        conn.close()

        # Update UI — health_factor drives plant / city color
        health = max(0, 1.0 - (self.cumulative_co2 / DAILY_BUDGET_G))
        self.root.after(0, lambda h=health: self.update_visuals(h))
        time.sleep(1)
• FULL SOURCE CODE

All Python Files on GitHub

Complete Jupyter notebooks and .py scripts from Semester III — EDA, ML pipeline, and the full Eco Monitor application.

↗ View on GitHub • See Analytics Projects