Portfolio / PYTHON LAB
PYTHON ANALYTICS LAB
Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn applied to real datasets — EDA, statistical analysis, machine learning, and a full Python desktop application built in Semester III.
0
PROJECTS
0
VISUALIZATIONS
0
LIBRARIES USED
0
DATASETS ANALYZED
0
ML MODEL BUILT
• LIBRARIES USED ACROSS ALL PROJECTS
pandas
Data manipulation • cleaning • aggregation
numpy
Numerical computation • arrays
matplotlib
Charts • histograms • scatter plots
seaborn
Statistical charts • heatmaps • boxplots
scikit-learn
Logistic regression • PCA • train/test split
tkinter + sqlite3
Desktop app • embedded database
PYTHON PROJECTS
PROJECT_01 · SEM III · EDA
📊 Data Science Jobs Market — EDA
Exploratory analysis of a real data science job postings dataset — uncovering salary distributions, in-demand skills, top hiring companies, and the relationship between company rating and compensation.
pandas
seaborn
matplotlib
numpy
12 CHARTS
histplot — salary distribution
barplot — top 10 companies
boxplot — salary by job category
scatterplot — rating vs salary
countplot — skill demand (Python/R/AWS)
heatmap — correlation matrix
kde=True — density overlay
value_counts + head(10) — top filter
• KEY SNIPPET — Correlation Heatmap + Skill Demand Chart
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("eda_data.csv")
# Skill demand — Python, R, Spark, AWS, Excel
skills = ['python_yn', 'R_yn', 'spark', 'aws', 'excel']
skill_counts = df[skills].sum()
sns.barplot(x=skill_counts.index, y=skill_counts.values)
plt.title("Skill Demand in Data Science Job Postings")
plt.show()
# Correlation heatmap across all numeric variables
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()
# Salary vs Company Rating — colored by job category
sns.scatterplot(data=df, x="Rating", y="avg_salary", hue="job_simp")
plt.title("Salary vs Company Rating by Job Category")
plt.show()
PROJECT_02 · SEM III · EDA ASSIGNMENT
🧠 Student Stress Dataset — EDA
Analyzed a survey dataset on student stress indicators — anxiety, sleep problems, concentration issues, and mood. Produced 14 visualizations mapping frequency distributions and correlations between stress symptoms.
pandas
matplotlib
seaborn
14 CHARTS
SURVEY DATA
groupby + mean — anxiety by age
line plot — trend over age groups
countplot — stress distribution
pie chart — palpitations, loneliness
histogram — anxiety frequency bins
bar chart — sleep problems, headaches
scatter — age vs overwhelm
marker, linestyle, alpha — chart styling
• KEY SNIPPET — Anxiety by Age + Stress Distribution
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("Stress_Dataset.csv")
# Line plot: Average anxiety level by age group
avg_anxiety = df.groupby("Age")[
"Have you been dealing with anxiety or tension recently?"
].mean()
plt.figure(figsize=(8, 5))
plt.plot(avg_anxiety.index, avg_anxiety.values,
marker="s", markersize=7, linestyle="--",
linewidth=2, color="g")
plt.title("Anxiety Level by Age")
plt.xlabel("Age")
plt.ylabel("Average Anxiety Level")
plt.grid(True, alpha=0.4)
plt.show()
# Countplot: Stress level distribution
sns.countplot(
x="Have you recently experienced stress in your life?",
data=df, palette="Set2"
)
plt.title("Stress Level Distribution")
plt.show()
PROJECT_03 · SEM III · END SEM ASSIGNMENT
📱 Digital Habits Survey — Data Cleaning + EDA
End-semester assignment analyzing a real survey on digital habits and environmental awareness. Full data cleaning pipeline followed by descriptive statistics and visualization — screen time vs environmental concern correlation.
pandas
matplotlib
DATA CLEANING
SURVEY · 121 ROWS
fillna — handle missing Name values
drop_duplicates — remove dupes
str.strip — clean whitespace
isnull / df[df[col].isnull()] — audit
mean, median, mode, std — desc stats
min, max, range — spread analysis
pivot_table — age group aggregation
scatter — screen time vs env concern
• KEY SNIPPET — Full Data Cleaning Pipeline
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel("Your Digital Habits Survey.xlsx")
# ── Data Cleaning Pipeline ──
df["Name"] = df["Name"].fillna("Unknown") # fill nulls
df.drop_duplicates(inplace=True) # remove dupes
df["What is your age?"] = df["What is your age?"].str.strip() # clean whitespace
# ── Descriptive Statistics ──
col = 'How concerned are you about environmental issues?'
print(f"Mean: {df[col].mean():.2f}")
print(f"Median: {df[col].median()}")
print(f"Std: {df[col].std():.2f}")
print(f"Range: {df[col].max() - df[col].min()}")
# ── Pivot: Average concern by age group ──
pivot = df.pivot_table(values=col,
index='What is your age?',
aggfunc='mean')
pivot.plot(kind='bar')
plt.title("Environmental Concern by Age Group")
plt.xticks(rotation=45)
plt.show()
PROJECT_04 · KAGGLE COMPETITION · MACHINE LEARNING
🤖 Customer Churn Prediction — Logistic Regression
Kaggle Playground Series competition entry — built a churn prediction model using logistic regression. Full pipeline: combined train/test datasets, label-encoded all categorical features, trained a classifier, and generated probability predictions for submission.
scikit-learn
pandas
LabelEncoder
LogisticRegression
KAGGLE COMPETITION
LabelEncoder — encode categoricals
pd.concat — combine train + test
train_test_split — holdout set
LogisticRegression(max_iter=1000)
predict_proba — soft predictions
target mapping: Yes→1, No→0
submission CSV generation
select_dtypes — auto detect cols
• KEY SNIPPET — Full ML Pipeline
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
# Map target to binary
train["Churn"] = train["Churn"].map({"Yes": 1, "No": 0})
# Combine for consistent label encoding
combined = pd.concat([train.drop("Churn", axis=1), test], axis=0)
for col in combined.select_dtypes(include="object").columns:
if col != "id":
le = LabelEncoder()
combined[col] = le.fit_transform(combined[col].astype(str))
# Split back into train / test
X_train = combined.iloc[:len(train)]
X_test = combined.iloc[len(train):]
y_train = train["Churn"]
# Train and predict churn probabilities
model = LogisticRegression(max_iter=1000)
model.fit(X_train.drop("id", axis=1), y_train)
pred_prob = model.predict_proba(X_test.drop("id", axis=1))[:, 1]
# Generate submission file
pd.DataFrame({"id": test["id"], "Churn": pred_prob}).to_csv(
"submission.csv", index=False
)
PROJECT_05 · PYTHON APPLICATION · MOST ADVANCED
🌿 Eco Monitor — Desktop CO₂ Tracker App
A fully functional desktop application that monitors real-time CPU and GPU usage, estimates CO₂ emissions per second, and stores daily history in SQLite. Features a live animated UI — a plant that wilts as emissions rise, or a city skyline whose sky darkens — built with Tkinter. Most advanced Python work in the portfolio.
tkinter
sqlite3
psutil
threading
GPUtil
FULL APP
LIVE UI
REAL-TIME MONITORING
Reads CPU and GPU usage every second using psutil and GPUtil — converts watt consumption to CO₂ grams using gCO₂/kWh formula
ANIMATED PLANT UI
Plant grows green when CO₂ is low, wilts brown as daily budget is exceeded — visual feedback using Canvas color interpolation
CITY VIEW TOGGLE
Switchable city skyline view — sky color fades from blue to grey as pollution rises, with animated building windows
SQLITE DATABASE
Every second of usage is written to SQLite — daily summaries tracked with INSERT OR REPLACE for anomaly detection
MULTI-THREADING
Background daemon thread runs the monitoring loop — UI thread stays responsive while data collection runs continuously
AI USAGE DETECTION
Detects browser + GPU usage spikes to estimate CO₂ from AI tool usage — flags sessions likely involving LLM inference
• KEY SNIPPET — CO₂ Calculation + SQLite Write
import psutil, sqlite3, threading, time, json
CPU_MAX_POWER_W = 100
GPU_MAX_POWER_W = 200
gCO2_per_kWh = 700
DAILY_BUDGET_G = 200
def update_metrics_loop(self):
while self.running:
cpu_pct = psutil.cpu_percent(interval=1)
gpu_pct = self.get_gpu_usage()
# Watts → CO₂ grams per second
power_cpu = CPU_MAX_POWER_W * (cpu_pct / 100.0)
co2_per_s = (power_cpu / 1000) * (gCO2_per_kWh / 3600)
self.cumulative_co2 += co2_per_s
# Persist to SQLite with daily summary rollover
conn = sqlite3.connect("sustainability_data.db")
conn.execute(
"INSERT INTO usage (cpu_percent, co2_grams) VALUES (?, ?)",
(cpu_pct, co2_per_s)
)
conn.execute(
"INSERT OR REPLACE INTO daily_summary (day, total_co2_grams) VALUES (?, ?)",
(date.today(), self.cumulative_co2)
)
conn.commit()
conn.close()
# Update UI — health_factor drives plant / city color
health = max(0, 1.0 - (self.cumulative_co2 / DAILY_BUDGET_G))
self.root.after(0, lambda h=health: self.update_visuals(h))
time.sleep(1)
• FULL SOURCE CODE
All Python Files on GitHub
Complete Jupyter notebooks and .py scripts from Semester III — EDA, ML pipeline, and the full Eco Monitor application.