Comprehensive Learning Reference

Data Science & ML Mastery Guide

Complete tutorials and reference notes for data processing, machine learning algorithms, OpenRefine, Flask development, Regular Expressions, and advanced ML concepts.

🔧 OpenRefine Data Cleaning Complete Guide

Master data cleaning, transformation, and quality improvement with OpenRefine

Installation & Setup

Step 1: Download OpenRefine

Download OpenRefine 3.6.2 from: https://github.com/OpenRefine/OpenRefine/releases/tag/3.6.2

Choose your OS version (Windows, Mac, Linux). Version 3.6.2 includes embedded Java.

Step 2: Launch OpenRefine

1. Extract the downloaded file

2. Run the executable (openrefine.exe on Windows)

3. Open browser and go to: http://127.0.0.1:3333

Project Creation & Data Import

Creating a New Project

1. Click "Create Project"

2. Choose "This Computer" → Browse for your CSV file

3. Click "Next" to preview data

4. Verify column headers and data types

5. Click "Create Project" (top right)

💡 Data Import Best Practices

• Always preview your data before creating the project

• Check if headers are properly detected

• Verify column separation (comma, tab, semicolon)

• Note any encoding issues with special characters

Core Data Cleaning Operations

1. Removing Blank/Null Values

Using Text Facets to Remove Blanks

1. Click on column dropdown → FacetText facet

2. In the facet panel (left side), you'll see all unique values

3. Click on (blank) to select only blank rows

4. Click AllEdit rowsRemove all matching rows

5. Close the facet when done

2. Filtering Data

Text Filtering

1. Click column dropdown → Text filter

2. Enter your search term (e.g., "Airport")

3. Note: Filtering is case-sensitive by default

4. To keep only matching rows: AllEdit rowsRemove all non-matching rows

3. Creating New Columns with GREL

Adding Columns Based on Existing Data

1. Click source column dropdown → Edit columnAdd column based on this column

2. Enter new column name

3. Write GREL expression in the expression box

4. Preview shows first few results

5. Click OK to create column

GREL (General Refine Expression Language) Reference

Essential GREL Functions

Function Purpose Example
contains(value, "text") Check if value contains text contains(value, "Airport")
value.match(/pattern/) Extract text matching regex value.match(/([A-Za-z\s]+Airport)/)
split(value, delimiter) Split text into array split(value, ",")
trim(value) Remove leading/trailing spaces trim(value)
if(condition, true_value, false_value) Conditional logic if(contains(value, "Airport"), "Yes", "No")

Real-World GREL Examples

Extract Airport Names
if(contains(value, "Airport"),
  value.match(/([A-Za-z\s]+Airport)/)[0].trim(),
  ""
)
Clean and Standardize Text
value.trim().replace(/\s+/, " ").toTitlecase()
Extract Text After Specific Word
if(contains(value, " at "),
  split(value, " at ")[1].trim(),
  value
)

Clustering & Merging Similar Values

Using Cluster and Edit Feature

1. Select the column with similar values

2. Go to Edit cellsCluster and edit

3. Try different clustering methods:

Key collision → fingerprint: Basic similarity

Key collision → ngram-fingerprint: Handles typos

Nearest neighbor → levenshtein: Character differences

4. Review suggested clusters and choose merge values

5. Check Merge? for clusters you want to merge

6. Click Merge Selected & Re-Cluster

Exporting Results

Export Cleaned Data

1. Click Export (top right) → Comma-separated value

2. Save the cleaned CSV file

Export History (for Reproducibility)

1. Go to Undo/Redo tab → ExtractExport

2. Downloads history.json with all operations

3. Can be used to replay operations on similar datasets

⚠️ Common Mistakes to Avoid

• Always backup original data before major operations

• Test GREL expressions on small datasets first

• Check for case sensitivity in filters and matching

• Verify row counts after filtering operations

🌐 Flask Web Development Complete Tutorial

Build data-driven web applications with Python Flask

Flask Installation & Setup

Install Flask
Terminal/Command Prompt
# Install Flask
pip install Flask

# Optional: Create virtual environment first
python -m venv flask_env
# Windows: flask_env\Scripts\activate
# Mac/Linux: source flask_env/bin/activate
pip install Flask

Basic Flask Application Structure

project_folder/
├── run.py                 # Main application runner
├── flaskapp/
│   ├── __init__.py       # Flask app initialization
│   ├── routes.py         # URL routes and view functions
│   └── templates/
│       └── index.html    # HTML templates
└── data/
    └── dataset.csv       # Data files

1. Application Runner (run.py)

run.py
""" run.py - Run the Flask app """
from flaskapp import app

if __name__ == '__main__':
    # Start the Flask development server
    app.run(host='127.0.0.1', port=3001, debug=True)

2. Flask App Initialization

flaskapp/__init__.py
from flask import Flask

# Create Flask application instance
app = Flask(__name__)

# Import routes (must be after app creation)
from flaskapp import routes

3. Data Processing Module

data_processing.py
import csv

def username():
    return 'your_username'

def data_wrangling(filter_class=None):
    """
    Process CSV data and return formatted results
    
    Args:
        filter_class (str): Optional filter for data category
    
    Returns:
        tuple: (header, table_data, dropdown_options)
    """
    with open('data/dataset.csv', 'r', encoding='utf-8') as f:
        reader = csv.reader(f)
        table = []
        all_classes = set()
        
        # Read header row
        header = next(reader)
        
        # Read and process data rows
        for row in reader:
            # Convert numeric columns
            processed_row = [
                row[0],                    # species (string)
                row[1],                    # class (string) 
                int(row[2]) if row[2] else 0  # count (integer)
            ]
            table.append(processed_row)
            all_classes.add(row[1])  # Collect unique classes
        
        # Create sorted dropdown options
        dropdown_options = sorted(list(all_classes))
        
        # Apply filtering if specified
        if filter_class:
            table = [row for row in table if row[1] == filter_class]
        
        # Sort by count (descending) and limit to top 10
        table.sort(key=lambda x: x[2], reverse=True)
        table = table[:10]
    
    return header, table, dropdown_options

4. Routes and View Functions

flaskapp/routes.py
from flask import render_template, request
from flaskapp import app
from data_processing import data_wrangling

@app.route('/')
@app.route('/index')
def index():
    """Main page route"""
    # Get filter parameter from URL
    filter_class = request.args.get('filter_class')
    
    # Process data with optional filter
    header, table, dropdown_options = data_wrangling(filter_class)
    
    # Render template with data
    return render_template('index.html',
                         header=header,
                         table=table,
                         dropdown_options=dropdown_options,
                         current_filter=filter_class)

@app.route('/api/data')
def api_data():
    """API endpoint for JSON data"""
    filter_class = request.args.get('filter_class')
    header, table, dropdown_options = data_wrangling(filter_class)
    
    return {
        'header': header,
        'data': table,
        'filters': dropdown_options,
        'count': len(table)
    }

Running Your Flask Application

Start the Development Server
Terminal
# Navigate to your project folder
cd your_project_folder

# Run the application
python run.py

# Output should show:
# * Running on http://127.0.0.1:3001

Open your browser and go to: http://127.0.0.1:3001

Flask Development Best Practices

💡 Essential Tips

• Use debug=True for development (auto-reload on changes)

• Organize code into modules (routes, models, utilities)

• Use templates for all HTML (avoid HTML in Python code)

• Handle errors gracefully with try-catch blocks

• Use environment variables for configuration

🔍 Regular Expressions Complete Reference

Master pattern matching for text processing and data extraction

RegEx Fundamentals

Basic Syntax Elements

Pattern Description Example Matches
. Any single character a.c abc, axc, a1c
* Zero or more of preceding ab*c ac, abc, abbc
+ One or more of preceding ab+c abc, abbc (not ac)
? Zero or one of preceding ab?c ac, abc
^ Start of string ^Hello Hello world
$ End of string world$ Hello world

Data Extraction Patterns

Airport Name Extraction
# Python
import re

# Basic airport pattern
pattern = r'\b([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s+((?:International\s+)?Airport)\b'

text = "77kg of sandalwood at the Chhatrapati Shivaji Maharaj International Airport"
match = re.search(pattern, text)
if match:
    airport_name = f"{match.group(1)} {match.group(2)}"
    print(airport_name)  # "Chhatrapati Shivaji Maharaj International Airport"
Email Address Extraction
# Basic email pattern
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

text = "Contact us at support@example.com or admin@site.org"
emails = re.findall(pattern, text)
print(emails)  # ['support@example.com', 'admin@site.org']

Python RegEx Functions

Function Purpose Returns
re.search(pattern, text) Find first match Match object or None
re.findall(pattern, text) Find all matches List of strings
re.sub(pattern, replacement, text) Replace matches Modified string

🐍 Python Data Processing Patterns

Essential Python techniques for data manipulation and analysis

Pandas Data Manipulation

Essential Pandas Operations

Data Loading and Basic Operations
import pandas as pd
import numpy as np

# Load CSV data
df = pd.read_csv('data.csv')

# Basic data exploration
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")
print(f"Missing values:\n{df.isnull().sum()}")

# Display first few rows
print(df.head())

# Basic statistics
print(df.describe())

Data Cleaning Patterns

Remove Duplicates and Handle Missing Data
# Remove duplicate rows
df_clean = df.drop_duplicates()

# Remove rows with missing values in specific columns
df_clean = df.dropna(subset=['important_column'])

# Fill missing values
df['column'] = df['column'].fillna('default_value')
df['numeric_column'] = df['numeric_column'].fillna(df['numeric_column'].mean())

# Remove rows where column is empty string
df_clean = df[df['column'].str.strip() != '']

Filtering and Querying

Advanced Filtering Techniques
# Basic filtering
airport_data = df[df['Subject'].str.contains('Airport', case=True, na=False)]

# Multiple conditions
filtered = df[
    (df['Count'] > 10) & 
    (df['Country'].isin(['China', 'India'])) &
    (df['Date'].str.contains('2024'))
]

# Using query method (more readable for complex conditions)
result = df.query("Count > 10 and Country in ['China', 'India']")

# String operations
df['clean_name'] = df['name'].str.strip().str.title()
has_keyword = df[df['description'].str.contains('wildlife', case=False, na=False)]

🌲 Ensemble Methods (Boosting vs Bagging)

Understanding parallel vs sequential ensemble learning approaches

Installation & Setup

Required packages
# Required packages
import numpy as np
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

Core Concepts

Bagging (Bootstrap Aggregating)

Boosting

Bagging: f(x) = (1/M) Σ f_m(x)
Boosting: f(x) = Σ α_m h_m(x)

Code Examples

Implementation Examples
# Random Forest (Bagging)
rf = RandomForestClassifier(
    n_estimators=100,
    max_features='sqrt',  # Random feature selection
    bootstrap=True,       # Bootstrap sampling
    oob_score=True       # Out-of-bag error estimation
)

# AdaBoost (Boosting)
ada = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0
)

Best Practices

Random Forest Tips

• Use √p features for classification, p/3 for regression

• More trees rarely hurt (diminishing returns after ~100)

• Use OOB error for model selection

• Feature importance provides good interpretability

Boosting Tips

• Start with weak learners (shallow trees)

• Lower learning rate with more estimators

• Monitor training/validation error to avoid overfitting

• More sensitive to noise than bagging

Troubleshooting

Common Issues

Memory issues: Reduce n_estimators or use n_jobs=1

Slow training: Use n_jobs=-1 for parallel processing

Overfitting in boosting: Reduce learning rate or max_depth

Poor Random Forest performance: Check max_features and min_samples_leaf

🌳 CART Decision Trees

Classification and Regression Trees for interpretable machine learning

Installation & Setup

Required imports
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
import matplotlib.pyplot as plt

Core Concepts

CART Algorithm

Gini Impurity: G = 1 - Σ p_i²
Entropy: H = -Σ p_i log(p_i)
Information Gain: IG = H(parent) - Σ (n_i/n) H(child_i)

Overfitting Prevention

Code Examples

CART Implementation
# Classification Tree
clf = DecisionTreeClassifier(
    criterion='gini',        # or 'entropy'
    max_depth=10,           # Limit tree depth
    min_samples_split=20,   # Min samples to split
    min_samples_leaf=10,    # Min samples in leaf
    random_state=42
)

# Visualization
plt.figure(figsize=(20, 10))
plot_tree(clf, 
         feature_names=feature_names,
         class_names=class_names,
         filled=True,
         rounded=True,
         max_depth=3)  # Show only first 3 levels

Best Practices

Hyperparameter Guidelines

max_depth: Start with 5-10, tune based on validation

min_samples_split: 2-20, higher for noisy data

min_samples_leaf: 1-10, higher prevents overfitting

max_features: sqrt(n) for classification, n/3 for regression

Troubleshooting

Common Problems

Overfitting: Reduce max_depth, increase min_samples_leaf

Underfitting: Increase max_depth, reduce stopping criteria

Unbalanced trees: Check data distribution, consider class weights

Large trees hard to visualize: Use max_depth=3 in plot_tree

📡 Compressed Sensing & Medical Imaging

Sparse signal recovery from underdetermined systems

Installation & Setup

Required packages
import numpy as np
from sklearn.linear_model import Lasso, Ridge, LassoCV, RidgeCV
import scipy.io
import matplotlib.pyplot as plt

Core Concepts

Compressed Sensing Theory

Measurement Model: y = Ax + ε
Lasso: min ||y - Ax||₂² + λ||x||₁
Ridge: min ||y - Ax||₂² + λ||x||₂²

Lasso vs Ridge for Sparse Recovery

Code Examples

Compressed Sensing Implementation
# Generate sparse signal and measurements
n_measurements, n_pixels = 1300, 2500
A = np.random.normal(0, 1, (n_measurements, n_pixels))
noise = np.random.normal(0, 5, n_measurements)
y = A @ x_true + noise

# Lasso with cross-validation
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 50), cv=10)
lasso_cv.fit(A, y)
x_lasso = lasso_cv.coef_

# Ridge with cross-validation
ridge_cv = RidgeCV(alphas=np.logspace(-2, 3, 50), cv=10)
ridge_cv.fit(A, y)
x_ridge = ridge_cv.coef_

Best Practices

Compressed Sensing Guidelines

Measurement matrix: Use random Gaussian or Fourier measurements

Lambda selection: Use cross-validation or information criteria

Sparsity level: Ensure sufficient measurements (m ≥ 2s log p)

Noise handling: Account for noise level in regularization

Troubleshooting

Recovery Issues

Poor recovery: Check sparsity level and measurement count

Over-regularization: Lambda too large, reduces to zero

Under-regularization: Lambda too small, doesn't promote sparsity

Numerical issues: Scale features, add small regularization

⚡ AdaBoost Algorithm

Adaptive boosting for sequential weak learner combination

Installation & Setup

Required imports
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier

Core Concepts

AdaBoost Algorithm Steps

  1. Initialize weights: D₁(i) = 1/n for all i
  2. For t = 1 to T:
    • Train weak learner hₜ with weights Dₜ
    • Calculate error: εₜ = Σ Dₜ(i) × I[hₜ(xᵢ) ≠ yᵢ]
    • Calculate alpha: αₜ = ½ ln((1-εₜ)/εₜ)
    • Update weights: Dₜ₊₁(i) = Dₜ(i) × exp(-αₜyᵢhₜ(xᵢ)) / Zₜ
  3. Final classifier: H(x) = sign(Σ αₜhₜ(x))
Weight Update: Dₜ₊₁(i) = (Dₜ(i) × exp(-αₜyᵢhₜ(xᵢ))) / Zₜ
Alpha: αₜ = ½ ln((1-εₜ)/εₜ)
Normalization: Zₜ = Σ Dₜ(i) × exp(-αₜyᵢhₜ(xᵢ))

Code Examples

Manual AdaBoost Implementation
# Manual AdaBoost implementation (key parts)
def find_best_stump(X, y, weights):
    best_error = float('inf')
    best_stump = None
    
    for feature in [0, 1]:  # For each feature
        for threshold in thresholds:
            for polarity in [1, -1]:
                predictions = np.where(X[:, feature] <= threshold, polarity, -polarity)
                error = np.sum(weights[predictions != y])
                
                if error < best_error:
                    best_error = error
                    best_stump = {'feature': feature, 'threshold': threshold, 
                                'polarity': polarity, 'predictions': predictions}
    
    return best_stump, best_error

# Weight updates
alpha = 0.5 * np.log((1 - error) / error)
Z = np.sum(D * np.exp(-alpha * y * predictions))
D_new = D * np.exp(-alpha * y * predictions) / Z

Best Practices

AdaBoost Guidelines

Weak learners: Use shallow trees (stumps or max_depth=1-3)

Number of iterations: Monitor training error, stop when it plateaus

Data quality: AdaBoost is sensitive to noise and outliers

Class balance: Works best with balanced datasets

Troubleshooting

Common Issues

Overfitting: Reduce number of estimators or use early stopping

Weak learners too strong: Reduce tree depth

Numerical instability: Check for very small errors (ε ≈ 0)

Poor performance: Ensure weak learners are actually weak but better than random

🌲 Random Forest

Ensemble of decision trees with bootstrap aggregating

Installation & Setup

Required packages
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import cross_val_score
import numpy as np

Core Concepts

Random Forest Components

Key Parameters

OOB Error: Use ~1/3 of samples not in bootstrap for each tree
Feature Importance: Average decrease in impurity across all trees

Code Examples

Random Forest Implementation
# Random Forest with parameter tuning
rf = RandomForestClassifier(
    n_estimators=100,
    max_features='sqrt',     # √p for classification
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=5,
    oob_score=True,         # Out-of-bag error
    random_state=42,
    n_jobs=-1              # Parallel processing
)

# Parameter sensitivity analysis
max_features_range = [1, 2, 5, 10, int(np.sqrt(n_features)), n_features//3]
oob_errors = []
test_errors = []

for max_feat in max_features_range:
    rf_temp = RandomForestClassifier(max_features=max_feat, oob_score=True)
    rf_temp.fit(X_train, y_train)
    oob_errors.append(1 - rf_temp.oob_score_)
    test_errors.append(1 - rf_temp.score(X_test, y_test))

Best Practices

Optimization Tips

max_features: √p for classification, p/3 for regression

n_estimators: Start with 100, increase until OOB error stabilizes

Tree depth: Don't limit unless overfitting (Random Forest handles this well)

Feature importance: Use for feature selection and interpretation

Troubleshooting

Performance Issues

High bias: Increase n_estimators, reduce min_samples_leaf

High variance: Increase min_samples_split, limit max_depth

Slow training: Use n_jobs=-1, reduce n_estimators

Memory issues: Reduce max_depth or n_estimators

🎯 One-Class SVM

Novelty detection and anomaly identification

Installation & Setup

Required imports
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

Core Concepts

Novelty Detection Approach

Key Parameters

Decision Function: f(x) = sign(Σ αᵢ K(xᵢ, x) - ρ)
RBF Kernel: K(x, y) = exp(-γ||x - y||²)

Code Examples

One-Class SVM Implementation
# One-Class SVM setup
# Extract only normal examples for training
X_normal = X_train[y_train == 0]  # Only non-spam emails

# Feature scaling (important for SVM)
scaler = StandardScaler()
X_normal_scaled = scaler.fit_transform(X_normal)
X_test_scaled = scaler.transform(X_test)

# Hyperparameter tuning
nu_values = [0.01, 0.05, 0.1, 0.2, 0.3, 0.5]
gamma_values = ['scale', 'auto', 0.001, 0.01, 0.1, 1.0]

best_params = {}
best_accuracy = 0

for nu in nu_values:
    for gamma in gamma_values:
        ocsvm = OneClassSVM(nu=nu, gamma=gamma, kernel='rbf')
        ocsvm.fit(X_normal_scaled)
        
        predictions = ocsvm.predict(X_test_scaled)
        # Convert: +1 → 0 (normal), -1 → 1 (anomaly)
        y_pred = np.where(predictions == 1, 0, 1)
        accuracy = accuracy_score(y_test, y_pred)
        
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params = {'nu': nu, 'gamma': gamma}

Best Practices

One-Class SVM Guidelines

Feature scaling: Always scale features for SVM

Nu parameter: Start with expected outlier fraction

Gamma tuning: Use grid search with cross-validation

Evaluation: Focus on recall for anomaly class

Troubleshooting

Common Challenges

Poor anomaly detection: Adjust nu parameter, try different gamma

Too many false positives: Decrease nu value

Too many false negatives: Increase nu value

Training instability: Ensure sufficient normal examples

📈 Locally Weighted Linear Regression

Non-parametric regression with local model fitting

Installation & Setup

Required packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold

Core Concepts

Mathematical Foundation

Objective: min Σ (yᵢ - β₀ - (x - xᵢ)ᵀβ₁)² Kₕ(x - xᵢ)
Solution: β̂ = (XᵀWX)⁻¹XᵀWY
Gaussian Kernel: Kₕ(z) = exp(-||z||²/(2h²))

Matrix Definitions

Code Examples

LWLR Implementation
def gaussian_kernel(z, h):
    """Gaussian kernel function"""
    return np.exp(-np.sum(z**2, axis=-1) / (2 * h**2))

def locally_weighted_regression(X_train, y_train, x_pred, h):
    """LWLR at single prediction point"""
    n, p = X_train.shape
    
    # Compute weights
    weights = gaussian_kernel(X_train - x_pred, h)
    
    # Design matrix: [1, (X_train - x_pred)]
    X_design = np.column_stack([np.ones(n), X_train - x_pred])
    
    # Weighted least squares
    W = np.diag(weights)
    XTW = X_design.T @ W
    beta = np.linalg.solve(XTW @ X_design, XTW @ y_train)
    
    # Prediction is β₀
    return beta[0]

# Cross-validation for bandwidth selection
def cross_validate_lwlr(X, y, h_values, cv_folds=5):
    kf = KFold(n_splits=cv_folds, shuffle=True, random_state=42)
    cv_errors = []
    
    for h in h_values:
        fold_errors = []
        for train_idx, val_idx in kf.split(X):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            y_pred = [locally_weighted_regression(X_train, y_train, x, h) 
                     for x in X_val.flatten()]
            mse = np.mean((y_val - np.array(y_pred))**2)
            fold_errors.append(mse)
        
        cv_errors.append(np.mean(fold_errors))
    
    return np.array(cv_errors)

Best Practices

LWLR Implementation Tips

Bandwidth selection: Use cross-validation for optimal h

Numerical stability: Add small regularization to XᵀWX

Computational efficiency: Pre-compute distances when possible

Kernel choice: Gaussian is most common, but others work

Troubleshooting

Implementation Issues

Singular matrix: Add regularization or increase bandwidth

Poor predictions: Check bandwidth selection and data quality

Slow computation: Use approximate methods for large datasets

Boundary effects: Consider different kernels or padding

⚖️ Bias-Variance Tradeoff

Understanding the fundamental tradeoff in machine learning

Installation & Setup

Required packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve, validation_curve

Core Concepts

Fundamental Decomposition

Total Error = Bias² + Variance + Irreducible Error
Bias: E[f̂(x)] - f(x)
Variance: E[(f̂(x) - E[f̂(x)])²]

Model Complexity Effects

Method-Specific Tradeoffs

Code Examples

Bias-Variance Analysis
# Validation curve to visualize bias-variance tradeoff
from sklearn.model_selection import validation_curve

# Example: Decision tree depth
param_range = np.arange(1, 11)
train_scores, test_scores = validation_curve(
    DecisionTreeClassifier(random_state=42), 
    X, y, param_name='max_depth', param_range=param_range,
    cv=5, scoring='accuracy'
)

# Plot bias-variance tradeoff
plt.figure(figsize=(10, 6))
plt.plot(param_range, np.mean(train_scores, axis=1), 'o-', label='Training Score')
plt.plot(param_range, np.mean(test_scores, axis=1), 'o-', label='Validation Score')
plt.xlabel('Max Depth (Model Complexity)')
plt.ylabel('Accuracy')
plt.title('Bias-Variance Tradeoff')
plt.legend()
plt.grid(True)

# Bias-variance decomposition (conceptual)
def bias_variance_decomposition(true_function, model_predictions):
    """
    Conceptual bias-variance decomposition
    """
    mean_prediction = np.mean(model_predictions, axis=0)
    bias_squared = np.mean((mean_prediction - true_function)**2)
    variance = np.mean(np.var(model_predictions, axis=0))
    return bias_squared, variance

Best Practices

Managing Bias-Variance Tradeoff

Cross-validation: Use for model selection and hyperparameter tuning

Regularization: Add penalty terms to control model complexity

Ensemble methods: Combine multiple models to reduce variance

Data size: More data generally reduces variance

Troubleshooting

Diagnosis and Solutions

High Bias: Increase model complexity, add features, reduce regularization

High Variance: Decrease complexity, add regularization, get more data

Both High: Review model choice and data quality

Validation: Use learning curves to diagnose the problem

✅ Cross-Validation Techniques

Robust model evaluation and selection methods

Installation & Setup

Required imports
from sklearn.model_selection import (
    cross_val_score, KFold, StratifiedKFold, 
    LeaveOneOut, cross_validate, GridSearchCV
)

Core Concepts

Types of Cross-Validation

Applications

K-Fold CV Error: CV = (1/k) Σ L(fᵢ, Dᵢ)
Standard Error: SE = √(Var(CV errors)/k)

Code Examples

Cross-Validation Implementation
# Basic cross-validation
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

# Stratified K-Fold for classification
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')

# Manual cross-validation loop
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_errors = []

for train_idx, val_idx in kf.split(X):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    error = 1 - accuracy_score(y_val, y_pred)
    cv_errors.append(error)

cv_error = np.mean(cv_errors)
cv_std = np.std(cv_errors)

# Grid search with cross-validation
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid, cv=5, scoring='accuracy', n_jobs=-1
)
grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

Best Practices

Cross-Validation Guidelines

Fold selection: 5-10 folds typically optimal (bias-variance tradeoff)

Stratification: Use for classification to maintain class balance

Shuffle data: Unless temporal dependencies exist

Nested CV: Use for unbiased model comparison with hyperparameter tuning

Troubleshooting

Common Pitfalls

Data leakage: Ensure no information flows from test to train

Small datasets: Use LOO or stratified sampling

Imbalanced data: Use stratified CV and appropriate metrics

Time series: Use time-based splits, not random

🔧 Common Issues & Solutions

Solutions to frequent problems in data processing and machine learning

OpenRefine Issues

⚠️ "GREL Expression Error"

Problem: GREL syntax errors or unexpected results

Solutions:

• Test expressions on small data samples first

• Use parentheses to ensure proper order of operations

• Check for typos in function names (case-sensitive)

• Use value.toString() if working with numbers

⚠️ "Row Count Mismatch After Filtering"

Problem: Lost rows during data processing

Solutions:

• Check for case sensitivity in filters

• Verify GREL expressions don't return empty strings

• Use facets to inspect data before filtering

• Export history to track which operation caused loss

Python/Pandas Issues

⚠️ "KeyError: Column not found"

Problem: Referencing non-existent column

Solution
# Check available columns
print(df.columns.tolist())

# Safe column access
if 'column_name' in df.columns:
    result = df['column_name']
else:
    print("Column not found")

# Use .get() method for safe access
value = df.get('column_name', default_value)
⚠️ "SettingWithCopyWarning"

Problem: Pandas warning about chained assignments

Solution
# Avoid chained assignment
# Bad: df[df['A'] > 5]['B'] = 'new_value'

# Good: Use .loc
df.loc[df['A'] > 5, 'B'] = 'new_value'

# Or create explicit copy
subset = df[df['A'] > 5].copy()
subset['B'] = 'new_value'

Machine Learning Issues

⚠️ "Model Overfitting"

Problem: High training accuracy, poor test performance

Solutions:

• Reduce model complexity (less depth, fewer features)

• Add regularization (L1/L2 penalties)

• Use cross-validation for hyperparameter tuning

• Collect more training data

• Use ensemble methods to reduce variance

⚠️ "Poor Cross-Validation Results"

Problem: High variance in CV scores

Solutions:

• Check data distribution and class balance

• Use stratified CV for classification

• Increase number of CV folds

• Check for data leakage

• Ensure proper preprocessing

Flask Issues

⚠️ "Template Not Found"

Problem: Flask can't locate HTML templates

Solutions:

• Ensure templates are in templates/ folder

• Check file name spelling and case

• Verify Flask app initialization includes template folder

Correct Structure
app = Flask(__name__, template_folder='templates')

RegEx Debugging

💡 RegEx Testing Strategy

• Use online tools: regex101.com, regexr.com

• Test patterns incrementally (build complexity gradually)

• Print intermediate results to debug extraction

Debug RegEx Pattern
import re

def debug_regex(pattern, text):
    """Debug regex pattern with detailed output"""
    print(f"Pattern: {pattern}")
    print(f"Text: {text}")
    
    match = re.search(pattern, text)
    if match:
        print(f"Full match: '{match.group()}'")
        for i, group in enumerate(match.groups(), 1):
            print(f"Group {i}: '{group}'")
    else:
        print("No match found")
    
    # Test with findall for multiple matches
    all_matches = re.findall(pattern, text)
    print(f"All matches: {all_matches}")

# Usage
debug_regex(r'([A-Z][a-z]+)\s+(Airport)', 
           "Beijing Capital Airport and Shanghai Pudong Airport")

🚀 Future Learning Areas

Structured areas for expanding knowledge and skills

Advanced Machine Learning

🧠 Deep Learning

Add notes on: neural networks, TensorFlow/PyTorch, CNNs, RNNs, transformers, deep learning architectures

📊 Advanced ML Algorithms

Add tutorials for: SVM kernels, clustering algorithms, dimensionality reduction, reinforcement learning

Data Science & Analytics

📈 Data Visualization

Add tutorials for: matplotlib, seaborn, plotly, dashboard creation, interactive charts

🔍 Feature Engineering

Add content on: feature selection, feature creation, text processing, time series features

Web Development

🌐 Advanced Flask

Add content on: database integration, user authentication, API development, deployment strategies

⚛️ Frontend Technologies

Add learning notes for: JavaScript, React, HTML/CSS advanced techniques, responsive design

Database & SQL

🗄️ Database Design

Add tutorials on: SQL queries, database normalization, performance optimization, NoSQL databases

Tools & Automation

🔧 Development Tools

Add guides for: Git version control, Docker containerization, CI/CD pipelines, testing frameworks

☁️ Cloud Platforms

Add learning for: AWS, Azure, Google Cloud, serverless computing, cloud databases

How to Add New Learning Sections

📝 Template for New Topics

1. Create a new section with appropriate ID

2. Include: Overview, Installation/Setup, Core Concepts, Code Examples, Best Practices

3. Add practical examples and real-world use cases

4. Include troubleshooting section for common issues

5. Update table of contents with new section