Comprehensive Learning Guide

📚 Complete Learning Reference

🔧 OpenRefine Data Cleaning Complete Guide

Master data cleaning, transformation, and quality improvement with OpenRefine

Installation & Setup

Step 1: Download OpenRefine

Download OpenRefine 3.6.2 from: https://github.com/OpenRefine/OpenRefine/releases/tag/3.6.2

Choose your OS version (Windows, Mac, Linux). Version 3.6.2 includes embedded Java.

Step 2: Launch OpenRefine

1. Extract the downloaded file

2. Run the executable (openrefine.exe on Windows)

3. Open browser and go to: http://127.0.0.1:3333

Project Creation & Data Import

Creating a New Project

1. Click "Create Project"

2. Choose "This Computer" → Browse for your CSV file

3. Click "Next" to preview data

4. Verify column headers and data types

5. Click "Create Project" (top right)

💡 Data Import Best Practices

• Always preview your data before creating the project

• Check if headers are properly detected

• Verify column separation (comma, tab, semicolon)

• Note any encoding issues with special characters

Core Data Cleaning Operations

1. Removing Blank/Null Values

Using Text Facets to Remove Blanks

1. Click on column dropdown → Facet → Text facet

2. In the facet panel (left side), you'll see all unique values

3. Click on (blank) to select only blank rows

4. Click All → Edit rows → Remove all matching rows

5. Close the facet when done

2. Filtering Data

Text Filtering

1. Click column dropdown → Text filter

2. Enter your search term (e.g., "Airport")

3. Note: Filtering is case-sensitive by default

4. To keep only matching rows: All → Edit rows → Remove all non-matching rows

3. Creating New Columns with GREL

Adding Columns Based on Existing Data

1. Click source column dropdown → Edit column → Add column based on this column

2. Enter new column name

3. Write GREL expression in the expression box

4. Preview shows first few results

5. Click OK to create column

GREL (General Refine Expression Language) Reference

Essential GREL Functions

Function	Purpose	Example
`contains(value, "text")`	Check if value contains text	`contains(value, "Airport")`
`value.match(/pattern/)`	Extract text matching regex	`value.match(/([A-Za-z\s]+Airport)/)`
`split(value, delimiter)`	Split text into array	`split(value, ",")`
`trim(value)`	Remove leading/trailing spaces	`trim(value)`
`if(condition, true_value, false_value)`	Conditional logic	`if(contains(value, "Airport"), "Yes", "No")`

Real-World GREL Examples

Extract Airport Names

if(contains(value, "Airport"),
  value.match(/([A-Za-z\s]+Airport)/)[0].trim(),
  ""
)

Clean and Standardize Text

value.trim().replace(/\s+/, " ").toTitlecase()

Extract Text After Specific Word

if(contains(value, " at "),
  split(value, " at ")[1].trim(),
  value
)

Clustering & Merging Similar Values

Using Cluster and Edit Feature

1. Select the column with similar values

2. Go to Edit cells → Cluster and edit

3. Try different clustering methods:

• Key collision → fingerprint: Basic similarity

• Key collision → ngram-fingerprint: Handles typos

• Nearest neighbor → levenshtein: Character differences

4. Review suggested clusters and choose merge values

5. Check Merge? for clusters you want to merge

6. Click Merge Selected & Re-Cluster

Exporting Results

Export Cleaned Data

1. Click Export (top right) → Comma-separated value

2. Save the cleaned CSV file

Export History (for Reproducibility)

1. Go to Undo/Redo tab → Extract → Export

2. Downloads history.json with all operations

3. Can be used to replay operations on similar datasets

⚠️ Common Mistakes to Avoid

• Always backup original data before major operations

• Test GREL expressions on small datasets first

• Check for case sensitivity in filters and matching

• Verify row counts after filtering operations

🌐 Flask Web Development Complete Tutorial

Build data-driven web applications with Python Flask

Flask Installation & Setup

Install Flask

Terminal/Command Prompt

# Install Flask
pip install Flask

# Optional: Create virtual environment first
python -m venv flask_env
# Windows: flask_env\Scripts\activate
# Mac/Linux: source flask_env/bin/activate
pip install Flask

Basic Flask Application Structure

project_folder/
├── run.py                 # Main application runner
├── flaskapp/
│   ├── __init__.py       # Flask app initialization
│   ├── routes.py         # URL routes and view functions
│   └── templates/
│       └── index.html    # HTML templates
└── data/
    └── dataset.csv       # Data files

1. Application Runner (run.py)

run.py

""" run.py - Run the Flask app """
from flaskapp import app

if __name__ == '__main__':
    # Start the Flask development server
    app.run(host='127.0.0.1', port=3001, debug=True)

2. Flask App Initialization

flaskapp/__init__.py

from flask import Flask

# Create Flask application instance
app = Flask(__name__)

# Import routes (must be after app creation)
from flaskapp import routes

3. Data Processing Module

data_processing.py

import csv

def username():
    return 'your_username'

def data_wrangling(filter_class=None):
    """
    Process CSV data and return formatted results
    
    Args:
        filter_class (str): Optional filter for data category
    
    Returns:
        tuple: (header, table_data, dropdown_options)
    """
    with open('data/dataset.csv', 'r', encoding='utf-8') as f:
        reader = csv.reader(f)
        table = []
        all_classes = set()
        
        # Read header row
        header = next(reader)
        
        # Read and process data rows
        for row in reader:
            # Convert numeric columns
            processed_row = [
                row[0],                    # species (string)
                row[1],                    # class (string) 
                int(row[2]) if row[2] else 0  # count (integer)
            ]
            table.append(processed_row)
            all_classes.add(row[1])  # Collect unique classes
        
        # Create sorted dropdown options
        dropdown_options = sorted(list(all_classes))
        
        # Apply filtering if specified
        if filter_class:
            table = [row for row in table if row[1] == filter_class]
        
        # Sort by count (descending) and limit to top 10
        table.sort(key=lambda x: x[2], reverse=True)
        table = table[:10]
    
    return header, table, dropdown_options

4. Routes and View Functions

flaskapp/routes.py

from flask import render_template, request
from flaskapp import app
from data_processing import data_wrangling

@app.route('/')
@app.route('/index')
def index():
    """Main page route"""
    # Get filter parameter from URL
    filter_class = request.args.get('filter_class')
    
    # Process data with optional filter
    header, table, dropdown_options = data_wrangling(filter_class)
    
    # Render template with data
    return render_template('index.html',
                         header=header,
                         table=table,
                         dropdown_options=dropdown_options,
                         current_filter=filter_class)

@app.route('/api/data')
def api_data():
    """API endpoint for JSON data"""
    filter_class = request.args.get('filter_class')
    header, table, dropdown_options = data_wrangling(filter_class)
    
    return {
        'header': header,
        'data': table,
        'filters': dropdown_options,
        'count': len(table)
    }

Running Your Flask Application

Start the Development Server

Terminal

# Navigate to your project folder
cd your_project_folder

# Run the application
python run.py

# Output should show:
# * Running on http://127.0.0.1:3001

Open your browser and go to: http://127.0.0.1:3001

Flask Development Best Practices

💡 Essential Tips

• Use debug=True for development (auto-reload on changes)

• Organize code into modules (routes, models, utilities)

• Use templates for all HTML (avoid HTML in Python code)

• Handle errors gracefully with try-catch blocks

• Use environment variables for configuration

🔍 Regular Expressions Complete Reference

Master pattern matching for text processing and data extraction

RegEx Fundamentals

Basic Syntax Elements

Pattern	Description	Example	Matches
`.`	Any single character	`a.c`	abc, axc, a1c
`*`	Zero or more of preceding	`ab*c`	ac, abc, abbc
`+`	One or more of preceding	`ab+c`	abc, abbc (not ac)
`?`	Zero or one of preceding	`ab?c`	ac, abc
`^`	Start of string	`^Hello`	Hello world
`$`	End of string	`world$`	Hello world

Data Extraction Patterns

Airport Name Extraction

# Python
import re

# Basic airport pattern
pattern = r'\b([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s+((?:International\s+)?Airport)\b'

text = "77kg of sandalwood at the Chhatrapati Shivaji Maharaj International Airport"
match = re.search(pattern, text)
if match:
    airport_name = f"{match.group(1)} {match.group(2)}"
    print(airport_name)  # "Chhatrapati Shivaji Maharaj International Airport"

Email Address Extraction

# Basic email pattern
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

text = "Contact us at support@example.com or admin@site.org"
emails = re.findall(pattern, text)
print(emails)  # ['support@example.com', 'admin@site.org']

Python RegEx Functions

Function	Purpose	Returns
`re.search(pattern, text)`	Find first match	Match object or None
`re.findall(pattern, text)`	Find all matches	List of strings
`re.sub(pattern, replacement, text)`	Replace matches	Modified string

🐍 Python Data Processing Patterns

Essential Python techniques for data manipulation and analysis

Pandas Data Manipulation

Essential Pandas Operations

Data Loading and Basic Operations

import pandas as pd
import numpy as np

# Load CSV data
df = pd.read_csv('data.csv')

# Basic data exploration
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")
print(f"Missing values:\n{df.isnull().sum()}")

# Display first few rows
print(df.head())

# Basic statistics
print(df.describe())

Data Cleaning Patterns

Remove Duplicates and Handle Missing Data

# Remove duplicate rows
df_clean = df.drop_duplicates()

# Remove rows with missing values in specific columns
df_clean = df.dropna(subset=['important_column'])

# Fill missing values
df['column'] = df['column'].fillna('default_value')
df['numeric_column'] = df['numeric_column'].fillna(df['numeric_column'].mean())

# Remove rows where column is empty string
df_clean = df[df['column'].str.strip() != '']

Filtering and Querying

Advanced Filtering Techniques

# Basic filtering
airport_data = df[df['Subject'].str.contains('Airport', case=True, na=False)]

# Multiple conditions
filtered = df[
    (df['Count'] > 10) & 
    (df['Country'].isin(['China', 'India'])) &
    (df['Date'].str.contains('2024'))
]

# Using query method (more readable for complex conditions)
result = df.query("Count > 10 and Country in ['China', 'India']")

# String operations
df['clean_name'] = df['name'].str.strip().str.title()
has_keyword = df[df['description'].str.contains('wildlife', case=False, na=False)]

🌲 Ensemble Methods (Boosting vs Bagging)

Understanding parallel vs sequential ensemble learning approaches

Installation & Setup

Required packages

# Required packages
import numpy as np
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

Core Concepts

Bagging (Bootstrap Aggregating)

Parallel training: Multiple models trained independently
Bootstrap sampling: Each model sees different subset of data
Averaging: Final prediction is average/majority vote
Variance reduction: Reduces overfitting by averaging
Example: Random Forest

Boosting

Sequential training: Models trained one after another
Error focusing: Later models focus on previous mistakes
Weighted combination: Better models get higher weight
Bias reduction: Converts weak learners to strong learner
Example: AdaBoost, Gradient Boosting

Bagging: f(x) = (1/M) Σ f_m(x)
Boosting: f(x) = Σ α_m h_m(x)

Code Examples

Implementation Examples

# Random Forest (Bagging)
rf = RandomForestClassifier(
    n_estimators=100,
    max_features='sqrt',  # Random feature selection
    bootstrap=True,       # Bootstrap sampling
    oob_score=True       # Out-of-bag error estimation
)

# AdaBoost (Boosting)
ada = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0
)

Best Practices

Random Forest Tips

• Use √p features for classification, p/3 for regression

• More trees rarely hurt (diminishing returns after ~100)

• Use OOB error for model selection

• Feature importance provides good interpretability

Boosting Tips

• Start with weak learners (shallow trees)

• Lower learning rate with more estimators

• Monitor training/validation error to avoid overfitting

• More sensitive to noise than bagging

Troubleshooting

Common Issues

• Memory issues: Reduce n_estimators or use n_jobs=1

• Slow training: Use n_jobs=-1 for parallel processing

• Overfitting in boosting: Reduce learning rate or max_depth

• Poor Random Forest performance: Check max_features and min_samples_leaf

🌳 CART Decision Trees

Classification and Regression Trees for interpretable machine learning

Installation & Setup

Required imports

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
import matplotlib.pyplot as plt

Core Concepts

CART Algorithm

Binary splits: Each node has exactly two branches
Greedy splitting: Best split at each node independently
Impurity measures: Gini, Entropy (classification), MSE (regression)
Stopping criteria: Max depth, min samples, impurity threshold

Gini Impurity: G = 1 - Σ p_i²
Entropy: H = -Σ p_i log(p_i)
Information Gain: IG = H(parent) - Σ (n_i/n) H(child_i)

Overfitting Prevention

Pre-pruning: max_depth, min_samples_split, min_samples_leaf
Post-pruning: Cost complexity pruning
Cross-validation: For hyperparameter selection
Early stopping: Based on validation error

Code Examples

CART Implementation

# Classification Tree
clf = DecisionTreeClassifier(
    criterion='gini',        # or 'entropy'
    max_depth=10,           # Limit tree depth
    min_samples_split=20,   # Min samples to split
    min_samples_leaf=10,    # Min samples in leaf
    random_state=42
)

# Visualization
plt.figure(figsize=(20, 10))
plot_tree(clf, 
         feature_names=feature_names,
         class_names=class_names,
         filled=True,
         rounded=True,
         max_depth=3)  # Show only first 3 levels

Best Practices

Hyperparameter Guidelines

• max_depth: Start with 5-10, tune based on validation

• min_samples_split: 2-20, higher for noisy data

• min_samples_leaf: 1-10, higher prevents overfitting

• max_features: sqrt(n) for classification, n/3 for regression

Troubleshooting

Common Problems

• Overfitting: Reduce max_depth, increase min_samples_leaf

• Underfitting: Increase max_depth, reduce stopping criteria

• Unbalanced trees: Check data distribution, consider class weights

• Large trees hard to visualize: Use max_depth=3 in plot_tree

📡 Compressed Sensing & Medical Imaging

Sparse signal recovery from underdetermined systems

Installation & Setup

Required packages

import numpy as np
from sklearn.linear_model import Lasso, Ridge, LassoCV, RidgeCV
import scipy.io
import matplotlib.pyplot as plt

Core Concepts

Compressed Sensing Theory

Sparse signals: Most coefficients are zero
Underdetermined systems: n measurements < p unknowns
L1 regularization: Promotes sparsity
RIP condition: Restricted Isometry Property for exact recovery

Measurement Model: y = Ax + ε
Lasso: min ||y - Ax||₂² + λ||x||₁
Ridge: min ||y - Ax||₂² + λ||x||₂²

Lasso vs Ridge for Sparse Recovery

Lasso (L1): Sets coefficients exactly to zero, recovers sparse signals
Ridge (L2): Shrinks coefficients, doesn't achieve sparsity
Medical imaging: MRI images are naturally sparse in some domain

Code Examples

Compressed Sensing Implementation

# Generate sparse signal and measurements
n_measurements, n_pixels = 1300, 2500
A = np.random.normal(0, 1, (n_measurements, n_pixels))
noise = np.random.normal(0, 5, n_measurements)
y = A @ x_true + noise

# Lasso with cross-validation
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 50), cv=10)
lasso_cv.fit(A, y)
x_lasso = lasso_cv.coef_

# Ridge with cross-validation
ridge_cv = RidgeCV(alphas=np.logspace(-2, 3, 50), cv=10)
ridge_cv.fit(A, y)
x_ridge = ridge_cv.coef_

Best Practices

Compressed Sensing Guidelines

• Measurement matrix: Use random Gaussian or Fourier measurements

• Lambda selection: Use cross-validation or information criteria

• Sparsity level: Ensure sufficient measurements (m ≥ 2s log p)

• Noise handling: Account for noise level in regularization

Troubleshooting

Recovery Issues

• Poor recovery: Check sparsity level and measurement count

• Over-regularization: Lambda too large, reduces to zero

• Under-regularization: Lambda too small, doesn't promote sparsity

• Numerical issues: Scale features, add small regularization

⚡ AdaBoost Algorithm

Adaptive boosting for sequential weak learner combination

Installation & Setup

Required imports

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier

Core Concepts

AdaBoost Algorithm Steps

Initialize weights: D₁(i) = 1/n for all i
For t = 1 to T:
- Train weak learner hₜ with weights Dₜ
- Calculate error: εₜ = Σ Dₜ(i) × I[hₜ(xᵢ) ≠ yᵢ]
- Calculate alpha: αₜ = ½ ln((1-εₜ)/εₜ)
- Update weights: Dₜ₊₁(i) = Dₜ(i) × exp(-αₜyᵢhₜ(xᵢ)) / Zₜ
Final classifier: H(x) = sign(Σ αₜhₜ(x))

Weight Update: Dₜ₊₁(i) = (Dₜ(i) × exp(-αₜyᵢhₜ(xᵢ))) / Zₜ
Alpha: αₜ = ½ ln((1-εₜ)/εₜ)
Normalization: Zₜ = Σ Dₜ(i) × exp(-αₜyᵢhₜ(xᵢ))

Code Examples

Manual AdaBoost Implementation

# Manual AdaBoost implementation (key parts)
def find_best_stump(X, y, weights):
    best_error = float('inf')
    best_stump = None
    
    for feature in [0, 1]:  # For each feature
        for threshold in thresholds:
            for polarity in [1, -1]:
                predictions = np.where(X[:, feature] <= threshold, polarity, -polarity)
                error = np.sum(weights[predictions != y])
                
                if error < best_error:
                    best_error = error
                    best_stump = {'feature': feature, 'threshold': threshold, 
                                'polarity': polarity, 'predictions': predictions}
    
    return best_stump, best_error

# Weight updates
alpha = 0.5 * np.log((1 - error) / error)
Z = np.sum(D * np.exp(-alpha * y * predictions))
D_new = D * np.exp(-alpha * y * predictions) / Z

Best Practices

AdaBoost Guidelines

• Weak learners: Use shallow trees (stumps or max_depth=1-3)

• Number of iterations: Monitor training error, stop when it plateaus

• Data quality: AdaBoost is sensitive to noise and outliers

• Class balance: Works best with balanced datasets

Troubleshooting

Common Issues

• Overfitting: Reduce number of estimators or use early stopping

• Weak learners too strong: Reduce tree depth

• Numerical instability: Check for very small errors (ε ≈ 0)

• Poor performance: Ensure weak learners are actually weak but better than random

🌲 Random Forest

Ensemble of decision trees with bootstrap aggregating

Installation & Setup

Required packages

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import cross_val_score
import numpy as np

Core Concepts

Random Forest Components

Bootstrap sampling: Each tree trained on different subset
Random feature selection: At each split, consider random subset of features
Majority voting: Final prediction by majority vote (classification) or averaging (regression)
Out-of-bag error: Unbiased error estimate using unused samples

Key Parameters

n_estimators: Number of trees in forest
max_features: Number of features to consider at each split
max_depth: Maximum depth of trees
min_samples_split/leaf: Minimum samples for splitting/leaf nodes

OOB Error: Use ~1/3 of samples not in bootstrap for each tree
Feature Importance: Average decrease in impurity across all trees

Code Examples

Random Forest Implementation

# Random Forest with parameter tuning
rf = RandomForestClassifier(
    n_estimators=100,
    max_features='sqrt',     # √p for classification
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=5,
    oob_score=True,         # Out-of-bag error
    random_state=42,
    n_jobs=-1              # Parallel processing
)

# Parameter sensitivity analysis
max_features_range = [1, 2, 5, 10, int(np.sqrt(n_features)), n_features//3]
oob_errors = []
test_errors = []

for max_feat in max_features_range:
    rf_temp = RandomForestClassifier(max_features=max_feat, oob_score=True)
    rf_temp.fit(X_train, y_train)
    oob_errors.append(1 - rf_temp.oob_score_)
    test_errors.append(1 - rf_temp.score(X_test, y_test))

Best Practices

Optimization Tips

• max_features: √p for classification, p/3 for regression

• n_estimators: Start with 100, increase until OOB error stabilizes

• Tree depth: Don't limit unless overfitting (Random Forest handles this well)

• Feature importance: Use for feature selection and interpretation

Troubleshooting

Performance Issues

• High bias: Increase n_estimators, reduce min_samples_leaf

• High variance: Increase min_samples_split, limit max_depth

• Slow training: Use n_jobs=-1, reduce n_estimators

• Memory issues: Reduce max_depth or n_estimators

🎯 One-Class SVM

Novelty detection and anomaly identification

Installation & Setup

Required imports

from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

Core Concepts

Novelty Detection Approach

Training: Learn boundary around "normal" data only
Prediction: +1 for normal, -1 for outliers/anomalies
Use cases: Fraud detection, spam detection, fault detection
Assumption: Anomalies are rare and different from normal data

Key Parameters

nu (ν): Fraction of training examples to be outliers (0 < ν ≤ 1)
gamma: RBF kernel parameter controlling decision boundary complexity
kernel: Usually 'rbf' for non-linear boundaries

Decision Function: f(x) = sign(Σ αᵢ K(xᵢ, x) - ρ)
RBF Kernel: K(x, y) = exp(-γ||x - y||²)

Code Examples

One-Class SVM Implementation

# One-Class SVM setup
# Extract only normal examples for training
X_normal = X_train[y_train == 0]  # Only non-spam emails

# Feature scaling (important for SVM)
scaler = StandardScaler()
X_normal_scaled = scaler.fit_transform(X_normal)
X_test_scaled = scaler.transform(X_test)

# Hyperparameter tuning
nu_values = [0.01, 0.05, 0.1, 0.2, 0.3, 0.5]
gamma_values = ['scale', 'auto', 0.001, 0.01, 0.1, 1.0]

best_params = {}
best_accuracy = 0

for nu in nu_values:
    for gamma in gamma_values:
        ocsvm = OneClassSVM(nu=nu, gamma=gamma, kernel='rbf')
        ocsvm.fit(X_normal_scaled)
        
        predictions = ocsvm.predict(X_test_scaled)
        # Convert: +1 → 0 (normal), -1 → 1 (anomaly)
        y_pred = np.where(predictions == 1, 0, 1)
        accuracy = accuracy_score(y_test, y_pred)
        
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params = {'nu': nu, 'gamma': gamma}

Best Practices

One-Class SVM Guidelines

• Feature scaling: Always scale features for SVM

• Nu parameter: Start with expected outlier fraction

• Gamma tuning: Use grid search with cross-validation

• Evaluation: Focus on recall for anomaly class

Troubleshooting

Common Challenges

• Poor anomaly detection: Adjust nu parameter, try different gamma

• Too many false positives: Decrease nu value

• Too many false negatives: Increase nu value

• Training instability: Ensure sufficient normal examples

📈 Locally Weighted Linear Regression

Non-parametric regression with local model fitting

Installation & Setup

Required packages

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold

Core Concepts

Mathematical Foundation

Local fitting: Fit linear model locally around prediction point
Weighted regression: Points closer to prediction get higher weight
Gaussian kernel: Weight function based on distance
Bandwidth parameter: Controls locality of fit

Objective: min Σ (yᵢ - β₀ - (x - xᵢ)ᵀβ₁)² Kₕ(x - xᵢ)
Solution: β̂ = (XᵀWX)⁻¹XᵀWY
Gaussian Kernel: Kₕ(z) = exp(-||z||²/(2h²))

Matrix Definitions

X: (n × (p+1)) design matrix [1, (x-x₁)ᵀ; 1, (x-x₂)ᵀ; ...]
W: (n × n) diagonal weight matrix with Wᵢᵢ = Kₕ(x - xᵢ)
Y: (n × 1) response vector [y₁, y₂, ..., yₙ]ᵀ

Code Examples

LWLR Implementation

def gaussian_kernel(z, h):
    """Gaussian kernel function"""
    return np.exp(-np.sum(z**2, axis=-1) / (2 * h**2))

def locally_weighted_regression(X_train, y_train, x_pred, h):
    """LWLR at single prediction point"""
    n, p = X_train.shape
    
    # Compute weights
    weights = gaussian_kernel(X_train - x_pred, h)
    
    # Design matrix: [1, (X_train - x_pred)]
    X_design = np.column_stack([np.ones(n), X_train - x_pred])
    
    # Weighted least squares
    W = np.diag(weights)
    XTW = X_design.T @ W
    beta = np.linalg.solve(XTW @ X_design, XTW @ y_train)
    
    # Prediction is β₀
    return beta[0]

# Cross-validation for bandwidth selection
def cross_validate_lwlr(X, y, h_values, cv_folds=5):
    kf = KFold(n_splits=cv_folds, shuffle=True, random_state=42)
    cv_errors = []
    
    for h in h_values:
        fold_errors = []
        for train_idx, val_idx in kf.split(X):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            y_pred = [locally_weighted_regression(X_train, y_train, x, h) 
                     for x in X_val.flatten()]
            mse = np.mean((y_val - np.array(y_pred))**2)
            fold_errors.append(mse)
        
        cv_errors.append(np.mean(fold_errors))
    
    return np.array(cv_errors)

Best Practices

LWLR Implementation Tips

• Bandwidth selection: Use cross-validation for optimal h

• Numerical stability: Add small regularization to XᵀWX

• Computational efficiency: Pre-compute distances when possible

• Kernel choice: Gaussian is most common, but others work

Troubleshooting

Implementation Issues

• Singular matrix: Add regularization or increase bandwidth

• Poor predictions: Check bandwidth selection and data quality

• Slow computation: Use approximate methods for large datasets

• Boundary effects: Consider different kernels or padding

⚖️ Bias-Variance Tradeoff

Understanding the fundamental tradeoff in machine learning

Installation & Setup

Required packages

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve, validation_curve

Core Concepts

Fundamental Decomposition

Total Error = Bias² + Variance + Irreducible Error
Bias: E[f̂(x)] - f(x)
Variance: E[(f̂(x) - E[f̂(x)])²]

Model Complexity Effects

High Bias (Underfitting):
- Model too simple to capture true relationship
- Poor performance on both training and test data
- Examples: Linear model for nonlinear data
High Variance (Overfitting):
- Model too complex, fits noise in training data
- Good training performance, poor test performance
- Examples: Deep decision tree, high-degree polynomial

Method-Specific Tradeoffs

LWLR: Bandwidth h controls tradeoff (small h = low bias, high variance)
Decision Trees: Depth controls tradeoff
Regularization: λ parameter controls tradeoff
Neural Networks: Architecture and training time

Code Examples

Bias-Variance Analysis

# Validation curve to visualize bias-variance tradeoff
from sklearn.model_selection import validation_curve

# Example: Decision tree depth
param_range = np.arange(1, 11)
train_scores, test_scores = validation_curve(
    DecisionTreeClassifier(random_state=42), 
    X, y, param_name='max_depth', param_range=param_range,
    cv=5, scoring='accuracy'
)

# Plot bias-variance tradeoff
plt.figure(figsize=(10, 6))
plt.plot(param_range, np.mean(train_scores, axis=1), 'o-', label='Training Score')
plt.plot(param_range, np.mean(test_scores, axis=1), 'o-', label='Validation Score')
plt.xlabel('Max Depth (Model Complexity)')
plt.ylabel('Accuracy')
plt.title('Bias-Variance Tradeoff')
plt.legend()
plt.grid(True)

# Bias-variance decomposition (conceptual)
def bias_variance_decomposition(true_function, model_predictions):
    """
    Conceptual bias-variance decomposition
    """
    mean_prediction = np.mean(model_predictions, axis=0)
    bias_squared = np.mean((mean_prediction - true_function)**2)
    variance = np.mean(np.var(model_predictions, axis=0))
    return bias_squared, variance

Best Practices

Managing Bias-Variance Tradeoff

• Cross-validation: Use for model selection and hyperparameter tuning

• Regularization: Add penalty terms to control model complexity

• Ensemble methods: Combine multiple models to reduce variance

• Data size: More data generally reduces variance

Troubleshooting

Diagnosis and Solutions

• High Bias: Increase model complexity, add features, reduce regularization

• High Variance: Decrease complexity, add regularization, get more data

• Both High: Review model choice and data quality

• Validation: Use learning curves to diagnose the problem

✅ Cross-Validation Techniques

Robust model evaluation and selection methods

Installation & Setup

Required imports

from sklearn.model_selection import (
    cross_val_score, KFold, StratifiedKFold, 
    LeaveOneOut, cross_validate, GridSearchCV
)

Core Concepts

Types of Cross-Validation

K-Fold CV: Split data into k folds, train on k-1, test on 1
Stratified K-Fold: Maintains class distribution in each fold
Leave-One-Out (LOO): k = n, leave one sample out each time
Time Series CV: Respect temporal order in splits

Applications

Model evaluation: Unbiased estimate of model performance
Hyperparameter tuning: Select optimal parameters
Model selection: Compare different algorithms
Feature selection: Evaluate feature importance

K-Fold CV Error: CV = (1/k) Σ L(fᵢ, Dᵢ)
Standard Error: SE = √(Var(CV errors)/k)

Code Examples

Cross-Validation Implementation

# Basic cross-validation
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

# Stratified K-Fold for classification
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')

# Manual cross-validation loop
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_errors = []

for train_idx, val_idx in kf.split(X):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    error = 1 - accuracy_score(y_val, y_pred)
    cv_errors.append(error)

cv_error = np.mean(cv_errors)
cv_std = np.std(cv_errors)

# Grid search with cross-validation
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid, cv=5, scoring='accuracy', n_jobs=-1
)
grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

Best Practices

Cross-Validation Guidelines

• Fold selection: 5-10 folds typically optimal (bias-variance tradeoff)

• Stratification: Use for classification to maintain class balance

• Shuffle data: Unless temporal dependencies exist

• Nested CV: Use for unbiased model comparison with hyperparameter tuning

Troubleshooting

Common Pitfalls

• Data leakage: Ensure no information flows from test to train

• Small datasets: Use LOO or stratified sampling

• Imbalanced data: Use stratified CV and appropriate metrics

• Time series: Use time-based splits, not random

🔧 Common Issues & Solutions

Solutions to frequent problems in data processing and machine learning

OpenRefine Issues

⚠️ "GREL Expression Error"

Problem: GREL syntax errors or unexpected results

Solutions:

• Test expressions on small data samples first

• Use parentheses to ensure proper order of operations

• Check for typos in function names (case-sensitive)

• Use value.toString() if working with numbers

⚠️ "Row Count Mismatch After Filtering"

Problem: Lost rows during data processing

Solutions:

• Check for case sensitivity in filters

• Verify GREL expressions don't return empty strings

• Use facets to inspect data before filtering

• Export history to track which operation caused loss

Python/Pandas Issues

⚠️ "KeyError: Column not found"

Problem: Referencing non-existent column

Solution

# Check available columns
print(df.columns.tolist())

# Safe column access
if 'column_name' in df.columns:
    result = df['column_name']
else:
    print("Column not found")

# Use .get() method for safe access
value = df.get('column_name', default_value)

⚠️ "SettingWithCopyWarning"

Problem: Pandas warning about chained assignments

Solution

# Avoid chained assignment
# Bad: df[df['A'] > 5]['B'] = 'new_value'

# Good: Use .loc
df.loc[df['A'] > 5, 'B'] = 'new_value'

# Or create explicit copy
subset = df[df['A'] > 5].copy()
subset['B'] = 'new_value'

Machine Learning Issues

⚠️ "Model Overfitting"

Problem: High training accuracy, poor test performance

Solutions:

• Reduce model complexity (less depth, fewer features)

• Add regularization (L1/L2 penalties)

• Use cross-validation for hyperparameter tuning

• Collect more training data

• Use ensemble methods to reduce variance

⚠️ "Poor Cross-Validation Results"

Problem: High variance in CV scores

Solutions:

• Check data distribution and class balance

• Use stratified CV for classification

• Increase number of CV folds

• Check for data leakage

• Ensure proper preprocessing

Flask Issues

⚠️ "Template Not Found"

Problem: Flask can't locate HTML templates

Solutions:

• Ensure templates are in templates/ folder

• Check file name spelling and case

• Verify Flask app initialization includes template folder

Correct Structure

app = Flask(__name__, template_folder='templates')

RegEx Debugging

💡 RegEx Testing Strategy

• Use online tools: regex101.com, regexr.com

• Test patterns incrementally (build complexity gradually)

• Print intermediate results to debug extraction

Debug RegEx Pattern

import re

def debug_regex(pattern, text):
    """Debug regex pattern with detailed output"""
    print(f"Pattern: {pattern}")
    print(f"Text: {text}")
    
    match = re.search(pattern, text)
    if match:
        print(f"Full match: '{match.group()}'")
        for i, group in enumerate(match.groups(), 1):
            print(f"Group {i}: '{group}'")
    else:
        print("No match found")
    
    # Test with findall for multiple matches
    all_matches = re.findall(pattern, text)
    print(f"All matches: {all_matches}")

# Usage
debug_regex(r'([A-Z][a-z]+)\s+(Airport)', 
           "Beijing Capital Airport and Shanghai Pudong Airport")

🚀 Future Learning Areas

Structured areas for expanding knowledge and skills

Advanced Machine Learning

🧠 Deep Learning

Add notes on: neural networks, TensorFlow/PyTorch, CNNs, RNNs, transformers, deep learning architectures

📊 Advanced ML Algorithms

Add tutorials for: SVM kernels, clustering algorithms, dimensionality reduction, reinforcement learning

Data Science & Analytics

📈 Data Visualization

Add tutorials for: matplotlib, seaborn, plotly, dashboard creation, interactive charts

🔍 Feature Engineering

Add content on: feature selection, feature creation, text processing, time series features

Web Development

🌐 Advanced Flask

Add content on: database integration, user authentication, API development, deployment strategies

⚛️ Frontend Technologies

Add learning notes for: JavaScript, React, HTML/CSS advanced techniques, responsive design

Database & SQL

🗄️ Database Design

Add tutorials on: SQL queries, database normalization, performance optimization, NoSQL databases

Tools & Automation

🔧 Development Tools

Add guides for: Git version control, Docker containerization, CI/CD pipelines, testing frameworks

☁️ Cloud Platforms

Add learning for: AWS, Azure, Google Cloud, serverless computing, cloud databases

How to Add New Learning Sections

📝 Template for New Topics

1. Create a new section with appropriate ID

2. Include: Overview, Installation/Setup, Core Concepts, Code Examples, Best Practices

3. Add practical examples and real-world use cases

4. Include troubleshooting section for common issues

5. Update table of contents with new section

📚 Complete Learning Reference

Data Processing Fundamentals

Machine Learning Algorithms

Advanced Concepts

🔧 OpenRefine Data Cleaning Complete Guide

Installation & Setup

Step 1: Download OpenRefine

Step 2: Launch OpenRefine

Project Creation & Data Import

Creating a New Project

💡 Data Import Best Practices

Core Data Cleaning Operations

1. Removing Blank/Null Values

Using Text Facets to Remove Blanks

2. Filtering Data

Text Filtering

3. Creating New Columns with GREL

Adding Columns Based on Existing Data

GREL (General Refine Expression Language) Reference

Essential GREL Functions

Real-World GREL Examples

Clustering & Merging Similar Values

Using Cluster and Edit Feature

Exporting Results

Export Cleaned Data

Export History (for Reproducibility)

⚠️ Common Mistakes to Avoid

🌐 Flask Web Development Complete Tutorial

Flask Installation & Setup

Install Flask

Basic Flask Application Structure

1. Application Runner (run.py)

2. Flask App Initialization

3. Data Processing Module

4. Routes and View Functions

Running Your Flask Application

Start the Development Server

Flask Development Best Practices

💡 Essential Tips

🔍 Regular Expressions Complete Reference

RegEx Fundamentals

Basic Syntax Elements

Data Extraction Patterns

Python RegEx Functions

🐍 Python Data Processing Patterns

Pandas Data Manipulation

Essential Pandas Operations

Data Cleaning Patterns

Filtering and Querying

🌲 Ensemble Methods (Boosting vs Bagging)

Installation & Setup

Core Concepts

Bagging (Bootstrap Aggregating)

Boosting

Code Examples

Best Practices

Random Forest Tips

Boosting Tips

Troubleshooting

Common Issues

🌳 CART Decision Trees

Installation & Setup

Core Concepts

CART Algorithm

Overfitting Prevention

Code Examples

Best Practices

Hyperparameter Guidelines

Troubleshooting

Common Problems

📡 Compressed Sensing & Medical Imaging

Installation & Setup

Core Concepts

Compressed Sensing Theory

Lasso vs Ridge for Sparse Recovery

Code Examples

Best Practices

Compressed Sensing Guidelines

Troubleshooting

Recovery Issues