Indexes
Purpose: Speed up data retrieval (like a book’s index)
Types:
- Primary Index: Automatically created on primary keys
- Unique Index: Ensures uniqueness (like email addresses)
- Composite Index: Multiple columns together (lastName, firstName)
- Full-text Index: For text searching
Example:
CREATE INDEX idx_customer_email ON Customers(email);
CREATE INDEX idx_order_date ON Orders(order_date, customer_id);
Trade-offs: Indexes speed up reads but slow down writes (inserts/updates). Don’t index everything—only columns you frequently search, filter, or join on.
Constraints
Data Integrity Rules:
NOT NULL: Field must have a value
email VARCHAR(255) NOT NULL
UNIQUE: No duplicate values allowed
username VARCHAR(50) UNIQUE
CHECK: Validates data against conditions
age INT CHECK (age >= 0 AND age <= 150)
price DECIMAL(10,2) CHECK (price > 0)
DEFAULT: Provides automatic values
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
status VARCHAR(20) DEFAULT 'pending'
FOREIGN KEY with actions:
customer_id INT REFERENCES Customers(customer_id)
ON DELETE CASCADE -- Delete orders when customer deleted
ON UPDATE CASCADE -- Update order's customer_id if customer_id changes
Other FK options: SET NULL, SET DEFAULT, RESTRICT
Data Types
Choose appropriately for performance and storage:
Numeric:
INTvsBIGINTvsSMALLINT(storage size matters)DECIMAL(10,2)for money (exact precision)FLOAT/DOUBLEfor scientific data (approximate)
Strings:
CHAR(10)– Fixed length, faster but wastes spaceVARCHAR(255)– Variable length, more flexibleTEXT– Large text (don’t index these)
Dates:
DATE– Just the dateTIMESTAMP– Date + time with timezone awarenessDATETIME– Date + time without timezone
Boolean:
BOOLEANorTINYINT(1)
Bad example: Storing dates as strings "2024-12-06" – can’t do date math or proper sorting
Naming Conventions
Consistency is critical:
Table names:
- Plural vs singular (choose one):
UsersorUser - PascalCase, snake_case, or lowercase (be consistent)
Column names:
- Descriptive:
created_atnotdt - Avoid reserved words: Don’t name a column
orderoruser - Foreign keys:
customer_idclearly shows it references Customers
Prefixes to avoid:
- Don’t prefix tables with
tbl_ - Don’t use Hungarian notation:
str_name,int_age
Soft Deletes vs Hard Deletes
Hard Delete: Actually remove the row
DELETE FROM Users WHERE user_id = 123;
Soft Delete: Mark as deleted but keep the data
-- Add column
deleted_at TIMESTAMP NULL
-- "Delete"
UPDATE Users SET deleted_at = NOW() WHERE user_id = 123;
-- Query active users
SELECT * FROM Users WHERE deleted_at IS NULL;
Use soft deletes when:
- You need audit trails
- Data has dependencies
- Regulatory compliance requires retention
- Users might want to “undelete”
Audit Trails and History
Track changes for compliance/debugging:
Simple approach: Add tracking columns
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
created_by INT REFERENCES Users(user_id)
updated_at TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
updated_by INT
Advanced approach: Separate audit table
-- Main table
Products (product_id, name, price)
-- Audit table
Product_History (
history_id,
product_id,
name,
price,
action, -- 'INSERT', 'UPDATE', 'DELETE'
changed_by,
changed_at
)
Use database triggers to automatically populate audit tables.
Partitioning
Split large tables for performance:
Range Partitioning: By date ranges
-- Orders table partitioned by year
Orders_2023
Orders_2024
Orders_2025
Hash Partitioning: Distribute data evenly List Partitioning: By specific values (regions, categories)
Benefits: Faster queries, easier maintenance, can archive old partitions
Denormalization Patterns
When performance trumps perfect normalization:
Calculated fields:
-- Store order_total instead of always calculating from OrderItems
Orders (
order_id,
customer_id,
order_total -- Denormalized!
)
Redundant data for performance:
-- Store customer_name in Orders to avoid joins
Orders (
order_id,
customer_id,
customer_name -- Duplicates data from Customers table
)
Materialized views: Pre-computed query results updated periodically
Scalability Considerations
Vertical Partitioning: Split wide tables
-- Instead of one huge Users table
Users (user_id, username, email)
User_Profiles (user_id, bio, avatar, preferences) -- Rarely accessed columns
Horizontal Sharding: Split by rows across servers
- User IDs 1-1M on Server A
- User IDs 1M-2M on Server B
Read Replicas: Copy databases for read-only queries
Caching: Redis/Memcached for frequently accessed data
Security Design
Encryption:
- Encrypt sensitive columns (passwords, SSN, credit cards)
- Use bcrypt/argon2 for passwords, never plain text
- Consider encryption at rest and in transit
Access Control:
- Principle of least privilege
- Different database users for different applications
- Row-level security for multi-tenant applications
SQL Injection Prevention:
- Use parameterized queries/prepared statements
- Never concatenate user input into SQL
Performance Optimization
Query Optimization:
- Use EXPLAIN to analyze slow queries
- Avoid SELECT * (fetch only needed columns)
- Limit result sets with pagination
- Use proper JOIN types
Connection Pooling: Reuse database connections
Batch Operations: Insert/update many rows at once instead of one-by-one
Caching Strategies:
- Application-level caching
- Query result caching
- Object caching
Backup and Recovery
Strategy:
- Regular automated backups
- Test restore procedures
- Point-in-time recovery capability
- Geographic redundancy
Types:
- Full backups (weekly)
- Incremental backups (daily)
- Transaction log backups (hourly)
Schema Versioning / Migrations
Track database changes like code:
Use migration tools (Flyway, Liquibase, Alembic):
-- Migration: V001__create_users_table.sql
CREATE TABLE Users (...);
-- Migration: V002__add_phone_to_users.sql
ALTER TABLE Users ADD COLUMN phone VARCHAR(20);
Never modify migrations after they’re deployed—create new ones
Design Patterns
UUID vs Auto-increment IDs:
- Auto-increment: Simple, sequential, can expose scale
- UUID: Globally unique, harder to guess, larger storage
Polymorphic Associations: One table relates to multiple tables
Comments (
comment_id,
commentable_id, -- Could be post_id or photo_id
commentable_type -- 'Post' or 'Photo'
)
EAV (Entity-Attribute-Value): Flexible but slow
-- For products with varying attributes
Product_Attributes (
product_id,
attribute_name, -- 'color', 'size', 'weight'
attribute_value
)
Use sparingly—usually better to use JSONB or separate tables
Documentation
Essential to maintain:
- ER diagrams showing table relationships
- Data dictionary explaining each table/column
- Business rules and constraints
- Index strategy and rationale
- Backup/recovery procedures
Testing
Database testing:
- Test constraints actually prevent bad data
- Performance testing with realistic data volumes
- Test backup/restore procedures
- Load testing for concurrent users