Q&A 3 How do you evaluate models before deployment?

3.1 Explanation

Before deploying machine learning models, it’s important to evaluate their performance on unseen test data. This helps you:

  • Compare models based on accuracy, precision, recall, and F1 score
  • Select the best model(s) for deployment
  • Detect overfitting or underfitting
  • Create a summary table for documentation or reporting

In this Q&A, we load previously saved models from the models/ folder, evaluate them on test data, and store the results in a single CSV file: evaluation_summary.csv.

3.2 Python Code

# scripts/evaluate_models.py

import os
import joblib
import pandas as pd
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Paths
MODEL_DIR = "models"
DATA_PATH = "data/titanic.csv"
OUTPUT_FILE = "data/evaluation_summary.csv"

# Load and preprocess Titanic data
df = pd.read_csv(DATA_PATH)
df = df.dropna(subset=["Age", "Fare", "Embarked", "Sex", "Survived"])
df["Sex"] = df["Sex"].astype("category").cat.codes
df["Embarked"] = df["Embarked"].astype("category").cat.codes
df["Survived"] = df["Survived"].astype(int)

features = ["Pclass", "Sex", "Age", "Fare", "Embarked"]
X = df[features]
y = df["Survived"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Store results
results = []

# Evaluate all saved models
for filename in os.listdir(MODEL_DIR):
    if filename.endswith(".joblib"):
        model_path = os.path.join(MODEL_DIR, filename)
        model = joblib.load(model_path)
        model_name = filename.replace(".joblib", "")

        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred, output_dict=True)
        
        # Use macro avg for simplicity
        precision = report["macro avg"]["precision"]
        recall = report["macro avg"]["recall"]
        f1 = report["macro avg"]["f1-score"]

        results.append({
            "Model": model_name,
            "Accuracy": round(acc, 4),
            "Precision": round(precision, 4),
            "Recall": round(recall, 4),
            "F1 Score": round(f1, 4)
        })

# Save results to CSV
results_df = pd.DataFrame(results)
results_df.to_csv(OUTPUT_FILE, index=False)
print(f"\nβœ… Evaluation summary saved to: {OUTPUT_FILE} see results below:\n")

print(results_df)
βœ… Evaluation summary saved to: data/evaluation_summary.csv see results below:

                 Model  Accuracy  Precision  Recall  F1 Score
0                  knn    0.6853     0.6841  0.6867    0.6838
1                  svc    0.6364     0.6378  0.6109    0.6038
2  logistic_regression    0.7902     0.8057  0.7737    0.7784
3    gradient_boosting    0.7762     0.7858  0.7612    0.7652
4        random_forest    0.7832     0.7837  0.7742    0.7769
5          naive_bayes    0.7692     0.7734  0.7566    0.7600
6        decision_tree    0.6783     0.6746  0.6653    0.6664

3.3 R Code

# For a Python-based deployment workflow, use Python for evaluation.
# For R-based workflows, use caret::confusionMatrix() or metrics from modelr or yardstick.

βœ… Takeaway: Always evaluate your models and store the results before deployment. This ensures you deploy with confidence and clarity.