Q&A 1 How do you read the dataset from the data/ folder before deployment?

1.1 Explanation

Before deploying any machine learning model, it’s essential to understand the data it was trained on. This step helps ensure consistent preprocessing, reproducibility, and seamless integration across tools.

In the CDI deployment pipeline, we assume that cleaned and prepared data (like Titanic or Iris datasets) is stored in a data/ folder at the project root. This structure allows for organized workflows and compatibility with scripts and APIs.

We’ll demonstrate how to read a typical dataset using both Python and R, preparing it for evaluation or serving.

1.2 Python Code

import pandas as pd

# Load the Titanic dataset
df = pd.read_csv("data/titanic.csv")

# Preview the first few rows
print(df.head())
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

1.3 R Code

library(readr)

# Load the Titanic dataset
df <- read_csv("data/titanic.csv")

# Preview the first few rows
head(df)
# A tibble: 6 × 12
  PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
        <dbl>    <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
1           1        0      3 Braund… male     22     1     0 A/5 2…  7.25 <NA> 
2           2        1      1 Cuming… fema…    38     1     0 PC 17… 71.3  C85  
3           3        1      3 Heikki… fema…    26     0     0 STON/…  7.92 <NA> 
4           4        1      1 Futrel… fema…    35     1     0 113803 53.1  C123 
5           5        0      3 Allen,… male     35     0     0 373450  8.05 <NA> 
6           6        0      3 Moran,… male     NA     0     0 330877  8.46 <NA> 
# ℹ 1 more variable: Embarked <chr>

✅ Takeaway: Store your datasets in a consistent data/ directory and load them early to ensure your models, APIs, and frontends share the same input structure.