Data Preparation
Step 4.1: Generate CSV File
- Generate CSV: Create a CSV file that contains paths to images and their labels (Benign or Malignant, and subtype for Malignant).
- Automation: Labels are derived based on folder names, automating the labeling process and reducing human error.
- Function: create_csv(data_dir, csv_filename)
- Iterates through the dataset directory and extracts information such as image paths and labels.
- Purpose: Provides an organized structure for efficiently loading data during training.
def create_csv(data_dir, csv_filename):
data = []
for root, dirs, files in os.walk(data_dir):
for file in files:
if file.lower().endswith(('.jpg', '.png', '.jpeg')):
filepath = os.path.join(root, file)
# Extract class information from the folder structure
parts = filepath.split(os.sep)
class_name = parts[-2]
if class_name == 'Benign':
binary_label = 0
subtype_label = -1
elif class_name == '[Malignant] Pre-B':
binary_label = 1
subtype_label = 0
elif class_name == '[Malignant] Pro-B':
binary_label = 1
subtype_label = 1
elif class_name == '[Malignant] early Pre-B':
binary_label = 1
subtype_label = 2
else:
continue # Skip unknown classes
data.append([filepath, binary_label, subtype_label])
df = pd.DataFrame(data, columns=['filepath', 'binary_label', 'subtype_label'])
df.to_csv(csv_filename, index=False)
print(f"CSV file saved to {csv_filename}")
Step 4.2: Split Data into Train, Validation, and Test Sets
- Split Data: Split data into training, validation, and test sets using stratified splitting.
- Training Set: Used to train the model.
- Validation Set: Used to tune hyperparameters and prevent overfitting.
- Test Set: Used to evaluate the final model's performance.
- Stratified Splitting: Ensures each class is represented proportionally in all subsets.
- Function: split_data(csv_file, train_csv, val_csv, test_csv, val_size=0.1, test_size=0.1, random_state=42)
- Purpose: Helps maintain a balanced dataset, reducing bias and improving generalization.
split_data('dataset_labels.csv', 'train_labels.csv', 'val_labels.csv', 'test_labels.csv')