Data Preparation

Step 4.1: Generate CSV File

  • Generate CSV: Create a CSV file that contains paths to images and their labels (Benign or Malignant, and subtype for Malignant).
    • Automation: Labels are derived based on folder names, automating the labeling process and reducing human error.
  • Function: create_csv(data_dir, csv_filename)
    • Iterates through the dataset directory and extracts information such as image paths and labels.
    • Purpose: Provides an organized structure for efficiently loading data during training.
  def create_csv(data_dir, csv_filename):
      data = []
      for root, dirs, files in os.walk(data_dir):
          for file in files:
              if file.lower().endswith(('.jpg', '.png', '.jpeg')):
                  filepath = os.path.join(root, file)
                  # Extract class information from the folder structure
                  parts = filepath.split(os.sep)
                  class_name = parts[-2]  

                  if class_name == 'Benign':
                      binary_label = 0
                      subtype_label = -1
                  elif class_name == '[Malignant] Pre-B':
                      binary_label = 1
                      subtype_label = 0
                  elif class_name == '[Malignant] Pro-B':
                      binary_label = 1
                      subtype_label = 1
                  elif class_name == '[Malignant] early Pre-B':
                      binary_label = 1
                      subtype_label = 2
                  else:
                      continue  # Skip unknown classes

                  data.append([filepath, binary_label, subtype_label])

      df = pd.DataFrame(data, columns=['filepath', 'binary_label', 'subtype_label'])
      df.to_csv(csv_filename, index=False)
      print(f"CSV file saved to {csv_filename}")

Step 4.2: Split Data into Train, Validation, and Test Sets

  • Split Data: Split data into training, validation, and test sets using stratified splitting.
    • Training Set: Used to train the model.
    • Validation Set: Used to tune hyperparameters and prevent overfitting.
    • Test Set: Used to evaluate the final model's performance.
  • Stratified Splitting: Ensures each class is represented proportionally in all subsets.
    • Function: split_data(csv_file, train_csv, val_csv, test_csv, val_size=0.1, test_size=0.1, random_state=42)
    • Purpose: Helps maintain a balanced dataset, reducing bias and improving generalization.
split_data('dataset_labels.csv', 'train_labels.csv', 'val_labels.csv', 'test_labels.csv')