Deep Learning Classification Pipeline
This project utilizes a deep neural network to classify files as either benign or malicious. I implemented a complete machine learning pipeline that preprocesses file features, splits the data for validation, and trains a custom PyTorch model using Binary Cross Entropy loss.
Below is the main execution script from my project. It handles the full lifecycle of the model:
loading the dataset, creating the train/test split, initializing the MalwareClassifier model,
and executing the training loop.
from sklearn.model_selection import train_test_split
from TrainingDataset import MalwareDataset, MalwareClassifier, preprocess_data, train, evaluate
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torch.optim as optim
def main():
# 1. Load and preprocess data
print("Loading and preprocessing data...")
features, labels, scaler = preprocess_data(r'/dataset/malware_dataset.csv')
# Convert to tensors
features = torch.FloatTensor(features)
labels = torch.LongTensor(labels)
# 2. Split data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(
features, labels, test_size=0.2, random_state=42
)
# 3. Create datasets and loaders
train_dataset = MalwareDataset(X_train, y_train)
test_dataset = MalwareDataset(X_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
# 4. Initialize model
input_size = X_train.shape[1]
model = MalwareClassifier(input_size)
# 5. Set up training (Binary Cross Entropy Loss)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 6. Train the model
print("Training model...")
train(model, train_loader, criterion, optimizer, epochs=10)
# 7. Evaluate
accuracy = evaluate(model, test_loader)
print(f"\nTest Accuracy: {accuracy*100:.2f}%")
# 8. Save model
torch.save(model.state_dict(), 'malware_classifier.pt')
print("Model saved as malware_classifier.pt")
if __name__ == "__main__":
main()
The model was evaluated on a held-out test set (20% of the data). Using the Adam optimizer with a learning rate of 0.001, the model converged effectively over 10 epochs.