PyTorch Malware Detection

Deep Learning Classification Pipeline

Project Overview

This project utilizes a deep neural network to classify files as either benign or malicious. I implemented a complete machine learning pipeline that preprocesses file features, splits the data for validation, and trains a custom PyTorch model using Binary Cross Entropy loss.

The Tech Stack

Core Framework: PyTorch (torch, torch.nn, optim)
Data Processing: Pandas, NumPy, Scikit-Learn (train_test_split)
Visualization: Matplotlib

Key Code: The Training Pipeline

Below is the main execution script from my project. It handles the full lifecycle of the model: loading the dataset, creating the train/test split, initializing the MalwareClassifier model, and executing the training loop.


from sklearn.model_selection import train_test_split  
from TrainingDataset import MalwareDataset, MalwareClassifier, preprocess_data, train, evaluate
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torch.optim as optim

def main():
    # 1. Load and preprocess data
    print("Loading and preprocessing data...")
    features, labels, scaler = preprocess_data(r'/dataset/malware_dataset.csv')
    
    # Convert to tensors
    features = torch.FloatTensor(features)
    labels = torch.LongTensor(labels)
    
    # 2. Split data (80% Train, 20% Test)
    X_train, X_test, y_train, y_test = train_test_split(
        features, labels, test_size=0.2, random_state=42
    )
    
    # 3. Create datasets and loaders
    train_dataset = MalwareDataset(X_train, y_train)
    test_dataset = MalwareDataset(X_test, y_test)
    
    train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
    
    # 4. Initialize model
    input_size = X_train.shape[1]
    model = MalwareClassifier(input_size)
    
    # 5. Set up training (Binary Cross Entropy Loss)
    criterion = nn.BCELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    # 6. Train the model
    print("Training model...")
    train(model, train_loader, criterion, optimizer, epochs=10)
    
    # 7. Evaluate
    accuracy = evaluate(model, test_loader)
    print(f"\nTest Accuracy: {accuracy*100:.2f}%")
    
    # 8. Save model
    torch.save(model.state_dict(), 'malware_classifier.pt')
    print("Model saved as malware_classifier.pt")

if __name__ == "__main__":
    main()

Results & Validation

The model was evaluated on a held-out test set (20% of the data). Using the Adam optimizer with a learning rate of 0.001, the model converged effectively over 10 epochs.