Real-Time Phishing URL Detection Using Fine-Tuned DistilRoBERTa

D Sakthivel

doi:10.34293/sijash.v13iS2-i2-Jan.10523

D Sakthivel Assistant Professor, Department of Computer Technology, KG College of Arts and Science, Coimbatore, Tamil Nadu, India

DOI: https://doi.org/10.34293/sijash.v13iS2-i2-Jan.10523

Keywords: Phishing URL Detection, DistilRoBERTa, Transformer Models, Real-Time Security, Byte Pair Encoding, Deep Learning, Cybersecurity, URL Classification, Contextual Feature Learning, Anomaly Detection

Abstract

The widespread adoption of internet-based services has led to a substantial rise in phishing attacks, which constitute a major threat to both user privacy and the broader cybersecurity landscape. Attackers craft malicious URLs to manipulate users into disclosing confidential information—including login credentials, banking details, and personal identifiers—ultimately resulting in financial harm and identity compromise. Conventional detection strategies, such as static blocklist-based filtering and rule-driven feature engineering, are inherently insufficient for countering dynamically generated or previously unseen (zero-day) phishing URLs. To address these shortcomings, this study introduces a real-time phishing URL detection framework built upon DistilRoBERTa, a computationally lean transformer model derived from RoBERTa. The system treats each URL as a raw text sequence and applies Byte Pair Encoding (BPE) tokenization to extract both lexical and structural cues. Through fine-tuning DistilRoBERTa with an attached classification head, the model autonomously acquires contextual relationships between URL components, removing any dependency on manually engineered features. The framework is specifically architected to function on standard CPU-only hardware, making it viable for real-world deployment scenarios. Experiments were conducted on a curated benchmark dataset comprising 12,000 labeled URLs, partitioned into training and test subsets. Model effectiveness was measured using Accuracy, Precision, Recall, F1-score, ROC-AUC, and confusion matrix analysis. The results confirm strong detection performance—an accuracy of 90.55%, recall of 98.61%, and F1-score of 92.30%—demonstrating the model’s capacity to reliably identify phishing URLs while keeping false negatives to a minimum. Benchmarking against classical machine learning models, deep neural architectures, and the full RoBERTa model reveals that DistilRoBERTa achieves an advantageous balance between detection efficacy and runtime efficiency. Real-time inference experiments further validate the system’s readiness for deployment in browser security extensions, intrusion detection pipelines, and network monitoring infrastructures.