HTTP Attack Classification Models

A collection of machine learning models for detecting and classifying HTTP-based cyber attacks from raw request logs.
Each model takes a raw HTTP request string as input and classifies it into one of 9 attack categories.


Task

  • Task: Multi-class Text Classification
  • Domain: Network Security / Intrusion Detection
  • Input: Raw HTTP request string (method, path, headers, body)
  • Output: One of 9 attack type labels

Attack Types

Class Description Common Indicators
Vulnerability_Scan Automated scanning for known vulnerabilities sqlmap, nikto, nmap user-agents; repeated probing patterns
System_Cmd_Execution OS command injection attempts |, ;, &&, wget, curl, /bin/sh, boot.ini
HOST_Scan Network host discovery and port scanning Minimal headers, bare GET /, nmap scripting engine
Path_Disclosure Directory traversal and file path exposure ../, ..%2F, /etc/passwd, /etc/shadow, /proc/
SQL_Injection SQL injection in query parameters UNION SELECT, OR 1=1, --, ', boolean-based blind patterns
Cross_Site_Scripting XSS payload injection <script>, onerror=, javascript:, alert(), prompt()
Automatically_Searching_Infor Automated information gathering Crawlers, /_vti_pvt/, robots.txt, sitemap.xml probing
Leakage_Through_NW Sensitive file access via network Access to config files, logs, backups (.ico, .conf, .bak)
Directory_Indexing Browsing exposed directory listings Trailing / on directory paths, source/workspace/src paths

Models

File Model Feature Extraction Test Accuracy Notes
tdidf-svc.joblib TF-IDF + LinearSVC word, default 87.4% Best generalization
xgb_char.joblib TF-IDF + XGBoost char, ngram(1,2), max_features=1024 88.5% Best local accuracy
xgb_word.joblib TF-IDF + XGBoost word, NLTK tokenizer 86.7%
lgb_model.joblib TF-IDF + LightGBM word, NLTK tokenizer 86.5%
rf_nltk.joblib TF-IDF + RandomForest word, NLTK, n_estimators=1000 84.8%
rf_gridsearch.joblib TF-IDF + RandomForest word, GridSearchCV best 83.6% best: max_depth=None, n_estimators=150
rf_basic.joblib TF-IDF + RandomForest word, default 83.3%
catboost.joblib TF-IDF + CatBoost word, NLTK tokenizer 83.0%
multinomial_nb.joblib CountVectorizer + MultinomialNB word, default 67.5% Baseline
lstm_bidirectional.h5 BiLSTM Keras Tokenizer, maxlen=216 85.2% Requires Keras/TF
textcnn_model.h5 TextCNN Keras Tokenizer, maxlen=256 86.1% Requires Keras/TF

Usage

Preprocessing

import urllib.parse

def preprocess(payload: str) -> str:
    return urllib.parse.unquote_plus(payload)

sklearn-based models (joblib)

Applies to: tdidf-svc.joblib, xgb_char.joblib, xgb_word.joblib, lgb_model.joblib, rf_*.joblib, catboost.joblib, multinomial_nb.joblib

Each file is a scikit-learn Pipeline with the vectorizer and classifier bundled together β€” raw text can be passed directly.

import joblib

model = joblib.load("xgb_char.joblib")

payloads = [
    "GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1\r\n",
    "GET /search?q=' OR 1=1-- HTTP/1.1\r\nHost: example.com\r\n",
]
predictions = model.predict(payloads)
print(predictions)
# ['Path_Disclosure', 'SQL_Injection']

Keras-based models (.h5)

import numpy as np
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
import joblib

model = load_model("lstm_bidirectional.h5")       # or textcnn_model.h5
tokenizer = joblib.load("tokenizer.joblib")        # must be saved separately during training

payloads = ["GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1"]
sequences = tokenizer.texts_to_sequences(payloads)
padded = pad_sequences(sequences, maxlen=216)      # maxlen=256 for TextCNN

pred = model.predict(padded)
label_idx = np.argmax(pred, axis=1)
print(label_idx)

Evaluation

Per-model summary

Model Accuracy Macro F1 Weakest Class (F1)
TF-IDF + XGBoost (char) 88.5% 0.92 System_Cmd_Execution (0.84)
TF-IDF + LinearSVC 87.4% β€” System_Cmd_Execution
TF-IDF + XGBoost (word) 86.7% 0.90 System_Cmd_Execution (0.83)
TF-IDF + LightGBM 86.5% 0.91 System_Cmd_Execution (0.83)
TextCNN 86.1% 0.89 System_Cmd_Execution (0.79)
BiLSTM 85.2% 0.89 System_Cmd_Execution (0.78)

Per-class observations

  • Easiest classes: Automatically_Searching_Infor and Leakage_Through_NW achieve F1 β‰₯ 0.99 across all models β€” highly distinctive tool signatures (nmap, crawlers) and file access patterns make them trivial to separate.
  • Hardest class: System_Cmd_Execution consistently scores the lowest F1 (0.75–0.84) due to pattern overlap with Vulnerability_Scan. Both classes involve probing behavior with similar HTTP structure.
  • char-level XGBoost advantage: Sub-word character n-grams capture attack-specific tokens like ../, <script>, UNION more robustly than word tokenization, especially for obfuscated payloads.

Architecture Details

BiLSTM

Embedding(22,883 vocab, dim=100, maxlen=216)
β†’ Bidirectional(LSTM(64)) β†’ LSTM(32) β†’ Dense(512) β†’ Dense(9, softmax)
  • EarlyStopping(monitor=val_accuracy, patience=3) β€” triggered at epoch 20
  • Saved: lstm_bidirectional.h5 (28 MB)

TextCNN

Embedding(20,000 vocab, dim=128, maxlen=256)
β†’ Conv1D(128, kernel=3) ─┐
β†’ Conv1D(128, kernel=4) ──→ GlobalMaxPool β†’ Concat(384) β†’ Dense(256) β†’ Dropout(0.3) β†’ Dense(9, softmax)
β†’ Conv1D(128, kernel=5) β”€β”˜
Total params: 2.86M
  • EarlyStopping(monitor=val_loss, patience=3) β€” triggered at epoch 5
  • Saved: textcnn_model.h5 (33 MB)

Key Findings

  • TF-IDF outperforms deep learning on HTTP attack data: attack patterns rely on decisive keywords (UNION SELECT, ../, <script>, wget). Bag-of-words representations capture these directly, while sequential models can be distracted by irrelevant header noise.
  • char-level features beat word-level: Character n-grams handle URL encoding variations and partial token matches more effectively (e.g., %3Cscript%3E vs <script>).
  • Class imbalance effect: Vulnerability_Scan dominates at 37.5% β€” models tend to over-predict this class for ambiguous samples.

Environment

Item Value
Python 3.12
scikit-learn 1.x
XGBoost 3.2.0
LightGBM 4.6.0
CatBoost 1.2.10
TensorFlow / Keras 2.x

License

MIT License

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support