HTTP Attack Classification Models

A collection of machine learning models for detecting and classifying HTTP-based cyber attacks from raw request logs.
Each model takes a raw HTTP request string as input and classifies it into one of 9 attack categories.

Task

Task: Multi-class Text Classification
Domain: Network Security / Intrusion Detection
Input: Raw HTTP request string (method, path, headers, body)
Output: One of 9 attack type labels

Attack Types

Class	Description	Common Indicators
`Vulnerability_Scan`	Automated scanning for known vulnerabilities	sqlmap, nikto, nmap user-agents; repeated probing patterns
`System_Cmd_Execution`	OS command injection attempts	`\|`, `;`, `&&`, `wget`, `curl`, `/bin/sh`, `boot.ini`
`HOST_Scan`	Network host discovery and port scanning	Minimal headers, bare `GET /`, nmap scripting engine
`Path_Disclosure`	Directory traversal and file path exposure	`../`, `..%2F`, `/etc/passwd`, `/etc/shadow`, `/proc/`
`SQL_Injection`	SQL injection in query parameters	`UNION SELECT`, `OR 1=1`, `--`, `'`, boolean-based blind patterns
`Cross_Site_Scripting`	XSS payload injection	`<script>`, `onerror=`, `javascript:`, `alert()`, `prompt()`
`Automatically_Searching_Infor`	Automated information gathering	Crawlers, `/_vti_pvt/`, `robots.txt`, `sitemap.xml` probing
`Leakage_Through_NW`	Sensitive file access via network	Access to config files, logs, backups (`.ico`, `.conf`, `.bak`)
`Directory_Indexing`	Browsing exposed directory listings	Trailing `/` on directory paths, source/workspace/src paths

Models

File	Model	Feature Extraction	Test Accuracy	Notes
`tdidf-svc.joblib`	TF-IDF + LinearSVC	word, default	87.4%	Best generalization
`xgb_char.joblib`	TF-IDF + XGBoost	char, ngram(1,2), max_features=1024	88.5%	Best local accuracy
`xgb_word.joblib`	TF-IDF + XGBoost	word, NLTK tokenizer	86.7%
`lgb_model.joblib`	TF-IDF + LightGBM	word, NLTK tokenizer	86.5%
`rf_nltk.joblib`	TF-IDF + RandomForest	word, NLTK, n_estimators=1000	84.8%
`rf_gridsearch.joblib`	TF-IDF + RandomForest	word, GridSearchCV best	83.6%	best: max_depth=None, n_estimators=150
`rf_basic.joblib`	TF-IDF + RandomForest	word, default	83.3%
`catboost.joblib`	TF-IDF + CatBoost	word, NLTK tokenizer	83.0%
`multinomial_nb.joblib`	CountVectorizer + MultinomialNB	word, default	67.5%	Baseline
`lstm_bidirectional.h5`	BiLSTM	Keras Tokenizer, maxlen=216	85.2%	Requires Keras/TF
`textcnn_model.h5`	TextCNN	Keras Tokenizer, maxlen=256	86.1%	Requires Keras/TF

Usage

Preprocessing

import urllib.parse

def preprocess(payload: str) -> str:
    return urllib.parse.unquote_plus(payload)

sklearn-based models (joblib)

Applies to: tdidf-svc.joblib, xgb_char.joblib, xgb_word.joblib, lgb_model.joblib, rf_*.joblib, catboost.joblib, multinomial_nb.joblib

Each file is a scikit-learn Pipeline with the vectorizer and classifier bundled together — raw text can be passed directly.

import joblib

model = joblib.load("xgb_char.joblib")

payloads = [
    "GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1\r\n",
    "GET /search?q=' OR 1=1-- HTTP/1.1\r\nHost: example.com\r\n",
]
predictions = model.predict(payloads)
print(predictions)
# ['Path_Disclosure', 'SQL_Injection']

Keras-based models (.h5)

import numpy as np
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
import joblib

model = load_model("lstm_bidirectional.h5")       # or textcnn_model.h5
tokenizer = joblib.load("tokenizer.joblib")        # must be saved separately during training

payloads = ["GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1"]
sequences = tokenizer.texts_to_sequences(payloads)
padded = pad_sequences(sequences, maxlen=216)      # maxlen=256 for TextCNN

pred = model.predict(padded)
label_idx = np.argmax(pred, axis=1)
print(label_idx)

Evaluation

Per-model summary

Model	Accuracy	Macro F1	Weakest Class (F1)
TF-IDF + XGBoost (char)	88.5%	0.92	System_Cmd_Execution (0.84)
TF-IDF + LinearSVC	87.4%	—	System_Cmd_Execution
TF-IDF + XGBoost (word)	86.7%	0.90	System_Cmd_Execution (0.83)
TF-IDF + LightGBM	86.5%	0.91	System_Cmd_Execution (0.83)
TextCNN	86.1%	0.89	System_Cmd_Execution (0.79)
BiLSTM	85.2%	0.89	System_Cmd_Execution (0.78)

Per-class observations

Easiest classes: Automatically_Searching_Infor and Leakage_Through_NW achieve F1 ≥ 0.99 across all models — highly distinctive tool signatures (nmap, crawlers) and file access patterns make them trivial to separate.
Hardest class: System_Cmd_Execution consistently scores the lowest F1 (0.75–0.84) due to pattern overlap with Vulnerability_Scan. Both classes involve probing behavior with similar HTTP structure.
char-level XGBoost advantage: Sub-word character n-grams capture attack-specific tokens like ../, <script>, UNION more robustly than word tokenization, especially for obfuscated payloads.

Architecture Details

BiLSTM

Embedding(22,883 vocab, dim=100, maxlen=216)
→ Bidirectional(LSTM(64)) → LSTM(32) → Dense(512) → Dense(9, softmax)

EarlyStopping(monitor=val_accuracy, patience=3) — triggered at epoch 20
Saved: lstm_bidirectional.h5 (28 MB)

TextCNN

Embedding(20,000 vocab, dim=128, maxlen=256)
→ Conv1D(128, kernel=3) ─┐
→ Conv1D(128, kernel=4) ──→ GlobalMaxPool → Concat(384) → Dense(256) → Dropout(0.3) → Dense(9, softmax)
→ Conv1D(128, kernel=5) ─┘
Total params: 2.86M

EarlyStopping(monitor=val_loss, patience=3) — triggered at epoch 5
Saved: textcnn_model.h5 (33 MB)

Key Findings

TF-IDF outperforms deep learning on HTTP attack data: attack patterns rely on decisive keywords (UNION SELECT, ../, <script>, wget). Bag-of-words representations capture these directly, while sequential models can be distracted by irrelevant header noise.
char-level features beat word-level: Character n-grams handle URL encoding variations and partial token matches more effectively (e.g., %3Cscript%3E vs <script>).
Class imbalance effect: Vulnerability_Scan dominates at 37.5% — models tend to over-predict this class for ambiguous samples.

Environment

Item	Value
Python	3.12
scikit-learn	1.x
XGBoost	3.2.0
LightGBM	4.6.0
CatBoost	1.2.10
TensorFlow / Keras	2.x

License

MIT License

Downloads last month: -