Text Classification
Scikit-learn
Joblib
Keras
English
cybersecurity
http-attack-detection
intrusion-detection
web-security
tfidf
xgboost
lightgbm
Instructions to use cycloevan/http-attack-classification with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use cycloevan/http-attack-classification with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("cycloevan/http-attack-classification", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Keras
How to use cycloevan/http-attack-classification with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://cycloevan/http-attack-classification") - Notebooks
- Google Colab
- Kaggle
HTTP Attack Classification Models
A collection of machine learning models for detecting and classifying HTTP-based cyber attacks from raw request logs.
Each model takes a raw HTTP request string as input and classifies it into one of 9 attack categories.
Task
- Task: Multi-class Text Classification
- Domain: Network Security / Intrusion Detection
- Input: Raw HTTP request string (method, path, headers, body)
- Output: One of 9 attack type labels
Attack Types
| Class | Description | Common Indicators |
|---|---|---|
Vulnerability_Scan |
Automated scanning for known vulnerabilities | sqlmap, nikto, nmap user-agents; repeated probing patterns |
System_Cmd_Execution |
OS command injection attempts | |, ;, &&, wget, curl, /bin/sh, boot.ini |
HOST_Scan |
Network host discovery and port scanning | Minimal headers, bare GET /, nmap scripting engine |
Path_Disclosure |
Directory traversal and file path exposure | ../, ..%2F, /etc/passwd, /etc/shadow, /proc/ |
SQL_Injection |
SQL injection in query parameters | UNION SELECT, OR 1=1, --, ', boolean-based blind patterns |
Cross_Site_Scripting |
XSS payload injection | <script>, onerror=, javascript:, alert(), prompt() |
Automatically_Searching_Infor |
Automated information gathering | Crawlers, /_vti_pvt/, robots.txt, sitemap.xml probing |
Leakage_Through_NW |
Sensitive file access via network | Access to config files, logs, backups (.ico, .conf, .bak) |
Directory_Indexing |
Browsing exposed directory listings | Trailing / on directory paths, source/workspace/src paths |
Models
| File | Model | Feature Extraction | Test Accuracy | Notes |
|---|---|---|---|---|
tdidf-svc.joblib |
TF-IDF + LinearSVC | word, default | 87.4% | Best generalization |
xgb_char.joblib |
TF-IDF + XGBoost | char, ngram(1,2), max_features=1024 | 88.5% | Best local accuracy |
xgb_word.joblib |
TF-IDF + XGBoost | word, NLTK tokenizer | 86.7% | |
lgb_model.joblib |
TF-IDF + LightGBM | word, NLTK tokenizer | 86.5% | |
rf_nltk.joblib |
TF-IDF + RandomForest | word, NLTK, n_estimators=1000 | 84.8% | |
rf_gridsearch.joblib |
TF-IDF + RandomForest | word, GridSearchCV best | 83.6% | best: max_depth=None, n_estimators=150 |
rf_basic.joblib |
TF-IDF + RandomForest | word, default | 83.3% | |
catboost.joblib |
TF-IDF + CatBoost | word, NLTK tokenizer | 83.0% | |
multinomial_nb.joblib |
CountVectorizer + MultinomialNB | word, default | 67.5% | Baseline |
lstm_bidirectional.h5 |
BiLSTM | Keras Tokenizer, maxlen=216 | 85.2% | Requires Keras/TF |
textcnn_model.h5 |
TextCNN | Keras Tokenizer, maxlen=256 | 86.1% | Requires Keras/TF |
Usage
Preprocessing
import urllib.parse
def preprocess(payload: str) -> str:
return urllib.parse.unquote_plus(payload)
sklearn-based models (joblib)
Applies to: tdidf-svc.joblib, xgb_char.joblib, xgb_word.joblib, lgb_model.joblib, rf_*.joblib, catboost.joblib, multinomial_nb.joblib
Each file is a scikit-learn Pipeline with the vectorizer and classifier bundled together β raw text can be passed directly.
import joblib
model = joblib.load("xgb_char.joblib")
payloads = [
"GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1\r\n",
"GET /search?q=' OR 1=1-- HTTP/1.1\r\nHost: example.com\r\n",
]
predictions = model.predict(payloads)
print(predictions)
# ['Path_Disclosure', 'SQL_Injection']
Keras-based models (.h5)
import numpy as np
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
import joblib
model = load_model("lstm_bidirectional.h5") # or textcnn_model.h5
tokenizer = joblib.load("tokenizer.joblib") # must be saved separately during training
payloads = ["GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1"]
sequences = tokenizer.texts_to_sequences(payloads)
padded = pad_sequences(sequences, maxlen=216) # maxlen=256 for TextCNN
pred = model.predict(padded)
label_idx = np.argmax(pred, axis=1)
print(label_idx)
Evaluation
Per-model summary
| Model | Accuracy | Macro F1 | Weakest Class (F1) |
|---|---|---|---|
| TF-IDF + XGBoost (char) | 88.5% | 0.92 | System_Cmd_Execution (0.84) |
| TF-IDF + LinearSVC | 87.4% | β | System_Cmd_Execution |
| TF-IDF + XGBoost (word) | 86.7% | 0.90 | System_Cmd_Execution (0.83) |
| TF-IDF + LightGBM | 86.5% | 0.91 | System_Cmd_Execution (0.83) |
| TextCNN | 86.1% | 0.89 | System_Cmd_Execution (0.79) |
| BiLSTM | 85.2% | 0.89 | System_Cmd_Execution (0.78) |
Per-class observations
- Easiest classes:
Automatically_Searching_InforandLeakage_Through_NWachieve F1 β₯ 0.99 across all models β highly distinctive tool signatures (nmap, crawlers) and file access patterns make them trivial to separate. - Hardest class:
System_Cmd_Executionconsistently scores the lowest F1 (0.75β0.84) due to pattern overlap withVulnerability_Scan. Both classes involve probing behavior with similar HTTP structure. - char-level XGBoost advantage: Sub-word character n-grams capture attack-specific tokens like
../,<script>,UNIONmore robustly than word tokenization, especially for obfuscated payloads.
Architecture Details
BiLSTM
Embedding(22,883 vocab, dim=100, maxlen=216)
β Bidirectional(LSTM(64)) β LSTM(32) β Dense(512) β Dense(9, softmax)
- EarlyStopping(monitor=val_accuracy, patience=3) β triggered at epoch 20
- Saved:
lstm_bidirectional.h5(28 MB)
TextCNN
Embedding(20,000 vocab, dim=128, maxlen=256)
β Conv1D(128, kernel=3) ββ
β Conv1D(128, kernel=4) βββ GlobalMaxPool β Concat(384) β Dense(256) β Dropout(0.3) β Dense(9, softmax)
β Conv1D(128, kernel=5) ββ
Total params: 2.86M
- EarlyStopping(monitor=val_loss, patience=3) β triggered at epoch 5
- Saved:
textcnn_model.h5(33 MB)
Key Findings
- TF-IDF outperforms deep learning on HTTP attack data: attack patterns rely on decisive keywords (
UNION SELECT,../,<script>,wget). Bag-of-words representations capture these directly, while sequential models can be distracted by irrelevant header noise. - char-level features beat word-level: Character n-grams handle URL encoding variations and partial token matches more effectively (e.g.,
%3Cscript%3Evs<script>). - Class imbalance effect:
Vulnerability_Scandominates at 37.5% β models tend to over-predict this class for ambiguous samples.
Environment
| Item | Value |
|---|---|
| Python | 3.12 |
| scikit-learn | 1.x |
| XGBoost | 3.2.0 |
| LightGBM | 4.6.0 |
| CatBoost | 1.2.10 |
| TensorFlow / Keras | 2.x |
License
MIT License
- Downloads last month
- -