Skip to content

πŸ›‘οΈ Enterprise-grade AI security framework protecting LLMs from prompt injection attacks using ML-powered detection

License

Notifications You must be signed in to change notification settings

khanovico/prompt-guard

Repository files navigation

πŸ›‘οΈ Guardrail: Advanced AI Security Framework for GenAI Applications

Python 3.12+ License: Apache 2.0 Security Vector Search

Enterprise-grade AI security solution that protects Generative AI applications from prompt injection attacks, malicious inputs, and adversarial exploits using advanced machine learning techniques.

🎯 Overview

Guardrail is a production-ready security framework designed to safeguard Large Language Models (LLMs) and Generative AI applications from sophisticated attacks. Built with enterprise scalability in mind, it combines multiple detection strategies to provide comprehensive protection against prompt injection, adversarial inputs, and malicious content.

πŸš€ Key Features

  • πŸ” Multi-Layer Detection: Combines semantic similarity, anomaly detection, entropy analysis, and AI-powered validation
  • ⚑ Dual Execution Modes: Pipeline (sequential) and Mixture of Experts (parallel) approaches
  • πŸ”„ Flexible Vector Storage: Support for both FAISS (local) and MongoDB Atlas (cloud) vector databases
  • 🎯 High Accuracy: Trained on real-world prompt injection datasets with proven effectiveness
  • πŸ”§ Modular Architecture: Easy to extend and customize for specific use cases
  • βš™οΈ Production Ready: Optimized for high-throughput, low-latency environments

πŸ—οΈ Architecture

Core Components

1. Semantic Similarity Engine

  • Model: all-MiniLM-L6-v2 sentence transformer
  • Purpose: Detects malicious prompts by comparing against known attack patterns
  • Storage: FAISS index or MongoDB vector search
  • Performance: Sub-millisecond similarity scoring

2. Anomaly Detection System

  • Algorithm: One-Class Support Vector Machine (OCSVM)
  • Features: TF-IDF vectorization of text inputs
  • Training: Pre-trained on malicious prompt datasets
  • Output: Anomaly score with confidence metrics

3. Entropy Analysis Module

  • Method: Shannon entropy calculation on tokenized text
  • Purpose: Identifies suspicious patterns in text complexity
  • Threshold: Configurable upper bound for entropy scores
  • Use Case: Detects obfuscated or encoded malicious content

4. AI-Powered Validation

  • Model: DeBERTa-v3-base fine-tuned for prompt injection detection
  • Provider: ProtectAI's specialized security model
  • Capabilities: Binary classification (INJECTION vs. NORMAL)
  • Accuracy: High precision in detecting sophisticated attacks

5. Input Sanitization

  • Function: Detects invisible Unicode characters
  • Coverage: Zero-width spaces, directional markers, and control characters
  • Purpose: Prevents character-level obfuscation attacks

Execution Strategies

Pipeline Mode (Sequential)

Input β†’ Sanitize β†’ Similarity β†’ Anomaly β†’ Entropy β†’ Validation β†’ Decision
  • Advantage: Early termination on first violation
  • Use Case: High-security environments requiring strict blocking
  • Performance: Optimized for speed with configurable thresholds

Mixture of Experts Mode (Parallel)

Input β†’ [Sanitize, Similarity, Anomaly, Entropy, Validation] β†’ Weighted Decision
  • Advantage: Comprehensive analysis with weighted scoring
  • Use Case: Research and analysis scenarios
  • Performance: Parallel execution with ensemble decision making

πŸ”§ Technical Specifications

System Requirements

Component Minimum Recommended
Python 3.12+ 3.12+
RAM 4 GB 8 GB+
Storage 500 MB 2 GB+
CPU 64-bit Multi-core
GPU Optional CUDA-compatible

Dependencies

Core Dependencies

  • sentence-transformers==2.2.2: Semantic embeddings and similarity
  • faiss-cpu==1.7.4: High-performance vector similarity search
  • scikit-learn: Machine learning algorithms (OCSVM)
  • nltk: Natural language processing and entropy calculation
  • transformers: Hugging Face model integration
  • torch: PyTorch for deep learning inference

Vector Storage Options

  • FAISS: Local vector database with optimized similarity search
  • MongoDB Atlas: Cloud-based vector search with enterprise features

Development Tools

  • uv: Fast Python package manager and environment management
  • ruff: High-performance Python linter and formatter
  • datasets: Hugging Face datasets for testing and evaluation

Performance Metrics

Metric Pipeline Mode Mixture of Experts
Latency < 50ms < 100ms
Throughput 1000+ req/s 500+ req/s
Accuracy 95%+ 97%+
False Positive Rate < 2% < 1%

πŸ›‘οΈ Security Capabilities

Attack Vectors Protected

1. Prompt Injection Attacks

  • SQL Injection-style prompts: "Ignore previous instructions and..."
  • Role manipulation: "You are now a different AI..."
  • System prompt extraction: Attempts to reveal internal instructions
  • Jailbreak techniques: Various methods to bypass safety measures

2. Adversarial Inputs

  • Character-level obfuscation: Invisible Unicode characters
  • Text encoding tricks: Base64, URL encoding, etc.
  • Semantic manipulation: Sophisticated language patterns
  • Context poisoning: Malicious context injection

3. Malicious Content

  • Harmful instructions: Requests for dangerous activities
  • Privacy violations: Attempts to extract sensitive information
  • System exploitation: Commands to modify system behavior
  • Data exfiltration: Attempts to access unauthorized data

Detection Methods

Semantic Similarity Detection

  • Vector embeddings: Converts text to high-dimensional vectors
  • Similarity scoring: Cosine similarity against known malicious patterns
  • Threshold-based: Configurable similarity thresholds
  • Real-time updates: Dynamic pattern database updates

Anomaly Detection

  • One-Class SVM: Trained on normal behavior patterns
  • Feature extraction: TF-IDF vectorization of text features
  • Outlier detection: Identifies deviations from normal patterns
  • Confidence scoring: Probability-based anomaly scores

Entropy Analysis

  • Information theory: Shannon entropy calculation
  • Token analysis: Word-level entropy measurement
  • Pattern recognition: Identifies suspicious text complexity
  • Threshold monitoring: Configurable entropy limits

AI Validation

  • Specialized model: DeBERTa fine-tuned for security
  • Binary classification: INJECTION vs. NORMAL classification
  • Confidence scoring: High-precision attack detection
  • Continuous learning: Model updates for new threats

🏒 Enterprise Features

Scalability

  • Horizontal scaling: Stateless design for load balancing
  • Vertical scaling: GPU acceleration support
  • Database scaling: MongoDB Atlas integration for large-scale deployments
  • Caching: Vector similarity caching for improved performance

Monitoring & Observability

  • Comprehensive logging: Detailed execution traces
  • Performance metrics: Latency and throughput monitoring
  • Security analytics: Attack pattern analysis
  • Health checks: System status monitoring

Integration Capabilities

  • REST API: Standard HTTP interface
  • Python SDK: Native Python integration
  • Docker support: Containerized deployment
  • Kubernetes: Orchestration-ready

Compliance & Governance

  • Audit trails: Complete request/response logging
  • Data privacy: No sensitive data storage
  • Configurable policies: Custom security rules
  • Multi-tenant support: Isolated environments

πŸ”¬ Research & Development

Dataset Integration

  • SPML Chatbot Prompt Injection: Real-world attack dataset
  • Custom datasets: Support for domain-specific training
  • Continuous evaluation: Regular model performance assessment
  • A/B testing: Framework for comparing detection strategies

Model Training

  • Transfer learning: Leveraging pre-trained models
  • Fine-tuning: Domain-specific model adaptation
  • Ensemble methods: Combining multiple detection approaches
  • Active learning: Continuous model improvement

Performance Optimization

  • Vector quantization: Optimized similarity search
  • Model compression: Reduced memory footprint
  • Batch processing: Efficient bulk operations
  • GPU acceleration: CUDA support for inference

πŸ“Š Benchmarks & Evaluation

Test Datasets

  • SPML Prompt Injection Dataset: 1000+ malicious and benign prompts
  • Custom evaluation sets: Domain-specific test cases
  • Adversarial examples: Sophisticated attack patterns
  • Performance benchmarks: Latency and accuracy measurements

Evaluation Metrics

  • Accuracy: Overall classification performance
  • Precision: True positive rate for malicious detection
  • Recall: Coverage of actual attacks
  • F1-Score: Balanced performance metric
  • Latency: Response time measurements
  • Throughput: Requests per second

Results

  • Detection Accuracy: 95%+ on real-world datasets
  • False Positive Rate: < 2% in production environments
  • Latency: < 50ms average response time
  • Scalability: 1000+ concurrent requests

🀝 Contributing

We welcome contributions from the security and AI communities! Please see our contributing guidelines for:

  • Security research: New attack vector detection
  • Performance optimization: Faster inference and training
  • Model improvements: Better accuracy and coverage
  • Documentation: Enhanced guides and examples
  • Testing: Comprehensive test coverage

πŸ“„ License

This project is licensed under the Apache License, Version 2.0 - see the LICENSE file for details.

πŸ”— Links

Guardrail - Protecting the future of AI, one prompt at a time. πŸ›‘οΈ

About

πŸ›‘οΈ Enterprise-grade AI security framework protecting LLMs from prompt injection attacks using ML-powered detection

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •