Securing Tomorrow’s AI World Today: Llama Guard Defensive Strategies for LLM Application
Introduction
We’ve established robust security measures for traditional web applications and cloud platforms, the rise of Generative AI and Large Language Models (LLMs) has introduced a new set of challenges.
Gone are the days of simply plugging pre-trained models into your app and calling it a day. Today’s developers must grapple with unique vulnerabilities like jailbreak prompt,prompt injection, insecure output handling, and model poisoning. These threats can have dire consequences, from spreading misinformation to compromising sensitive data.
An example attack scenario of jailbreak prompt. Refer full paper
Blog Updated
26-Jan-2024 :- GitHub Code Added to easily setup in your system.
Get ready to:
- Understand the critical security threats lurking within your LLM models.
- Discover practical tools and techniques to protect your applications from malicious attacks.
- Deep-dive into specific solutions like Llama Guard and its customization vulnerability template implementation steps.
- Stay ahead of the curve with future installments covering additional tools and strategies.
Critical Security Threats on LLM Model
The Open Web Application Security Project (OWASP) has identified the top 10 critical vulnerabilities specific to LLM applications, providing a valuable roadmap for developers and organizations. Let’s take a closer look at these vulnerabilities:
For in-depth details, explore the OWSAP website to fortify your understanding of security in LLM applications.
Tools and Solution
In the realm of security, a myriad of open-source tools awaits exploration, listed in the below section.
Paid Tools:
- Perspective API
- OpenAI Content Moderation API
- Azure Content Safety API
Open Source Tools/Models:
- PurpleLlama (Llama Guard Model)
- NeMo-Guardrails
- Plexiglass
- Rebuff
- Garak
- LLMFuzzer
- LLM Guard
- Vigil
Let’s begin by delving into the intricacies of Llama Guard, an essential tool for enhancing security measures in LLM models. The standout feature of Llama Guard lies in its ability to easily incorporate vulnerability details into the Taxonomy template, allowing for seamless customization based on specific needs.
As I actively address security issues in LLM models, stay tuned for upcoming blogs where I will provide more comprehensive details about additional tools and their practical applications.
What is Llama Guard?
Llama Guard developed by Facebook Meta team. It’s open source Model.
- Built on a powerful LLM (Llama2–7b): This ensures it has the natural language processing ability needed to understand complex inputs and outputs.
- Input/output safeguard model: It can analyze both prompts and responses, making it a comprehensive security solution.
- Safe/unsafe classification: Provides a clear assessment of potential risks associated with LLM outputs.
- Detailed explanation for unsafe outputs: The taxonomy category violations breakdown is incredibly helpful for identifying specific areas of concern.
- Fine-tuning with custom datasets: The ability to adapt the model to your specific needs and vulnerabilities creates a truly personalized security solution.
- Dataset: The dataset comprises of 13,997 prompts and responses annotated datasets.
Llama guard default risk guidelines
Llama Guard covers following categories by default.
- Violence&Hate (ex: race, color, religion, national origin, sexual orientation, gender, gender identity, or disability)
- Sexual Content (i.e., erotic)
- Guns & Illegal weapons (ex: explosives, biological agents, or chemical weapons)
- Regulated or Controlled substances (ex.illegal drugs, tobacco, alcohol, or cannabis)
- Suicide & Self Harm (ex: by providing instructions or information on methods of self-harm).
- Criminal Planning (ex: statements that encourage violence should be considered violating under Violence)
We can add,edit and delete using default template.
Model Performance
Meta team compared Llama Guard model with OpenAI API and Perspective API.
Why Llama Guard so powerful?
- The tools we have now don’t really tell the difference between checking the safety of what the user is doing and what the AI is doing. These are two separate jobs because users usually ask for help, while the AI gives them answers.
- Those paid tool sticks to a set of unchanging rules. So, it’s hard to make it work with new rules.
- Also, above tool you have to pay for only lets you use it through an API. This means you can’t tweak it to fit your specific needs very well.
Implementation Steps
We are going to use Hugging face Transformers library to run Llama Guard model.
Step 1: Login to Hugging Face Hub
log into the Hugging Face Hub to download the model.
from huggingface_hub import login
login()
Step 2: Import Necessary library
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
Step 3: Define Model details
model_id = "meta-llama/LlamaGuard-7b"
device = "cuda"
dtype = torch.bfloat16
Step 4: Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map=device)
Step 5: Define Scan function
def scan(chat):
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(device)
output = model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
prompt_len = input_ids.shape[-1]
return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
Now, let’s understand each variable:
chat
: This is the input parameter for the function, representing a conversation. It seems to be a list of dictionaries where each dictionary has a "role" (e.g., "user" or "assistant") and "content" (the actual message).input_ids
: This variable is created by applying the chat template using the tokenizer. It converts the conversation into PyTorch tensors, and thereturn_tensors="pt"
specifies that PyTorch tensors should be returned.output
: This variable is the result of generating a response using the pre-trained language model (model.generate
). Themax_new_tokens=100
limits the length of the generated response, andpad_token_id=0
specifies the padding token.prompt_len
: The length of the input prompt is determined byinput_ids.shape[-1]
, which retrieves the size along the last dimension of the input_ids tensor.return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
: The final result is obtained by decoding the generated output. It skips special tokens and only returns the generated content.
Step 6: Call the function with prompt
scan([
{"role": "user", "content": "I forgot how to kill a process in Linux, can you help?"},
{"role": "assistant", "content": "Sure! To kill a process in Linux, you can use the kill command followed by the process ID (PID) of the process you want to terminate."},
])
Step 7: Output
Llama Guard outputs a binary classification of ‘safe’ or ‘unsafe’. If the response is ‘unsafe’, it’s accompanied by a category number (e.g., ‘unsafe 01’). This number corresponds to a specific taxonomy category identified as violated, providing valuable insight into the nature of the potential risk.
Exciting news!!!
Here’s the GitHub Code repo to run Streamlit web application that will offer a seamless playground for exploring the Llama Guard model. Here, you can effortlessly add or edit the Llama Guard template as per your needs.
Let’s connect on LinkedIn and GitHub to stay in the loop on this project.
Feel free to drop a comment with any questions or if you need any assistance.