Securing Tomorrow’s AI World Today: Llama Guard Defensive Strategies for LLM Application

Bala Venkatesh
6 min readJan 22, 2024

Introduction

We’ve established robust security measures for traditional web applications and cloud platforms, the rise of Generative AI and Large Language Models (LLMs) has introduced a new set of challenges.

Gone are the days of simply plugging pre-trained models into your app and calling it a day. Today’s developers must grapple with unique vulnerabilities like jailbreak prompt,prompt injection, insecure output handling, and model poisoning. These threats can have dire consequences, from spreading misinformation to compromising sensitive data.

An example attack scenario of jailbreak prompt. Refer full paper

Credit- Arxiv

Blog Updated

26-Jan-2024 :- GitHub Code Added to easily setup in your system.

Get ready to:

  • Understand the critical security threats lurking within your LLM models.
  • Discover practical tools and techniques to protect your applications from malicious attacks.
  • Deep-dive into specific solutions like Llama Guard and its customization vulnerability template implementation steps.
  • Stay ahead of the curve with future installments covering additional tools and strategies.

Critical Security Threats on LLM Model

The Open Web Application Security Project (OWASP) has identified the top 10 critical vulnerabilities specific to LLM applications, providing a valuable roadmap for developers and organizations. Let’s take a closer look at these vulnerabilities:

For in-depth details, explore the OWSAP website to fortify your understanding of security in LLM applications.

Tools and Solution

In the realm of security, a myriad of open-source tools awaits exploration, listed in the below section.

Paid Tools:

  • Perspective API
  • OpenAI Content Moderation API
  • Azure Content Safety API

Open Source Tools/Models:

  • PurpleLlama (Llama Guard Model)
  • NeMo-Guardrails
  • Plexiglass
  • Rebuff
  • Garak
  • LLMFuzzer
  • LLM Guard
  • Vigil

Let’s begin by delving into the intricacies of Llama Guard, an essential tool for enhancing security measures in LLM models. The standout feature of Llama Guard lies in its ability to easily incorporate vulnerability details into the Taxonomy template, allowing for seamless customization based on specific needs.

As I actively address security issues in LLM models, stay tuned for upcoming blogs where I will provide more comprehensive details about additional tools and their practical applications.

What is Llama Guard?

Llama Guard developed by Facebook Meta team. It’s open source Model.

  • Built on a powerful LLM (Llama2–7b): This ensures it has the natural language processing ability needed to understand complex inputs and outputs.
  • Input/output safeguard model: It can analyze both prompts and responses, making it a comprehensive security solution.
  • Safe/unsafe classification: Provides a clear assessment of potential risks associated with LLM outputs.
  • Detailed explanation for unsafe outputs: The taxonomy category violations breakdown is incredibly helpful for identifying specific areas of concern.
  • Fine-tuning with custom datasets: The ability to adapt the model to your specific needs and vulnerabilities creates a truly personalized security solution.
  • Dataset: The dataset comprises of 13,997 prompts and responses annotated datasets.

Llama guard default risk guidelines

Llama Guard covers following categories by default.

Source:Author
  1. Violence&Hate (ex: race, color, religion, national origin, sexual orientation, gender, gender identity, or disability)
  2. Sexual Content (i.e., erotic)
  3. Guns & Illegal weapons (ex: explosives, biological agents, or chemical weapons)
  4. Regulated or Controlled substances (ex.illegal drugs, tobacco, alcohol, or cannabis)
  5. Suicide & Self Harm (ex: by providing instructions or information on methods of self-harm).
  6. Criminal Planning (ex: statements that encourage violence should be considered violating under Violence)

We can add,edit and delete using default template.

Model Performance

Meta team compared Llama Guard model with OpenAI API and Perspective API.

Source: Meta Team
Source:Meta Team

Why Llama Guard so powerful?

  1. The tools we have now don’t really tell the difference between checking the safety of what the user is doing and what the AI is doing. These are two separate jobs because users usually ask for help, while the AI gives them answers.
  2. Those paid tool sticks to a set of unchanging rules. So, it’s hard to make it work with new rules.
  3. Also, above tool you have to pay for only lets you use it through an API. This means you can’t tweak it to fit your specific needs very well.

Implementation Steps

We are going to use Hugging face Transformers library to run Llama Guard model.

Step 1: Login to Hugging Face Hub

log into the Hugging Face Hub to download the model.

from huggingface_hub import login
login()

Step 2: Import Necessary library

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

Step 3: Define Model details

model_id = "meta-llama/LlamaGuard-7b"
device = "cuda"
dtype = torch.bfloat16

Step 4: Load model and tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map=device)

Step 5: Define Scan function

def scan(chat):
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(device)
output = model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
prompt_len = input_ids.shape[-1]
return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)

Now, let’s understand each variable:

  1. chat: This is the input parameter for the function, representing a conversation. It seems to be a list of dictionaries where each dictionary has a "role" (e.g., "user" or "assistant") and "content" (the actual message).
  2. input_ids: This variable is created by applying the chat template using the tokenizer. It converts the conversation into PyTorch tensors, and the return_tensors="pt" specifies that PyTorch tensors should be returned.
  3. output: This variable is the result of generating a response using the pre-trained language model (model.generate). The max_new_tokens=100 limits the length of the generated response, and pad_token_id=0 specifies the padding token.
  4. prompt_len: The length of the input prompt is determined by input_ids.shape[-1], which retrieves the size along the last dimension of the input_ids tensor.
  5. return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True): The final result is obtained by decoding the generated output. It skips special tokens and only returns the generated content.

Step 6: Call the function with prompt

scan([
{"role": "user", "content": "I forgot how to kill a process in Linux, can you help?"},
{"role": "assistant", "content": "Sure! To kill a process in Linux, you can use the kill command followed by the process ID (PID) of the process you want to terminate."},
])

Step 7: Output

Llama Guard outputs a binary classification of ‘safe’ or ‘unsafe’. If the response is ‘unsafe’, it’s accompanied by a category number (e.g., ‘unsafe 01’). This number corresponds to a specific taxonomy category identified as violated, providing valuable insight into the nature of the potential risk.

Exciting news!!!

Here’s the GitHub Code repo to run Streamlit web application that will offer a seamless playground for exploring the Llama Guard model. Here, you can effortlessly add or edit the Llama Guard template as per your needs.

Let’s connect on LinkedIn and GitHub to stay in the loop on this project.

Feel free to drop a comment with any questions or if you need any assistance.

Reference

--

--

Bala Venkatesh

I have a passion for understanding technology at a fundamental level and Sharing ideas and code. * Aspire to Inspire before I expire*