Guided Generation for LLM Outputs

# LLMs

# Guided Generation

# Cartography Bio

Constraining LLM outputs to align with specific formats

June 26, 2024

Kopal Garg

LLMs like GPT-4 and Gemini Pro are useful for generating and manipulating text. But to harness their full potential, it's important to guide the generation process, such that the outputs adhere to specific formats or structures.

In this blog, we will explore the following techniques for guided generation with LLMs:

use of regular expressions

JSON schemas

context-free grammars (CFGs)

templates

entities

structured data generation.

0. Initialization

We will run this code using Vertex AI. First, let's initialize our environment and set up the Vertex AI client with the necessary configuration to ensure our outputs are both useful and safe:

1. Guided Generation with Regular Expressions

Regular expressions (regex) are a powerful way to ensure that generated text matches a specific pattern.

E.g. Imagine you need a 6-digit number. By defining a regex pattern, you can validate the generated number, ensuring it’s exactly six digits with no extra spaces or characters. This method is great for maintaining strict control over simple, structured outputs like numeric codes or specific text formats.

_{Figure 1. Guided Generation with Regular Expressions}

Output:

2. Guided Generation with JSON Schemas

JSON schemas allow you to define the structure and data types of JSON objects. This is particularly useful when you need to generate structured data, such as user profiles, where each profile must include a name, age, and email.

By validating the generated JSON against a schema, you ensure that the output adheres to the expected structure and data types. This technique is useful for applications requiring precise and predictable data formats.

_{Figure 2. Guided Generation with JSON Schemas}

Output

3. Guided Generation with Context-Free Grammars

Context-Free grammars (CFGs) allow us to define a set of production rules for generating structured sentences. CGFs are excellent for generating structured sentences or text that follows a specific set of grammatical rules.

E.g., you might want to generate sentences about people performing actions on objects. A CFG can define the structure of these sentences, ensuring they always follow a logical and grammatical pattern. This method is ideal for tasks requiring syntactically correct and varied sentences, such as automated storytelling or dialogue generation.

Output:

The following diagram represents the CFG used in the above example:

_{Figure 3. Guided Generation with Context-Free Grammars}

1.S -> NP VP:

The start symbol S is expanded into a noun phrase (NP) and a verb phrase (VP).

2. NP:

NP can be any of 'John', 'Mary', 'Alice', or 'Bob'.

3. VP -> V Obj:

The verb phrase VP is expanded into a verb (V) and an object (Obj).

4. V:

V can be any of 'eats', 'drinks', 'sees', or 'likes'.

5. Obj -> Det N:

The object Obj is expanded into a determiner (Det) and a noun (N).

6. Det:

Det can be either 'an' or 'a'.

7. N:

N can be any of 'apple', 'banana', 'water', or 'book'.

In our CFG, the start symbol S is expanded into a noun phrase (NP) and a verb phrase (VP). The NP can be names like 'John', 'Mary', 'Alice', or 'Bob'. The VP is broken down into a verb (V) and an object (Obj). The verb could be actions like 'eats', 'drinks', 'sees', or 'likes'. The object is composed of a determiner (Det) and a noun (N), where determiners can be 'an' or 'a', and nouns can be 'apple', 'banana', 'water', or 'book'. This structured approach ensures that the generated sentences are both grammatically correct and varied.

4. Template-based Generation:

Template-based generation uses predefined templates to structure the generated text.

E.g., you can create a user profile using a template that specifies placeholders for the name, age, and email. This method ensures that the generated content follows a consistent format, which is particularly useful for applications like automated report generation or content templating where the format is fixed, but the content varies.

_{Figure 4. Template-based Generation}

Output:

5. Entity-based Generation

Entity-based generation is about including specific entities in the generated text.

E.g., if you want to generate a paragraph about France, you can specify entities such as the capital (Paris), a famous food (croissant), and the official language (French). This technique ensures that the generated text is relevant and includes the necessary information about the entities, making it ideal for tasks like generating descriptive content or tailored information based on specific data points.

_{Figure 5. Entity-based Generation}

Output:

6. Structured Data Generation

Structured data generation involves creating data in a tabular format, such as CSV, which can be easily converted into a DataFrame for analysis or processing.

E.g., you might generate a table with columns for Name, Age, Country, and Profession, and populate it with data for several rows. This approach is beneficial for generating datasets or structured information that needs to be processed further, ensuring consistency and ease of use in data-centric applications.

_{Figure 6. Structured Data Generation}

Output:

Wrapping Up:

Guided generation techniques are key to making sure LLM outputs are useful and well-structured. Using methods like regular expressions, JSON schemas, CFGs, templates, entities, and structured data generation can greatly improve the accuracy and reliability of LLM content. These techniques help ensure the generated text meets specific needs, making it easier to integrate LLMs into real-world applications.

Here is a link to a Jupyter Notebook containing all the above code. The notebook is for demonstration purposes only. You will need a GCP account with credits in order to run it.

Thanks for reading! 🤝

Originally posted at:

Guided Generation for LLM Outputs

Constraining LLM outputs to align with specific formats

Popular

Related