Skip to content

Data Generator

Eren Chenyang Zhao 赵晨阳 edited this page Apr 6, 2023 · 4 revisions

In addition to the user's system description, I believe that the data generator should receive more parameters.

Model_config

Which LLM is the user interested in selecting? At present, I am exclusively utilizing OpenAI's model, which is available via an API. I'm uncertain about how to access LLMs such as PaLM-E or BLOOM.

Additionally, there may be a minor issue with the API for GPT-4. OpenAI has provided me with a preview API for 8000 tokens. To a great extent, I rely on ChatGPT Pro to utilize GPT-4. I'm unsure how to use GPT-4 in the same way as davinci (GPT-3).

Here is GPT-3:

import openai
openai.api_key = "YOUR_API_KEY"

def generate_text(prompt):
    model_engine = "text-davinci-003"
    response = openai.Completion.create(
        engine=model_engine,
        prompt=prompt,
        max_tokens=1024,
        n=1,
        stop=None,
        temperature=0.5,
    )
    return response.choices[0].text.strip()

prompt = "Write a short story about a cat and a mouse."
response = generate_text(prompt)
print(response)

system_description and few_shot_examples

I am unsure about the format of the system_description. From my understanding, after the Input Parser parses the user's prompt and a few examples, the Input Spec would pass the system_description and few_shot_examples to the Unlabeled Data Generator.

To confirm the format of the system_description and few_shot_examples, let's take the Movie Sentiment Classification Task IMDB (https://huggingface.co/datasets/imdb) as an example. The system_description would be a string that serves as a meta prompt. For instance, it could provide instructions on how to complete the task or offer background information on the data. On the other hand, the few_shot_examples would be a small set of input-output pairs that illustrate the expected behavior of the system. In the context of the IMDB dataset, these examples could consist of a few movie reviews labeled as positive or negative.

system_description = "I'd like to teach my children about movie sentiment classification. I plan to provide them with a set of comments and their associated sentiment, where 0 represents a negative sentiment and 1 represents a positive sentiment. By doing this, I hope to help my children develop their analytical and critical thinking skills while also introducing them to the fascinating world of natural language processing. Can you just give me a relevant example without give me the label?"

few_shot_examples = ["I would put this at the top of my list of films in the category of unwatchable trash! There are films that are bad, but the worst kind are the ones that are unwatchable but you are suppose to like them because they are supposed to be good for you! The sex sequences, so shocking in its day, couldn't even arouse a rabbit. The so called controversial politics is strictly high school sophomore amateur night Marxism. The film is self-consciously arty in the worst sense of the term. The photography is in a harsh grainy black and white. Some scenes are out of focus or taken from the wrong angle. Even the sound is bad! And some people call this art?<br /><br />. The sentiment result is 0.", 

"Its not the cast. A finer group of actors, you could not find. Its not the setting. The director is in love with New York City, and by the end of the film, so are we all! Woody Allen could not improve upon what Bogdonovich has done here. If you are going to fall in love, or find love, Manhattan is the place to go. No, the problem with the movie is the script. There is none. The actors fall in love at first sight, words are unnecessary. In the director's own experience in Hollywood that is what happens when they go to work on the set. It is reality to him, and his peers, but it is a fantasy to most of us in the real world. So, in the end, the movie is hollow, and shallow, and message-less. The sentiment result is 0.",

"If the crew behind \"Zombie Chronicles\" ever read this, here's some advice guys: <br /><br />1. In a \"Twist Ending\"-type movie, it's not a good idea to insert close-ups of EVERY DEATH IN THE MOVIE in the opening credits. That tends to spoil the twists, y'know...? <br /><br />2. I know you produced this on a shoestring and - to be fair - you worked miracles with your budget but please, hire people who can actually act. Or at least, walk, talk and gesture at the same time. Joe Haggerty, I'm looking at you...<br /><br />3. If you're going to set a part of your movie in the past, only do this if you have the props and costumes of the time.<br /><br />4. Twist endings are supposed to be a surprise. Sure, we don't want twists that make no sense, but signposting the \"reveal\" as soon as you introduce a character? That's not a great idea.<br /><br />Kudos to the guys for trying, but in all honesty, I'd rather they hadn't...<br /><br />Only for zombie completists. The sentiment result is 0."]

# Note that the red line is handcrafted by myself

And I will use this prompt as the input to the API:

prompt = system_description
if few_shot_examples is not None:
    for each in few_shot_examples:
        prompt +=  "\n" + each
prompt = 'I\'d like to teach my children about movie sentiment classification. I plan to provide them with a set of comments and their associated sentiment, where 0 represents a negative sentiment and 1 represents a positive sentiment. By doing this, I hope to help my children develop their analytical and critical thinking skills while also introducing them to the fascinating world of natural language processing. Can you give me a relevant example without give me the result?\nI would put this at the top of my list of films in the category of unwatchable trash! There are films that are bad, but the worst kind are the ones that are unwatchable but you are suppose to like them because they are supposed to be good for you! The sex sequences, so shocking in its day, couldn\'t even arouse a rabbit. The so called controversial politics is strictly high school sophomore amateur night Marxism. The film is self-consciously arty in the worst sense of the term. The photography is in a harsh grainy black and white. Some scenes are out of focus or taken from the wrong angle. Even the sound is bad! And some people call this art?<br /><br />. The sentiment result is 0.\nIts not the cast. A finer group of actors, you could not find. Its not the setting. The director is in love with New York City, and by the end of the film, so are we all! Woody Allen could not improve upon what Bogdonovich has done here. If you are going to fall in love, or find love, Manhattan is the place to go. No, the problem with the movie is the script. There is none. The actors fall in love at first sight, words are unnecessary. In the director\'s own experience in Hollywood that is what happens when they go to work on the set. It is reality to him, and his peers, but it is a fantasy to most of us in the real world. So, in the end, the movie is hollow, and shallow, and message-less. The sentiment result is 0.\nIf the crew behind "Zombie Chronicles" ever read this, here\'s some advice guys: <br /><br />1. In a "Twist Ending"-type movie, it\'s not a good idea to insert close-ups of EVERY DEATH IN THE MOVIE in the opening credits. That tends to spoil the twists, y\'know...? <br /><br />2. I know you produced this on a shoestring and - to be fair - you worked miracles with your budget but please, hire people who can actually act. Or at least, walk, talk and gesture at the same time. Joe Haggerty, I\'m looking at you...<br /><br />3. If you\'re going to set a part of your movie in the past, only do this if you have the props and costumes of the time.<br /><br />4. Twist endings are supposed to be a surprise. Sure, we don\'t want twists that make no sense, but signposting the "reveal" as soon as you introduce a character? That\'s not a great idea.<br /><br />Kudos to the guys for trying, but in all honesty, I\'d rather they hadn\'t...<br /><br />Only for zombie completists. The sentiment result is 0.'

I pass it to ChatGPT-3.5:

Sure, here's an example comment without the label:

"The cinematography in this film is absolutely breathtaking. Each shot is like a work of art, and the way the camera captures the scenery is truly remarkable. The soundtrack is also amazing, perfectly complementing the visuals to create an immersive experience. However, the pacing of the film could have been better. Some scenes drag on for too long, and it feels like the story loses momentum at times."

Updates and TODOs

Updates

  • model_config

This is something the user can configure! PaLM-E is out of the question, but we could use GPT-3, GPT-4, or PaLM through paid APIs. We also have an internal API in the LTI for Bloom-175, which we could use for comparison. I think we can just choose one API (e.g., GPT-3.5-Turbo) for our initial open-source release and then experiment with more LLMs in our paper.

  • num_examples, random_seed

    The parameters num_examples and random_seed are optional. Although we need to provide them to the generator, we may consider hard-coding them instead of requiring the user to input them.

  • store_path

    This parameter is optional. By default, we return a datasets.DatasetDictobject to the user. However, if the user provides a valid store_pathparameter of type str, we will store the resulting DatasetDictobject at the specified path.

  • diverse/distinct data points

    How can we generate diverse data points? One possible method is to decrease the output temperature and generate multiple examples. Another option is to "condition" each batch of generated examples on specific parameters.

    I am also intrigued by this issue. When selecting the interface yesterday, I contemplated it, and to my surprise, all 13 of the example outputs that ChatGPT generated were unique from one another.

    The conditional generation has the potential to produce more detrimental outcomes; thus, it necessitates the use of a re-labeler LLM to ensure precision. At the moment, utilizing a low temperature may be the preferred alternative. However, I believe that a more effective solution may exist, and I intend to review the five shared papers to examine their codes and, hopefully, find a solution.

  • response parse

ChatGPT-3.5 responded me with this:

Sure, here's an example comment without the label:

"The cinematography in this film is absolutely breathtaking. Each shot is like a work of art, and the way the camera captures the scenery is truly remarkable. The soundtrack is also amazing, perfectly complementing the visuals to create an immersive experience. However, the pacing of the film could have been better. Some scenes drag on for too long, and it feels like the story loses momentum at times."

We only require the example without the prefix (i.e., "Sure, here's an example comment without the label:"). Therefore, we may need to find a way to remove these prefixes or suffixes post-processing. However, this could be challenging since these prefixes and suffixes may change over time. One solution could be to include the prompt "Please provide your example without any prefix or suffix" in our input instructions. But it's possible that ChatGPT may not strictly follow this instruction and give unexpected responses.

Thus our meta prompt should be like this:

system_description = "I'd like to teach my children about movie sentiment classification. I plan to provide them with a set of comments and their associated sentiment, where 0 represents a negative sentiment and 1 represents a positive sentiment. By doing this, I hope to help my children develop their analytical and critical thinking skills while also introducing them to the fascinating world of natural language processing."

#! This is the input of the user

meta_prompt = "Please only provide your example without anything else."

We continue use the same example as before, the result is not so ideal.

Example: "The acting in this film is superb. The storyline is engaging and keeps you on the edge of your seat. I would highly recommend this movie to anyone looking for a thrilling experience." The sentiment result is 1.

The result from ChatGPT still gives us some prefixes and suffixes. But perhaps we can simply get the unconditional generated example with the following:

response[response.find('"') + 1: response.find('"', response.find('"') + 1)]

TODOs

  • Refactor the existing codebase.
  • Add clear and concise descriptions.
  • Document any pending tasks in the description lines.
  • Search for ways to generate diverse / distinct data points.
  • Search for ways to parse the response from LLMs

Reminder

  • PromptSpec

    I am unfamiliar with using PromptSpec, which I believe is a parser designed to extract natural language instructions and few-shot examples from user input.

    To be more specific, system_description ****describes how the data should be used, while hard-coded meta_prompt specifies the format of the example, such as "Let’s think step by step" or "Please only provide your example without anything else."

    I'm still pondering how we can effectively reconstruct any form of data into a string with few_shot_examples, which is currently a list of strings. For now, I've omitted PromptSpec from our DatasetGenerator and I'm separately passing in the system_description and few_shot_examples.

  • DatasetDict

    I've made some changes to the output of our interface to make it more user-friendly. Instead of a tuple, the output is now a DatasetDict, which is a dictionary with the keys train, validation, and test.

Here is a simple usage:

# Example usage

# Define the system description
system_description = "Generate examples of text summarization for news articles"

# Define the model configuration
model_config = {
    "engine": "davinci",
    "max_tokens": 256,
    "n": 1,
    "stop": "\n",
}

# Define the few-shot examples
few_shot_examples = ["Example 1: Summarize the following news article...", "Example 2: Summarize the following news article..."]

# Create an instance of the DatasetGenerator class
dataset_generator = DatasetGenerator(
    system_description=system_description,
    model_config=model_config,
    few_shot_examples=few_shot_examples,
    output_dir="my_dataset",
    num_train_examples=5000,
    num_val_examples=1500,
    num_test_examples=500,
    random_seed=42,
)

# Generate the dataset
dataset = dataset_generator.generate_datasets()

# Access the generated dataset
train_dataset = dataset["train"]
val_dataset = dataset["validation"]
test_dataset = dataset["test"]
Clone this wiki locally