Enhancement: Add convenience token-counting functions to this package #250

pamelafox · 2024-02-07T18:39:43Z

We have implemented a lot of logic around token counting for ChatCompletion requests, and it feels like the logic should go in a separate package. I'm wondering if tiktoken would be an appropriate spot, given the logic all depends on tiktoken?

Specifically, I'm thinking of this sort of code, which is based off cookbooks:

def num_tokens_from_messages(message: Mapping[str, object], model: str) -> int:
    """
    Calculate the number of tokens required to encode a message.
    Args:
        message (Mapping): The message to encode, in a dictionary-like object.
        model (str): The name of the model to use for encoding.
    Returns:
        int: The total number of tokens required to encode the message.
    Example:
        message = {'role': 'user', 'content': 'Hello, how are you?'}
        model = 'gpt-3.5-turbo'
        num_tokens_from_messages(message, model)
        output: 11
    """

    encoding = tiktoken.encoding_for_model(get_oai_chatmodel_tiktok(model))
    num_tokens = 2  # For "role" and "content" keys
    for value in message.values():
        if isinstance(value, list):
            for item in value:
                num_tokens += len(encoding.encode(item["type"]))
                if item["type"] == "text":
                    num_tokens += len(encoding.encode(item["text"]))
                elif item["type"] == "image_url":
                    num_tokens += calculate_image_token_cost(item["image_url"]["url"], item["image_url"]["detail"])

        elif isinstance(value, str):
            num_tokens += len(encoding.encode(value))
        else:
            raise ValueError(f"Could not encode unsupported message value type: {type(value)}")
    return num_tokens



def get_image_dims(image):
    if re.match(r"data:image\/\w+;base64", image):
        image = re.sub(r"data:image\/\w+;base64,", "", image)
        image = Image.open(BytesIO(base64.b64decode(image)))
        return image.size
    else:
        raise ValueError("Image must be a base64 string.")


def calculate_image_token_cost(image, detail="auto"):
    # Constants
    LOW_DETAIL_COST = 85
    HIGH_DETAIL_COST_PER_TILE = 170
    ADDITIONAL_COST = 85

    if detail == "auto":
        # assume high detail for now
        detail = "high"

    if detail == "low":
        # Low detail images have a fixed cost
        return LOW_DETAIL_COST
    elif detail == "high":
        # Calculate token cost for high detail images
        width, height = get_image_dims(image)
        # Check if resizing is needed to fit within a 2048 x 2048 square
        if max(width, height) > 2048:
            # Resize the image to fit within a 2048 x 2048 square
            ratio = 2048 / max(width, height)
            width = int(width * ratio)
            height = int(height * ratio)
        # Further scale down to 768px on the shortest side
        if min(width, height) > 768:
            ratio = 768 / min(width, height)
            width = int(width * ratio)
            height = int(height * ratio)
        # Calculate the number of 512px squares
        num_squares = math.ceil(width / 512) * math.ceil(height / 512)
        # Calculate the total token cost
        total_cost = num_squares * HIGH_DETAIL_COST_PER_TILE + ADDITIONAL_COST
        return total_cost
    else:
        # Invalid detail_option
        raise ValueError("Invalid value for detail parameter. Use 'low' or 'high'.")

We also have full tests for that code.

Would that be appropriate for tiktoken, or is it already in a separate package? It seems like it'd be helpful to be packaged up for easier community re-use. Thanks!

The text was updated successfully, but these errors were encountered:

kartikagrawal2503 · 2024-02-11T06:22:32Z

It will be great to have a cost calculator or at least a token calculator within tiktoken for prompt as well as for messages in chatCompletion

stephenasuncionDEV · 2024-03-18T18:54:21Z

I agree, it seems like it's constantly changing.

pamelafox · 2024-03-24T21:23:51Z

@stephenasuncionDEV I'm curious, have you seen a change in the logic needed for the calculation I pu above? Just want to make sure I didn't miss an announcement.

pamelafox · 2024-04-06T22:32:07Z

For now, since I have a need to use this functionality across multiple projects, I've put it in a small package:
https://github.com/pamelafox/openai-messages-token-helper

For example-

from openai_messages_token_helper import count_tokens_for_image

image = "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEA..."
num_tokens = count_tokens_for_image(image)

Will happily move to tiktoken or openAI if the functionality gets moved to one of those packages, though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Add convenience token-counting functions to this package #250

Enhancement: Add convenience token-counting functions to this package #250

pamelafox commented Feb 7, 2024

kartikagrawal2503 commented Feb 11, 2024

stephenasuncionDEV commented Mar 18, 2024

pamelafox commented Mar 24, 2024

pamelafox commented Apr 6, 2024 •

edited

Loading

Enhancement: Add convenience token-counting functions to this package #250

Enhancement: Add convenience token-counting functions to this package #250

Comments

pamelafox commented Feb 7, 2024

kartikagrawal2503 commented Feb 11, 2024

stephenasuncionDEV commented Mar 18, 2024

pamelafox commented Mar 24, 2024

pamelafox commented Apr 6, 2024 • edited Loading

pamelafox commented Apr 6, 2024 •

edited

Loading