Skip to content

Conversation

@fdubut
Copy link
Contributor

@fdubut fdubut commented Jan 22, 2026

Description

This PR adds support for image edition/remix in OpenAIImageTarget, specifically being able to pass one text prompt (always required) and 1-16 image(s). It also fixes a handful of minor inconsistencies with class parameters (with different OpenAI models supporting different values).

Tests and Documentation

  • Updated unit tests to include support for image edition/remix, and the new requirements for input types.
  • Added an example of image edition/remix to the target documentation.
  • Split up the multi-modal target notebook into four notebooks (one per output modality).

doc/_toc.yml Outdated
- file: code/targets/5.4_openai_tts_target
- file: code/targets/6_rate_limiting
- file: code/targets/7_http_target
- file: code/targets/8_openai_responses_target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this also belong into the same category? At that point almost everything is in there so we might as well not have an additional level 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Yeah maybe we can flatten the hierarchy and have 1. Chat target 2. Responses target 3. Image target 4. Video target 5. TTS target (and then renumbering the other targets). Or alternatively, since the multi-level sectioning works, we could have one group for all OpenAI targets and subpages for each target. Which do you prefer? I'll update the TOC accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a fan of a flat structure 🙂 Opinions may diverge here, though.

@rlundeen2 , thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the upcoming commit I flattened it, we can always do a small section if needed.

Copy link
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good! Let us know what you think about the comments 🙂

image_files = []
for img_path in image_paths:
try:
image_files.append(open(img_path, "rb"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unable to look it up right now but we might have images stored in the cloud that won't be load able this way (I think). Roman TODO: look this up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked it up. The normalizer takes care of this so you should be using the serializer.

img_serializer = data_serializer_factory(
        category="prompt-memory-entries",  # folder name
        value=image_path,  # local or remote image path
        data_type="image_path"
)
image_base64 = await img_serializer.read_data_base64()

or if you need bytes then just read_data()

Let me know if this works for you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I updated the code. Looks like the OpenAI API is expecting a tuple with file name and MIME type instead of just bytes, so I had to add a couple of extra lines. For file name, I made a conscious choice to just generate a UUID for simplicity and privacy (instead of sending the full image path). We can revise if needed.

return [response]
return response

async def _send_edit_request_async(self, message: Message) -> Message:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this is meant to be a different target (with just 1 method different). It is a different API after all.

My worry: what if there's a different image API tomorrow that takes text + image + image (let's say to remix image 1 together with image 2 based on text)? Then we would not be able to accommodate that well.

Option: name it OpenAIImageEditTarget (or Remix? Whatever the API is called is best).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a technical point of view these are two endpoints of the image API, but from a user point of view this is conceptually very close - you are generating an image based on inputs (which can be text only or text + image, similar to the single chat target that takes text only or text + image). In my own red teaming, I like being able to switch seamlessly from one to the other just based on the input, not creating two different targets.

Also I'm not sure I fully understand the second line. The updated target already supports text + multiple (up to 16) images 😊

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I meant something different (which is the point I suppose). I meant that the two images are combined in a way that the text outlines. For example, "take the house from image 1, and the tree from image 2, ..."

If more image APIs pop up this could come to bite us. Tagging @rlundeen2 to determine if I'm paranoid 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if the OpenAI API supports such scenario by using the file names instead of an ordinal... I'm not sure, I'll try on Monday.

await image_target.send_prompt_async(message=request)
finally:
if os.path.isfile(image_piece.original_value):
os.remove(image_piece.original_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need integration tests. Ideally, for the new functionality (entra and api key based) but also for the various models. I can help identify what we have so far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great! I have never touched the integration tests...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants