A common feature in T2i generation is to skip the last and final layer (matrix calculation) in the CLIP text-encoding model
This will “distort” the text encoding slightly , which SD users have discovered works to their benefit when prompting with common english words like “banana”, “car” , “anime”, “woman”, “tree” etc
Being able to select between a CLIP Skip 2 text encoder , and the default text-encoder will be an appreciated feature for perchance users.
For exotic tokens like emojis or other tokens with high ID in the vocab.json , the un-modified CLIP configuration (CLIP skip 1) is far superior.
But for “boring normal english word” prompts , CLIP skip 2 will often improve the output.
This code here shows how one can import a SD1.5 CLIP text encoder configured to CLIP skip 2
https://github.com/huggingface/diffusers/issues/3212
//—//
Sidenote: Personally , I’d love to see a split of the:
text prompt -> tokenizer -> embedding -> text-encoding -> image generation
pipeline into their separate modules on perchance
So instead of sending a text to the perchance server the user can send an embedding (many are available to be downloaded online) or a text+embedding mix.
, or a text encoding configured to either CLIP skip 1 or 2
to the perchance server and get an image back.
The CLIP model is unique in that it can create both text- and image-encodings. By checking cosine similarity between the text and image encodings you can generate a text prompt for any given input image, that when prompted , will generate “that kind of image”.
Note that for either of these cases there wont be a text prompt for the image. The pipeline is a “one-way-process”.
//—//
Main thing here to consider is adding a CLIP Skip 2 option, as I think a lot of “standard” text-to-image generators on perchance would benefit from having this option.