Contenido principal

extractTextEmbeddings

Extract text embeddings from search text using CLIP network text encoder

Since R2026a

    Description

    Add-On Required: This feature requires the Computer Vision Toolbox Model for OpenAI CLIP Network add-on.

    textEmbeddings = extractTextEmbeddings(clip,text) extracts text embeddings from the search text text using the text encoder of a Contrastive Language-Image Pre-Training (CLIP) network, clip, by running a forward pass on the neural network.

    Note

    This functionality requires Deep Learning Toolbox™.

    example

    textEmbeddings = extractTextEmbeddings(clip,text,Name=Value) specifies options using one or more name-value arguments. For example, MiniBatchSize=32 limits the batch size to 32 text strings.

    Examples

    collapse all

    Create a pretrained CLIP network with a ViT-B/16 backbone.

    clip = clipNetwork("vit-b-16");

    Define a text search term.

    search = "A photo of a children's book.";

    Extract the text embeddings for the search term using the CLIP network encoder.

    textEmbeddings = extractTextEmbeddings(clip,search);

    Display the size of the text embeddings.

    size(textEmbeddings)
    ans = 1×2
    
       512     1
    
    

    Input Arguments

    collapse all

    CLIP network, specified as a clipNetwork object.

    Input text, specified as a B-element string array or a datastore containing B strings. B is the number of strings in the batch. You must specify the text in English using ASCII characters. The function automatically pads or truncates each text input so that it contains exactly 77 tokens.

    Name-Value Arguments

    collapse all

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: extractTextEmbeddings(clip,text,MiniBatchSize=32) limits the batch size to 32 text strings.

    Number of strings in each batch, specified as a positive integer. Larger batch sizes reduce processing time, but require more memory.

    Hardware resource on which to run the detector, specified as "auto", "gpu", or "cpu". The table shows the valid hardware resource values.

    Resource Action
    "auto" Use a GPU if it is available. Otherwise, use the CPU.
    "gpu" Use the GPU. To use a GPU, you must have Parallel Computing Toolbox™ and a CUDA® enabled NVIDIA® GPU. If a suitable GPU is not available, the function returns an error. For information about the supported compute capabilities, see GPU Computing Requirements (Parallel Computing Toolbox).
    "cpu" Use the CPU.

    Output Arguments

    collapse all

    Text embeddings extracted from the CLIP model encoder, returned as a 512-by-B or 768-by-B matrix, depending on the value of the backbone argument of the clipNetwork object.

    Image Encoder Backbone backbone ValueText Embeddings Format
    "vit-b-16"

    512-by-B matrix

    "vit-l-14"or "resnet50"

    768-by-B matrix

    Version History

    Introduced in R2026a