NVIDIA Model Card

Last updated: May 2024

Model Details

  • Model summary: Generative AI by Getty Images generates images consistent with input text prompts. The model was trained on high resolution, licensed or owned images and meta data from Getty Images' vast creative library. All training data is owned or licensed. We block both prompts and generations to avoid visuals being generated that would create legal risks or be considered offensive. This model is safe for commercial use. Getty Images represents and warrants that necessary model and property releases have been obtained to avoid infringement of third-party intellectual property rights. The model is a custom architecture. It supports images up to a 4K resolution using super-resolution techniques. Additionally, the model strives to promote people diversity and representation through the training dataset which promotes diversity, as well as custom model design.
  • Model Name: Generative AI by Getty Images
  • Model release date: September, 2023
  • Model version: Getty Images, Edify Image v2.1
  • References: This model is based on large-scale text-to-image diffusion models.
    • [1] Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B. and Karras, T., 2022. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324.
    • [2] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T. and Ho, J., 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, pp.36479-36494.
    • [3] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. and Chen, M., 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2), p.3.

Training Details

  • Hardware: Nvidia A100 GPUs
  • Model Architecture: Based on diffusion architecture
    • Architecture Type: Convolution Neural Network (CNN)
    • Network Architecture: Unet-Based

Inputs

  • Input Type(s): Text, Image
  • Input Format(s): Raw Text, JPG
  • Input Parameter(s): 1D
  • Other Properties Related to Input: Max 77 text words

Outputs

  • Output Type(s): Image
  • Output Format: Red, Green, Blue (RGB)
  • Output Parameter(s): 2D
  • Other Properties Related to Output: Output Sizes (Configurable)- 1024x1024, 1024x768, 1024x576, 576x1024, 768x1024 for 1K resolution; 4096x4096, 4096x3072, 4096x2304, 2304x4096, 3072x4096 for 4K resolution.

Software Integration

  • Runtime Engine(s): Not Applicable (N/A)
  • Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere
  • [Preferred/Supported] Operating System(s): Linux

Inference

  • Engine: Tensor (RT), Triton
  • Test Hardware: NVIDIA A100

Model Use

The intended use of the model is for commercially safe, photorealistic image generation for creation & ideation. Users of the model are expected to act responsibly and are subject to the terms and conditions expressed in the Getty Images Site Terms of Use, Getty Images Content License Agreement and the applicable AI Image Generation Subscription Agreement which prohibit illegal and certain other uses.

Data and Performance

  • Dataset: Licensed or owned high-resolution photography, illustrations, and still images from Getty Images vast Creative Asset library, paired with detailed visual descriptions per asset. Descriptions and metadata attributes curated and crafted by Getty Images photographers and professional content editors are utilized. You can review this collection and metadata at gettyimages.com and istock.com.
  • Creator Compensation: Getty Images compensates contributors in an ongoing basis. This includes where contributors’ content is used as training data for AI. On an annual recurring basis, we will share in the revenues generated from the Generative AI by Getty Images with contributors whose content was used to train the AI Generator, allocating both a pro rata share in respect of every file and allocating a share based on traditional licensing revenue.
  • Quality: It especially excels at content that is commercially viable, photorealistic people, and compelling creative concepts.
  • Performance: The model achieves an average of 11 seconds to generate 4 images.

Limitations

  1. People and object deformations: While the model addresses common issues in generative models, such as malformed limbs, hands, and disproportionate object sizes through careful design choices and custom loss functions, it can still occasionally produce images with malformed or disfigured human parts or objects.
  2. Offensive: The model might create unrealistic and potentially offensive representations of humans by merging independent features learned during training. We attempt to block many of these instances through prompt blocking and output blocking.
  3. Bias: While the model implements measures to generate more diverse representations of humans, the training dataset has some imbalances in the distribution of human attributes like gender and ethnicity in relation to occupational roles that can be biased towards such attributes. Our custom prompting and custom model design aims to combat these biases, but they may still occasionally arise.
  4. Not safe for work: The model is supplemented by a language model which analyzes and filters text prompts, and an image filter that screens for inappropriate outputs. However, both these models can mistakenly filter “safe” prompts and images and may fail to filter unsafe prompts or images. This can arise from expertly designed adversarial input prompts or inherent limitations within the models.
  5. Contemporary: The training data covers up to May 2023 and only includes descriptions in English.
  6. Text: The model does not perform well at generating text in outputs.
  7. Prompt adherence: The model is weaker at generating outputs that adhere to the intent of long prompts with highly detailed and precise attribute descriptions of subjects.
  8. Fantastical and cinematic illustrations: The model’s strength is photorealism. As such, it does not perform well on with fantastical or cinematic illustrative styles.

Please send model questions and comments to api@gettyimages.com or https://www.nvidia.com/en-us/support/submit-security-vulnerability/