Categories of generative AI models
Generative AI models are categorized by their output modality and architecture. Large language models (LLMs) generate text by predicting the most likely next token given the preceding context; they handle question answering, text generation, coding, reasoning, and conversation. Diffusion models generate images, audio, or video by learning to reverse a noise-addition process; they are the dominant architecture for image synthesis. Multimodal models accept and produce multiple modalities — text, image, and sometimes audio — in a single model, enabling tasks like image captioning, visual question answering, and image-guided generation. Each category has different training requirements, inference characteristics, and failure modes.
How they differ from discriminative models
Discriminative models classify or predict: given an input, what category does it belong to, or what value should be predicted? Generative models produce: given a conditioning input, what new content should exist? The distinction affects how they are evaluated and deployed. Discriminative models are evaluated against labeled test sets with clear right/wrong answers. Generative models produce outputs where quality is harder to define — an image may be technically correct (no artifacts) but aesthetically poor, or a text response may be fluent but factually wrong. This evaluation challenge makes generative AI quality assurance more complex than equivalent discriminative AI quality assurance.
Foundation models and fine-tuning
Foundation models are large generative models pre-trained on broad datasets that can be adapted to specific tasks through fine-tuning, prompting, or retrieval augmentation. The pre-training investment is enormous but amortized across many downstream applications; most organizations deploy foundation models rather than training from scratch. Fine-tuning specializes a foundation model on task-specific data to improve performance, adjust style, or embed domain knowledge. Retrieval augmentation augments foundation model outputs with information retrieved from an external knowledge base at inference time, reducing hallucination for knowledge-intensive tasks without retraining.