Architecture

Vision Transformer (ViT)

Definition

An adaptation of the transformer architecture to image data, treating fixed-size image patches as tokens. ViTs now outperform convolutional networks on many computer vision benchmarks and are used in medical imaging, satellite analysis, and industrial quality control.

Related Terms

Transformer

The dominant neural network architecture for language, vision, and multimodal AI, introduced in the 2017 "Attention Is All You Need" paper. Transformers use self-attention to process all tokens in parallel, enabling training on internet-scale data and powering every major LLM in use today.

Computer Vision

A field of AI that enables machines to interpret and understand visual information from the world—images, video, and sensor feeds. It underpins applications from quality-control cameras on factory floors to facial recognition and autonomous vehicle perception.

Multimodal AI

AI systems that process and reason across multiple types of data simultaneously—text, images, audio, and video. Multimodal models enable richer enterprise applications such as document understanding that combines tables, charts, and prose.

Related Services

Product Leadership Program

Knowing the Terms Is Step One. Applying Them Is Step Two.

Book a Physical AI Fit Call to discuss how these AI concepts translate to your specific industry and business challenges.