架构

Vision Transformer (ViT)

定义

An adaptation of the transformer architecture to image data, treating fixed-size image patches as tokens. ViTs now outperform convolutional networks on many computer vision benchmarks and are used in medical imaging, satellite analysis, and industrial quality control.

相关术语

Transformer

The dominant neural network architecture for language, vision, and multimodal AI, introduced in the 2017 "Attention Is All You Need" paper. Transformers use self-attention to process all tokens in parallel, enabling training on internet-scale data and powering every major LLM in use today.

Computer Vision

A field of AI that enables machines to interpret and understand visual information from the world—images, video, and sensor feeds. It underpins applications from quality-control cameras on factory floors to facial recognition and autonomous vehicle perception.

Multimodal AI

AI systems that process and reason across multiple types of data simultaneously—text, images, audio, and video. Multimodal models enable richer enterprise applications such as document understanding that combines tables, charts, and prose.

了解术语只是第一步，将其落地应用才是第二步。

预约一次 Physical AI 适配性沟通，探讨这些 AI 概念如何转化到您所在的具体行业与业务挑战中。

Vision Transformer (ViT)

定义

相关术语

Transformer

Computer Vision

Multimodal AI

相关服务

了解术语只是第一步，将其落地应用才是第二步。

Vision Transformer (ViT)

定义

相关术语

Transformer

Computer Vision

Multimodal AI

相关服务

了解术语只是第一步，将其落地应用才是第二步。