アーキテクチャ

Vision Transformer (ViT)

定義

An adaptation of the transformer architecture to image data, treating fixed-size image patches as tokens. ViTs now outperform convolutional networks on many computer vision benchmarks and are used in medical imaging, satellite analysis, and industrial quality control.

関連用語

Transformer

The dominant neural network architecture for language, vision, and multimodal AI, introduced in the 2017 "Attention Is All You Need" paper. Transformers use self-attention to process all tokens in parallel, enabling training on internet-scale data and powering every major LLM in use today.

Computer Vision

A field of AI that enables machines to interpret and understand visual information from the world—images, video, and sensor feeds. It underpins applications from quality-control cameras on factory floors to facial recognition and autonomous vehicle perception.

Multimodal AI

AI systems that process and reason across multiple types of data simultaneously—text, images, audio, and video. Multimodal models enable richer enterprise applications such as document understanding that combines tables, charts, and prose.

AIの理解にお困りですか？

Physical AI 適合性コールをご予約いただき、これらのAI概念が貴社の業界や課題にどう適用されるかをご相談ください。

Vision Transformer (ViT)

定義

関連用語

Transformer

Computer Vision

Multimodal AI

関連サービス

AIの理解にお困りですか？

Vision Transformer (ViT)

定義

関連用語

Transformer

Computer Vision

Multimodal AI

関連サービス

AIの理解にお困りですか？