Deep Learning ASIC Accelerator
What is deep learning accelerator?
Deep Learning Accelerator is an AI computing device and is categorized as GPU, FPGA, and ASIC. Accelerator refers to an auxiliary computing device that complements the CPU, which is the main computing device in computer systems. As the CPU alone is limited in handling high-end graphics, the GPU has emerged as a graphics accelerator. As explained in the last article, graphic processing and deep learning are common in that parallel simple operations (multiplication, addition-oriented) processing dominate. Therefore, GPU is being used as a deep learning accelerator. FPGA (field programmable gate array) is a programmable chip, as the name suggests. The flexibility of changing the chip configuration to match the rapidly evolving deep learning model is an advantage, but on the contrary, it has the disadvantage of low efficiency. ASIC (application-specific integrated circuit), which will be mainly described below, is a processor specialized in deep learning operations. ASICs are specialized for targeted applications, so they are more efficient than GPUs and FPGAs, but have less flexibility. In this post, we will look at the structure and features of deep learning ASICs in detail, focusing on the Google TPU.
Google TPU
Google's first-generation TPU was first developed in 2016 as an ASIC developed for inference calculation, and the second and third generations have been released since. In particular, the first generation of TPU, a paper [1] describing the structure in detail, was released, which helped to analyze the characteristics of ASIC. To understand how TPU is optimized for AI computation, you need to look at deep learning computation. Take CNN (convolutional neural network) algorithm, which is mainly used for object recognition. The main course of CNN Algorithm consists of convolution, activation, and pooling. Convolution is the process of extracting the features of an image by filtering the screen. Most of the image recognition AI models using CNN contain filter information. After completing the convolution, it goes through activation to determine whether to pass the data to the next step. This is a structure that mimics the transmission of information from our brain, but the neurons in the brain say that there are steps to decide whether or not to transmit the signal to the next level. Next is the optional step, pooling. Pooling selects and leaves only the largest or smallest pixel value among the surrounding pixels, which can be regarded as a process of compressing information. In the picture below, only the largest value is selected among the color-coded areas. If you look at the Google TPU structure again, you can see that there are Matrix multiplication unit in charge of convolution, Activation unit, Pooling, and Normalization unit. As you can see, Google TPU is optimally designed for deep learning.
Another characteristics of TPU is systolic-array-type matrix multiplication unit for increasing data reuse rate. Original meaning of systolic is the blood which goes through cells. For processor, it means that data flows through processing units. Data reuse rate refers to how many times the data is repeatedly calculated by processors. For deep learning processors, data reuse rate is critical. Since the processor should pause its operation during the latency of accessing DRAM, the amount of operation within a given time is reduced. Also, retrieving data from DRAM takes lots of energy. Therefore, the designers of deep learning accelerator aims to utilize the data from DRAM as many times as possible. In the matrix multiplication unit of the below table, deep learning model data (weight) goes from top to bottom and image data including feature maps moves from left to right. Each data can be operated for at most 256 times. Each processing unit multiplies and adds the data for 1 time respectively. There is “systolic data setup” block next to matrix multiply unit, which establishes the schedule of data input in accordance with weight flow of deep learning model. It is required to schedule data input for effective operation of TPU executing about 50,000 operations at once.
From a memory point of view, Google TPU has a large SRAM memory size. 24MB SRAM serves to store the intermediate result generated while performing inference. Taking the image recognition of CNN as an example, the myriad of feature maps generated through each layer will be stored in SRAM. As mentioned before, saving and loading data in DRAM is time and energy consuming, so it is an attempt to store the result in a large amount of SRAM serving as a buffer and not to access DRAM as much as possible. In addition, in the first generation of TPU, DDR3 was used for the storage of the deep learning model. As shown in Google's paper, the bandwidth of the slow DRAM cannot keep up with the TPU's operation speed, which limits the overall performance of the TPU system. For this reason, from the 2nd generation of TPU, HBM, which has the largest bandwidth among current memories, is installed. Although Google hasn't revealed it correctly, it is estimated that the 2nd and 3rd generations increased the number of TPU cores while maintaining the basic architecture of the 1st generation.
Goya and Gaudi by Habana Labs
Other deep learning ASIC companies include Israeli startup Habana, which was acquired by Intel in late 2019. Habana launched Gaudi for Training and Goya for Inference before the acquisition of Intel. The core of Gaudi's deep learning operation is the tensor core processor (TCP). According to Habana, TCP is VLIW SIMD structure. VLIW (very long instruction word) is a structure that drives multiple processors with one instruction by putting multiple independent instructions in a long instruction structure. SIMD (single instruction multi data) is a structure for vectors that perform operations on multiple data in one instruction. The VLIW SIMD structure seems to be a strategy to reduce the overhead of instruction and increase the efficiency of vector (matrix) computation that occupies most of the deep learning operation. The main difference between Gaudi and Goya lies in the Memory interface. In order to learn a large amount of data, HBM2 is installed in the Gaudi for training where memory throughput is critical, and DDR4 is installed in the Goya for inference, where latency is relatively important. Of course, in addition to HBM or DDR4, scratchpad type on-chip memory was included to reduce the latency of memory access.
IPU by Graphcore
The last one to look at is Graphcore in the UK. Graphcore's IPU (intelligence processing unit), an ASIC for deep learning, features a Scratchpad-type SRAM of 300MB. Graphcore's Architecture is based on white paper [2] written by Citadel. Graphcore is composed of 1216 tiles. One tile contains one AMP and 256KB scratchpad memory. AMP is a VLIW SIMD processor like Habana's TCP. It is designed to perform vector (matrix) operations in parallel. The characteristic of IPU is scratch memory which is 300MB in total. According to Graphcore's explanation, by utilizing the scratchpad memory located next to the processor, it is possible to reduce the latency of data access and improve overall performance. However, if the overall size of the deep learning model and the intermediate operation value is outside the range of the Scratchpad memory, there is a limitation because data must be migrated from DRAM in the same way as the existing processors. In addition, it is not easy to properly distribute data to the memory and processor existing in 1216 tiles and schedule the operation to maximize the performance of the processor. To solve this problem, Graphcore needs to provide library support but we have not yet confirmed what level of Graphcore is.
How's the future?
All three companies we've looked at now are promoting the efficiency and performance of ASICs by comparing them with NVIDIA's GPU, the flagship of traditional deep learning computational processors. Except for the Google TPU, which entered the actual service, the rest of the companies are in the beginning stages, but we need to see how much they can bring the existing NVIDIA market to their own.
[1] N. P. Jouppi et al., 2017, In-Datacenter performance analysis of a tensor processing unit, proceedings on 44th ISCA
[2] Z. Jia et al., 2019, Dissecting the Graphcore IPU architecture via microbenchmarking, Technical report




Best bets for soccer today - Sports Toto
ReplyDeleteToday, www.jtmhub.com we're going to herzamanindir tell you a few https://jancasino.com/review/merit-casino/ key to checking into soccer betting apps. of the most popular soccer 토토사이트 betting options and which ones sol.edu.kg will