Tesla has revealed specifications of its current supercomputing cluster, which may be one of the most powerful in the world.
Since 2019 Tesla CEO Elon Musk has periodically tweeted about Dojo; a ‘neural network exaflop supercomputer.’
However, while that system is still in development, the automotive company this week discussed a powerful precursor system it uses for autonomous vehicle training.
Andrej Karpathy, senior director of AI at Tesla, discussed the details of the seemingly unnamed system this week at the 4th International Joint Conference on Computer Vision and Pattern Recognition (CCVPR 2021).
“This is a massive supercomputer,” Karpathy said. “I actually believe that in terms of flops this is roughly the number five supercomputer in the world, so it’s actually a fairly significant computer here.”
According to Karpathy, the system – the company’s third HPC cluster – has 720 nodes, each powered by eight 80GB Nvidia A100 GPUs totaling 5,760 A100s throughout the system. The system includes ten petabytes of “hot tier” NVMe storage, which has a transfer rate of 1.6 terabytes per second.
HPCWire calculates that previous benchmarking of A100 performance on Nvidia’s own 63 petaflops Selene supercomputer means 720 sets of eight-A100 nodes could yield around 81.6 Linpack petaflops; which would place the machine fifth on the most recent Top500 list.
Much of the supercomputing work the company is doing is around fully autonomous driving. Training data – comprised of one million ten-second videos from each of the eight cameras on the sampled Teslas, each running at 36 frames per second – total 1.5 petabytes.
Karpathy said work is still ongoing on Project Dojo but the company isn’t ready to reveal any more details. Musk has previously said the system will use chips and computing architecture developed in-house and optimized for neural net training, and will not use a GPU cluster.
Source: datacenterdynamics.com