1. Evaluate Financial Costs involved

Estimate of Cost (Rent Compute Hardware vs. Purchase Compute Hardware)

First let’s go over and understanding the financial costs (computational and other) involved. We’ll use a baseline LlaMA-2 Large Language Model from Meta;

LlaMA-2, the 7B (billion) parameter version required 180,000 GPU hours to train.

LlaMA-2, the 70B (billion) parameter model is 10 times as large. It took approximately 10 times as much compute, which amounts to 1.7 million GPU hours.

For our example, we’ll assume LlaMA-2 10B model, a 10 billion parameter model takes on the order of 100,000 GPU hours to train.

At that rate a 100 billion parameter LlaMA-2model will take approximately 1 Million GPU hours to train.

So how can we translate these compute hours into Dollar amount.

There are two options that we can use;

Option #1

We can rent the Compute, or GPU’s from any of the big cloud providers, such as Google, Amazon, Microsoft or others. The cost of an Nvidia A100 GPU card to train a LlaMa-2 type model is approximately $1 to $2 per GPU per hour.

So for a 10 billion parameter model, this cost is going to be on the order of $150,000 just to train. And for the 100 billion parameter model, the cost will be around $1.5 million to train.

Option #2

The second option is to buy the hardware. In that case we just have to take into consideration the price of these GPU’s and the Host Hardware. So let's say an A100 is about $10,000 and we form a 1,000 GPU’s cluster. The cost for this hardware alone are going to be around $10 million.

But that's not the only cost. When we're running a cluster like this for weeks it consumes a tremendous amount of energy. So we also have to take into account the energy cost. For example, training a 100 billion parameter model consumes about 1,000 megawatt hours of energy. Let’s assume the price of energy is about $100 per megawatt hour. This means the marginal cost of training a 100 billion parameter model is going to be around $100,000. So when we sum this up, the total cost of training a 100 billion parameter LLM is approx. $10.1 million.

This enormous cost explains why just a handful of tech giants take on the risk of cranking out new LLM’s so frequently. OpenAI’s GPT-4, which is a 1.76 trillion model cost over $100 million to train. Not many companies can afford this.

PreviousHow to Build a LLM from Scratch Next2. Data Curation

Last updated 13 days ago