Well, let's do some order-of-magnitude calculations: A single 1kW B200 GPU will set you back $50k, and as NVIDIA claims[0] can do 125 tokens per second with LLama4. Let's imagine you can use it for 36 months, at a DC, cooling and electricity price of 20 cents per kWh. That's $4.3E-6 per token for the card and $4E-7 per token for DC and power, together $4.7E-6 per token.
Let's say you are a power user, so your queries and responses are complex and numerous, say 1000 tokens per query+response and 1 query every 10 minutes of an 8h workday. That's 48k tokens per workday, at 20 workdays per month that's 960k tokens per month.
So the cost (not sales price!) for those 960k tokens (roughly 1M) a month should be $4.5
Now you can go over the numbers again and think about where they might be wrong: Maybe a typical query is more than 1000 tokens. Maybe power users issue more queries. You might very well multiply by a factor of 10 here. Nvidia getting more greedy for new GPUs? Add 50%. Data center and power cost too conservative, network and storage also important? Add 50%. 3 years of use for a GPU too long, because the field is very quickly adapting ever larger models? Add 50%. Usage factor not 100%, but lower, say a more realistic 50%? Double the cost. Llama4 not good enough, need a more advanced model? May produce a lot less tokens per GPU-hour, but numbers are hard to come by.
With that, it's easy to imagine that one might still loose money at $200 per month.
To compare, Azure sells OpenAI models in 1M token batches that can easily be compared to the above monthly cost.
I'm not so sure there. The factor of 72 is accidentially also the number of GPUs in a full GB200 DGX rack[0]. The phrasing "and it reaches 72,000 TPS/server at our highest throughput configuration" also hints at something being fishy here. They carefully use the phrases "node" earlier, and "server" later, without getting specific by what they mean a "server" to be. Also, for that 72000 figure, there is no mention of batching at all.
The very short article [2] linked in [0] which is supposed to be the independent source of those numbers also doesn't specify any details to that effect.
In general, I've learned to treat Nvidia numbers very carefully. They are well-known for misrepresenting apples-to-orange-elefants figures such as comparing FP16, FP8 or FP4 FLOPS, thereby grossly overstating the performance advantages of their new architectures[3].
[1]
> NVIDIA DGX™ GB200 is purpose-built for training and inferencing trillion-parameter generative AI models. Designed as a rack-scale solution, each liquid-cooled rack features 36 NVIDIA GB200 Grace Blackwell Superchips—–36 NVIDIA Grace CPUs and 72 Blackwell GPUs
> The factor of 72 is accidentially also the number of GPUs in a full GB200 DGX rack
Sure, but in the article it is already mentioned as 1000 TPS/User for an 8 GPU node, and the rack contains 9 nodes - i.e. 9x more GPUs, not 72x - so the 72000 TPS/Server simply being a multiple of 72 seems like a red herring.
But yeah, I agree that 72x seems high - although only 9x seems low given vLLM showing over 20x speedups with continuous batching. I guess there are a lot of variables.
I didn't really distinguish between those token types, since I didn't want to complicate the calculation too much. But yes, you are right, there is a large difference between input and output tokens.
Input management is the biggest difference between monthly plans like Copilot and pay-per-use like Cline. Monthly plans hide the input from the user and try to minimize it while pay-per-use plans visibly use the full context window.
Mentally modelling the pricing as being determined by output doesn't match reality in my experience.
It's also fundamentally different economics since input is gated by VRAM capacity while output is gated by compute.
There's also the training costs, which are a massive capital investment, that need to be spread over all subscribers. Depends on training schedule as well, for each training window the costs of it need to be spread over subscribers during that window.
It's good it scales down with a higher number of paying subscriptions (each pays a smaller share of training costs).
There is also market separation in play: For the base service you only charge cost+small margin. For higher service levels you charge higher profit margins even though the additional service does not cost that much more to provide.
Best example: flight seats. Economy class fills the plane, but business and first class are the money makers [1].
Sure, and it wouldn't be surprising to see different pricing for realtime AI API usage vs slower (overnight, etc) response times (to fill up the seats and keep the hardware occupied). It remains to be seen how the dynamics of this works out though - function of cost of increasing capacity vs customer demand at various price points and service levels.
LLM pricing seems to still very much be up in the air though - models getting more efficient, serving hardware getting more efficient, use cases evolving, and not all providers operating with same business model (e.g. Meta, maybe China).
The joke used to go "We lose money on every sale, but we make it up in volume”, but funnily enough, this financial logic adds up when you make it up with investor money.
No it’s not. Assume there is demand for 5000 units at $1 and 2000 units at $2. If it cost $0 to produce, then the profit is $5000/$4000 so you should price at $1. If it cost 90c to produce, then the profit is $500/$2200 so you should price at $2.
I was talking about software (the OP example), where there is a big fixed cost and (generally) a very low per unit cost. But I probably should have made that clearer. Things do indeed change when you have a significant per unit cost (e.g. manufactured goods).
Is it that noteworthy ? That has been the entire silicon valley playbook for the past.. decade ?
Grow at all cost with VC money, grab the market by offering unsustainably cheap prices, and when you have a monopoly, offer a slightly better or slightly worse version of the previous service (but hey, with an app that runs react!), at a slightly higher or same as before price (turns out profitability is important!), with much much worse working conditions for everyone involved (VC needs their money back, has to come from somewhere!)
Then look around confused when there is insane wealth inequality and social unrest.
Well, since I know that a lot of people are actually creating businesses, based on chatbots, $200/month is probably an acceptable price.
From the article, it says that it’s a money loser, though, so I suspect that a lot of AI-based businesses run just fine, from the lower-tier price point.
They might want to consider adding an “in-between” pricing tier.
I sell Saas software that’s easily six figures per month. I think there’s a confusion between professional prices are “Pro“ as the upper tier of individual service.
I can't readily find the HN post but the math posted was solething like 'Each prompt costs a bottle of water'. Now think about the logistics required to get that amount of water. AI usage currently does not scale well.
Sure the data is out there, if you have billions of dollars you can set up your own data center, gather the same data, hire the engineers and pay the networking inference costs and make it free to everyone.
If it had stayed a non profit, would people have donated enough to keep it in business? Enough people aren’t willing to donate to keep a browser maker in business.
the technology is nascent and takes kilowatts of power to run. it doesnt look like there are any more fundamental breakthroughs coming either, and we can now only hope for moores law pace improvements until someone comes up with a better trick than the one LLMs are using
Add it to the pile with crypto of software with an enormous cost to run and questionable utility when we're supposed to be in the middle of a climate crisis.
Why is no one in this thread saying the real reason—because it's meant for business customers who are using LLMs in a professional context sending sometimes five figures of tokens per prompt for 8 hours a day every day. And while business users are not particularly price sensitive they also don't want to get a surprise huge bill that you could get with usage based pricing.
And we don't even know how high the pricing will get once they're out of the competitive acquiring customers phase and into the steady state dependent customers phase.
I assume that over time pricing will converge to cost of service provision plus a reasonable (50%?) profit margin. As long as the profit margin remains high then it encourages more competition.
Of course there is a large cost in building a SOTA model in the first place, maybe building your own datacenter(s) for inference too, but compare to something like semiconductor manufacturing where upfront costs are also very high, yet profit margins still reasonable, e.g. ~40% for TSMC who make chips for NVIDIA, AMD, Apple ... As long as there is a possibility of competition (primarily Samsung in this case), then profit margins will be held in check.
Enterprise and public administrations are showering with money everything AI. AI it's this the new COVID. Why a single surgical face mask cost $5 in 2021?
Indeed. But value is a subjective matter, and if a company believes they're receiving the requisite amount of value from their expenditure, then that's all that matters really isn't it?
Well, let's do some order-of-magnitude calculations: A single 1kW B200 GPU will set you back $50k, and as NVIDIA claims[0] can do 125 tokens per second with LLama4. Let's imagine you can use it for 36 months, at a DC, cooling and electricity price of 20 cents per kWh. That's $4.3E-6 per token for the card and $4E-7 per token for DC and power, together $4.7E-6 per token.
Let's say you are a power user, so your queries and responses are complex and numerous, say 1000 tokens per query+response and 1 query every 10 minutes of an 8h workday. That's 48k tokens per workday, at 20 workdays per month that's 960k tokens per month.
So the cost (not sales price!) for those 960k tokens (roughly 1M) a month should be $4.5
Now you can go over the numbers again and think about where they might be wrong: Maybe a typical query is more than 1000 tokens. Maybe power users issue more queries. You might very well multiply by a factor of 10 here. Nvidia getting more greedy for new GPUs? Add 50%. Data center and power cost too conservative, network and storage also important? Add 50%. 3 years of use for a GPU too long, because the field is very quickly adapting ever larger models? Add 50%. Usage factor not 100%, but lower, say a more realistic 50%? Double the cost. Llama4 not good enough, need a more advanced model? May produce a lot less tokens per GPU-hour, but numbers are hard to come by.
With that, it's easy to imagine that one might still loose money at $200 per month.
To compare, Azure sells OpenAI models in 1M token batches that can easily be compared to the above monthly cost.
https://developer.nvidia.com/blog/blackwell-breaks-the-1000-...
https://azure.microsoft.com/en-us/pricing/details/cognitive-...
> NVIDIA claims[0] can do 125 tokens per second
The claim is per user. With batching, it is MUCH higher (72x)
I'm not so sure there. The factor of 72 is accidentially also the number of GPUs in a full GB200 DGX rack[0]. The phrasing "and it reaches 72,000 TPS/server at our highest throughput configuration" also hints at something being fishy here. They carefully use the phrases "node" earlier, and "server" later, without getting specific by what they mean a "server" to be. Also, for that 72000 figure, there is no mention of batching at all.
The very short article [2] linked in [0] which is supposed to be the independent source of those numbers also doesn't specify any details to that effect.
In general, I've learned to treat Nvidia numbers very carefully. They are well-known for misrepresenting apples-to-orange-elefants figures such as comparing FP16, FP8 or FP4 FLOPS, thereby grossly overstating the performance advantages of their new architectures[3].
[0] https://developer.nvidia.com/blog/blackwell-breaks-the-1000-...
[1] > NVIDIA DGX™ GB200 is purpose-built for training and inferencing trillion-parameter generative AI models. Designed as a rack-scale solution, each liquid-cooled rack features 36 NVIDIA GB200 Grace Blackwell Superchips—–36 NVIDIA Grace CPUs and 72 Blackwell GPUs
https://www.nvidia.com/en-eu/data-center/dgx-gb200/
[2] https://www.linkedin.com/feed/update/urn:li:activity:7331470...
[3] https://dev.to/maximsaplin/nvidias-1000x-performance-boost-c...
> The factor of 72 is accidentially also the number of GPUs in a full GB200 DGX rack
Sure, but in the article it is already mentioned as 1000 TPS/User for an 8 GPU node, and the rack contains 9 nodes - i.e. 9x more GPUs, not 72x - so the 72000 TPS/Server simply being a multiple of 72 seems like a red herring.
But yeah, I agree that 72x seems high - although only 9x seems low given vLLM showing over 20x speedups with continuous batching. I guess there are a lot of variables.
You're not factoring the input tokens into the equation, which is 90% of the price with a tool like Cline.
My queries are like 30,000 tokens input for 50 tokens output
I didn't really distinguish between those token types, since I didn't want to complicate the calculation too much. But yes, you are right, there is a large difference between input and output tokens.
Input management is the biggest difference between monthly plans like Copilot and pay-per-use like Cline. Monthly plans hide the input from the user and try to minimize it while pay-per-use plans visibly use the full context window.
Mentally modelling the pricing as being determined by output doesn't match reality in my experience.
It's also fundamentally different economics since input is gated by VRAM capacity while output is gated by compute.
There's also the training costs, which are a massive capital investment, that need to be spread over all subscribers. Depends on training schedule as well, for each training window the costs of it need to be spread over subscribers during that window.
It's good it scales down with a higher number of paying subscriptions (each pays a smaller share of training costs).
Companies generally charge whatever price they think will optimize their profit. This quite unrelated to what the service costs to run.
There is also market separation in play: For the base service you only charge cost+small margin. For higher service levels you charge higher profit margins even though the additional service does not cost that much more to provide.
Best example: flight seats. Economy class fills the plane, but business and first class are the money makers [1].
[1] https://www.youtube.com/watch?v=BzB5xtGGsTc
Sure, and it wouldn't be surprising to see different pricing for realtime AI API usage vs slower (overnight, etc) response times (to fill up the seats and keep the hardware occupied). It remains to be seen how the dynamics of this works out though - function of cost of increasing capacity vs customer demand at various price points and service levels.
LLM pricing seems to still very much be up in the air though - models getting more efficient, serving hardware getting more efficient, use cases evolving, and not all providers operating with same business model (e.g. Meta, maybe China).
Yes, you see this all over the place. For example, Home and Professional versions of software products:
https://successfulsoftware.net/2013/02/28/how-i-increased-sa...
Sure, and early adopters can usually expect to pay more.
> This price is quite unrelated to what the service costs to run.
Well. It's noteworthy when the price is lower than the cost.
It's not that rare. But it is noteworthy, as it's not sustainable.
The optimal price is the optimal price, regardless of what the service costs to run.
But, yes, if the cost to run the service is X and the optimal price is <X, you have a problem.
The joke used to go "We lose money on every sale, but we make it up in volume”, but funnily enough, this financial logic adds up when you make it up with investor money.
No it’s not. Assume there is demand for 5000 units at $1 and 2000 units at $2. If it cost $0 to produce, then the profit is $5000/$4000 so you should price at $1. If it cost 90c to produce, then the profit is $500/$2200 so you should price at $2.
I was talking about software (the OP example), where there is a big fixed cost and (generally) a very low per unit cost. But I probably should have made that clearer. Things do indeed change when you have a significant per unit cost (e.g. manufactured goods).
Except that AI companies have different unit economics from SaaS businesses, requiring a rethinking of fundamentals, as laid out in this a16z post: https://a16z.com/the-new-business-of-ai-and-how-its-differen...
The per unit cost here is electricity. Those GPUs suck down enough during inference that it is not negligible.
Depends on the type of software. Some software have significant support costs.
AWS or Azure do charge per-usage, but not in line with usage of electricity. They're well above that.
>The optimal price is the optimal price, regardless of what the service costs to run.
such insight
Is it that noteworthy ? That has been the entire silicon valley playbook for the past.. decade ?
Grow at all cost with VC money, grab the market by offering unsustainably cheap prices, and when you have a monopoly, offer a slightly better or slightly worse version of the previous service (but hey, with an app that runs react!), at a slightly higher or same as before price (turns out profitability is important!), with much much worse working conditions for everyone involved (VC needs their money back, has to come from somewhere!)
Then look around confused when there is insane wealth inequality and social unrest.
money well spent if you arks me
Well, since I know that a lot of people are actually creating businesses, based on chatbots, $200/month is probably an acceptable price.
From the article, it says that it’s a money loser, though, so I suspect that a lot of AI-based businesses run just fine, from the lower-tier price point.
They might want to consider adding an “in-between” pricing tier.
Are they creating profitable businesses?
I suspect so, but probably because they run on the $20/month subscription, and charge a lot more than that.
Because there are customers willing to pay $200/month and more...
I sell Saas software that’s easily six figures per month. I think there’s a confusion between professional prices are “Pro“ as the upper tier of individual service.
I also sell Saas software however in the seven figures per month range. But besides that I agree with you.
What is the point of this comment? Just to brag?
Just waiting for Bezos to jump on to that one and end it.
I can't readily find the HN post but the math posted was solething like 'Each prompt costs a bottle of water'. Now think about the logistics required to get that amount of water. AI usage currently does not scale well.
Calling ChatGPT a "chat bot" in 2025 isn't technically incorrect but it is like calling a male human assistant a "chat guy".
It costs $200 because the chatty little bot knows a surprising number of things amazingly well, and does decent work pretty darn fast.
It's trained on or collective data, it should be owned collectively. Wasn't ChatGPT a non-profit once?
Sure the data is out there, if you have billions of dollars you can set up your own data center, gather the same data, hire the engineers and pay the networking inference costs and make it free to everyone.
If it had stayed a non profit, would people have donated enough to keep it in business? Enough people aren’t willing to donate to keep a browser maker in business.
It costs $200 dollar because you need probably 50 cents to cover it's costs, and because people are happy to pay 199.50 for the value it provides.
Whether that value is worth the money is a different discussion, that is rarely held with big tech offerings.
> probably 50 cents to cover it's costs
Right now it costs absolutely more than the subscription price
the technology is nascent and takes kilowatts of power to run. it doesnt look like there are any more fundamental breakthroughs coming either, and we can now only hope for moores law pace improvements until someone comes up with a better trick than the one LLMs are using
Add it to the pile with crypto of software with an enormous cost to run and questionable utility when we're supposed to be in the middle of a climate crisis.
https://archive.is/XYUmD
Why is no one in this thread saying the real reason—because it's meant for business customers who are using LLMs in a professional context sending sometimes five figures of tokens per prompt for 8 hours a day every day. And while business users are not particularly price sensitive they also don't want to get a surprise huge bill that you could get with usage based pricing.
And we don't even know how high the pricing will get once they're out of the competitive acquiring customers phase and into the steady state dependent customers phase.
I assume that over time pricing will converge to cost of service provision plus a reasonable (50%?) profit margin. As long as the profit margin remains high then it encourages more competition.
Of course there is a large cost in building a SOTA model in the first place, maybe building your own datacenter(s) for inference too, but compare to something like semiconductor manufacturing where upfront costs are also very high, yet profit margins still reasonable, e.g. ~40% for TSMC who make chips for NVIDIA, AMD, Apple ... As long as there is a possibility of competition (primarily Samsung in this case), then profit margins will be held in check.
Enterprise and public administrations are showering with money everything AI. AI it's this the new COVID. Why a single surgical face mask cost $5 in 2021?
Because they offer $200 worth of value?
Price and value are far from the same thing
Indeed. But value is a subjective matter, and if a company believes they're receiving the requisite amount of value from their expenditure, then that's all that matters really isn't it?
[dead]
It cost 200$ because they didn't pay for the terabytes of data try trained on their model