Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
We reduced the AWS costs of our streaming data pipeline (taloflow.ai)
142 points by cloudfalcon on May 28, 2020 | hide | past | favorite | 83 comments


Hmm. This looks to me like a lot of the savings were realized by moving away from managed services into a scenario where there’s more operator overhead. The AWS bill gets lower, but what about the cost of the engineering work?


Does anyone else find the costs associated with running well-tested, well developed systems overblown? Like if you know how to adjust some basic parameters, you will solve for 99% use cases (adjust memory, adjust ram).

Examples I can think of is Rabbit MQ and Cassandra. But in general, we have some really battle-tested software these days that has become simpler to configure and run over time. People seem scared to run their own these days.


I vouched for this comment because it’s a valid point and I’m not sure why it was killed.

I happen to disagree strongly, though: lots of engineers in my experience undervalue the work of systems administrators and underestimate the effort needed to operationalize any technology.

Running your own is absolutely fine if you are willing to keep your stack small and invest time learning the tools you pick. But there are still horror stories of people thinking snapshots are backups, turning the wrong knobs and turning off fsync on their databases, ...


Yea exactly and unless you are FB scale you can just run a single docker container and never really have to worry (granted you know how to use Docker).

Most small startups are actually the ones who don’t really need SaaS services.


>Yea exactly and unless you are FB scale you can just run a single docker container and never really have to worry

This has not been the case at multiple employers and or consulting clients.

If you're providing software to an enterprise this almost will never fly. That single docker container will have an outage when basically anything happens. The container dies, systemd fails to restart, node dies, network switch dies, data center has basically any major issue, etc.

I think your comment brings value just probably biased with your own experience of running a consumer to consumer startup.


I think you're misinterpreting my comment. I meant specifically for most small time startups, not necessarily small time startups deploying enterprise apps. If you're deploying enterprise apps then by definition you're for all intents and purposes "fb scale."

A lot of SaaS promise infinite scalability—a need which often never comes to most small time startups.


Sometimes.

But developers are part of this problem too. There's plenty of times where I see devs immediately reach for tools instead of learning just a little bit more about what they already have. My favorite example is when folks want to add a NoSQL db into the mix on top of a traditional db. Not because there's a real performance need, but because for their use case it is 'easier'. Never mind that their problem possibly could have been solved by just writing their own SQL instead of trusting a garbage ORM...


This is probably a tradeoff in a lot of AWS related stuff; you pay a premium for convenience. But depending on your workload it can pay itself off fairly quickly; AWS bills can go up quite fast, whereas personnel costs are fairly predictable.


False equivalence. The engineer will be doing more than just cloud work.

This comparison is the #1 flawed sales tactic the cloud companies use to convince you youre saving money


> False equivalence. The engineer will be doing more than just cloud work.

> This comparison is the #1 flawed sales tactic the cloud companies use to convince you youre saving money

Time is of a limited quantity and time spent managing postgres backups (for example) is time not spent doing other (possibly more meaningful/impactful _to the business_) work.


How much liability can you claim against AWS if there's an issue with their RDS backups?


How much liability can you claim against Cloud Employee if there's an issue with your RDS backups?


What is the chance that Amazon cause an issue with your RDS backup, versus a Cloud Employee?

The answer is definitely not clear to me at all

EDIT: no sarcasm, I legitimately don't know which I would choose as a biz owner


I would personally always have an off-platform backup to fall back on, as protection against the platform going down, accidental damage to data or malicious damage to data. Snapshots in cold storage too.


Liability? Probably none. See section "11. Limitations of Liability." here: https://aws.amazon.com/agreement/

RDS SLA's are here (doesn't mention backups though, so not sure how that's handled): https://aws.amazon.com/rds/sla/

Not a lawyer or anything, but my layman's understanding is that, you essentially are voluntarily opting-in to waiving liability when you sign up for AWS and accept the terms and conditions, and instead of liability, you agree to accept service credits if SLAs are not met.


what is involved in managing backups? isn't that just a cronjob?


First, you need to write the cronjob. But what goes in there? You need to decide exactly how you're going to make a backup, and the process may differ by what's being backed up. Ideally you want a quiescent snapshot, but the way you do that varies by application. What if the application is a distributed application, in which case you need to synchronize the snapshot process among all its nodes? What if it's a master-replica design, where the node that runs the cron job may vary based on the current topology?

And if you need some sort of cluster-aware lock to coordinate backups among different peers, you'll need to decide which system works for you, implement it, and maintain that as a separate system. And if that needs to be upgraded, figure out a bulletproof process for upgrading it while it's still being used as a coordinator.

Then, you need to ensure there's storage for the backup. You need to decide what kind of storage you're going to use, make sure you've got enough space, figure out how to encrypt the storage (very important in secure environments), how to protect the storage using authn/authz. And lots of environments have retention and storage lifecycle policies - you don't want to put the old backups on the expensive fast media; you want it on the cheap slow media. And some environments make you dispose of old data, so you have to figure out how to age it out but without ever losing the backups you want to keep.

Finally, you need to make sure the backups you create are valid and usable. So you'll want to build an automated regression testing procedure to ensure that every time you make a change (regardless of how minor) to the system being backed up or the backup process, that you end up with usable backups.

(Disclaimer: I work for AWS, but opinions expressed here are my own and not necessarily those of my employer.)


You make it sound like there aren’t cookbooks for many of these scenarios and that the company will have to invent these scripts and procedures by hand.

Yes it is work, but this company’s whole reason for being is to save AWS spend, so I assume they have patterns they employ for their clients regularly that achieve their SLO.


There are original-definition cookbooks and yet it still costs me time to provision my own lunch vs using the managed service of my corner restaurant.


Yes, managed services are better in many cases.


Sometimes there are cookbooks, but they are of varying quality and often don't have dedicated resources to maintain them, so I would use them with great caution. You also have to implement them and often maintain the underlying infrastructure.

But I was really responding to the brusque naiveté of the "just write a cronjob" response.


“managed services” don’t necessarily save you the headache of making sure backups are usable. But other points are valid.


There was a time when I used to think the same, then I found my backups were corrupted (or stopped because run out of space, etc) just in the moment when I needed them.


I once worked at an IT shop that worked closely with the construction industry. A new sports stadium was being built and we were doing panoramic photos during each stage of construction, and rendering them in a web app where the facilities team could “peel back the layers” and see what was behind the wall or under the floor all the way down to the foundation. This was... 15 years ago? So it was pretty neat technology and a little less ubiquitous than today.

Well, our storage server barfed and the data was gone. Went to restore from backups, all the hourly tar files were there... but were zero bytes.

We looked at the backup script the engineer had put together and it was one of those classic “didn’t give the right parameter to have tar recurse” type bugs. Unfortunately we lost all the photos of the foundation and much of the photos of the electric being run. Oops.


One wonders why tar even has an option (and a default!) to not recurse. What would be the use case for that?

The "normal" use case seems to have a recursive archival. Sounds like somebody chose the wrong default... would be an interesting software archaelogy project to figure out where this "feature" originated.


Not for a reliable solution. For example, assume you have a master and a replica database for reliability, what happens if the master, where the cron runs, fails? Do you remember to set up the cron up on the replica? From my experience, having worked on backup software, over 10% of servers that need to be backed up are not.

System reliability is hard, and the cloud makes that easier.


Need a lot of storage and make sure the backup is readable (view the files content or try a restore).

The number one backup solutions nowadays is AWS S3, because it's easy-to-use unlimited storage.

How does a company handle backups without S3? Usually they don't. That would require employees to buy machines/SAN with tens of TB of storage and maintain them (weeks in ordering and travelling to the datacenter once in a while). It's too much hassle so nevermind.


Easy to use unlimited storage is a sure recipe for not finding what you actually need, restoring the wrong backup, etc.

Unless you take your DR plans seriously, the cloud doesn't eliminate risk, it just changes it.

The place I work at forces a failover on a monthly basis, and does a full-on offsite DR exercise twice a year.

I'm sure it took time to set it all up, but now that it's there it takes almost no effort to continue.


The cloud eliminates the most common risks, that is ops + developers simply giving up on backups because there is nowhere to store them, and not being able to access them anyway.

Typical new sysadmin in large corp: The backup storage is full and backups have been failing since before I joined, should we do something about it?

Oh we raised tickets to request more disks. They will take months to arrive if they ever pass approvals.


Backups are a solved problem and has been for decades.

Can you think of a better example?


They’re a solved problem in the sense that the tools exist but that still has an ongoing operational cost to correctly setup, secure, monitor and test – and there are plenty of examples of expensive failures or security breaches caused by people thinking it was an easy solved problem.

Deciding whether you get enough benefit from doing that yourself is a classic business trade off which any experienced engineer should consider.


Yes. I’ve found that the amount you have to learn to use a managed service often equals or exceeds the amount you have to learn to run something on EC2 or on-prem. The automation/management costs of AWS or equivalent are a lot higher than people think and not significantly different from the costs to learn Linux and enough networking to do an “old-fashioned” deploy.


Much of that rings true, but I find some of the cloud abstractions can help to make the steady-state ops time required lower (especially for a side project where you really don’t want to deal with life interruptions).


The cloud was supposed to help you get rid of all these pesky sysadmins. Imagine the savings!

Now the cloud is so complicated that you have to hire "devops". It's the same people as before, with a higher salary.


I don't have a dog in this race; I'm not partnered with any cloud provider. I will say that based upon what the article discusses, they save a bunch of money on AWS Glue by... running their own ETL pipeline inside of ECS instead. What's the maintenance burden of that decision? It's certainly not zero.


That depends on the scale and other particulars that are only shared/known inside the company.

In my experience, beyond a certain scale it simply doesn't make any sense to use managed services anymore. There is a high initial upfront cost in development hours and hardware that is amortized over a very long time after, and this upfront cost is partially paid for by the reduced cloud bill.


That also depends on the markup.

EKS on EC2 costs EC2 costs + flat money for control plane, so that might make more sense than running your own Kubernetes on EC2. (although I have no experience to say how much time this actually saves you)

Managed Kafka costing 2x the cost of EC2 infrastructure? Probably not worth it.


Most ETL processes should be declarative and configured in text files like you configure a CloudFormation or Terraform. Once you have that, the execution piece is relatively straightforward. I think a lot of issues and costs with ETL come from poor architecture decisions. For instance, we all made the mistake of running Extract processes with too many transformations. An extract should be an extract, transformation should come later.


It’s a “false equivalence” or “flawed sales tactic” to suggest planning using total costs? That’s what both engineers and business people are supposed to do - and reflexively attacking it really does not cast your motives in a good light.


It's a false equivalence to suggest that managed services have zero staff costs, and that using a DIY database has a staff cost measured in whole FTEs.


That could be the answer to the question which was actually asked but if you read the thread again, notice that you’re arguing against a claim nobody made.


Scroll up, my dude:

> False equivalence. The engineer will be doing more than just cloud work.

> This comparison is the #1 flawed sales tactic the cloud companies use to convince you youre saving money

"False equivalence. The engineer will be doing more than just cloud work" -> "It's a false equivalence to suggest that [...] using a DIY database has a staff cost measured in whole FTEs."

Hey, maybe AWS should launch some kind of ML-powered reading-comprehension-as-a-service?


Yes, do scroll up — note that the portions you quoted were the strawmen which nojito tossed out, not the original question, and perhaps ponder whether accusing someone else of not reading for comprehension is adding anything to the conversation.


While I don't entirely disagree I think it should be made clear that both can be true, even at the same time. To spin up a 9 node managed Elasticsearch cluster load-balanced across two regions takes a competent engineer with practice roughly a couple of hours, or twenty to thirty minutes if they were smart and terraformed it out previously. Now there's a whole host of potential problems that come along with using that managed Elasticsearch cluster too (no access to "cluster mode", no tunability, etc). But if those potential problems don't apply to you and a very vanilla ES cluster suits your use case then you're fine.

Alternatively that practiced engineer could have spun up a self-managed ES cluster in a couple of DCs in about the same time, but now has the obligation to maintain those servers (patching, etc.). Maybe that marginal cost is damn near zero - chef has been deployed to all instances and enforces patching and there's already good security monitoring in place, etc. The cost of that engineer managing that box, as with a managed ES in AWS, is practically nothing.

TL;DR: as in all cases, it depends.


Agreed - there is a tradeoff that must factor in many things: engineer competency or the ability to get competent engineers, state of the product itself (maybe Elasticsearch as a service was an interim step in a longer term vision), complexity of the managed service itself, integratability (is that a word?) into other AWS services, maturity of the managed service, and probably a few other things I'm missing.

We've seen our teams go both from managed to non-managed and non-managed to managed with relative success - to give scale, across all of our accounts we spend way north of $3 million/month at AWS so this has happened within our realm a quite few times. The short, unsatisfying answer is that _it depends_. We have an internal policy from the suits that "if there's a managed version, use it" but most of our teams are thankfully smart enough to take that at face value and do their own analysis.


> integratability

interoperability?


Thanks :)


My personal favorite is "Move off of AWS MSK". No big deal - we just fire up some kafka brokers and and zookeeper nodes in ECS! All we gotta do is run several more supporting services to keep the cluster healthy and deal with the nightmare of Apache security ourselves.

As far as I'm concerned MSK is cheap - one broker is priced roughly same price as 2 equivalent EC2 instances. And you don't have to worry about zookeeper at all!


Hi Corey! I'm the author of the blog post - I definitely agree with you. 9 times out of 10 engineering teams underestimate cost of engineering work as well as opportunity cost lost due to managing non-core functionality internally or moving away from managed services.

For us our pipeline was actually easier to work with Flink than Glue because of the restrictions that Amazon placed on it and so that factored into our decision.


If you're already paying the cost(both engineering time and compute wise) for EMR, I can't imagine it takes more effort to create a new Flink job than a new Glue job?

The advantage of Glue or the corresponding serverless GCP ETL option (dataflow) is that it's serverless elastic, but it sounds like their workload wasn't applicable.


Unfortunately that's not nearly how AWS works. AWS breaks down everything and charges you for it separately. Flink and Glue are entirely different animals.


Can you explain more? From what I understand, if you're already running a Flink cluster on AWS and you have capacity for another job, you aren't charged more, no?

I haven't used Glue, but it seems like it's able to do stream processing on Kinesis and dumping to S3 or whatnot, so it seems like there's overlap with using EMR running Flink?


I find it highly entertaining a two-year old company who was founded on the basis of helping slash cloud spending found so much waste in their own AWS spend. This is not an example of dogfooding, but an example of sheer incompetency and massive technical debt.

I'd really like to start seeing a series of blog posts from companies who are running extremely lean and efficient tech environments by utilizing cloud in an intelligent manner and avoiding the expensive and unnecessary bullshit that's so prevalent today. The ones that can brag "How we run a $4M/yr SaaS on $40k/yr of AWS spend!" are far more interesting than "How we stopped incinerating millions of VC money by simply turning off shit we didn't need"


two-year old companies have limited resources. It might have been a deliberate trade off to focus on work that produces value to the customers.

Maybe the blog post would have been "How we run a $1M/yr SaaS on $40k/yr of AWS spend!" instead of $4M?


Back when AWS started, there would be articles about the work to master scalability and performance for the modern web but as things matured, we somehow ended up in a much larger heap of literature around AWS cost optimization.


In some sense this is a good problem to have. With on-prem you used to have very limited resources to start with, so cost efficiency is a baked-in requirement. With cloud providers you seem to have limitless resources and the new problem of cost optimization arises.

Admittedly there’s difference between optimizing fully-controlled resources and cloud provider managed services. For one, low visibility into cloud service internals makes such optimization harder.


Engineering is always balancing capabilities and costs. Before AWS existed there were plenty of stories about people over-buying to handle peak loads, optimizing workloads to fit a particular budget (especially during the dotcom era when VCs stopped underwriting huge sales for Sun, et al.), etc.

Cloud services gave new options for variable use and reallocating management costs but they also did something which most places were not used to: expose every detail as an itemized bill. That makes costs more visible than they’d been for most organizations which is good in the sense that people can make architectural decisions with pretty detailed numbers but bad in that many CIOs get sticker shock unless they’d done a well above average job calculating on-premise TCO.


We ended up here because Amazon can't scale. It's just uncool to admit you have to notice the pink elephant. Why? I don't know. Maybe it has to do with cred in engineering teams or for engineering teams in the broader org structure.

But the problem with AWS, with a lot of the "cloud", is the pitch that remote centralization of a service scales ad infinitum. It's still subject to the same constraints as self-managed, even if those constraints appear at a higher limit.

The greatest constraint is the per-unit pricing. You buy self-managed, you have huge upfront and period costs, but with remote, you see the $.03/MB price and assume that variable cost is more manageable over the long run. And it is... until price changes, overhead changes, bandwidth changes, or worse, accessibility changes. And suddenly, what you had cost-effective scaling on 18 months ago now has a massive deficit affixed to it. Because that's how most people used the platform... or because removing A or B features reduced maintenance costs or freed up bandwidth.

AWS is an experiment. Does it work in many or even most use cases? Yes. For now.

I love engineers. A lot. In fact, being in sales, I would give up a deal with an engineering team unless I knew for sure my ROI basis was solid. That said, I do know sales and marketing rhetoric. And having spent hundreds of hours in meetings with product, marketing and dev professionals, I wish I could record the stress-induced breakdowns I've seen in engineers and executives who had everything running buttery, "and then [provider] pushed [update]..." and they then have executives breathing on the back of their neck 16 hours a day, entire teams offline or unable to do basic tasks, etc. I just want to play that shit to people and say, "This is why you don't overpromise."


I am curious about the actual cost in $! Managing your own kafka or observability infra is expensive, you need a team to do this.

A 67% reduction doesn't say the whole truth. They have more services to manage now, which means they need more people and more time to do this.

Saving 10k from your AWS bill by hiring 2 more engineers is not cost effective.


Or course it is.

Where did we get the idea that engineers are hired to do only one thing?

This has never ever been the case in my experience.

Also Kafka being hard this manage is not the case. A simple look into many small companies and startups running their own clusters shows otherwise.


Engineers are hired to deliver and produce value. Tooling can be a part of it but if you can outsource something which is not your source of income, you have to do it. Engineering time is more valuable.

I also know many startups and small companies investing 5 people and 6 months to get an observability platform up and running while they could just get datadog or new relic for half the price... and I don't get into account outages and updates to the platform.

I remember a recent uber blog post on how they moved from build tool A to build tool B and a couple of weeks later, 3000 people where laid off. It's important to spend development time on revenue streams.

This is some nice piece of advice https://nav.al/build-a-team-that-ships

"Outsource everything that isn’t core. Resist the urge to pick up that last dollar. Founders do Customer Service."


>Where did we get the idea that engineers are hired to do only one thing?

At a certain size or number of self run services, they very well might be. I used to be the guy that did the set up for these sort of self managed solutions, and ran them day to day. In some shops the workload was high enough we needed multiple people like me doing it. Or a whole team. Doing DevOps style management of them just let us do it with fewer people - it certainly didn't make it feasible for developers to do the day to day management of these services and still write code.


"Eliminate unused EC2 instances" -27% of cost

Haha, so cleaned the internal IT / DevOps mess and call it a day and than blog post it


Eh, you're pulling a quote out of context. It's 27% reduction in EC2 use, which was only 18.5% total. So this only accounted for ~5% of total savings.


I mean if a quarter of your EC2 instances were unused, that is absolutely an internal devops / IT mess.

The whole point of AWS is to use services on demand; it's like buying 133 conference tickets for your 100 person company.


> The whole point of AWS is to use services on demand

That's a decade-old misconception about how people actually use AWS.

Most servers I've seen in AWS are permanent.

In fact, it's an anti-pattern to wait until you need more capacity to scale up, since those servers may not be available, especially in newer instance families.

Even if the needed instances available, typically ASGs don't react the way you expect without a lot of experimentation (ie. outages.) An example is if traffic increases load, your health check may consider the servers to be unhealthy and start killing them, creating a death spiral.


This is why service health endpoints should be very carefully designed, with the proper balance between health inspection and performance penalty.

That being said, it's true that an ALB doesn't offer throttling capability like a true reverse proxy such as HAProxy provides, were you can cap the number of concurrent requests and give a chance to your backend to avoid death by overload.

I wish there would be a way for ASGs to at least make the distinction between an unhealthy instance and an overloaded one.


I think there's different levels of "on demand". In this case, I think this is much slower level of "on demand".

I see two primary use cases of cloud:

1) You're a startup or just need something small, and want to focus on building your MVP, instead of messing around with colocated Linux servers. Cloud is much more expensive than those, but you don't care because you don't really need all that much, or maybe you are VC backed and have unlimited money.

2) You're a large company with broken internal processes. You can get server in company datacenter in three months after seven approvals (since it's capex), or you can spin up an EC2 instance. You don't care about cost since you have unlimited money.

Those are kind of medium scale "on demand" - not "I need 100 new servers right this minute" but "I need server in ten minutes intead of 'when I get to buy one' or 'in three months and 37 forms'".

In both cases, you're throwing money away, because time is more important for you than the extra money cloud costs.


More like ordering 133 lunches every day for your 100 employees and dumping 33 in the trash.


Fun fact, 30-40% of the food in our supply chain is wasted. So, we're actually not too far from that. Sure, the meals aren't directly wasted by the company and rather through the supply chain. But, the waste is still there and is surprisingly high IMO.

Source: https://www.usda.gov/foodwaste/faqs


And doing this for months. Without noticing.

Honestly this isn’t ultimately engineerings fault. This is a SaaS business Someone in their company is responsible for the COGS KPI. For that person to either not notice an increase in COGS, or to not be aggressively incentivizing engineering to reduce COGS, is giant red flag.


The article did say that the motivation for doing so was because AWS credits were running out .. why prematurely optimize a free resource? :)


Ouch this thread ruined some poor souls promotion / pay rise narrative


Perhaps more like ordering 100 lunches for your 100 employees and forgetting that 20 of them moved to night shift?


I don't disagree with it that at all. By all means, we should be doing our best not to waste resources.

Just stating that the waste was only 5% of the total savings.


It seems reasonable to make some of these cost comparisons more visible.

ie If working on a new product or feature to understand upfront "this managed service is x% more then more bare bones" etc.

essentially turning an alchemy into a science


AWS offers a cost calculator for just that purpose; they offer 'easier' products if you can't be arsed to dive into AWS costs and technologies yourself.

I think a lot of people make the mistake of assuming AWS is just an easy off-the-shelf thing you can just grab, but if you use it seriously it's a full-time job and its own expertise.

Source: I've done some AWS certifications, never was able to put them into practice though. I've also worked in multiple organizations that migrated to AWS, they all had a full-time team of people managing it.

It's a full-time, specialist job and you can't just palm it off to your engineers as a background thing.


The initial pie chart seems to indicate that either AWS glue is significantly overpriced, or that they were doing something wrong.


As with all things AWS the more "magic" there is to it, the more expensive it is.


Huge part of why I always try to build applications as platform agnostic as possible.

If I make a .NET service or site, I know (with the tools I use) I can deploy it on any linux or windows machine without issue. I can take it anywhere that I can run any software.

Sure, may need more glue for certain scenarios, but you know that you can move as soon as a provider shows it's fangs.


Speaking from experience - not a bad idea.


Interesting idea. Does anyone do this for Azure?


We went through a similar process with GCP, which was annoying since GCP was sold as being cheaper than AWS.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: