原作者;Bilal Aslam
核心提示:今天,Appuri的联合创始人兼首席产品官Bilal Aslam与大家分享他与他的团队使用亚马逊ECS的惨痛经历。
在Appuri,ETL管道、API和UI都是由大量的小型单目标服务构成的。一开始,我们使用的是大型的单个资源库,后来逐渐向微服务模式转型。这并不是因为某种观念上的偏见,而是因为它符合我们的工作方式。尽管有利有弊,大体上来说,微服务的应用效果很不错。但是,我们今天不是来讨论微服务的,而是要向你讲述我们应用亚马逊EC2弹性容器服务(ECS)的惨痛经历,以及我们如何通过转向 Kubernetes悬崖勒马。
2015年6月,我们开始考虑购买PaaS来部署公司的服务。我的意愿是选择Docker化的管理服务,与此同时,保持一定的控制权。作为AWS的客户,我们考虑使用亚马逊Elastic Beanstalk和全新的亚马逊EC2 ECS。
- 可以方便快捷地启动Docker容器
- ECS能提供多重可用区(Multiple Availability Zones)
- 支持回滚部署(rolling deploys),真正实现了零停机(Zero-Downtime)部署
- API客户端。所有AWS服务的API客户端都支持我们使用的所有语言类型。
- ECS和EC2实例集群协同工作。这样,我们就不需要学习一个新的PaaS,只需要在运行亚马逊Linux的任何一个EC2实例上安装ECS客户端,加入ECS集群。
我们看到ECS demo的第一印象是,它缺少很多关键功能:
缺少服务发现(service discovery)功能。在ECS中,服务发现功能的替代方式为使用内置的负载均衡器(load balancers)。这是运行ECS网络可访问(network-accessible)服务的唯一方式,即使只有一个实例,也必须得运行ELB。对于微服务架构来说,这就增加了每次部署服务的成本。
平庸的CLI。和Kubernetes等竞争对手相比,ECS的CLI表现很平庸。你可以从命令行(aws ecs update-service –desired-count N)进行扩展,但是ECS的CLI功能不是很强大。
在使用ECS的近一年时间里,我们关注每一个功能的发布,积极参与开放GitHub issue等等。但是到最后,我们还是因为以下几个原因放弃了ECS:
ECS agent经常断开连接,致使我们无法启动新容器。ECS在每一个EC2实例中都安装一个agent,用来和亚马逊API以及Docker进行互动。但是这个agent经常断开连接,导致部署失败,这对我们的服务部署来说是致命的。这一问题尽管已成定论,但仍然在不断发生。在我们的集群上,这一问题每天至少出现两次。尽管我们已经做出了最大努力,但仍然找不到根本原因。据我所知,ECS团队至今还没有解决这一问题。
- 对GitHub issue缺少关注。GitHub issue上有很多功能和客户请求,并没有得到亚马逊ECS的关注。
- 糟糕的架构。ECS欠缺很多现代化部署和运营基础设施所需的基本元素。
在对ECS的一片怨声载道过后,我们决定试用Kubernetes (k8s)。两个星期的体验之后,我们感觉很满意。这个开源项目很适合做大规模的部署和运营。不管是它的CLI,还是服务发现或配置管理,都非常好用。尽管我们遇到了一个很奇怪的问题,就是它的kube-proxy不能正确地挖掘流量,但是重启之后问题就解决了,而且没有复发。到目前为止,我们还没有后悔我们做出的这一选择。
Here at Appuri, we have a large number of small, single-purpose services that make up our ETL pipeline, API and UI. We started from large, monolithic repos and gradually migrated to this microservices pattern, not because of any philosophical bias but because it fit our work style. By and large, this has worked well with all the known pros and cons of microservices. But I’m not here to debate microservices. I’m here to tell you about our nightmare on Amazon EC2 Elastic Container Service (ECS) and how we saved ourselves by moving to Kubernetes.
NOTE: In general, we love AWS. Also, your mileage with ECS may vary. For example, Segment had a great experience with ECS and apparently none of our complaints.
There’s also the wonderful Convox project which contains a lot of great workflows on top of ECS. When we started using ECS, Convox wasn’t far enough along to meet our needs.
And so, it begins, with a love of managed services
I love managed services. For example, we don’t run our own Postgres server – we use Amazon RDS. We also don’t run our own hypervisor or bare metal servers, we use Amazon EC2. With managed services, you trade control for peace of mind and, in an ideal world, you can focus on building differentiated value add. Everyone wins. In fact, we have had exactly this experience with most managed services.
In June 2015, we started looking into a PaaS where we could deploy our services. I wanted to stay close to Docker, but maintain a degree of control. As an AWS customer, we considered Amazon Elastic Beanstalk and the shiny new Amazon EC2 Elastic Container Service (ECS).
Amazon ECS fit the bill because of several promises:
- With ECS, you simply launch Docker containers.
- ECS is aware of multiple availability zones (AZs). As long as EC2 instances are set up in multiple AZs, ECS will try to distribute containers to maintain high availability.
- You can do rolling deploys. Neato, deployments with zero downtime!
- API clients. All AWS services have (sadly auto-generated) API clients for all languages we use.
- ECS works with vanilla EC2 instances. This is a nice plus, as we don’t have to learn a new PaaS – just install the ECS agent on any plain old EC2 instance running Amazon Linux and have it join an ECS cluster.
First impression: wow, it’s missing a LOT of stuff.
My first impression on seeing an ECS demo was how much it was missing. We use a lot of AWS services and are well-aware of how Amazon releases incremental updates. That’s all good, we do that, too. However, it was sad to see that these key features were missing:
- No service discovery. In ECS, the recommended way to do service discovery is to use internal load balancers. This is actually a bigger deal because using an internal ELB is the only way you can run a service in ECS that is network-accessible; even with a single instance you HAVE to run an ELB for the service to be discoverable — for a microservice architecture this adds cost with every service you deploy despite having no additional hardware.
- No central config. ECS doesn’t have a way to pass configuration to services (i.e. Docker containers) other than with environment variables. Great, how do I pass the same environment variable to every service? Copy and paste it. We considered setting up Consul, but instead decided to stick with native ECS environment variables to start using the service.
- Mediocre CLI. Compared to competitors like Kubernetes, ECS has a mediocre CLI at best. You can scale from the command line (
aws ecs update-service --desired-count N
) but the ECS CLI is just not very powerful.
Despite these missing features, we decided to move ahead.
I have made a huge mistake
Our first “oh crap” moment with ECS in production was when we noticed that it was leaking environment variables to CloudTrail, and on to DataDog and other third party services that consume CloudTrail events and logs. ECS, like a good AWS citizen, logs events to CloudTrail. When you start a new service, it logs the service definition including environment variables to CloudTrail!
We opened a forum post and response from the team wasn’t on target. Apparently they don’t believe in treating environment variables as sensitive quantities.
Now, we could have built yet more infrastructure to encrypt secrets using Amazon Key Management Service (KMS) and decrypt them at service start – in fact, this is exactly what Convox does. But why would we build this infrastructure when there was so much more interesting work in our domain to do?
What killed ECS for us
We ran ECS in production for nearly a year. In that time, we watched every single feature announcement, participated in opening GitHub issues and so on. Finally, we gave up on ECS when two issues remained unaddressed:
- ECS agent disconnects periodically, making it impossible to launch new containers. Recall that ECS works by installing an agent on every EC2 instance that’s part of an ECS cluster. This agent interacts with the Amazon API as well as Docker. This agent has a horrible tendency to disconnect, and when this happens your deployments will fail – this kills your services. This problem is tracked in this GitHub issue and despite it being a closed issue, we have seen it happen repeatedly. It happens at least twice a day on our clusters and despite our best efforts, we haven’t been able to nail the root cause. To my knowledge, it remains unaddressed by the ECS team.
This is a Slack search results view of just some of the times we’ve seen this problem happen. This problem became so pervasive that we started restarting agents periodically to get around the failure:
You know you’re going crazy when you restart a service every hour to fix its bugs.
- Lack of traction on GitHub issues. This issue is an example of how many features and customer requests remain unaddressed. This issue is the most commented feature for a year and remains unaddressed. Incidentally, we hit this issue as well.
- Bad architecture. I expect modern deployment and operations infrastructure to support 12 factor apps in a meaningful, robust way. ECS simply lacks the fundamentals.
Adios ECS, hello Kubernetes
After much grumbling at ECS, we decided to try out Kubernetes (k8s). Having flipped the switch in production two weeks ago, we are delighted. It seems that the contributors to this open source projects really thought through deployments and operations at scale. From the CLI to service discovery and configuration management, it has been a pleasure to use. We ran into an odd issue with kube-proxy
not routing traffic correctly, but a restart fixed the issue and it hasn’t cropped up since. We haven’t looked back!
- 英特尔火力全开炮轰AMD和英伟达:漏洞数量及危害性“遥遥领先”
- SUSE发布SUSE Edge Suite 与Edge 3.2 ,助力零售企业实现无缝化运营
- Gartner:2025年全球IT支出将达到5.61亿美元,同比增长9.8%
- 消息称去年全球IT支出超过5万亿美元 数据中心系统支出大幅增加
- 2025年全球数据中心:数字基础设施的演变
- 谷歌押注多模态AI,BigQuery湖仓一体是核心支柱
- 数字化转型支出将飙升:到2027年将达到4万亿美元
- 量子与人工智能:数字化转型的力量倍增器
- 华为OceanStor Dorado全闪存存储荣获CC认证存储设备最高认证级别证书
- 2024年终盘点 | 华为携手伙伴共筑鲲鹏生态,openEuler与openGauss双星闪耀