查看原文
其他

A-Talk|Sky Computing:计算公用设施化的新探索

蚂蚁技术AntTech 蚂蚁技术AntTech
2024-08-23

A-TALK

在数字化变革的历史机遇和科技引领创新的趋势下,蚂蚁集团于2021年中正式成立蚂蚁技术研究院,致力于前沿科学技术的探索与研究。研究院下设六大实验室,分别为数据库实验室、图计算实验室、隐私计算实验室、编译器实验室和视觉智能实验室。


A-Talk 专栏文章来自蚂蚁技术研究院推出的对话技术前沿专家交流平台在这里我们探索前沿,惟学无际。本文是Ion Stoica教授带来的关于Sky Computing的最新研究与分享。


“Spark和Ray是开源领域备受关注的两个项目,分别是由AMPLab推出的大规模数据处理而设计的快速通用的计算引擎,以及RISELab推出的新一代高性能分布式计算框架。这两者的背后都离不开一个核心人物,Ion Stoica。Ion Stoica 是Spark 和 Ray 的联合发起者, Databricks 和 Anyscale 的联合创始人兼执行主席,也是加州大学伯克利分校的计算机科学教授和 RISELab的首席研究员。
蚂蚁集团和RISELab一直拥有深度的合作关系,不仅积极参加了Ray的开发,并规模化地在蚂蚁落地。Sky Computing是Ion Stoica 教授提出的新一技术构想,是计算公用设施化概念的一种新的探索,目标是将目前的云生态系统转换成一个公用的计算平台,将个人云结合成一个统一的网络。期待在不远将来,我们能看到这一愿景的实现。”
——蚂蚁技术研究院院长 陈文光




SKY LAB  



今天我要谈谈我们正在打造的一个新实验室,介绍一些背景情况,这个实验室称为 Sky Computing Lab。这是伯克利系列实验室长期工作的最新成果。这个实验室建立在系统端,由 David Patterson 从四五年前开始打造。多年来,这个实验室诞生了许多有影响力的技术, 比如 RISC 处理器、Ray 资源和独立磁盘冗余阵列。现在,他们拥有四个工作站,这基本构成了目前几乎所有数据中心的基础。 


最新成立的两个实验室包括 AMPLab 和 RISE Lab, 也是我最近参与的两个实验室。这两个实验室都致力研发较少但较为流行的软件工件,包括 AMPLab 的 MESOS、APACHE Spark 和 ALLUXIO,以及 RISE Lab 的Ray、MODIN、mc2 安全平台以及 Quipper 服务平台。在你使用的这些项目中,其中一些也有你的贡献,比如助力打造 Ray 平台,非常感谢你的付出。


Sky Computing 的基本任务是让应用程序透明地跨云和本地集群迁移服务和资源。

接下来,我将分成三到四部分进行介绍。什么是 Sky?为什么能够实现 Sky?怎样实现?Sky 会有什么影响?如果 Sky 得以实现,会怎么样?我也会谈谈非常早期的 Sky 体验。

可以这么看,Sky 就像为云设置的互联网。1977 年 11 月的首次互联网演示中,研究人员将数据包从门洛帕克发送到洛杉矶南部南加州大学的信息科学研究所,即从贝利亚到洛杉矶。


这些数据包在两个大陆之间,即在北美和欧洲之间,通过三个不同的网络传输。先是分组无线网络,然后是 ARPANET 网络,最后是卫星网络。这种互联网的妙处在于,它将这些不同的网络抽象化了,尽管这些网络应用不同的技术,而且使用不同的协议来在小型网络的节点之间传送数据包,它们被抽象出来,传送至入口和端应用,再传送给端用户。


所以,我们希望Sky 至少能做到这样。这里,我们再举一个例子,一个更具体的例子。

假设你有一套机器学习管道,这是一套简单的机器学习管道,包括数据处理阶段、训练阶段,还有服务阶段。那么假设有以下两个要求:首先,你需要处理要训练的数据,包含保密信息,比如可识别身份的信息。同时你还想尽量降低成本,为什么不呢?假设为了处理保密数据并删除个人身份信息,你需要使用 Opatch。这是我们开发的另一个系统,需要使用英特尔的 SGX。


而 Sky 将会尝试将这个云抽象化。现在你可以做的是,在这里寻找潜在的解决方案,因为在公共云中,假设你现在只使用公共云 Azure,则只有他们的服务 Azure Confidential Computing 能够提供 SGX 处理器。你要使用它进行数据处理,安全地删除敏感信息,要使用 Azure,对吧?然后在下一阶段,你可能要使用谷歌的 TPU。在这种情况下,你要使用谷歌的 GCP。最后在服务阶段,你可能要使用 Inferentia,这是 Amazon 的新服务。所以你要使用 AWS。因此在这个阶段,你可能会使用不同的云。


下面我们来更具体地谈谈。我们假设,使用一个用户评论数据集,这个数据集来自 Amazon。我们要使用 BERT,要使用内容回合 (Epoch),使用这个评论数据集来进行 BERT 微调。假设这个评论数据集包含用户标识符,这是我们想要剥离的信息,在训练模型后,将执行 10000 个查询。再强调一下,我们希望在任务过程中对数据保密。现在鉴于此,如果我们不得不只使用一个云,则必须使用 Azure,这是因为由于限制,第一阶段必须在 Azure 中运行。这种情况下会带来成本,也就是 Azure 的总成本。在这种情况下,整个运行期间必须始终使用 Azure。


现在,如果你使用 Sky,则可以灵活地使用不同的云进行不同的阶段,结果会怎么样呢?在这种情况下,你可以将成本降低大约 60%,而且运行速度更快,加快 47%。很明显,在这种情况下,这个基于消耗量的成本模型,如果运行时间更短,在使用相同资源的时间间隔,成本也会随之降低,成本与时间之间具有很好的相关性。这些只是让你有基本的了解。


现在我来进行更进一步的介绍。大体上来说,我们需要的是,从左侧的当前现状——各个云彼此孤立,每个云都有自己的许多专有服务等等,转变为右侧使用 Sky Computing 的情况,其中,Sky Computing大致分为三个主要组成部分。

首先,我需要兼容集,兼容集是一组可能由多个云实现的公共服务。然后是云间代理,这是 Sky Computing 的关键部分,由其负责将云抽象化,它是使用互联网云服务中介在多个云上运行的应用程序。哪些云将用于运行不同的部分可能并不明确。第三个云组成部分是对等互连。如今,这部分并不是必要的。我们相信,我们将不断进步,实现这样的目标:不同的云之间通过安排将能够自由地交换数据。目前从云中获取数据成本高昂,因为出口成本很高。所以,最好部署免费对等互连,但这不是必要的。


我们回过头看看,注意,实际上,在这个应用程序中,我们确实考虑了跨云移动数据的出口成本,但在这种情况下,计算成本远超移动数据的出口成本。

现在,你可能会有这样的疑问,Sky 与其他多云有何不同?

毕竟,现在你越来越频繁地听到越来越多的公司在使用多云(multi-cloud)。一般来说,当人们说自己使用多云时,或者不同的公司提到自己使用多云时,这意味着什么呢?这意味着不同的工作负荷 来自不同团队,它们使用不同的云,这就是我们所说的分区多云。例如,在这里,一个团队使用 Azure 上的 Synapse 处理一些数据,另一个团队也许使用 Google GCP 上的 Vertex AI来实现机器学习,还有一个团队使用 Redshift,这只是其中一种多云模型。

另一种模型是所谓的便携式多云,这种模型也越来越常见。在这种情况下,你在不同的云上运行相同的应用程序,例如,用于数据库的 Snowflake。这里要注意的是,尽管应用程序在多云上运行,但并未实现云透明化。具体而言,这在于你的注册情况,例如,Snowflake 帐户。注意,在创建帐户之前,你需要选择云,甚至需要选择地区,即要在哪里创建你的帐户。


多年来,人们也在尝试提供跨云的统一层,也就是在服务中多点基础设施,在所有云上部署相当低层的 API。例如,谷歌开发了Anthos,Microsoft 和 Azure 也开展了一些类似的工作。通常,这位于低层,比如容器编排等服务,或 VM 编排。


而 Sky 有许多不同之处。首先,它推动并旨在实现透明化,云的透明化,因此它将云抽象化。


其次是我之前提到的兼容集。这种兼容集包含各种服务和软件堆栈的不同层。这些服务不需要在所有云上运行,它们可以是在两个云上运行的服务,甚至可以是在单个云上运行的服务。

下面,我将举例说明这种兼容集。其中一个例子是更多的开源托管服务,比如说,Kubernetes。例如,如今每个主要公共云都提供了托管版本的 Kubernetes,具体而言比如AKS、GKS 和 EKS。

另一个例子是 Apache Spark。每个云都提供一个托管版本。实际上,有些云提供的不仅仅是托管版本的 Apache Spark。比如,Azure 提供的 HDInsight 和 Synapse、GCP 提供的 Data Proc、 AWS 提供的 EMR。这些都是由云本身提供的,但也有第三方公司提供托管开源项目,比如对应 Apache Spark 的 Databricks,和对应 Apache Kafka 的 Confluent。这些都是关于兼容集的例子。

到目前为止,我提到的所有例子都是开源服务,但它不仅仅提供开源服务。你还可以让第三方提供多云服务,就像我之前提到的,比如 Snowflake,这些是专有服务。

或者你可以拥有 BigQuery 或 SageMaker,这些由单个云提供的服务。再强调一下,你可以选择多种服务,其中一些服务将在一个或多个云上运行。当然,如果服务在多个云上运行,你可以选择在哪里运行你的应用程序或应用程序组件。


然后,云间代理是 Sky 的核心部分,它在所有这些可用的云服务之间以及用户的应用程序之间建立了一个双边市场。顺便提一下,你可以有多个云间代理,云间代理可能专门针对不同的应用程序。

现在,云间代理有哪些组成部分?云间代理非常复杂。这里我不会向你展示所有组成部分,只是展示一些可能较为重要的部分。首先,你有一个服务目录。这个服务目录将涵盖在多个云上可用的所有服务,也许会列出相关的成本,也许会提供相关的指示,说明如何启动服务,如何管理服务,如何拆除服务等等。

其次,你有一个提交任务的用户,由其提供任务分类以及自身偏好,比如最大限度减少成本、延迟和支持。你还有一个具体的任务规范,可作为启动参数,比如 DAG,即有向非循环图。你可以把这想象成工作流管理器 Airflow之类的东西。


另外云间代理还有一个优化器也就是控制中心组件,这个优化器从具有偏好的用户处获取任务描述,然后会查找服务目录中可用的服务,而服务目录会运行这个任务的不同部分,然后将对任务进行分区并且可能在不同的云上运行不同的部分。然后,当然,云间代理还有计费。计费可以通过两种方式实现。一种是直接面向用户。可以这么看,用户已经拥有不同云的帐户,在云间代理中,基本上只是提供用户访问自己在不同云上的帐户的凭据和访问权限,以便云间代理可以在这些帐户中代表用户运行用户工作负载。

你还可以有一个计费组件,在这种情况下,用户拥有云间代理的帐户,而云间代理具有不同云的各个帐户。因此,各个云将向云间代理收费,然后云间代理向用户收费。在这两种情况下,云间代理都可以获得服务费用。

我肯定,这里稍微涉及到了 Sky Computing 方案的高层次内容。我相信你现在有很多问题,接下来我会解答其中几个问题。现在我们来看看第一个问题。通常,你会疑惑 Sky 为什么能够实现?云不会促使 Sky 的实现,而是与其背道而驰,因为这可能使云商品化。因此,过去也开展过一些工作,试图将云抽象化。实际上,其中一些甚至也被称为 Sky,比如十年前左右的研究。那么你为什么要相信我们会成功,或者至少有机会成功呢?


原因如下,我提出了三个推测。首先,兼容集正在迅速增长。其次,Sky 无需现有云的帮助即可启动,所以我们不需要帮助。最后,一旦我们开始, 我们相信,市场力量将推动完成剩下的工作。


首先,兼容集正在迅速增长,并且在很大程度上,就像我之前提到的,开源软件正在推动它的增长。而且此外,这是所有的“参与者”的需求,至少在某种程度上。客户想要它,第三方想要它,甚至云本身都想要它。


正如我所提到的,开源软件正在推动它的增长,现在,开源在软件堆栈的许多层中占主导地位,你也许有基于开源、可在不同云中使用的开箱即用服务,就像我之前提到的,或者你可以自己运行在云中的开源服务。这些开源服务中的每一个都在推进 Sky 在云中运行。

另外正如我所说,这是所有参与者的需求,包括客户、第三方和云。对于客户来说,这很容易理解,对吧?为什么他们想要兼容集呢?首先正如你所见,你越来越频繁地听到数据和运营主权的概念,大体上来说,现在不仅有些国家/地区希望在自己的国境内(该国的领土上)处理数据时存在一些限制,而且有越来越多的法规规定数据应该在由该国国民运营的数据中心处理。

客户想要兼容集。他们希望利用最佳的服务和硬件。现在越来越多的云配备了自己的硬件和不同的服务。因此,客户希望拥有访问所有这些硬件与服务的产品。

而且,客户还想要跨云聚合资源。事实证明,在许多公司,也不能说是许多公司,而是在很多情况下,这些公司无法在单个云中提供足够的资源。例如GPU,对吧?因此实际上,今天其中一些公司正在开发一些聚合解决方案来与多个云签订合同,他们在不同的云中部署 GPU,以运行他们的工作负载。


显然,你还可以降低成本和延迟,最重要的一个方面是避免锁定。如果一家公司在云上花费数亿美元的话,这样的公司并不少,它们不会想被锁定在一个云中。


其次,第三方想要兼容集, 因为作为提供云服务的第三方,如果你可以在多个云上提供服务,可以在两个方面直接受益。首先,你会接触到更多的客户,因为你可以接触到每个云的所有客户。其次,你也可以更好地与公共云竞争,因为现在你具备在多个云上提供服务的价值优势,能够为客户避免云锁定。这些服务自然是兼容集的一部分。


最后,云自身会驱动兼容集,这点也许不太明显。但是想想的话,其中一个原因是,正如我提到过,兼容集会提供开源项目的托管版本,如果有开源项目,就有受欢迎的开源项目,几乎每一个云迟早都会提供这种开源项目开发的服务。

但还有一些比较微妙的原因。首先,他们也在其他云上提供自己的栈,他们想鼓励开发者在他们自己的软件上进行构建。即使他们说,我是谷歌产品,如果你在 AWS 上运行也是可以的,只要在谷歌 Anthos 上构建应用程序就行,而谷歌 Anthos 可以看作谷歌 Analytics 升级版。

因为作为谷歌,我相信自己能够提供谷歌 Anthos 最好的实例化,你迟早都会用谷歌,对吧?这样你就很容易会使用谷歌,因为你的应用已经在谷歌上运行了。还有一个更加微妙的原因,有些时候,有些 API 是相互比较的。为什么会这样呢?

因为举例而言,假如我是微软,然后我想得到 AWS 的客户。为了得到这个客户,当然,我可以提供比如说优惠的价格等等或者其他好处。但其中的一个障碍是如何让这个客户了解这些应用程序,使用你们的服务。所以现在的重点是你如何拉拢客户,嘿,如果你在使用 AWS 的服务,我这里也有一些很相似的服务,你不需要做些什么,只需要一些细微的改变,这就是它们有点相似的原因。

另外,Sky 不需要云的帮助,你可以通过现有的服务启动,我们不需要任何来自云端的支持。而且,还有一个非常重要的方面,大家都知道,我们也经常被问到,也许你可以支持最佳任务,但是能支持其他的任务,比如微服务之类的吗?我们的答案基本上是,我们不会从一开始就事事支持,我们只关注一些重要的,对我们来说可能比较简单的用例,从那里着手。如果你取得了成功,也许可以扩大范围;如果你不成功,至少很快就会失败。

顺便提一下,我们已经在着手构建。更详细点来说,我们目前正在构建云间代理的原型。我们称之为 SkyML。我们的基本目标是训练和超参数调整,用户就是我们自己的人工智能学生,他们使用并且有来自多个云的信用额度。


最后,一旦启动,我们认为我们将发展出一种良性的循环,就是飞轮效应。因为如你所知,正如下面提到的,你从最初的兼容集开始,然后开始构建你的早期服务。顾名思义,这些服务和应用程序中有许多将成为兼容集的一部分,因为它们要在 Sky 上运行,所以它们可以在多个云上运行。现在,如果这些服务成为兼容集的一部分,兼容集将继续提升。另外,如果这些是重要的工作负载,更多的云将提供接口,以便它们可以竞争这些工作负载。这样,兼容集就会更大,你拥有越多,就越容易编写更复杂的服务等等。

现在我来澄清一下对 Sky 的常见误解。


首先,Sky 不会试图为所有的云定义统一的标准 API。这太难了,甚至可行性都尚未可知。云 API 可能 比操作系统的 API 大至少十倍。对于操作系统,人们试图提供 API 层就像在 UNIX 领域一样。但他们失败了,其中也有一些成功的例子,比如 UNIX API 的子集,但并没有取得真正的巨大成功。同样,云也没有,至少其中一些云,领先的云并没有动力去支持统一标准,因为它们不想将其商品化。这需要进行规模庞大且漫长的标准化工作。坦白说,学术机构可能没有资源和时间来做这件事。


同时,Sky 不会尝试强加标准,甚至不会对某些服务强加标准。所以在许多情况下,思考服务的方式会有例外,总是会有例外。但你可以把 API 想象成一段代码。基本上,你不会说我想使用 Spark API 等等,你只会说我想用 Spark 3.2、Kubernetes 1.8。这些服务可以被想象成,就像是采用当前编程语言的库。


编写 Sky 应用程序应该与编写程序类似,你可以将“库”替换为“服务”。如今,开发人员负责这一工作,当您想要编写包含库的程序时,需要适当的库来提供适当的版本,并解决依赖关系以及管理冲突和依赖关系。这不是什么令人愉快的事情。很多人把它称为“依赖地狱”。

但是你知道它是有用的,非常可行。这里也是如此,我们希望开发人员至少在一开始能明确服务,比如服务的版本,也许是一些配置参数,就像你购买文件时那样。当然,你知道是否要使用优化器。你要去找应用程序编写者来管理这些冲突。互联云层就负责这些工作。正如我所说的,云间代理负责实例化服务实例并管理其生命周期。


Sky 并没有从一开始就试图支持所有的应用程序。从某种意义上讲,它类似于无服务器系统。无服务器系统最初只部署一些支持度较好的关键应用程序,现在它正在扩展到更多的应用程序。这里也是如此。我们从极少数简单但有用的应用程序开始,然后,我们希望能够进行扩展。

如果 Sky 得以实现,会怎么样?


我认为,这将产生很多积极的影响。首先,我们相信这将会引领专业化云并加速创新。比如现在,如果你进入市场,想提供云服务,由于基本上如今的人们都想要,他们的大部分应用程序都使用单一云。如果你想争得一席之地,需要提供许多具有筹码的服务,才能有谈判空间。出于同样的原因,在本地/边缘也会更容易部署云整合。


现在Sky 如果实现,借助 Sky,你可以构建一个专业化的云,这非常好。然后,Sky 会把需要你服务的应用程序的组件发送给你,如果你提供最好服务的话。这样,你就可以拥有经计算优化过的云,比如 NVIDIA 或 Cirrascale。这两家公司都拥有自己的云。这是关于 NVIDIA 云的宣传,这是 Cirrascale 的宣传。顺便提一下, 他们也有专门的芯片,比如说,他们有 Graphcore Cirrascale,或者一些在公共云上无法使用的芯片。

然后你可以拥有另一个经过优化的云,它针对存储进行了优化,在企业级存储方面是同类产品中的佼佼者。在这两个云之间,正如我所说,你可以有免费的对等协议。

新的芯片供应商也可以获得一席之地,因为现在,新的芯片供应商的处境非常困难,除非他们不做云相关业务,否则他们的生意会很难做。你可以想象,像 Cerebras 之类的公司开发出了侧芯片这种方式。他们可以安装它们,为他们的服务器提供服务。把它们置于 Aconics,基本上就是拥有物理托管的数据中央托管公司,他们拥有数据中心,在全球有两三百处。你可以把你的服务器放在那里,然后宣传你能提供的服务,比如说训练。Sky 计算就会把这些东西发送给你,这样你就可以获得一些收入,你可以由此发展壮大等等。


当然,公共云是其中的一部分,甚至边缘/私有云也是其中一部分。


我们还认为 Sky 将加快云计算的应用。因为它解决了一些客户的担忧,如数据、操作、主权和锁定问题以及其他类似问题。我们认为,它可能非常类似于互联网加速网络产业的发展。


这不是零和博弈,市场领域会更快变大。Sky 还将加快软件平台的发展,因为现在我们像微软一样独具特色。微软是一家大型软件生产商,我们能把他们的软件放在其他云上,如果他们愿意的话。


最后就是你研究 Sky 的原因所在。我们是研究人员,我们相信我们会影响许多研究课题,类似于互联网对网络研究的影响。


这里提到了一些研究课题,我就不多说了,但实际上涉及到方方面面。越来越多的应用程序正在上云,Sky 会成功的。所有这些应用程序都会上云,这些应用程序中的很多都会成为 Sky 的一部分。Sky 也带来了新的挑战,它提供了多个信任域,多个失败域。这个系统将变得更加不同,因为应用程序的各部分会在同一个数据中心中运行,在同一个云中运行,跨云运行等等。


最后,到目前为止我谈的是多云,现在我想谈谈Sky 对单云而言也非常重要。这是我之前放的一张幻灯片,讲的是为什么客户希望实现 Sky。但现在,我要重点谈谈单云。只要云在该特定国家/地区拥有数据中心,你仍然可以满足数据和运营主权。你仍然可以使用它来利用单云上的最佳服务和硬件,因为在单云中,你实际上节省了加速器。它们不是在所有地区都存在,比如 TPU,谷歌不是在所有地区都有 TPU,对吧?如果你想利用它,你必须在该地区运行你的软件。你也可以聚合资源,而不需要跨多个区域的云。你可能会在许多地区获得抢占式实例等,也可以降低成本和延迟。例如,你可以找到拥有足够抢占式实例的地区。它唯一没有的优势当然就是避免锁定。


然后,云间代理是 Sky 的核心部分,它在所有这些可用的云服务之间以及用户的应用程序之间建立了一个双边市场。顺便提一下,你可以有多个云间代理,云间代理可能专门针对不同的应用程序。

首先是 SkyML,就目前而言,SkyML 基本上刚开始对我们的学生进行研究,因为你很了解他们。你有来自多个云的信用额度,你想训练或调整你的模型,你想提交一项任务一项训练任务,你希望在 Sky 里拥有一个私有集群。你想在那里进行第一次尝试,因为它的成本并不高,不会有任何费用。不然的话,你要在后台进行尝试,在云端进行更深层次的运行,比如 AWS,找到一个拥有足够抢占式实例的区域,因为抢占式实例比较便宜。再不然的话,你可以选择 Azure。

或者也许你想使用 TPU,如果使用 TPU,会更加经济实惠,你可以选择谷歌之类的工具。所以使用 SkyML,你基本上可以在后台完成所有的安排。首先,作为用户,你提交一项任务,你的基本期望是我只需要在早上九点前,当我醒来时,当我开始工作时完成任务。顺带一提,这些的基础是,所有这些基础设施都基于这里的 Ray,这点也许不足为奇。


这有点像 SkyML 云间代理。这里还有一个方框,是关于配置的,你在云中配置资源,比如说集群,在这种情况下,一个数据集群在上面运行。显然你可以使用其他的工具,我们现在用的是 Ray。

基本上就是这么回事,现在再强调一下,它是基于 Ray 的灵活调度等等。我们已经构建了一个原型,我们也拥有了第一批用户,我们计划在接下来的几周内发布,当然,是以开源的形式。现在,还有一方面我没有提到,通常当人们听到多云时,他们会担忧数据垃圾的问题,对吧?它的价格昂贵,将大量数据从跨云转移需要很长时间,这点令人担忧。为了解决这个问题,这些是我提到的使用 SkyML 的结果。


所以我们开发了另一个叫 Skylark 的系统。使用 Skylark,是因为我们想要在云之间高效且经济地移动数据。

有时候,当你想跨云移动,在两者之间移动,如果你从一个地区的一个云移动到另一个地区的另一个云,跨越半个地球,费用就会非常高也耗费很长时间。


所以为了缓解延迟和成本问题,我们使用了多种技术,其中一项就是覆盖路由。大体而言,覆盖路由就是你可以找到一个中间点,如果你通过中间点,你实际上会得到比直接使用更高的吞吐量。在这个例子中,可能不是太明显,从阿联酋到印度孟买,再到美国北维吉尼亚,可能并不比从阿联酋直接到美国北弗吉尼亚更快。但这是另一种思考方式,再来看看另一个例子。假设我想从 AWS 西部将数据移动到 Azure 东部,事实证明,与直接移动数据相比,从 AWS 西部移动到 AWS 东部,再从 AWS 东部移动到 Azure 东部更好。为什么呢?这是因为在这种情况下,事实证明,AWS 和 Azure 东部地区的数据中心基本上是并排的,所以它们连接的是相同的网络,速度非常快。


我们还在每个地区分配多个 VM,来保证绕过一些带宽节流,多个特定的连接。另外,还有层选择,比如说我觉得如果 AWS 使用热土豆式路由,会便宜 40%,比冷土豆要便宜。冷土豆式路由意味着云在传输数据前将其尽可能多地保存在自己的网络中,热土豆式路由则会尽快处理掉这些数据包。


结果看起来很不错,这些是初步结果,所有结果都是初步结果。与 AWS 自有的 DataSync 相比,即使是在 AWS 内不同区域之间的数据传输,Skylark 所展现的性能也更加卓越,能够达到 4.6 倍的提升。

所以我们开发了另一个叫 Skylark 的系统。使用 Skylark,是因为我们想要在云之间高效且经济地移动数据。


总的来说,我们相信,Sky 是云进化过程中合乎逻辑的发展趋势。当然,关于如何达到这一目标,许多问题仍悬而未决。通过努力,我们希望能早日实现这一目标。


SKY LAB

So today I am going to talk about kind of a new lab we are starting, and then to provide some context, here it is called Sky Computing Lab. And this is the latest, you know, in a long line of work Berkeley’s series labs. This lab should be based on the system side, was started by the David Patterson, and as you know, during the work doing that four to five years ago. And this lab gave a lot of influential technologies along the years. like RISC processor, Ray resource, redundant array of independent disks. Now they've got four stations, which is basically their foundation of, you know, almost every data center today. 


And most recently, now it's AMPLab and RISE Lab. You see the two most recent labs I have been involved with, and these labs are characterized by producing some, you know, less and most of all popular, software artifacts, including MESOS, APACHE Spark, ALLUXIO in the AMPLab, and in RISE Lab,Ray, MODIN, mc2 and this platform for security, and Quipper, this was a serving platform. So again, you know, some of these projects you use, and thanks for that, thank you for that, and also you contributed to them, making the platform to Ray.


So Sky Computing mission is basically to make it transparent for application to your services and resources across clouds and on-prem clusters.


The rest of the talk, I'm going to have has three, four parts. What is Sky? Why and how we believe Sky will happen? What is the impact? If Sky happens, then what? I'll also talk of a bit about very early experience with Sky.


So one way to think about, Sky is like Internet for the clouds.


And this is a first Internet demo in November, 1977. And in this demo, people are, you know, the researchers are sending packets data from Menlo Park to Information Science Institute here at the University of South California, south of LA .So from Beria to LA, and these packets were travelling over two continents, north America and Europe, and over three different networks. Packet Radio Network, ARPANET, which was a network and then the Satellite Network, And the beautiful thing about this the Internet was it's abstracting away these different networks, despite the fact that these networks have different technologies and they are using different maybe protocols to carry the packets between the nodes with the niche network, they were abstracted away to the entry and to the end appliction, to the end users.


So the thing which, you know we hope at least to do with Sky And here, just to provide  some more example, and to be more concrete. Assume that you have a machine learning pipeline and a simple machine learning pipeline, so you have a data processing stage, a training stage, a serving stage. And then assume that you have these two requirements. First, you need to process the data you want to train on, contains confidential information, like personal identifiable information. And also you want to minimize the cost. Though why not? So assume that in order to process the confidential data and remove this, this PII information, you are going to use Opatch. It's another system we develop which requires the use of SGX, Intel’s SGX.


And Sky again, will try to abstract away this cloud. So now, the way you can, you know you want to seek the possible, possible solution here, because among the public clouds, say you are for now use only the public clouds, Azure, with their service, Azure Confidential Computing, is the only one to provide SGX processors. then you'd use for data processing, to securely to remove the information, the sensitive information, is going to use Azure, right? And then for the next stage, you may use it because you may want to use Google's TPUs. So in this case, you are going to use Google GCP. And maybe for serving you may want to use Inferentia, the new service on Amazon. So you are going to use AWS, ok? So for this, in this stage, you can use a different cloud.


So let's make this more concrete. So here let's assume that we use a user review data set. This is from Amazon. This is what we are using. And we are using BERT. We are going to fine tune BERT on using content epochs, using this review dataset. We assume that this review dataset, the reviews contain the user identifier. So this is the information we want to strip away, then once we train our model, we are going to perform 10,000 queries, ok? So it's again. We want to protect the data confidentially, that's going to happen on our job. Now, so, because of that, if we are forced to use only a single cloud, that has to be Azure, because the first stage has to run in Azure because of the constraints. And this is kind of the cost, the total cost for Azure. So in this case, it's Azure, end of time, running time. Now, if you use Sky,


and you have the flexibility for different stages around different clouds, then what you get, in this particular case, you get the costs reduction by some 60%. and also running faster, it's 47% faster. Obviously, in this case, you know that you know the base goes to consumption base, this cost model. if you have something which is running for a shorter time, intervals using the same resources is going also to cost less. So you have very nice correlation between the cost and the time, ok? So that's basically just for to give you a sense.


Now let me pop up one level. So basically, what we want here is to go from this kind of left hand side, where you have today, clouds each other, it's a silo. each cloud comes with its own proprietary services and so forth, it has many proprietary services. we want to move to this world, in which of Sky Computing, in which basically, So why I need this compatibility set The compatibility set is a set of public services which are possible implemented by multiple clouds. And then the intercloud broker. This is a key component of Sky Computing, and this is the one which is in charge of abstracting away the clouds, an application runs on multiple, using the Internet cloud broker. It may be oblivious of about which clouds are going to be used to run different of its components. A third component of cloud is peering this is not necessary. So today,We believe that we are going to evolve toward a world, in which, basically, you are going to have arrangements between different clouds so they can exchange the data, you know, freely  As you know today, to get data out of a cloud is quite expensive because the egress costs  Now free peering is desired, but is not necessary.


I just, let me go back here. notice here that actually, in this application, we did take into account the egress costs of moving the data between the clouds, but in this case, the compute cost dwarves the egress cost of moving the data.


Now, at this point, you may ask yourself about, how is Sky different from other multi-cloud?


After all, you hear more and more today about more and more companies being multi-cloud now, in general, What this means when people say, or different companies say I'm a multi-cloud. What this means that there are different work loads from different teams, which uses different clouds So this is what we called partitioned multi-cloud. For instance, here there is a team using the Synapse on Azure for some data processing. Maybe there is another team using machine learning on using Vertex AI, on Google GCP, and you know, another team doing using Redshift  So that's one model.


Another model. It's what you call portable multi-cloud. And you see that also more and more. In this case, you have the same application running on different clouds. So one example is Snowflake for database. And the main point here, although you have the application running on multiple clouds, is not cloud transparent In particular, this is about how you sign, for instance, for a Snowflake account. Notice that before you create your account you need to pick the cloud, then even you need to pick the region where you are going to create your account.


There are also quite a few efforts have been over the years to provide a uniform layer across clouds. This is a little bit more infrastructure in the service. And this kind of is pretty low level API on all clouds. And like, for instance, there are efforts from Google, is Anthos, there are some efforts from Microsoft and Azure as you can see like that. Typically, this is a low level, like container orchestration kind of service, or VM orchestration.


So Sky is different in several reasons, in several ways. First, it drives, you know, it aims to be, you know it provides transparency. with respect to the clouds. so it abstracts away the cloud.


The second one I mentioned earlier to you about compatibility set. This compatibility set contains services and a different layers of the softer stack. And these services do not need to run on all clouds. They can be services that run on two clouds, even services than run on a single cloud.


So let me give you some examples what is in this compatibility set. One examples are this kind of more open-source, hosted services. Like, for instance, think about Kubernetes. Like today, every of the major public clouds, they provide a hosted version of Kubernetes, like in particular, AKS, GKS and EKS.


Another example is Apache Spark. Every cloud provides a hosted version. Actually, some of them provide more than a hosted version of Apache Spark. Like, you know, for instance, Azure provides HDInsight and Synapse and GCP, Data Proc, AWS, EMR. This is provided by the clouds themselves.


But there are also third-party companies which provides hosted open-source projects, like Databricks, Apache Spark and Confluent, Apache Kafka, This is example about ? compatibility set. 


So far, everything I mentioned is kind of open-source. But it's not only open-source. You can have third-party providing multi-cloud services, like I mentioned earlier, like Snowflake. 


These are proprietary services, or you can have BigQuery or SageMaker. So again, services which are provided by a single cloud So again, you have this kind of sea of services, and some of the services are going to run on one or multiple clouds. But of course if you have a service running on multiple clouds, then you have a choice about where to run your application or component of your application.


So the intercloud broker is this the core part of Sky, and it creates a two-sided market between all these services which are available, cloud services, and the user's applications, OK? And by the way, there can be more than one intercloud broker. Maybe you may have intercloud broker, which are specialized for different applications.


Now, what is within an intercloud broker? An intercloud broker is pretty complex. Here I'm not showing you all the components, a few components which maybe are more important. First, you have a service catalogue. In the service catalogue, you are going to have all the services which are available on multiple clouds, maybe with a cost associated to them, maybe with instructions about how to start the service, manage the service, how to tear down a service, and things like that.


Then you say you have a user which submits a job, and with that provides classification of a job as well as her preferences, like minimize a cost, latency and support. And one particular job specification you can seem to start with, it can be like a DAG, that is, Directed Acyclic Graph. you can think about this like, workflow manager, Airflow Something like that.


Now, the intercloud also has an optimizer, which is, again a control center component, this optimizer takes the description of the job from the user with the preferences, and then it looks for services which are available in the service, there are the service catalogue to run different components of this job, and then is going to partition the job and is going to run different components on, possibly on different clouds. And then, of course, you have billing And billing, you can happen in two ways. One, it's direct to the user. one way to think about it is like. The user has already accounts with different clouds, and to basically just in the intercloud broker provides credentials and access rise to its own accounts on different clouds. So the intercloud broker can run the user workload on the users' behalf in these accounts.


Or you can have a billing a components, in this case, the user has an account to the intercloud broker, and the intercloud broker has account, individual accounts with different clouds. So, therefore, the intercloud broker is going to be charged by the clouds, and then the broker charge the user In both cases the intercloud broker can get a fee for the service, ok?


So that's kind of and I'm sure there are a little bit of the high level what the Sky Computing proposal is about. I'm sure that you have a lot of questions by now, and next I'm trying to maybe answer a few of them, and then I'll be happy to take a lot of question at the end. I think we have 15 min for questions. So now one question is about, typically, you ask about, why should this happen? You know the clouds will not run this to happen. And they are going to be opposed, because it may commoditize them. So, and there are several efforts in the past which tries to, to provide this kind of to abstract away the cloud Actually. Some of them are even called Sky, like a decade ago or so. So why do you believe we are going to succeed, or at least have a chance to succeed?


Here are the reason. I have three conjectures. First, the compatibility set is growing quickly. The second is that Sky can start with no help with existing clouds, So it's no help we need. And once we start, the market forces, we believe, will do the rest.


So compatibility set is growing quickly, and in large part, like I mentioned earlier, open-source software is driving it. But furthermore, all the “actors” want it, at least at some degree. Customers want it, Third-parties want it. Even clouds themselves want it.


Like I mentioned, open-source software is driving it like open-source dominates now at many layers of the software stack, and either you have out of box services based on the open-sources you can just use in different clouds, like I mentioned earlier, or you can run yourself this, this open-source in the cloud, the service. Every of these open-source provides reasonably easy way to run it in the cloud.


Now, I said that also, all these actors want it, customers, third-parties and clouds. For customer, it's pretty easy, right? Why they want that? It's first of all. You see it now today, more and more you hear about a data and operation sovereignty, and in which basically right now, is not only, there are some constraints that some countries want to process the data within their own borders, right, on the territory of that country, but also, there are more and more now regulations which basically say that the data should be processed in a data center which is operated by the nationals of that country.


And they want this. They want to leverage the best-of-breed services and hardware. Now, more and more clouds, they come with their own hardware and their different services. So the customers want to offer to have access to all of these hardware and services.


But another thing is aggregating the resources across clouds. So it turns out that in many companies not many, but in many cases, the companies, for instance, are not able to give enough resources in a single cloud. For instance, GPUs, right? And therefore, actually, today, some of them is doing some ACO solutions to get to have contracts with multiple clouds, and they have GPUs in different clouds in order to run their workloads.


Obviously, you can reduce a cost and latency, and one of the most important aspects is to avoid lock-in. So if you are a company and you spend hundreds of millions of dollars on the cloud, there are quite a few of these companies, you do not want to be locked in a single cloud.


The other one, the third-party want it, because if you are a third-party and you are providing cloud services, if you can provide your service on multiple clouds, gives you two immediate advantages. One, you reach more customers, because you can reach customers of all, you know, of every cloud. And you also can better compete with public clouds, because now one of your value that you are providing the service on multiple clouds, so you avoid lock-in for your customers, with respect to the clouds, ok? And these services are naturally part of the compatibility set.


And finally, clouds themselves drive it. maybe this little bit not obvious, but we think about,, one of the things is that, like I mentioned, is providing hosted versions of the open source projects. If there are open-source projects, there are popular ones, almost every cloud, sooner or later, will provide a service by the phase of that open-source projects. But it's another more sub, um you know there are other subtle things. One, they also provide their own stack on other clouds. And the idea here is that they want to incentivize developers to build on their own software, even if they say, I'm a Google, it's ok if you run in AWS, as long as you ran on top of you build your application on top of Google Anthos, which is, you can think about Google Analytics plus plus. Because as a Google, I believe that I am going to provide the best instantiation of Google Anthos, so sooner or later, you're going to come to Google  And it will be much easier to come to Google, because your application already is going to come to work on Google, which brings us to the one, to the reason which is more subtle. It's in time. Some of the API is kind of comparative. And why is that? Because, for instance, Say I am Microsoft, And then I want to get a customer of AWS. In order to get the customer, of course, I can provide, I can say, you know good pricing and things like that, or whatever, But one of the barriers would be for that customer to realize these applications now to use as your services. So it's now it's an interest of how you're basically turning to the customer, Hey, if you are using this service on AWS, I have something quite similar. It doesn't require you to do this or that  Just a small few changes, So that's kind of how they become a little bit more alike.


Also, Sky needs no help from their clouds. It can now start today with existing services. We don't need anything from the cloud. And also, it's a very important aspect, because another question we are asked very often, Well, but you are, you know maybe you can support best jobs, but what about these other jobs, like you know Microservices or whatever? And our answer to that is basically saying, Look, we do not try to support everything from day one. We just focus on some important and may be easy use cases for us, And we take it from there. If you are successful there, maybe you are going to expand that. If we are not successful, at least, we fail fast.


And by the way, we are already building. I am going to say a little bit more about it and what we are building, We are, a prototype of intercloud broker, We are calling it MSC, Oh, not not MSC, sorry, SkyML. And basically we are targeting on training and hyperparameter tuning. And the users are our own AI students, which are using, which have credits from multiple clouds.


And finally, once started, we think that we are going to develop this kind of a Virtuous Cycle, Flywheel, Because you know like in the following sense, you starts with original compatibility set, then you are going to start build your early services. By definition, many of these services and applications are going to be part of the compatibility set, because they are going to run on Sky so they can run on multiple clouds. And now, if these services are, you know, so again, by becoming part of the of the compatibility set, compatibility set is going to grow. But also, if these are important workloads, more clouds will provide the interfaces so they can compete for these workloads, which now is a bigger compatibility set, you have more, you are easier to write more sophisticated services and so on, ok?


Now, let me very clear about what Sky is not, alright?


So first of all, Sky doesn't try to define a uniform standard API for all clouds. It's just too hard. and it's unclear whether it is even feasible. The Cloud API is probably ten times at least larger than the operating system APIs. For operating systems, people try to have a layer, like in the UNIX industry, you know And they failed. There is some other success for this is the subset of the UNIX API, but not really huge success. Clouds also are not, at least some of them, the leading clouds are not incentivized to support a uniform standard for fear of commoditization, And it would require a huge and lengthy standardization effort, which frankly, as academic institution probably don't have resources and the time to do it.


Also, Sky doesn't try to impose standards, even standards for some services. So the way to think about, about the service in many cases, there are exceptions. there are always exceptions. But you can think about the API as a code  So basically you don't say, I want to use Spark API and so forth. You just say, I want to use Spark 3.2, Kubernetes 1.8. So it's very similar the way to think about these services, like libraries in today’s programming languages.


So really that I would think the Sky application should be similar to writing a program. You can just replace “library” to a “service”. So today, developers are responsible for it. When you want to write a program for including the libraries, the appropriate libraries to serve with appropriate versions and to solve the dependencies? And manage conflicts and dependencies. And yes, that's not pleasant. Many people refer it as, you know dependency hell. But, you know it's working. It's feasible, right? The same thing here. We expect the developers, at least at the beginning, to specify the services, like the version of the service, maybe some configuration parameters, like you have for a purchase file, You know you enable you know of course there is optimizer or not, And you are going to the application writer to manage the conflicts. The intercloud layer is responsible for them. As I said, you know intercloud broker is responsible for instantiating the service instances and managing their lifetime.


Like I mention, Sky doesn't try to support all applications from the beginning. It’s similar to serverless in some sense. The severless start with some key application just supporting well, and now it's expanding to more applications. The same thing here. We are starting with a very few easy but useful applications, and then we hope that we are going to expand.


What if Sky happens?


Well, so you know I think it will have a lot of positive impact. First of all, we believe to lead to specialized cloud and accelerate innovation. You see today, if you are coming and you want to provide a cloud service, because people are, you know wants to basically people today and most of their applications are single cloud. For you to have a seat off the table, you need to provide a lot of table-stake services just to be in conversation. And for the same reason, it will make it easier to integrate on on-prem/edge clouds.


Now, the Sky, what allows you is that if you have, if you have you can build a specialized cloud which is good as something, And then Sky would send the component of the applications which are going to require your service to you, if you are the best, So for instance, you can have compute-optimized clouds, like a NVIDIA or Cirrascale, Both of them, they have their own clouds This is the announcement about NVIDIA cloud. And this is Cirrascale. By the way, they have also specialized chips, like for instance, they have Graphcore Cirrascale, or you know some of the chips which are not available on the public clouds.


Then you can have another cloud which is optimal, optimized for storage, is the best in class when it comes to storage for enterprise. And between these clouds, which I have mentioned, you can have this free peer agreement.


You can also give a seat to the table to the new chip vendors. Because today, the new chip vendor is very hard, because unless they are not in a cloud, their business is difficult. But here you can imagine that someone like Cerebras, which develops this kind of way for side chips, You know, they can install them. They feed their servers. And they put in aconox, Aconics, which is basically has a few physical colocation  data center colocation company, which has data center, you know 200, 300 locations over the globe. So you can put there your servers, then you advertise the service you provide, like for instance, training. And Sky Computing will send over those to you so you can get some revenue, you can grow from there, things like that.


Of course, public clouds are part of the conversation, and even edge/private clouds are part of the conversation here.


We also think that we will accelerate cloud adoption, because it removes some of the customer’s concerns, like data and operation, sovereignty, lock-in and all the concerns and things like that. So we hope it’s very similar as the Internet accelerated the growth of the networking industry.


So it's not a zero sum game. The pie will get bigger faster. We’re to also accelerate the growth of software platforms, because now we differ like Microsoft. It's a big software producer. So this will allow us to use to put their software on other clouds if they wish.


And finally, that's why you are doing that. We are researchers. We believe that we impact many research topics, similar with the impact the Internet had on the networking research.


And you know they said some of the research topics, I'm not going to go into details, but really all the facets, you know, the way to think about, you know more and more applications are going to the cloud, And you, Sky is going to be successful. All these application are going to go to, you know, many of these applications are going to go to the Sky, And Sky provides new challenges. It provides, you know, multiple Trust Domains, multiple Failure Domains. It’s you know, the system will be much more radical, because parts of the application run in the same data center, there in the same cloud, across clouds and things like that.


But finally, I just want to make sure I talk now about, so far I talk about multi-clouds. The Sky is also relevant for a single cloud, is very relevant. This is a slide I put earlier about why customers want this to happen. But this now, I'm just focusing on a single cloud. You can still satisfy data and operational sovereignty as long as a cloud has a data center in that particular country. You can still use it to leverage best-of-breed services and hardware on a single cloud, because in the single cloud, you actually, you know save accelerators. are not been over in all regions. For instance, TPUs. Google has TPUs not in all the regions,? So if you want to take advantage of that, you have to run your software in that regions. You can also aggregate resources, instead of across cloud, across multiple regions, You can get maybe spot instances or so forth across many regions. It also can reduce cost and latency. For instance, you can find the region with enough spot instances. The only thing it doesn't do, for sure, is avoid lock-in.


So finally, over the last few minutes, let me give you a little bit about you know our early experience.


So first of all, this SkyML.


So SkyML is really, basically, it's for now, it started to research to our students, because you understand them well. You have credits from multiple clouds, and you want to train or to tune your models, and you want to submit a job, a training job, and you want ideally that to Sky and in Sky you ideally, if you have a private a private cluster, you want to first try there, because it doesn't cost too much, because it doesn't cost you anything. If not, you are going to try to, you know under the hood, to run on a deep on a cloud, like AWS, to find a region who is enough spot instances, because, for spot instances, that is cheaper. If not, you can go towards Azure. Or maybe you want to, prefer TPUs, you are going to be much more cost effective if you use TPUs. you can go to Google and things like that. So SkyML, you basically will do all this organization under the hood. And from you as a user, you know, you submit the job, basically, hopefully, you know, you say, you know I just need to be done by9 a.m.in the morning, when I wake up, when I start talking. By the way, this is based right now, all this infrastructure is based on, maybe I'm going not going to be surprised by on top of its Ray.


So this is I'm going to go quickly. That's kind of the SkyML intercloud broker. I have another box here. It's about provision. So you provision the resources in the cloud, you know, the cluster, in this case, a data cluster, ran on top of it. You can do things, you know obviously you can use other things and Ray, the Ray we are using for now.


So ok, that's basically what it is. And for now, it's again, built on top of Ray, flexible scheduling and so forth. So this is our and we do have a prototype, we do have a very, you know the first handful of users, and we are going to plan to release it over the next couple of weeks as open source, obviously.


Now, another thing it's about whenever, which I didn't touch on this, but typically when people hear about multi-cloud, one concern they have and one they argue about data garbage, right? It's expensive, and it takes a long time to move a lot of data from one cloud to another. And that's you know to make you concern. So for that, we are, these are the results I mentioned to you from the SkyML.


So we are, for that, we develop another system called Skylark. So for Skylark, it's we want to move data efficiently and cost effectively, between clouds, ok?


So this is things from, and sometimes when you are going, when you want to go from between cloud, between two, if you are from one region in one cloud to another region in a different cloud, half of ways of world, it can be quite expensive, and also take a long time.


So to do so, in order to alleviate the problem of latency and cost, we use a variety of techniques. One is overlay routing, So the overlay routing basically says that it's actually you may find an intermediate point. So if you go through intermediate point, actually you are going to get higher throughput than if you go directly, And this is maybe not an obvious one here, in this example from UAE to India, mumbai and to North Virginia, US. Maybe it's not wise faster than going from UAE directly to North Virginia. But here is another way to think about like another example. say I want to move the data from AWS west to Azure east. It turns out that from instead of doing directly, moving the data directly, it's better to go from AWS west to AWS east, and from AWS east to Azure east. And why? For instance, in this case, it turns out that the data center for the regions of AWS and Azure in the east are basically side by side. So they are connected to the same networks and things like that. It's very fast.


We also, you know allocate, you know, multiple VMs per region so to you knowto make sure that, you know, to go around some throttling,bandwidth throttling,multiple specific connections. and also another tier selection, you know you sound like, I think it AWS if you have use hot potato routing, it is probably 40% cheaper. So it is low of fee here than cold potatoes. Cold potatoes mean that the cloud, it tries to keep the data as much as possible in its own networks, before going out and hot potato, I'm going to get rid of the packets as fast as possible.


So after the results, the results are pretty promising. These are preliminary results, all of them. By comparing with AWS’ own DataSync, A Skylark provides better performance even on moving data. We see in this AWS between different regions in data transfers. up to 4.6x improvement.


And if you move the data between clouds, again, with GCP data transfer tool, which is a similar tool provided by Google.


So in summary, we believe that Sky is the next logical step in the evolution of the cloud. And of course many questions remain open on how to get there. And with these efforts, we hope to make it happen sooner.



向上滑动查看英文原文



继续滑动看下一个
蚂蚁技术AntTech
向上滑动看下一个

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存