English 中文(简体)
Ant Group s automated chaos engineering ChaosMeta is officially open source
原标题:

ChaosMeta Introduction

ChaosMeta is a chaos engineering platform designed for cloud-native and automated drills. It is the external open source version of XMonkey, an internal chaos engineering platform of Ant Group, which embodies the methodology, technical capabilities and product capabilities accumulated by Ant Group in the practice of large-scale red and blue offensive and defensive drills at the company level for many years. Driven by years of complex fault drill scenarios within the company, XMonkey has accumulated a lot of unique experience in the field of chaos engineering. It is an important platform for Ant Group R&D, testing, quality, SRE and other personnel to conduct historical fault drills and tap potential system risks. We very much hope that these experiences can be discussed and evolved together with the industry, so we open sourced ChaosMeta. ChaosMeta is designed to be a one-stop exercise comprehensive solution that includes a complete chaos engineering life cycle to help users quickly discover potential risks in business applications and systems. In addition, it also has a built-in "risk catalog" that condenses Ant Group s accumulation in the field of technical risks for many years. This is a summary of basic general risks in various technical fields. With the platform capabilities of the chaos engineering life cycle as the technical support and the "risk catalog" as the theoretical support, ChaosMeta can evolve towards automated chaos engineering. Core Features

▌Chaos Engineering Lifecycle

The current mainstream chaos engineering projects in the industry mainly focus on how to create faults. Engineers who often do drill-related work should understand that there are still the following pain points in each drill: Check whether the current environment meets the preset conditions of the drill (Admission Check), whether the business flow is satisfied (Traffic Injection), whether the failure effect after injection is in line with expectations (Fault Measurement), whether the business service is restored within the preset time (Recovery Measurement), analyze and summarize the risk points. Based on the current situation of the industry and the analysis of the above problems, combined with Ant Group s years of experience in the field of chaos engineering, ChaosMeta proposed a model of Chaos Engineering Lifecycle, covering "Admission Check", "Traffic Injection", "Fault Injection", and "Fault Measurement" , "Recovery Measurement", "Injection Recovery" and other stages of technical support, providing technical basis for automated chaos engineering. enter image description here

▌Risk Catalog

Ant Group holds company-level large-scale red and blue offensive and defensive drills every year for all businesses of the company, and many businesses also conduct 7X24-hour drills and monthly normal drills. Drill object types cover cloud products, Kubernetes, Operator applications, databases (OceanBase, MySQL, etc.), middleware (message queues, distributed scheduling, configuration centers, etc.), business applications (trading systems, marketing systems, etc.), etc. For each type of application, some of the most common risks can always be abstracted, such as message queues, there will be risks such as message accumulation, message loss, message order confusion, and unstable network of relying parties; such as transaction systems, there will be risks such as distributed transactions, amount consistency, and concurrency conflicts. The "Risk Catalog" is a "Risk Encyclopedia" that Ant Group has polished over the years in the context of a large-scale financial Internet architecture, and we will build the parts common to the open source world into the ChaosMeta project, as the theoretical basis of automated chaos engineering.

▌Abundant cloud-native fault injection capabilities

Large-scale and high-frequency drills have also promoted the construction of various fault injection capabilities. In addition to common system resource faults, kernel faults, network faults, JVM injection, etc. in the industry, we also provide a wealth of cloud-native fault injection capabilities. For example:

  • accumulating a large number of pending pods in the kubernetes cluster, dragging down the scheduling system;
  • injecting a validating admission webhook into the creation process of a resource object, prolonging the verification time and affecting the creation efficiency;
  • injecting a mutating admission webhook to mutate specific fields;
  • build a large number of Watch&List links to increase the burden on APIServer to call back Operator.

Cloud Native Architecture

ChaosMeta s core platform capabilities are implemented based on the Operator development model, so it naturally supports cloud native. ChaosMeta is divided into three layers of design, see the official documentation for details:

  1. The user layer is composed of chaosmeta-platform components, whose main task is to lower the threshold for users to use, provide a visual interface, and facilitate users to use platform functions such as planning, orchestration, experiment configuration, and experiment record details;
  2. The engine layer includes ChaosMeta s core platform capabilities such as remote injection, orchestration, and measurement, as well as the realization of some cloud-native fault capabilities;
  3. The kernel layer mainly includes the realization of single-machine fault injection capabilities, mainly including the chaosmetad component, which provides the way of resident HTTP service and the way of command line execution, and also encapsulates the corresponding daemonset component (chaosmeta-daemonset), which can be flexibly Match the drill platform with different needs.

enter image description here

RoadMap

The planning of ChaosMeta is divided into two main routes: platform capabilities and fault injection capabilities. The current main task is to complete the transformation of the main capabilities in the architecture diagram and open them to the external version.

▌Platform Capability

The future evolution of ChaosMeta platform capabilities is divided into three stages.

Phase 1 - Manual Configuration The goal to be achieved is to open all the components in the architecture diagram to the outside world. At this time, it can support the complete life cycle of chaos engineering, enter the field of primary automated chaos engineering, and use the "risk catalog" as a theoretical reference. Once manual configuration, multiple times automatically.

Phase 2 - Automation At this stage, the "Risk Catalog" will play a greater role. It not only gives the risk of a class of applications, but also the corresponding prevention and emergency recommendations, and the score of each item, and ChaosMeta will The "risk catalog" is integrated into a risk medical examination package of general components, which realizes the one-click "physical examination" capability, inputs target application information, and directly outputs a risk score and risk analysis report.

Phase 3 - intelligence Explore the direction of combining artificial intelligence.

▌Fault Injection Capability

The following is just a classification of fault capabilities. For the specific atomic fault capabilities provided, please refer to the description of fault capabilities (welcome to submit issues and put forward new capability requirements, and those with higher requirements are given priority): enter image description here

Join ChaosMeta Community

As an open project, we recognize the open source R&D model and are committed to making the ChaosMeta community an open and creative community. In the future, all R&D, discussion and other related work will be carried out transparently in the community. We welcome any form of participation, including but not limited to questions, code contributions, technical discussions, requirements suggestions, etc. Looking forward to receiving community ideas and feedback to move the project forward further.

  • Project GitHub address:https://github.com/traas-stack/chaosmeta
  • Official documentation:https://chaosmeta.gitbook.io/chaosmeta-en
  • Twitter:AntChaosMeta
  • Slack group:https://app.slack.com/client/T057ERYMS8J/C057883SM38?geocode=zh-cn
  • DingTalk Group:21765030887
问题回答

暂无回答




相关问题
Selenium not working with Firefox 3.x on linux

I am using selenium-server , selenium rc for UI testing in my application . My dev box is Windows with FireFox 3.5 and every thing is running fine and cool. But when i try to run selenium tests on my ...

Best browser for testing under Safari Mobile on Linux?

I have an iPhone web app I m producing on a Linux machine. What s the best browser I can use to most closely mimic the feature-limited version of Safari present on the iPhone? (It s a "slimmed down" ...

Code Coverage Tools & Visual Studio 2008 Pro

Just wondering what people are using for code coverage tools when using MS Visual Studio 2008 Pro. We are using the built-in MS test project and unit testing tool (the one that come pre-installed ...

Is there any error checking web app cralwers out there?

Wondering if there was some sort of crawler we could use to test and re-test everything when changes are made to the web app so we know some new change didn t error out any existing pages. Or maybe a ...

热门标签