ChaosMeta Introduction
ChaosMeta is a chaos engineering platform designed for cloud-native and automated drills. It is the external open source version of XMonkey, an internal chaos engineering platform of Ant Group, which embodies the methodology, technical capabilities and product capabilities accumulated by Ant Group in the practice of large-scale red and blue offensive and defensive drills at the company level for many years. Driven by years of complex fault drill scenarios within the company, XMonkey has accumulated a lot of unique experience in the field of chaos engineering. It is an important platform for Ant Group R&D, testing, quality, SRE and other personnel to conduct historical fault drills and tap potential system risks. We very much hope that these experiences can be discussed and evolved together with the industry, so we open sourced ChaosMeta. ChaosMeta is designed to be a one-stop exercise comprehensive solution that includes a complete chaos engineering life cycle to help users quickly discover potential risks in business applications and systems. In addition, it also has a built-in "risk catalog" that condenses Ant Group s accumulation in the field of technical risks for many years. This is a summary of basic general risks in various technical fields. With the platform capabilities of the chaos engineering life cycle as the technical support and the "risk catalog" as the theoretical support, ChaosMeta can evolve towards automated chaos engineering. Core Features
▌Chaos Engineering Lifecycle
The current mainstream chaos engineering projects in the industry mainly focus on how to create faults. Engineers who often do drill-related work should understand that there are still the following pain points in each drill: Check whether the current environment meets the preset conditions of the drill (Admission Check), whether the business flow is satisfied (Traffic Injection), whether the failure effect after injection is in line with expectations (Fault Measurement), whether the business service is restored within the preset time (Recovery Measurement), analyze and summarize the risk points. Based on the current situation of the industry and the analysis of the above problems, combined with Ant Group s years of experience in the field of chaos engineering, ChaosMeta proposed a model of Chaos Engineering Lifecycle, covering "Admission Check", "Traffic Injection", "Fault Injection", and "Fault Measurement" , "Recovery Measurement", "Injection Recovery" and other stages of technical support, providing technical basis for automated chaos engineering. enter image description here
▌Risk Catalog
Ant Group holds company-level large-scale red and blue offensive and defensive drills every year for all businesses of the company, and many businesses also conduct 7X24-hour drills and monthly normal drills. Drill object types cover cloud products, Kubernetes, Operator applications, databases (OceanBase, MySQL, etc.), middleware (message queues, distributed scheduling, configuration centers, etc.), business applications (trading systems, marketing systems, etc.), etc. For each type of application, some of the most common risks can always be abstracted, such as message queues, there will be risks such as message accumulation, message loss, message order confusion, and unstable network of relying parties; such as transaction systems, there will be risks such as distributed transactions, amount consistency, and concurrency conflicts. The "Risk Catalog" is a "Risk Encyclopedia" that Ant Group has polished over the years in the context of a large-scale financial Internet architecture, and we will build the parts common to the open source world into the ChaosMeta project, as the theoretical basis of automated chaos engineering.
▌Abundant cloud-native fault injection capabilities
Large-scale and high-frequency drills have also promoted the construction of various fault injection capabilities. In addition to common system resource faults, kernel faults, network faults, JVM injection, etc. in the industry, we also provide a wealth of cloud-native fault injection capabilities. For example:
- accumulating a large number of pending pods in the kubernetes cluster, dragging down the scheduling system;
- injecting a validating admission webhook into the creation process of a resource object, prolonging the verification time and affecting the creation efficiency;
- injecting a mutating admission webhook to mutate specific fields;
- build a large number of Watch&List links to increase the burden on APIServer to call back Operator.
Cloud Native Architecture
ChaosMeta s core platform capabilities are implemented based on the Operator development model, so it naturally supports cloud native. ChaosMeta is divided into three layers of design, see the official documentation for details:
- The user layer is composed of chaosmeta-platform components, whose main task is to lower the threshold for users to use, provide a visual interface, and facilitate users to use platform functions such as planning, orchestration, experiment configuration, and experiment record details;
- The engine layer includes ChaosMeta s core platform capabilities such as remote injection, orchestration, and measurement, as well as the realization of some cloud-native fault capabilities;
- The kernel layer mainly includes the realization of single-machine fault injection capabilities, mainly including the chaosmetad component, which provides the way of resident HTTP service and the way of command line execution, and also encapsulates the corresponding daemonset component (chaosmeta-daemonset), which can be flexibly Match the drill platform with different needs.
RoadMap
The planning of ChaosMeta is divided into two main routes: platform capabilities and fault injection capabilities. The current main task is to complete the transformation of the main capabilities in the architecture diagram and open them to the external version.
▌Platform Capability
The future evolution of ChaosMeta platform capabilities is divided into three stages.
Phase 1 - Manual Configuration The goal to be achieved is to open all the components in the architecture diagram to the outside world. At this time, it can support the complete life cycle of chaos engineering, enter the field of primary automated chaos engineering, and use the "risk catalog" as a theoretical reference. Once manual configuration, multiple times automatically.
Phase 2 - Automation At this stage, the "Risk Catalog" will play a greater role. It not only gives the risk of a class of applications, but also the corresponding prevention and emergency recommendations, and the score of each item, and ChaosMeta will The "risk catalog" is integrated into a risk medical examination package of general components, which realizes the one-click "physical examination" capability, inputs target application information, and directly outputs a risk score and risk analysis report.
Phase 3 - intelligence Explore the direction of combining artificial intelligence.
▌Fault Injection Capability
The following is just a classification of fault capabilities. For the specific atomic fault capabilities provided, please refer to the description of fault capabilities (welcome to submit issues and put forward new capability requirements, and those with higher requirements are given priority): enter image description here
Join ChaosMeta Community
As an open project, we recognize the open source R&D model and are committed to making the ChaosMeta community an open and creative community. In the future, all R&D, discussion and other related work will be carried out transparently in the community. We welcome any form of participation, including but not limited to questions, code contributions, technical discussions, requirements suggestions, etc. Looking forward to receiving community ideas and feedback to move the project forward further.
- Project GitHub address:https://github.com/traas-stack/chaosmeta
- Official documentation:https://chaosmeta.gitbook.io/chaosmeta-en
- Twitter:AntChaosMeta
- Slack group:https://app.slack.com/client/T057ERYMS8J/C057883SM38?geocode=zh-cn
- DingTalk Group:21765030887