From 91ba24d595c34d7cd9a6ddfe095457c3a03ca256 Mon Sep 17 00:00:00 2001 From: Spellchaser Date: Fri, 15 Sep 2017 12:38:33 -0400 Subject: [PATCH 1/7] Update design.md MD Heading fixes --- docs/design.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/design.md b/docs/design.md index 5f29fd6..7328a78 100644 --- a/docs/design.md +++ b/docs/design.md @@ -1,4 +1,4 @@ -#SIMOORG - HIGH LEVEL DESIGN +# SIMOORG - HIGH LEVEL DESIGN This document describes high level design of Simoorg: Linkedin’s Failure Inducing Framework. The rationale behind developing Simoorg is to have a simple yet powerful and extensible failure inducing framework. Simoorg is written in Python - Linkedin's lingua franca for solving operational challenges. @@ -11,7 +11,7 @@ Key points of Simoorg are: * Comprehensive logging to help SREs and developers to get valuable insights about how their application of choice reacts to failures. * Support of heterogeneous infrastructure by introducing flexible execution handlers. New execution handlers are easy to plug in with minimal efforts. -##From a bird's eye view +## From a bird's eye view Simoorg’s main job is to induce and revert failures against a service of your choice. The failures are induced based on the scheduler plugin type you wish to use. Simoorg comes with a non-deterministic scheduler configured, which generates failures at a random time. Although the failures are generated at a random time, you can still set a few limitations like: total run duration and min/max gap between failures. Each failure is followed by a revert, ensuring that the cluster we operate against is back to a clean state. Simoorg ensures logging of important metrics like failure name, impact and the time of the impact to help SREs and developers to reason about fault tolerance of their application of choice. @@ -33,7 +33,7 @@ In the subsequent paragraphs we will cover important components and talk about h * Topology * Api Server -###Moirai +### Moirai Moirai is a single threaded process that monitors and manages individual Atropos instances using standard UNIX IPC mechanism and python queues. It also provides entry points for the Api Server to retrieve information about the various services being tested. Moirai takes configs directory path as an input argument and bootstraps the framework by reading configuration files in the configs directory. Configs directory contains: @@ -48,7 +48,7 @@ Here each Atropos can communicate specific information to Moirai, with the help ![High level Design](/docs/images/high_level.jpg) -###Atropos +### Atropos Upon initialization, each Atropos instance reads one Fate Book ([link][/docs/configs.rst]) and depending on the destiny defined in the Fate Book sleeps until requirements are met. Once requirements are met, Atropos induces a random failure, waits for the specified interval and reverts to bring the cluster to a clean state. There are two types of requirements to be met before inducing a failure: @@ -60,12 +60,12 @@ Each Atropos instance has its own instance of a Scheduler which is in charge of Apart from the Scheduler, each Atropos instance has its own instance of a Handler, Logger and Journal. The high level diagram reflecting Atropos and its components is as follows: ![Atropos Components](/docs/images/atr1.png) -###Scheduler +### Scheduler A Scheduler generates a failure plan and keeps track of time. Currently Simoorg ships only with a Non-deterministic scheduler. The Non-deterministic scheduler randomly generates dispatch times and associates them with random failures. We refer to this sequence of timestamp and failures internally as a Plan. Once generated, the Plan is passed to Atropos. -###Handler +### Handler Each failure definition should have a handler associated with it. A Handler is referred to by its name within a failure definition and is responsible for inducing and reverting failures. The table below lists supported handlers and handlers planned to be available in future: @@ -77,7 +77,7 @@ AWS|AWS API calls|not supported|TBD| Rackspace|Rackspace API|not supported|TBD| -###Journal +### Journal Each Observer has a separate Journal instance. The Journal is responsible for: @@ -85,11 +85,11 @@ Each Observer has a separate Journal instance. The Journal is responsible for: * Persisting the current state of Atropos to support session resumption * Resuming state after a crash -###Logger +### Logger Each Atropos has a separate Logger instance. The Logger is used to log and store arbitrary messages spit out at various points of Plan execution. -###HealthCheck +### HealthCheck Healthcheck is an optional component that allows you to control the damage inflicted against your service. If enabled, Atropos kicks off the healthcheck logic defined in the Fate Book before inducing a failure. The Healthcheck component needs to return success in order for the failure run. Otherwise the Scheduler skips the current failure cycle. This ensures that we are not aggravating any existing issues and lets the cluster fire self-healing routines and recover. If a healthcheck is not defined, failures will be induced as scheduled assuming the cluster was able to recover. @@ -109,7 +109,7 @@ The best practice is to leverage your current monitoring system to identify the We also ship a simple kafka HealthCheck out of the box. This plugin considers a cluster to be healthy if the under replicated partition count is zero for all the nodes in cluster. The plugin also depends on the kafka topology config file to get information about the cluster. -###Topology +### Topology The Topology component is responsible for identifying and keeping the list of nodes that constitute your service. In most cases this is just a list of servers present in your cluster. The Topology component is also responsible for choosing a random node from the list and handing it over to Atropos. We ship a static topology and Kafka topology plugins with our source code. @@ -128,7 +128,7 @@ Another example of topology is Kafka topology. It is a custom Topology component * RANDOM_LEADER - Where the node is a leader for a random topic and a random partition * LEADER - Where the node is a leader for a specific topic and a specific partition (if you skip the partition it randomly selects a partition) -###Api Server +### Api Server Simoorg provides a simple API interface based on Flask. The API server communicates with Moirai process through linux FIFOs, so it is necessary that the Api Server is started on the same server as the Moirai process. The API endpoints currently supported by our systems are From 1601df458d019e53c782decbef4780cd13e1caba Mon Sep 17 00:00:00 2001 From: Spellchaser Date: Fri, 15 Sep 2017 12:39:40 -0400 Subject: [PATCH 2/7] Update user_guide.md Fix extra code ` Corrected MD headings --- docs/user_guide.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/docs/user_guide.md b/docs/user_guide.md index a23d5d8..d3435fe 100644 --- a/docs/user_guide.md +++ b/docs/user_guide.md @@ -1,13 +1,13 @@ -#Introduction +# Introduction This document describes the process of setting up and running simoorg against an application cluster. -##Installation +## Installation The system requirements for Simoorg are as follows OS: Any Linux distribution Python Version : Python-2.6 Additional Python Modules: multiprocessing, yaml, paramiko Simoorg is currently distributed via pip, so to install the package please run the following command -```` +``` (sudo) pip install simoorg ``` If you want to work with the latest code, please run the following commands @@ -25,13 +25,13 @@ Once you have confirmed that the tests have passed, you can install the code by ``` If you are planning to use ssh handler plugin to induce failures against a specific service cluster, please ensure that the user you are using to run simoorg have Passwordless SSH access to all the nodes in the cluster. You should also ensure that any failure scripts you plan to use are already present on all the nodes in the target service cluster. -##Basic Usage +## Basic Usage Simoorg is started using the command *simoorg* which takes the path to your config directory as the only argument. Please check the config document ([link](/docs/config.md)) to better understand the configuration files. The sample config directory packaged with the product can be used to set up your configs. ``` Ex: simoorg ~/configs/ ``` -##Usage Example +## Usage Example In this section of the document, we will be describing how to use Simoorg against a kafka cluster. For this examples we will be running three predefined failures (graceful stop, ungraceful stop and simulate full GC) on random nodes in the cluster using the Shell script handler plugin. We will be executing the failures in a random manner using the non deterministic scheduler. We will also be using the Kafka Topology plugin and Kafka HealthCheck plugin. Both of these plugins are packaged with the product and are ready to use out of the box. Before we start , we need to make sure that all the required failure scripts (the ones required for these failure scenario is present in the repo under Simoorg/failure_scripts/base/) are present on all the broker nodes in the kafka cluster. Let’s assume that the script is present in the location ~/test/failure_scripts/base/ on the kafka brokers, we will need this path later when we are updating our configurations. @@ -182,4 +182,3 @@ Where ~/kafka_configs/ is the path to your failure inducer configs. For longer r gunicorn 'simoorg.Api.MoiraiApiServer:create_app("~/kafka_configs/api.yaml")' ``` Where api.yaml should contain a valid path for the named pipe used by both the api server and Simoorg. Our current implementation of api, relies on the simoorg process to retrieve all information and do not serve any data once the process is dead. Please check the design doc to better understand the various REST API endpoints - From 7a06dc7e72914308a08ae7efead44cc610960145 Mon Sep 17 00:00:00 2001 From: Spellchaser Date: Fri, 15 Sep 2017 12:40:27 -0400 Subject: [PATCH 3/7] Update config.md MD heading fixes --- docs/config.md | 34 +++++++++++++++------------------- 1 file changed, 15 insertions(+), 19 deletions(-) diff --git a/docs/config.md b/docs/config.md index 632a91d..87df1c7 100644 --- a/docs/config.md +++ b/docs/config.md @@ -1,4 +1,4 @@ -#SIMOORG - CONFIG FILES +# SIMOORG - CONFIG FILES This document provides a quick overview of the various configuration files currently used by Simoorg. Simoorg expects path to the config directory as the first console argument and a standard simoorg config directory should have the following structure ``` configs/ @@ -18,7 +18,7 @@ configs/ Next we will go through each one of these configurations in details -##API CONFIG api.yaml +## API CONFIG api.yaml This is our main api config file, this needs to be passed to both Moirai process and Api server as well. This is a yaml file, which is mainly used to store input named pipe location for Moirai process. It may be used to contain more config items in the future as the api functionalities are extended. @@ -29,14 +29,14 @@ moirai_input_fifo: '/tmp/moirai.fifo' ``` -##FATE BOOKS fate_books/* +## FATE BOOKS fate_books/* Fate Book is a collection of configurations used to to describe failures to be induced against your service. Each service should have a unique Fate Book associated with it. Upon starting up, Simoorg scans configs/fate_books subdirectory for files with .yaml extension. Each qualified file is treated as Fate Book and used to instantiate observers that are watching and executing failures based on the conditions defined in a Fate Book. Fate Books are human readable and can be edited using a conventional editor. -###Fate Book Format +### Fate Book Format Format of the Fate Books are chosen to be YAML for its simplicity yet being capable to formally describe nested objects in a human readable form. -###Fate Book Contents +### Fate Book Contents Each service that needs to receive failure commands from the Failure Inducer, has to have a Fate Book associated with it. Below there is a sample Fate Book for an example service (called test-service) ```yaml @@ -107,16 +107,16 @@ failures: ``` -###Fate Book Sections +### Fate Book Sections Next we take a closer look at the various sections of the fatebook -####service: +#### service: Required : Yes Default: None The value for service key is used to uniquely identify the service being specified in that fate book. Simoorg enforces that no two fate books can have the same value for the service key. -####topology: +#### topology: Required : Yes All values related to topology plugin should be stored under this section. We expect only two values under this key, they are as follows @@ -130,7 +130,7 @@ The name of the topology plugin should be same as plugin class (please check the topology_config : Any plugin specific values should be added to this section.Simoorg expects the config to be contained inside the main config directory and the path provided here is relative to the config root -####logger +#### logger Required : Yes Contains the logging related information, we expect it to contain the following keys @@ -150,7 +150,7 @@ This key is used to enable console logging log_level : Simoorg expects the value for this key to be "WARNING", "INFO", "VERBOSE" or "DEBUG" -####healthcheck +#### healthcheck Required : Yes In this section we list all of our health check related configs. the various keys we expect in this section are as follows @@ -168,7 +168,7 @@ Depends on what plugin you use. In case of Defaulthealthcheck this is the absolu plugin_config : Place to specify any plugin specific configurations, Currently is None Default Health check plugin. -####destiny +#### destiny Required : Yes This section is responsible for listing all the scheduler specific information. We expect the following keys to be present under the destiny section @@ -187,7 +187,7 @@ scheduler_plugin| the name of the scheduler plugin| Yes | None| Please check the plugins document to better understand the plugin names. In addition to the keys listed above, the "scheduler_plugin" key could also contain any plugin specific config, also the failure name given in "scheduler_plugin"->failures->"failure_name" should have a valid failure definition in the failures sections of the fate book -####failures +#### failures This section includes a list of failure definition and each item in the list should contain the following keys Key name | Description | Mandatory | Default | @@ -204,11 +204,10 @@ restor_handler->args | The args passed to the handler during failure revert | Ye wait_seconds | The wait seconds between failure induction and failure revert | Yes | None | -###Plugin Configs -================= +### Plugin Configs These are config files that may be specific to some plugin. Since these configs are closely related to the plugins, we will mainly be covering configs for the plugins that are shipped out of the box. -####Handler Configs +#### Handler Configs For any handler plugin (lets assume the handler name is test_handler), we expect the config to be located in the path config/plugins/handler/test_handler/test_handler.yaml, the config contents greatly depends on the specific handler.The ShellScriptHandler plugin file for example, looks like this : @@ -217,7 +216,7 @@ For any handler plugin (lets assume the handler name is test_handler), we expect host_key_path: ~/.ssh/known_hosts ``` -####Topology Configs +#### Topology Configs The location of the topology plugin is usually provided under the topology section of the fate book. Again the content of this configuration file depends heavily on the specific plugin.But here are two sample configuration files for StaticTopology and KafkaTopology plugins respectively. In StaticTopology we list all the servers present in the service under the key node ``` # file: configs/plugins/topology/static/topo.yaml @@ -261,6 +260,3 @@ kafka_host_resolution: LEADER: {Topic: "Topic1"} ``` - - - From 15a7fb0c9daef7a97078b237f923e35c9a8e978c Mon Sep 17 00:00:00 2001 From: Spellchaser Date: Fri, 15 Sep 2017 12:41:01 -0400 Subject: [PATCH 4/7] Update index.md --- docs/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/index.md b/docs/index.md index b7e73ed..a68733c 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1 +1 @@ -#SIMOORG +# SIMOORG From 7300d2f73d703f290d39da98faa3d70adf8916ee Mon Sep 17 00:00:00 2001 From: Spellchaser Date: Fri, 15 Sep 2017 12:42:41 -0400 Subject: [PATCH 5/7] Update low_level.md MD heading and link fixes --- docs/low_level.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/low_level.md b/docs/low_level.md index 718e33b..715de38 100644 --- a/docs/low_level.md +++ b/docs/low_level.md @@ -1,7 +1,7 @@ -#LOW LEVEL FAILURES +# LOW LEVEL FAILURES Libfui provides an easy way to induce low level failures to any POSIX call in your application. To be able to use low level failures against POSIX calls, we require the application to be started under the control of libfiu. The best practice is to use these failures either on your staging/dev clusters or run on select nodes from your production cluster. -Please check the libfiu website (https://blitiri.com.ar/p/libfiu/) to understand how to build and install libfiu on your servers. Once the libfiu packages are installed, please restart your application under the control of libfiu. You can achieve this using the fiu-run command ( see https://blitiri.com.ar/p/libfiu/doc/man-fiu-run.html ), the command should look something like the following +Please check the [libfiu website](https://blitiri.com.ar/p/libfiu/) to understand how to build and install libfiu on your servers. Once the libfiu packages are installed, please restart your application under the control of libfiu. You can achieve this using the [fiu-run command](https://blitiri.com.ar/p/libfiu/doc/man-fiu-run.html), the command should look something like the following ``` fiu-run -x -c $COMMAND ``` From bffd492637560db4feb7e303c2a302c51b702797 Mon Sep 17 00:00:00 2001 From: Spellchaser Date: Fri, 15 Sep 2017 12:43:40 -0400 Subject: [PATCH 6/7] Update plugins.md Header MD fixes --- docs/plugins.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/plugins.md b/docs/plugins.md index ecf575c..26fcb37 100644 --- a/docs/plugins.md +++ b/docs/plugins.md @@ -1,7 +1,7 @@ -#How to create a new plugin: +# How to create a new plugin: In simoorg, we have four types of pluggable component namely Topology, Healthcheck, Scheduler and Handler. Even though we ship a few standard plugins of each category, we understand that it will not meet the requirements of all the potential customers. So one our guiding design principles has been to ensure that system is easily extensible. So in this document, we will be detailing the various steps to be taken to create a new plugin. -##Topology +## Topology First we start with the the topology plugin. Simoorg relies on the topology plugin to retrieve information about the individual nodes of a service. The arguments that are passed to any topology plugin is *Args:* input_file - the config file to be read by the plugin @@ -56,7 +56,7 @@ kafka_host_resolution: node_type_7: RANDOM_BROKER: {Topic: "Topic3"} -```` +``` * This class reads the config file and loads it in memory data structure. At the time of failure induction, it returns a random host (broker host name) to the caller method. The selection of this host depends upon the kind of node selected @@ -67,7 +67,7 @@ In the above Kafka Topology plugin example, it is possible to modify the config Path to KafkaTopology plugin : simoorg.plugins.topology.KafkaTopology.KafkaTopology -##HealthCheck : +## HealthCheck : Healthcheck plugin is responsible for checking the health of the target cluster. *Args:* script - Any external script to be used by the plugin @@ -92,7 +92,7 @@ Let’s take an example of *KafkaHealthCheck plugin* : If users want to use a shell script, that will do the HealthCheck on the target cluster, they can use the DefaultHealtCheck plugin in the fate book and pass it the customized shell_script. The DefaultHealthCheck plugin like KafkaHealthCheck plugin implements the check() method that will return true if the target cluster is healthy, else false otherwise. -##Scheduler: +## Scheduler: The Scheduler plugin is responsible for creating the plans that an atropos process will be following. A plan as received by atropos should be a list of single item dictionaries, where the dictionary has the failure name as the key and the trigger time as the value. *Args:* destiny_object - A dictionary containing the contents of the plugin key of the destiny @@ -116,7 +116,7 @@ Let us consider the example of NonDeterministicScheduler plugin: There are a number of fully implemented methods in BaseScheduler, that you can use in your implementation to better access the destiny object. -##Handler +## Handler Handler is the plugin responsible for actually inducing and reverting the failures *Args:* config_dir - This is the path to the simoorg config directory From eb1da8efc090a59b17e8d9d0b05c5cd512bd6d4b Mon Sep 17 00:00:00 2001 From: Spellchaser Date: Fri, 15 Sep 2017 12:45:54 -0400 Subject: [PATCH 7/7] Update user_guide.md Link fix --- docs/user_guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide.md b/docs/user_guide.md index d3435fe..2a9a950 100644 --- a/docs/user_guide.md +++ b/docs/user_guide.md @@ -26,7 +26,7 @@ Once you have confirmed that the tests have passed, you can install the code by If you are planning to use ssh handler plugin to induce failures against a specific service cluster, please ensure that the user you are using to run simoorg have Passwordless SSH access to all the nodes in the cluster. You should also ensure that any failure scripts you plan to use are already present on all the nodes in the target service cluster. ## Basic Usage -Simoorg is started using the command *simoorg* which takes the path to your config directory as the only argument. Please check the config document ([link](/docs/config.md)) to better understand the configuration files. The sample config directory packaged with the product can be used to set up your configs. +Simoorg is started using the command *simoorg* which takes the path to your config directory as the only argument. Please check the [config document](/docs/config.md) to better understand the configuration files. The sample config directory packaged with the product can be used to set up your configs. ``` Ex: simoorg ~/configs/ ```