When we design software, we normally put a lot of effort into writing high-quality code. But that’s not enough. A good software should also take care of its eco-system, like testing, deployment, network, etc. One of the most important aspects is configuration management.
Good configuration management should allow the software to be executed in any environment without changing the code. It helps Ops to manage all the hassle settings and it provides a view on what can happen during the process and even allows them to change the behavior during the runtime.
The most common configuration includes credentials to the database or an external service, the hostname of the deployed server, dynamic parameters, etc.
In this article, I want to share with you some good practices of configuration management and how we can implement them in Python. If you have more ideas, please leave your comments below.
When do we need a separate configuration file?
Before writing any configuration file, we should ask ourselves why we need an external file? Can’t we just make them constants in the code? Actually, the famous The Twelve-Factor App has answered this question for us:
A litmus test for whether an app has all config correctly factored out of the code is whether the codebase could be made open source at any moment, without compromising any credentials. Note that this definition of “config” does not include internal application config, such as config/routes.rb in Rails, or how code modules are connected in Spring. This type of config does not vary between deploys, and so is best done in the code.
It recommends that any environment-dependent parameters such as database credentials should sit in the external file. Otherwise, they are just normal constants in the code. Another use case I see a lot is to store dynamic variables in the external file, for instance, a blacklist or whitelist. But it can also be a number within a certain range (e.g. timeout) or some free texts. These variables can possibly be the same in each environment, but the configuration file makes the software much more flexible and easy to edit. However, if it grows too much, we might consider moving it to a database instead.
Which format of the configuration file should I use?
In fact, there are no constraints on the format of the configuration file as long as the code could read and parse them. But, there are some good practices.
The most common and standardized formats are YAML, JSON, TOML and INI. A good configuration file should meet at least these 3 criteria:
Easy to read and edit: It should be text-based and structured in such a way that is easy to understand. Even non-developers should be able to read.
Allow comments: Configuration file is not something that will be only read by developers. It is extremely important in production when non-developers try to understand the process and modify the software behavior. Writing comments is a way to quickly explain certain things, thus making the config file more expressive.
Easy to deploy: Configuration file should be accepted by all the operating systems and environments. It should also be easily shipped to the server via a CDaaS pipeline.
Maybe you still don’t know which one is better. But if you think about it in the context of Python, then the answer would be YAML or INI. YAML and INI are well accepted by most of the Python programs and packages. INI is probably the most straightforward solution with only 1 level of the hierarchy. However, there is no data type in INI, everything is encoded as a string.
[APP]
ENVIRONMENT = test
DEBUG = True
# Only accept True or False
[DATABASE]
USERNAME = xiaoxu
PASSWORD = xiaoxu
HOST = 127.0.0.1
PORT = 5432
DB = xiaoxu_database
The same configuration in YAML looks like this. As you can see, YAML supports nested structures quite well (like JSON). Besides, YAML natively encodes some data types such as string, integer, double, boolean, list, dictionary, etc.
APP:
ENVIRONMENT: test
DEBUG: True
# Only accept True or False
DATABASE:
USERNAME: xiaoxu
PASSWORD: xiaoxu
HOST: 127.0.0.1
PORT: 5432
DB: xiaoxu_database
JSON is very similar to YAML and is extremely popular as well, however, it’s not possible to add comments in JSON. I use JSON a lot for internal config inside the program, but not when I want to share the config with other people.
{
"APP": {
"ENVIRONMENT": "test",
"DEBUG": true
},
"DATABASE": {
"USERNAME": "xiaoxu",
"PASSWORD": "xiaoxu",
"HOST": "127.0.0.1",
"PORT": 5432,
"DB": "xiaoxu_database"
}
}
TOML, on the other hand, is similar to INI, but supports more data types and has defined syntax for nested structures. It’s used a lot by Python package managements like pip or poetry. But if the config file has too many nested structures, YAML is a better choice. The following file looks like INI, but every string value has quotes.
[APP]
ENVIRONMENT = "test"
DEBUG = true
# Only accept True or False
[DATABASE]
USERNAME = "xiaoxu"
PASSWORD = "xiaoxu"
HOST = "127.0.0.1"
PORT = 5432
DB = "xiaoxu_database"
So far, I’ve explained WHY and WHAT. In the next sections, I will show you the HOW.
Option1: YAML/JSON — Simply read an external file
As usual, we start from the most basic approach, that is simply creating an external file and reading it. Python has dedicated built-in packages to parse YAML and JSON files. As you see from the code below, they actually return the same dict object, so the attribute access will be the same for both files.
Read Due to security issue, it is recommended to use yaml.safe_load() instead of yaml.load() to avoid code injection in the configuration file.
import json
import yaml
def read_json(file_path):
with open(file_path, "r") as f:
return json.load(f)
def read_yaml(file_path):
with open(file_path, "r") as f:
return yaml.safe_load(f)
assert read_json("data/sample.json") ==read_yaml(
"data/sample.yaml")
Validation Both packages will raise a FileNotFoundError for a non-existing file. YAML throws different exceptions for a non-YAML file and an invalid YAML file, while JSON throws JSONDecoderError for both errors.
import pytest
def test_validation_json():
with pytest.raises(FileNotFoundError):
read_json(file_path="source/data/non_existing_file.json")
with pytest.raises(json.decoder.JSONDecodeError):
# only show the first error
read_json(file_path="source/data/sample_invalid.json")
read_json(file_path="source/data/sample_invalid.yaml")
def test_validation_yaml():
with pytest.raises(FileNotFoundError):
read_yaml(file_path="source/data/non_existing_file.yaml")
with pytest.raises(yaml.scanner.ScannerError):
# only show the first error
read_yaml(file_path="source/data/sample_invalid.yaml")
with pytest.raises(yaml.parser.ParserError):
# only show the first error
read_yaml(file_path="source/data/sample_invalid.json")
Option2: Cofigureparser — Python built-in package
From this onwards, I will introduce packages designed for configuration management. We start with a Python built-in package: Configureparser.
Configureparser is primarily used for reading and writing INI files, but it also supports dictionary and iterable file object as an input. Each INI file consists of multiple sections where there are multiple key, value pairs. Below is an example of accessing the fields.
Read
import configparser
def read_ini(file_path, config_json):
config = configparser.ConfigParser()
config.read(file_path)
for section in config.sections():
for key in config[section]:
print((key, config[section][key]))
read_ini("source/data/sample.ini", config_json)
# ('environment', 'test')
# ('debug', 'True')
# ('username', 'xiaoxu')
# ('password', 'xiaoxu')
# ('host', '127.0.0.1')
# ('port', '5432')
# ('db', 'xiaoxu_database')
Configureparser doesn’t guess datatypes in the config file, so every config is stored as a string. But it provides a few methods to convert string to the correct datatype. The most interesting one is Boolean type as it’s able to recognize Boolean values from 'yes'/'no', 'on'/'off', 'true'/'false' and '1'/'0'.
As mentioned before, it could also read from a dictionary using read_dict(), or a string using read_string() or an iterable file object using read_file().
import configparser
def read_ini_extra(file_path, dict_obj=None):
config=configparser.ConfigParser()
if dict_obj:
config.read_dict(dict_obj)
else:
config.read(file_path)
debug=config["APP"].getboolean("DEBUG")
print(type(debug))
# <class 'bool'>
name = config.get('APP', 'NAME', fallback='NAME is not defined')
print(name)
return debug
# read ini file
read_ini_extra(file_path="source/data/sample.ini")
# read dict obj
config_json=read_json(file_path="source/data/sample.json")
Validation The validation of Configureparser is not as straightforward as YAML and JSON. First, it doesn’t raise a FileNotFoundError if the file doesn’t exist, but instead, it raises a KeyError when it tries to access a key.
Besides, the package “ignores” the error of indentation. Like the example below, if you have an extra tab or space before “DEBUG”, then you would get a wrong value for both ENVIRONMENT and DEBUG.
Nevertheless, Configureparser is able to return ParserError for multiple errors (see the last test case). This helps us to solve problems in one shot.
import pytest
def test_validation_configureparser():
# doesn't raise FileNotFoundError, but raise KeyError
# when it tries to access a Key
with pytest.raises(KeyError):
read_ini_extra(file_path="source/data/non_existing_file.ini")
# [APP]
# ENVIRONMENT = test
# DEBUG = True
# doesn't raise exception for wrong indentation
debug = read_ini_extra( file_path="source/data/sample_wrong_indentation.ini"
)
print(debug)
# None
# However, config["APP"]["ENVIRONMENT"] will return 'test\nDEBUG = True'
# [APP]
# ENVIRONMENT = test
# DEBUG True
# [DATABASE]
# USERNAME = xiaoxu
# PASSWORD xiaoxu
with pytest.raises(configparser.ParsingError):
debug = read_ini_extra(
file_path="source/data/sample_wrong_key_value.ini"
)
# show all the errors
# configparser.ParsingError: Source contains parsing errors: 'source/data/sample_wrong_key_value.ini'
# [line 3]: 'DEBUG True\n'
# [line 8]: 'PASSWORD xiaoxu\n'
Option3: python-dotenv — Make configurations as environment variables
Now we move to third-party libraries. So far, I have actually missed one type of configuration files which is .env. The variables inside .env file will be loaded as environment variables by python-dotenv and can be accessed by os.getenv.
A .env file basically looks like this. The default path is the root folder of your project.
ENVIRONMENT=test
DEBUG=true
USERNAME=xiaoxu
PASSWORD=xiaoxu
HOST=127.0.0.1
PORT=5432
Read It is extremely easy to use. You can decide whether you want to override the existing variable in the environment with parameter override.
import os
from dotenv import load_dotenv
load_dotenv()
print(os.getenv('DEBUG'))
# true
load_dotenv(override=True)
# override existing variable in the environment
Validation However, python-dotenv doesn’t validate the .env file. If you have a .env file like this, and you want to access DEBUG, you will get None as the return without an exception.
# .env
ENVIRONMENT=test
DEBUG
# load.py
load_dotenv()
print('DEBUG' in os.environ.keys())
# False
Option4: Dynaconf — Powerful settings configuration for Python
Dynaconf is a very powerful settings configuration for Python that supports multi file formats: yaml, json, ini, toml and python. It can automatically load .env file and supports custom validation rules. In a short, it covers pretty much all the functionalities from the previous 3 options and even goes beyond that. For example, you can store an encrypted password and use a custom loader to decrypt the password. It’s also nicely integrated with Flask, Django and Pytest. I will not mention all the functionalities in this article, for more details, please refer to their documentation.
Read
Dynaconf uses .env to find all the settings file and populate settings object with the fields. If 2 settings file have the same variable, then the value will be overridden by the latest settings file.
# .env
# ROOT_PATH_FOR_DYNACONF="config/"
# SETTINGS_FILE_FOR_DYNACONF="['settings.ini', 'settings2.ini']"
from dynaconf import settings
print(settings["DB"])
# xiaoxu_database
Validation One of the interesting features to me is the custom validator. As mentioned before, Configureparser doesn’t validate INI file strictly enough, but this can be achieved within dynaconf. In this example, I check whether certain keys exist in the file and whether certain key has the correct value. If you read from YAML or TOML file which supports multi datatypes, you can even check if a number is in a certain range.
# settings.ini
# [default]
# ENVIRONMENT = test
# DEBUG = True
# USERNAME = xiaoxu
# PASSWORD = xiaoxu
# HOST = 127.0.0.1
# PORT = 5432
# DB = xiaoxu_database# [production]
# DEBUG = False
from dynaconf import settings, Validator
settings.validators.register(
Validator('ENVIRONMENT','DEBUG','USERNAME',
must_exist=True),
Validator('PASSWORD', must_exist=False),
Validator('DEBUG', eq='False', env='production'))
# Fire the validator
settings.validators.validate()
Integration with Pytest Another interesting feature is the integration with pytest. The settings for unit testing are normally different from other environments. You can use FORCE_ENV_FOR_DYNACONF to let the application read a different section in your settings file, or use monkeypatch to replace a speific key and value pair in the settings file.
import pytest
from dynaconf import settings
@pytest.fixture(scope="session", autouse=True)
def set_test_settings():
settings.configure(FORCE_ENV_FOR_DYNACONF="testing")
def test_dynaconf(monkeypatch):
monkeypatch.setattr(settings, 'HOST', 'localhost')
Refresh the config during Runtime Dynaconf also supports reload() , that cleans and executes all the loaders. This is helpful if you want the application to reload the settings file during runtime. For example, the application should automatically reload the settings when the config file has been opened and modified.
Option5: Hydra- Simplify the development by dynamically creating a hierarchical configuration
The last option is much more than just a file loader. Hydra is a framework developed by Facebook for elegantly configuring complex applications.
Besides reading, writing and validating config files, Hydra also comes up with a strategy to simplify the management of multi config files, override it through command line interface, create a snapshot of each run and etc.
Read Here is the basic use of hydra. +APP.NAME means adding a new field in the config, or APP.NAME=hydra1.1 to override an existing field.
import hydra
from omegaconf import DictConfig, OmegaConf
@hydra.main(config_name="config")
def my_app(cfg: DictConfig) ->None:
print(OmegaConf.to_yaml(cfg))
if__name__=="__main__":
my_app()
# python3 source/hydra_basic.py +APP.NAME=hydra# APP:
# ENVIRONMENT: test
# DEBUG: true
# NAME: hydra
Validation Hydra nicely integrate with @dataclass to perform basic validations such as type checking and read-only fields. But it doesn’t support __post_init__ method for advanced value checking like described in my previous article.
from dataclasses import dataclass
from omegaconf import MISSING, OmegaConf
import hydra
from hydra.core.config_store import ConfigStore
@dataclass
# @dataclass(frozen=True) means they are read-only fields
class MySQLConfig:
driver: str = "mysql"
host: str = "localhost"
port: int=3306
user: str = MISSING
password: str = MISSING
@dataclass
class Config:
db: DBConfig = MISSING
cs = ConfigStore.instance()
cs.store(name = "config", node = Config)
cs.store(group = "db", name = "mysql", node=MySQLConfig)
@hydra.main(config_path="conf", config_name = "config")
def my_app(cfg: Config) ->None:
print(OmegaConf.to_yaml(cfg))
if__name__=="__main__":my_app()
Config group Hydra introduces a concept called config group. The idea is to group configs with the same type and choose one of them during the execution. For example, you can have a group “database” with one config for Postgres, and another one for MySQL.
When it gets more complex, you might have a layout like this in your program (an example from Hydra documentation)
├── conf
│ ├── config.yaml
│ ├── db
│ │ ├── mysql.yaml
│ │ └── postgresql.yaml
│ ├── schema
│ ├── school.yaml
│ ├── support.yaml
│ └── warehouse.yaml
└── my_app.py
and you want to benchmark your application with different combinations of db, schema and ui, then you can run:
python my_app.py db=postgresql schema=school.yaml
More … Hydra supports parameter sweep with --multirun, that runs multiple jobs at the same with different config files. For instance, for the previous example, we can run:
python my_app.py schema=warehouse,support,school db=mysql,postgresql -m
Then you basically start 6 jobs simultaneously
[2019-10-01 14:44:16,254] - Launching 6 jobs locally
[2019-10-01 14:44:16,254] - Sweep output dir : multirun/2019-10-01/14-44-16
[2019-10-01 14:44:16,254] - #0 : schema=warehouse db=mysql
[2019-10-01 14:44:16,321] - #1 : schema=warehouse db=postgresql
[2019-10-01 14:44:16,390] - #2 : schema=support db=mysql
[2019-10-01 14:44:16,458] - #3 : schema=support db=postgresql
[2019-10-01 14:44:16,527] - #4 : schema=school db=mysql
[2019-10-01 14:44:16,602] - #5 : schema=school db=postgresql
Conclusion
In this article, I’ve talk about configuration management in Python in terms of WHY, WHAT and HOW. Depending on the usecase, a complex tool/framework isn’t always better than a simple package. No matter which one you choose, you should always think about its readbility, matainbility and how to spot the error as earily as possible. In fact, config file is just another type of code.
Source: Towards Data Science - Xiaoxu Gao
The Tech Platform
Comments