Python Development Guidelines
Table of contents
- Python resources
- Python flavors
- Python, pip, PyPI
- Virtual environments
- Deploying your code as part of a larger system
- Data classes and named tuples
- Type hints and mypy
- Unit testing
This document is greatly inspired by several other resources, the most notable being:
Google Python Style Guide
The Google Python Style Guide is a valuable reference for python developers: have a look at it.
The book Fluent Python by Luciano Ramalho (O'Reilly) is a good reference to get insights on how python works, learn to leverage its features, and get to write better python code. The second edition should be out in 2021.
Python is a language specification having several different implementations. Usually, when we talk of Python we refer to the most famous of its implementations: CPython. But others are available as well and it is good to know about them.
According to Wikipedia "CPython is the reference implementation of the Python programming language. Written in C and Python, CPython is the default and most widely used implementation of the language". It offers maximum compatibility with Python packages and C extension modules.
CPython is both a compiler and an interpreter: at run time it compiles python files (.py) as they are first imported, into intermediate bytecode files (.pyc) and then interprets the latter, executing bytecode instruction in the CPython virtual machine.
Contrary to java, the compilation takes place at run time instead of being performed beforehand.
Globally, we say that CPython is interpreted.
There are many, here we list only some of the most relevant.
PyPy: drop-in replacement of CPython uses just-in-time (JIT) compilation to translate Python code directly into a machine-native assembly, skipping the intermediate bytecode and performing optimization on the go. This leads to significant speed-up for pure Python code (4.4x on average, according to https://speed.pypy.org/, but speed-ups of 50x have been reported on specific tasks).
Two main drawbacks are that CPython extensions (C/C++) do not work or incurs some overhead and that PyPy does not support the syntax features of the most recent Python versions. Both these facts limit the compatibility with the Python ecosystem when using PyPy.
- Jython: written in Java, compiles Python code to CLI, and runs it on the JVM.
- IronPython: written in C#, compiles Python code to CLI code and runs it on the CLR, thus targeting Microsoft .NET Framework and Mono.
- Cython: a superset of Python language introducing additional C-inspired syntax aiming to reach C-like performance on execution.
Cython provides a compiler to translate Cython code into C/C++ code. Thus serving different purposes:
- Ease the creation of extension modules for CPython: Cython can wrap the C/C++ code it produced with Python interface code allowing to import it as a regular python module.
- Convert whole CPython/Python programs in C/C++ which you can then compile in regular executable files which won't need the CPython VM to run.
Python, pip, PyPI
From now on when talking of Python we intend CPython unless otherwise specified. Python comes pre-installed on your machine.
There are two major versions of Python: 2 and 3. Python 2 is quite legacy and you should choose Python3; still, both versions are probably installed in your machine. Check with
python --version (
python.exe in Windows), if you get a 2.X.Y version, then you probably have Python 3 as
Python Package Index
Python has a rich ecosystem including tons of 3rd party packages you can install and use as dependencies in your code.
The Python Package Index, abbreviated as PyPI and also known as the Cheese Shop, is the official third-party software repository for Python. PyPI primarily hosts Python packages in the form of archives called sdists (source distributions) or precompiled "wheels." Each package is identified by its name which must be globally unique. In python, there is no reversed Internet domain name like in java: package names are plain (e.g. flask).
PyPI allows users to search for packages by keywords or by filters against their metadata. A single entry on PyPI can store, aside from just a package and its metadata, previous releases of the package, precompiled wheels (including dll/so files), as well as different forms for different operating systems and Python versions.
Note: the previous paragraphs are derived from Wikipedia
Pip Installs Packages
Python bundles the package manager pip which allows you to install and remove the packages that you need as dependencies in your code.
pip uses PyPI as the default source for packages and their dependencies.
python is Python 2, use
pip to install packages for it and
pip3 to install packages for Python3.
With pip you can:
- install a package (optionally at a specific version):
pip install flask==1.0.0\ this will also install recursively any dependency of flask
- upgrade a package to a newer/newest version:
pip install -U flask
- list all packages in your virtual env (now including flask and all its dependencies):
- list all packages in a textual file (now including flask and all its dependencies):
pip freeze > requirements.txt
- install all packages listed in a requirements file:
pip install -r requirements.txt
- remove a package:
pip uninstall flask\ this will only remove flask: all its dependencies will stay
You don't want to mess up your system's python installation by adding and removing libraries with pip for your projects: this is a guaranteed way to mess up relevant features in your machine depending on python. Don't do that. Also, you may need different versions of python and dependencies in different projects: the solution is to have a dedicated python environment for each of your projects, including python, pip, and the required dependencies. In Python, such environments are called virtual environments or virtual envs; you will create a dedicated one for each project.
There exist several tools that allow the creation of environments.
Let's start with Anaconda, a platform including a distribution of Python and R, aiming to make it simple to obtain full-featured environments for data science. Anaconda comes in different flavors (regular Anaconda and smaller Miniconda) and allows you to create complex virtual environments with many libraries in a simplified manner:
- lots of default data science libraries are added by default in all environments taking care of compatibility issues in transitive dependencies
- optimized versions (compared to the regular ones available on PyPI) of several libraries are used
- can create a virtual env for any python version even if you don't have it installed in your machine
The main drawback of Anaconda is that its licensing changed and now you have to pay to use it for commercial purposes. So if you want virtual envs for free, you'll have to use something different.
All others tools for creating virtual envs require the version of python you want to base your virtual env on to be installed in your machine. Again, you don't want to mess up with your 3.7 default python just to be able to create a virtual env with python 3.10.
PyEnv is a multiplatform tool that helps you install and switch between multiple versions of python without impacting your system's python installation. With PyEnv you can set a global python version and then derogate it by specifying to use different versions in different folders.
It does not create virtual envs directly, but you can consider it a prerequisite to reduce the pain in virtual env creation.
Note: for Windows, you have to use the porting PyEnv-win
venv & virtualenv
venv and virtualenv are two command-line tools for creating virtual envs. venv is a simpler version of virtualenv and is bundled with python since version 3.3, so you don't need to install it. Contrary virtualenv is faster, more extendable, it can create envs for arbitrarily installed Python versions, it can be upgraded separately from python using pip, can be used with python versions lower than 3.3.
For example, using venv you can create a virtual environment based on your current python version:
python3 -m venv <path_to_new_virtual_environment>
and then activate it:
and finally deactivate:
Reproducible virtual environment
You can either create the virtual env in a subfolder of your project such as
./venv/<project_name> or gather all your virtual envs in a single dedicated folder.
In any case, you don't want to add the virtual env folder to git repository since it may easily grow large even for small projects.
Still, you want to commit the required info to allow you to recreate the exact same virtual environment.
The easiest way to do this is to commit a requirements.txt in your repository with all the dependencies you need. So that it can be pip installed in a new virtual env. You will also need to specify the python version you are using, for example in your README.md, as this is not specified in requirements.txt.
Anyway, this is not going to scale to a medium-large project, as you are easily ending up in dependency hell. For example, whenever you pip-install package A, having package B as sub-dependency, B is installed/updated as well. For B a version is chosen that is compatible with A requirements, but disrespecting the fact B may also be a sub-dependency of C - another of your project's main dependencies - which may also limit B's admissible versions. So as you install A you may break the installation of C and vice-versa. It is possible to handle this manually with pip, but it is complicated and error-prone.
You want a tool handling automatically solving the dependency graph and determining admissible versions for each transitive dependency.
While the first is merely a tool to manage virtual environments, poetry also takes care of packaging python projects (being an alternative to the default python packaging framework setuptools). There is a growing consensus that poetry improves the dependency management over pipenv, resulting in a better user experience and environment reproducibility, so for new projects is worth starting to use poetry.
The benefits of adopting poetry go beyond the virtual environment management; another main advantage of poetry is that it allows you to compact the configuration files on many different development tools in a single toml file. The details of poetry usage are beyond the scope of this document, but you can refer to the official documentation and lots of tutorials available online.
Virtual environments in notebooks
At some point, you may want to use a specific virtual env as a kernel in a jupyter notebook. This is how to do it:
- activate the virtual environment
- wrap the environment as a kernel that can be used from jupyter. You can choose any kernel name (why not making it equal to the environment name?)
pip install jupyterlab pip install ipykernel ipython kernel install --user --name=<ker_name>
- run jupyter lab
then open the notebook file and assign the kernel
Module vs script
A python file is a text file, usually as a file extension. Semantically it can be either a script or a module. A script is a file where the business code is at root level, intended to be executed to produce a result. For example the script:
import os os.makedirs("./bar", exist_ok=True) print("Created folder bar")
creates a folder and prints that it did so.
Contrary, a module is a python file that defines classes and/or functions for other python files to use. It is intended to be imported by other python files (either scripts or other modules). You don't expect a module to perform stuff directly like a script, because that root level code would be "silently" executed any time it gets imported.
Usually, python applications have an entry point file. This is semantically more similar to a script, but usually the main code is put inside an if that guarantees that it gets executed only if the file is run directly (and does not if it gets imported by another python file). A hello world entry point file looks like this:
def main(): # application main function, doing the real stuff ... if __name__ == "__main__": main()
It is worth noting that the variable
__name__ is initialized by python for each python file and it can assume the following values:
- "__main__" if this is executed as a python application such as
python -m my_file
- the module full name, if the file is imported by another file.
For example, assume you have the following structure in your project root:
and assume that the content of the two files is:
# main.py import a.b.inner if __name__ == "__main__": inner.tell_me_your_name() # inner.py def tell_me_your_name(): print(__name__)
python main.py will print
Imports are the way you reuse the code of one python module (i.e. a python file) in another python file.
You do it using the reserved word
import which can be used in several different ways in combination with
from - to reduce imported name length - and
as - to create aliases for imported symbols.
You can import a package, a module, or a symbol within a module (a class, a function, or a global variable).
When you execute a file that imports a module, the imported module code is executed.
You can find a valuable description of the python import mechanism here.
Circular imports and deferred imports
You should avoid circular imports, as they are likely to result in
ImportError at runtime, usually no other information is provided to help you locate where the error is. You can use pycycle to find and fix circular imports.
Imports usually are placed at the beginning of the file, but you can defer them to wherever you want, even inside a function/method; in that case, the availability of the imported symbols is limited to the scope. Generally, that's not a good idea as it makes your code less readable, but sometimes it may help to overcome circular imports.
Folder vs package
In your project, python files reside in a hierarchy of folders defining the namespace for each file, so that you can have two my_file.py in different paths and you can import both in a third file, using the full name for each and thus avoiding collisions.
In python, a package is just a folder that also contains a
__init__.py file. Unlike a simple folder, a package can be imported.
Let's assume your project contains the files:
main.py my_package/__init__.py my_package/my_module.py my_package/sub_folder/another_module.py
main.py you can:
- import the package:
import my_package, which executes
my_moduleusing its name based on the already imported
my_var = my_package.mymodule.my_func()
- similarly use
a_var = my_package.sub_folder.another_module.a_func()
Also, in case
another_module like this:
from my_package.sub_folder import another_module
then, once you import
my_package, you can use
another_module as if it was just inside
a_var = my_package.another_module.a_func()
Looking for modules
It's important to understand how the Python interpreter looks for modules to import.
According to python documentation:
When a module named
spamis imported, the interpreter first searches for a built-in module with that name. If not found, it then searches for a file named
spam.pyin a list of directories given by the variable
sys.pathis initialized from these locations:
- The directory containing the input script (or the current directory when no file is specified).
PYTHONPATH(a list of directory names, with the same syntax as the shell variable
- The installation-dependent default.
Note: on file systems that support symlinks, the directory containing the input script is calculated after the symlink is followed. In other words, the directory containing the symlink is not added to the module search path.
After initialization, Python programs can modify
sys.path. The directory containing the script being run is placed at the beginning of the search path, ahead of the standard library path. This means that scripts in that directory will be loaded instead of modules of the same name in the library directory. This is an error unless the replacement is intended. See section Standard Modules for more information.
So by default modules and packages are searched for in a list of folders that includes:
- the folder of the entry point file you are executing (usually your project root folder)
- the folders where the python standard library is located for the current virtual environment (this is also the location where pip installs additional libraries).
So, when writing your import statements, you only have to care to express paths of project resources relative to the folder containing the project entry point. In case your code needs to be executed from different entry points, located in different folders (e.g. if your code gets copied elsewhere to be deployed as part of a larger system), you can tweak your imports in at least 3 different ways:
- exporting the PYTHONPATH variable, adding the base folder your import statements expect, before launching the application
- in your python entry point,
import sysand then, before all the other import statements, append the folder they are based on to
sys.path. For example, it could be the folder containing the entry point file itself. You can get it with:
import os dir_path = os.path.dirname(os.path.realpath(__file__))
- in case there is just a couple of possible base folder for execution, you could duplicate your import statements providing two versions of each. Put the first block in a
try:and the second block in an
except ImportError:. This is just awful as it looks, but it is sometimes effective in convincing the IDE to see the dependencies and stop marking lots of unexisting errors.
When you specify imports of project files relative to the base folder in
sys.path, they are called absolute imports.
You can also have relative imports, where the paths are relative to the current file (the one containing that import statement).
You can tell a relative import by its path start with a dot. All imports in the previous example are absolute. Relative imports look like this:
import .bar # importing module/package bar, residing in the same folder of the current file import ..baz.foo # importing baz.foo starting from the parent dir of the current file import ...a_module # importing a module located two folders up from the current file
Relative imports cannot be used in scripts and entry point files as they rely on the
You can even mix relative and absolute imports in the same file (but don't do this).
Import best practices
The import mechanism may be difficult to master. To reduce the problems, it is a good idea to follow the Google Python Style Guide, quoting and adapting:
import xfor importing packages and modules.
from x import y where xis the package prefix and y is the module name with no prefix.
from x import y as zif two modules named
yare to be imported or if
yis an inconveniently long name.
import y as zonly when z is a standard abbreviation (e.g.,
importstatements for packages and modules only, not for individual classes or functions. Imports from the
typing_extensionsmodule, and the
six.movesmodule are exempt from this rule. Actually importing single classes/functions is useful as makes your code more compact.
- Use absolute imports. Import each module using the full-path name of the module.
- Do not use relative imports. Even if the module is in the same package, use the full package name. This helps prevent unintentionally importing a package twice and makes the import more readable.
Deploying your code as part of a larger system
In the ideal case, you have a repository containing a project which is either an application or a library for other applications to use. In the first case, your code can just be executed as-is, while in the second you will probably package it to a python wheel for distribution. But often things are nastier. It may be the case that your code is to be deployed as a part of a larger system. Maybe it gets executed by a CI/CD to perform integration tests, and then brought to a production environment. You may not even have visibility or control over the different conditions your code will be used in (which may encompass: file system structure, virtual environments, env variables, hardware/software/network resources).
It's impossible to provide a general solution to address all the problems you may face in similar conditions, but it may be worth listing some alternatives which affect how your code gets to be relocated elsewhere out of your repository. Let's review some alternatives:
Just copy the code
Of course, this is a bad idea and you should not do this. If you just copy your source code in a subfolder of a larger code base, it is really difficult to maintain consistent version control (unless you commit it as part of the larger code base, but then, what's the need for the smaller repo in the first instance).
You could configure the larger codebase to use your repo as a git subrepo. This adds burden in git operations but has the significant advantage of not cluttering the larger repo with all your code, and at the same time, it provides a way to lock specific versions of the larger codebase to a specific version of your project.
Bundle your project a library and pip-install it
This is probably the best solution you can have; you can use either setuptools or poetry to build your project. It's the only one that eliminates the eventuality of import errors due to code relocation as your project will be pip-installed in the target virtual environment and imported from there. It's also very easy to tie the larger system to use a specific version of your code by just versioning the artifacts and referring to a specific version in the outer requirements file. There are also some disadvantages:
- It gets slightly more difficult to try and fix your code in the integrated system as it is not co-located with the other code anymore, but that's a minor issue.
- If the process of building your code into an artifact is performed by a CD pipeline, you are subject to its time and availability: iterations might get slower.
- If your code gets deployed to some private python index you have the additional complexity and costs of setting up and maintaining such a service (e.g. artifactory).
Data classes and named tuples
When you handle records of data, each record made of the same fields (possibly having different types), in python it is tempting to create your record as a list, tuple or dict. This is generally a bad practice because your code will be more difficult to read and understand:
- in tuples and lists, you reference elements by position, so the semantic of each element is not clear
- dicts improve over this, as the key of each element makes them descriptive...
- but still, dicts can contain an arbitrary number of elements, with arbitrary keys (non only strings, the only requirement on keys is for them to be hashable).
So dicts are much more flexible than a record data structure, so better not to use them to store records.
Python offers two kinds of data to create record objects, depending on whether you want them to be mutable or immutable. These are:
- Data classes, provide a simplified syntax to declare classes to represent mutable data objects. The module is part of the Python standard library, you just need to import it.
- Named tuples can be defined via a factory function and then instantiated as regular classes. They are tuple-derived types where each element has a name, so they are a good fit for immutable data objects (also providing hashability).
Unluckily serialization and deserialization (via json or pickle/dill) of both data classes and named tuples is not straightforward, so you may end up implementing some custom code to handle it properly.
Type hints and mypy
Python is a dynamically typed language: the type of a variable is determined at runtime and can change during the execution (you can reassign a variable with an object of arbitrary type). This allows for maximum flexibility, but makes your code much more error-prone if compared to statically typed languages where the compiler performs the type checking for you and ensures all types match.
In Python, all this should be handled via carefully crafted code documentation to foster awareness. Still, it is quite difficult to make a docstring readable and exhaustive when corner cases are concerned.
That's why type hints are a best practice in python development. Type hints are type annotations you can add to your python code. They are ignored by the interpreter at run time, but make your code more readable and can be analyzed by type checkers. Type hints are optional, so you can annotate your source code even just partially (maybe you won't annotate private functions and local variables), but you should provide type hints in public functions and methods (both on parameters and return value).
You can set up your IDE for performing live type checking based on type hints, and you can make explicit use of mypy type checker (either manually or a pre-commit hook). You should do both.
Writing type hints
The typing module of the standard library provides support for type hints.
- annotate simple types: int, float, str, None
- annotate mapping and sequences with no further details:
dict, list, set, tuple
- annotate mapping and sequences with additional details by importing classes from
Mapping, Sequence, Dict, List, Tuple(e.g.
Dict[str, float]for a dict with string keys and float values)
- prefer more generic types (e.g.
Mapping, Sequence) in function parameters and more specific types for return values (e.g.
- for functions that do not return any value, annotate with
- for functions that do not return at all (e.g. always raises an exception) use
typing.Anyto allow for any type
Union[Type1, Type2]to specify
Callable[[Arg1Type, Arg2Type], ReturnType]to annotate a parameter/return value/variable being a function
TypeVarfactory function to declare a type variable, to parametrize generics.
NewTypefunction to create new derived types.
- use type aliases to simplify complex type signatures, e.g.
from collections.abc import Sequence ConnectionOptions = dict[str, str] Address = tuple[str, int] Server = tuple[Address, ConnectionOptions] def broadcast_message(message: str, servers: Sequence[Server]) -> None: ...
typing(seldom) use the
castfunction to change/force a specific type on a variable from a certain point of the code on, to solve type checking issues (when mypy find errors and you have no time to invest in properly fixing the type hints).
TypedDictto replace plain dicts if the set of keys is predetermined. Do this only in preexisting codebases making large use of dict: for new projects prefer data classes and named tuples
- Install mypy in your virtual environment
pip install mypy, or better, add it as a developer dependency in your requirements files / setup.py / poetry toml file
- If using python < 3.8, also install the package typing-extension to get the backporting of the latest typing features
- If using pre-commit hooks you may consider adding mypy to it
- If using vscode, configure it for performing type checking via mypy: add the entry
"python.linting.mypyEnabled": trueto the
.vscode/settings.jsonfile under your project (create it if not existing). Type checking errors will be shown in the problems panel.
- You can specify options for vscode to run mypy in a setup.cfg file in the root of your project: add a section for mypy in it. e.g.:
[mypy] ignore_missing_imports = True allow_redefinition = True
- You can also run mypy manually from a terminal on source code and test code. e.g.:
python -m mypy ./src --allow-redefinition --ignore-missing-imports
Python's default unit testing framework is implemented in the unittest module of the standard library. Despite this, the standard-de-facto for unit test in python is pytest. You can pip-install it and then use it as a command-line tool or configure your IDE to use it.
Some reasonable guideline to keep your tests ordered:
- Put your test file in
- Put the test for the file
src/a/b/c.pyin the file
- Put extra files required by
- To test the function
my_func()create a function
test_my_func()(or as many as you need adding suffixes to the function name)
- To test the method
MyClass, create a function
test_my_class_my_method()(or as many as you need adding suffixes to the function name), inside it instantiate
my_method()on the instance and assert based on the outcome
- Use assert for basic tests:
from a.b.c import my_func def test_my_func() res = myfunc() assert isinstance(res, str), "my_func should return a string" assert res == "hello", "my_func should return a 'hello'"
- Refer to the documentation of pytest for testing more advanced condition such as values being almost equal or function raising a specific exception
- If testing functions depending on AWS services via boto3, use moto to mock the services you need
- Run pytest from your project root as
python -m pytest -ra tests/
- To run a specific test pass the full filename:
python -m pytest -ra tests/a/b/test_c.py
- You can parametrize your test functions for pytest to map them over a list of (input, output) tuples.
import pytest import my_square data = ( (0, 0), (1, 1), (2, 4), (3, 9), ) @pytest.mark.parametrize('n, n2', data) def test_my_square() assert my_square(n) == n2
- Similarly, you can use pytest decorators to exclude specific tests based on determined conditions or group your tests in different logical sets and execute only a specific group (based on command-line options)
- Install and use pytest-cov to monitor your test coverage.
The most standard way to create consistent documentation in python is to use the tool Sphinx. With sphinx you create documentation files, such as your documentation entry point using the reStructuredText markup language; whereas most of the documentation will be built from the docstring you put in the code. In python there are several popular formats for docstrings, the most notable being:
- reST format the official one
- google format, more concise than the previous one
- numpydoc format, based on google format Since all these formats are supported by sphinx, just pick the one your team likes the most and then be consistent in using it.
To configure your project for sphinx, you can refer to this tutorial.
In your sphinx's
conf.py you may want to use the following:
# useful extensions to provide sphinx with the capability of parsing docstrings extensions = [ "sphinx.ext.autodoc", "sphinx.ext.napoleon", "sphinx_autodoc_typehints", "sphinx.ext.autosummary", ] # to build html documentation using the popular ReadTheDocs template html_theme = "sphinx_rtd_theme"
The most basic debugging tool in python is the
print(...) to print a line in the standard output. Despite being trivial, it will often serve you well enough.
The second step is the proper logging system you should always have in your projects. Python offers you a log4j-like logging framework via the logging module. So you can:
- configure it either via code or configuration file
- direct your log on console or files
- have rolling handling for log files
- have hierarchical handling of loggers and appenders
- choose the proper log level for each of your log commands This is often the most convenient way to get insights into the runtime behavior of multi-threaded/multi-process systems, or of systems where you don't have a console available (either GUI-based application or background processes).
For a more full-fledged debugging experience you can rely on your IDE, providing you debug configurations, breakpoints, step by step execution, and so on:
The professional pycharm version allows you remote debugging as well.
When you operate in a textual environment and need to debug code that you can edit and you need something more powerful than just printing/logging, you can use the python debugger pdb manually.
The most basic usage is usually powerful enough:
import pdbin a python file
pdb.set_trace()later in the file to set up a break at that very point
- at runtime the program will stop, providing you a python console to interact with variables and functions
continuein the console to resume the execution (until the next set_trace gets hit)
Since in cpython an application, or one of its dependencies, can rely on C/C++ extensions, the extension may try to access an invalid memory location.
You cannot handle this event with python
try/except. The application will just crash printing something like "Segmentation fault (core dumped)".
To try and isolate the mischievous extension you can use the faulthandler module. This will print to console a stack trace (limited to the python perimeter) when the segmentation fault occurs, even if it happens in a secondary thread.
Add these lines to your application entry point:
import faulthandler faulthandler.enable(file=sys.stderr, all_threads=True)
To find the point where the fault happens in C/C++ code you have to inspect the core dump file. It's a binary file named "core" that is produced in the same folder from where you are running the python app (or under
/var/crash/ in linux). Since it is not created by default, you have to enable it. On a linux machine try the following:
sudo vim /proc/sys/fs/suid_dumpable\ and overwrite the default value 2 with 1
sudo service apport start
ulimit -c unlimited\ this is not permanent, you have to run it in every new terminal you open.
Instructions for the actual inspection of a core file are beyond the scope of this document.