Basics

Variables

To check if a local variable is defined, we can use the locals function:

if 'my_variable' in locals():
    print('Variable exists')

Conditions and boolean context

Comparison operators

Python uses the standard set of comparison operators (==, !=, <, >, <=, >=).

They are functionally similar to C++ operators: they can be overloaded and the semantic meaning of == is equality, not identity (in contrast to Java).

Automatic conversion to bool

Unlike in other languages, any expression can be used in boolean context in python, as there are rules how to convert any type to bool. The following statement is valid, foor example:

s = 'hello'
if s: 
    print(s)

The code above prints 'hello', as the variable s evaluates to True.

Any object in Python evaluates to True, with exeption of:

  • False
  • None
  • numerically zero values (e.g., 0, 0.0)
  • standard library types that are empty (e.g., empty string, list, dict)

The automatic conversion to bool in boolean context has some couner intuitive consequences. The following conditions are not equal:

s = 'hello'

if s: # s evaluates to True

if s == True: # the result of s == True is False, then False evaluete to False

Checking the type

To check the exact type:

if type(<VAR>) is <TYPE>:
# e.g.
if type(o) is str:

To check the type in the polymorphic way, including the subtypes:

if isinstance(<VAR>, <TYPE>):
# e.g.
if isinstance(o, str):

Built-in data types

Numbers

Python has the following numeric types:

  • int - integer
  • float - floating point number

The int type is unlimited, i.e., it can represent any integer number. The float type is limited by the machine precision, i.e., it can represent only a finite number of real numbers.

Check if a float number is integer

To check whether a float number is integer, we can use the is_integer function:

Check if a number is NaN

To check whether a number is NaN, we can use the math.isnan function or the numpyp.isnan function:

Rounding

To round a number, use the round function.

For rounding up, use the math.ceil function.

For rounding down, use the math.floor function.

Strings

Strings in Python can be enclosed in single or double quotes (equivalent). The triple quotes can be used for multiline strings.

String formatting

The string formatting can be done in several ways:

  • using the f prefix to string literal: f'{<VAR>}'
  • using the format method: '{}'.format(<VAR>)

Each variable can be formatted for that, Python has a string formatting mini language.

The format is specified after the : character (e.g., f'{47:4}' set the width of the number 47 to 4 characters). Most of the format specifiers have default values, so we can omit them (e.g., f'{47:4}' is equivalent to f'{47:4d}').

The following are the most common options:

To use the character { and } in the string, we have to escape them using double braces: {{ and }}.

String methods

  • capitalize: capitalize the first letter of the string
  • lower: convert the string to lowercase
  • upper: convert the string to uppercase
  • strip: remove leading and trailing whitespaces
  • lstrip: remove leading whitespaces
  • rstrip: remove trailing whitespaces

Enumerations

For enumerations, we can use the enum module. The basic syntax is:

from enum import Enum

class MyEnum(Enum):
    VALUE1 = 1
    VALUE2 = 2

Collections and generators

Python has several built-in data structures, most notably list, tuple, dict, and set. These are less efficient then comparable structures in other languages, but they are very convenient to use.

Also, there is a special generator type. It does not store the data it is only a convinient way how to access data generated by some function.

Generator

Python wiki

Generators are mostly used in the iteration, we can iterte them the same way as lists.

To get the first item of the generator, we can use the next function:

g = (x for x in range(10))
first = next(g) # 0

To create a generator function (a function that returns a generator), we can use the yield keyword. The following function returns a generator that yields the numbers 1, 2, and 3:

def gen():
    yield 1
    yield 2
    yield 3

The length of the generator is not known in advance, to get the length, we have to iterate the generator first, for example using len(list(<generator>))

Tuple

Tuples are meant to store a fixed sequence of values. They are immutable.

The tuple literal is a comma-separated list of values in round braces:

t = (1, 2, 3)

Dictionary

Official Manual

Disctionaries are initialized using curly braces ({}) and the : operator:

d = {
    'key1': 'value1',
    'key2': 'value2',
    ...
}

Two dictionaries can be merged using the | operator:

d3 = d1 | d2 

Set

Documentation

Sets are initialized using curly braces ({}) or the set function:

s = {1, 2, 3}
s = set([1, 2, 3])

To add elements to the set, we use either the add for a single element or the update for multiple elements. In both cases, a union of the set and the new elements is computed, i.e., no exception is raised if an element is already in the set.

Comprehensions

In addition to literals, Python has a convinient way of creating basic data structures: the comprehensions. The basic syntax is:

<struct var> = 
    <op. brace> <member var expr.> for <member var> in <iterable><cl. brace>

As for literals, we use square braces ([]) for lists, curly braces ({}) for sets, and curly braces with colons for dictionaries. In contrast, we get a generator expression when using round braces (()), not a tuple.

We can also use the if keyword to filter the elements:

a = [it for it in range(10) if it % 2 == 0] # [0, 2, 4, 6, 8]

Sorting

Official Manual

For sorting, you can use the sorted function.

Instead of using comparators, Python has a different concept of key functions for custom sorting. The key function is a function that is applied to each element before sorting. For any expected object, the key function should return a value that can be compared.

Complex sorting using tuples

If we need to apply some complex sorting, we can use tuples as the key function return value. The tuples have comparison operator defined, the implementation is as follows:

  • elements are compared one by one
  • on first non-equal element, the comparison result is returned

This way, we can implement a complex sorting that would normaly require several conditions by storing the condition results in the tuple.

Slices

Many Python data structures support slicing: selecting a subset of elements. The syntax is:

<object>[<start>:<end>:<step>]

The start and end are inclusive.

The step is optional and defaults to 1. The start is also optional and defaults to 0.

Instead of omitting the start and end, we can use the None keyword:

a = [1, 2, 3, 4, 5]
a[None:3] # [1, 2, 3]

Sometimes, it is not possible to use the slice syntax:

  • when we need to use a variable for the step or,
  • when the object use the slice syntax for something else, e.g., for selecting columns in a Pandas dataframe.

In such cases, we can use the slice object:

a[0:10:2] 
s = slice(0, 10, 2)
a[s] # equivalent

Here, the parameters can be ommited as well. We can select everything by using slice(None), which is equivalent to slice(None, None, None).

Copying collections

If we copy a complex collection (e.g., a list of dictionaries), we typically want to create a deep copy so that the original collection is not modified. We can use the copy module for that:

import copy

a = [{'a': 1}, {'b': 2}]
b = copy.deepcopy(a)

Date and time

Python documentation

The base object for date and time is datetime

datetime construction

The datetime object can be directly constructed from the parts:

from daterime import datetime

d = datetime(2022, 12, 20, 22, 30, 0) # 2022-12-20 22:30:00

The time part can be ommited.

We can load the datetime from string using the strptime function:

d = datetime.strptime('2022-05-20 18:00', '%Y-%m-%d %H:%M')

For all possible time formats, check the strftime cheatsheet

Accessing the parts of datetime

The datetime object has the following attributes:

  • year
  • month
  • day
  • hour
  • minute
  • second

We can also query the day of the week using the weekday() method. The day of the week is represented as an integer, where Monday is 0 and Sunday is 6.

Intervals

There is also a dedicated object for time interval named timedelta. It can be constructed from parts (seconds to days), all parts are optional.

We can obtain a timedelta by substracting a datetime from another datetime:

d1 = datetime.strptime('2022-05-20 18:00', '%Y-%m-%d %H:%M')
d2 = datetime.strptime('2022-05-20 18:30', '%Y-%m-%d %H:%M')
interval = d2 - d1 # 30 minutes

We can also add or substract a timedelta object from the datetime object:

d = datetime.strptime('2022-05-20 18:00', '%Y-%m-%d %H:%M')
interval = timedelta(hours=1)
d2 = d + interval # '2022-05-20 19:00'

Converting to Unix timestamp

To convert a datetime object to unix timestamp, we can use the timestamp method. It returns the number of seconds since the epoch (1.1.1970 00:00:00). Note however, that the timestamp is computed based on the datetime object's timezone, or your local timezone if the datetime object has no timezone information.

Named tuples

Apart from the standard tuple, Python has a named tuple class that can be created using the collections.namedtuple function. In named tuple, each member has a name and can be accessed using the dot operator:

from collections import namedtuple

Point = namedtuple('Point', ['x', 'y'])
p = Point(1, 2)
print(p.x) # 1

Functions

Argument unpacking

if we need to conditionaly execute function with a different set of parameters (supposed the function has optional/default parameters), we can avoid multiple function calls inside the branching tree by using argument unpacking.

Suppose we have a function with three optional parameters: a, b, c. If we skip only last n parameters, we can use a list for parameters and unpack it using *:

def call_me(a, b, c):
    ...

l = ['param A', True]
call_me(*l) # calls the function with a = 'param A' and b = True

If we need to skip some parameters in the middle, we have to use a dict and unpack it using **:

d = {'c': 142}
call_me(**d) # calls the function with c = 142

String formatting

To format python strings we can use the format function of the string or the equivalen fstring:

a = 'world'
message = "Hello {} world".format(a)
message = f"Hello {a}" # equivalent

If we need to a special formatting for a variable, we can specify it behind : as we can see in the following example that padds the number from left:

uid = '47'
message = "Hello user {:0>4d}".format(a) # prints "Hello user 0047"
message = f"Hello {a:0>4d}" # equivalent

More formating optios can be found in the Python string formatting cookbook.

Classes

Official Manual

Classes in Python are defined using the class keyword:

class MyClass:
    ...

Unlike in other languages, we only declare the function members, other members are declared in the constructor or even later.

Constructor

The constructor is a special function named __init__. Usually, non-function members are declared in the constructor:

class MyClass:
    def __init__(self, a, b):
        self.a = a
        self.b = b
        self.c = 0
        self.d = None

Check if an object contains a member

To check whether an object contains a member, we can use the hasattr function:

if hasattr(obj, 'member'):
    ...

Constructor overloading

Python does not support function overloading, including the constructor. That is unfortunate as default arguments are less powerfull mechanism. For other functions, we can supplement overloading using a function with a different name. However, for the constructor, we need to use a different approach.

The most clean way is to use a class method as a constructor. Example:

class MyClass:
    def __init__(self, a, b = 0):
        self.a = a
        self.b = b
        self.c = 0
        self.d = None

    @classmethod
    def from_a(cls, b):
        return cls(0, b)

Importing

In python, we can import whole modules as:

import <module>

Also, we can import specific functions, classes, or variables from the module:

from <module> import <name>

Note that when importing variable, we import the reference to the variable. Therefore, it will become out of sync with the original variable if the original variable is reassigned. Therefore, importing non-constant variables is not recommended.

The module path can absolute or relative (starting with .). Absolute imports are recommended, as they are more robust and less error-prone.

Resolving absolute module paths

If the path is absolute, it is resolved as follows:

  1. The already imported modules are searched
  2. The built-in modules are searched
  3. The module is searched in the import path which is a list of directories stored in the sys.path variable. The sys.path variable typically contains the following directories:
    • the directory of the script that is executed ('' in case of the interactive shell),
    • the directories in the PYTHONPATH environment variable,
    • the standard library directories (e.g., /usr/lib/python3.9), and
    • the site-packages directory.

Resolving relative module paths

Relative imports can only be used in packages (directories with __init__.py file). The relative path may start with

  • .: relative to the current module,
  • .. relative to the parent module.

Imports in tests

The tests are located outside the main package, so we cannot use the absolute import starting with the package name. One option is to use relative imports. But a better option is to use absolute imports starting from the project root. We can do that, because test suites like pytest add the project root to the sys.path variable.

The project root is typically determined automatically by the test suite, e.g. by searching for the setup.py file. Therefore, if the tests directory is located in the same directory as the setup.py file, we can import as follows:

import tests/common

Exceptions

documentation

Syntax:

try:
    <code that can raise exception>
except <ERROR TYPE> as <ERROR VAR>:
    <ERROR HANDELING>
finally:
    <code that is executed always>

The except and finally blocks are optional. In other words, we can handle errors without having any default cleanup code, and we can have cleanup code without handling errors.

Raising exceptions

To raise an exception, we can use the raise keyword:

raise ValueError('message')

Sometimes, we want just to re-raise an exception after some partial exception handling. In such cases, we can use the raise keyword without arguments:

try:
    ...
except:
    ...
    raise

Assertions

In Python, assertions are executed by defult. They can be disabled by running python with the -O or -OO flag.

The syntax is:

assert <condition>, <message>

Filesystem

There are three ways commonly used to work with filesystem in Python:

The folowing code compares both approaches for path concatenation:

# string path concatenation
a = "C:/workspace"
b = "project/file.txt"
c = f"{a}/{b}"

# os.path concatenation
a = "C:/workspace"
b = "project/file.txt"
c = os.path.join(a, b)

# pathlib concatentation
a = Path("C:/workspace")
b = Path("project/file.txt")
c = a / b 

As the pathlib is the most modern approach, we will use it in the following examples. Appart from pathlib documentation, there is also a cheat sheet available on github.

Path editing

Computing relative path

To prevent misetakes, it is better to compute relative paths beteen directories than to hard-code them. Fortunately, there are methods we can use for that.

If the desired relative path is a child of the start path, we can simply use the relative_to method of the Path object:

a = Path("C:/workspace")
b = Path("C:/workspace/project/file.txt")
rel = b.relative_to(a) # rel = 'project/file.txt'

However, if we need to go back in the filetree, we need a more sophisticated method from os.path:

a = Path("C:/Users")
b = Path("C:/workspace/project/file.txt")
rel = os.path.relpath(a, b) # rel = '../Workspaces/project/file.txt'

Get parent directory

We can use the parent property of the Path object:

p = Path("C:/workspace/project/file.txt")
parent = p.parent # 'C:\\workspace\\project'

Absolute and canonical path

We can use the absolute method of the Path object to get the absolute path. To get the canonical path, we can use the resolve method.

Splitting paths and working with path parts

To read the file extension, we can use the suffix property of the Path object. The property returns the extension with the dot.

To change the extension, we can use the with_suffix method:

p = Path("C:/workspace/project/file.txt")
p = p.with_suffix('.csv') # 'C:\\workspace\\project\\file.csv'

To remove the extension, just use the with_suffix method with an empty string.

We can split the path into parts using the parts property:

p = Path("C:/workspace/project/file.txt")
parts = p.parts # ('C:\\', 'workspace', 'project', 'file.txt')

To find the index of some specific part, we can use the index method:

p = Path("C:/workspace/project/file.txt")
index = p.parts.index('project') # 2

Later, we can use the index to manipulate the path:

p = Path("C:/workspace/project/file.txt")
index = p.parts.index('project') # 2
p = Path(*p.parts[:index]) # 'C:\\workspace'

Changing path separators

To change the path separators to forward slashes, we can use the as_posix and method:

p = Path(r"C:\workspace\project\file.txt")
p = p.as_posix() # 'C:/workspace/project/file.txt'

Using ~ as the home directory in paths

Normally, the ~ character is not recognized as the home directory in Python paths. To enable this, we can use the expanduser method:

p = Path("~/project/file.txt")
p = p.expanduser() # 'C:\\Users\\user\\project\\file.txt'

Working directory

  • os.getcwd() - get the current working directory
  • os.chdir(<path>) - set the current working directory

Iterating over files

The pathlib module provides a convenient way to iterate over files in a directory. The particular methods are:

  • iterdir - iterate all files and directories in a directory
  • glob - iterate over files in a single directory, using a filter
  • rglob - iterate over files in a directory and all its subdirectories, using a filter

In general, the files will be sorted alphabetically. When we need a different order, we have to store the results in a list and sort it.

Single directory iteration

Using pathlib, we can iterate over files using a filter with the glob method:

p = Path("C:/workspace/project")
for filepath in p.glob('*.txt') # iterate over all txt files in the project directory

The old way is to use the os.listdir method:

p = Path("C:/workspace/project")
for filename in os.listdir(p):
    if filename.endswith('.txt'):
        filepath = p / filename

Recursive iteration

Using pathlib, we can iterate over files using a filter with the rglob method:

p = Path("C:/workspace/project")
for filepath in p.rglob('*.txt') # iterate over all txt files in the project directory and all its subdirectories

The old way is to use the os.walk method:

p = Path("C:/workspace/project")
for root, dirs, files in os.walk(p):
    for filename in files:
        if filename.endswith('.txt'):
            filepath = Path(root) / filename

Iterate only directories/files

There is no specific filter for files/directories, but we can use the is_file or is_dir method to filter out directories:

p = Path("C:/workspace/project")
for filepath in p.glob('*'):
    if filepath.is_file():
        # do something

Use more complex filters

Unfortunately, the glob and rglob methods do not support more complex filters (like regex). However, we can easily apply the regex filter manually:

p = Path("C:/workspace/project")
for filepath in p.glob('*'):
    if not re.match(r'^config.yaml$', filepath.name):
        # do something

Get the path to the current script

Path(__file__).resolve().parent

Checking write permissions for a directory

Unfortunatelly, most of the methods for checking write permissions are not reliable outside Unix systems. The most reliable way is to try to create a file in the directory:

p = Path("C:/workspace/project")
try:
    with open(p / 'test.txt', 'w') as f:
        pass
    p.unlink()
    return True
except PermissionError:
    return False
except:
    raise # re-raise the exception

Other methods like os.access or using tempfile module are not reliable on Windows (see e.g.: https://github.com/python/cpython/issues/66305).

Creating directories

To create a directory, we can use the mkdir method of the Path object:

p = Path("C:/workspace/project")
p.mkdir()

Important parameters:

  • parents: if set to True, the directory will be created even if the parent directories do not exist. Default is False.
  • exist_ok: if set to True, the directory will not be created if it already exists. Default is False.

Copying files and directories

For copying files and directories, we can use the shutil module. The most used method is copy2, which copies the file with all metadata:

import shutil

p1 = Path("C:/workspace/project/file.txt")
p2 = Path("C:/workspace/project/file2.txt")

shutil.copy2(p1, p2)

The copy2 method can also copy into a directory:

p1 = Path("C:/workspace/project/file.txt")
p2 = Path("C:/workspace/project2")

shutil.copy2(p1, p2) # the new file will be 'C:/workspace/project2/file.txt'

Other methods and the comparison are described in a SO question.

Deleting files and directories

To delete a file, we can use the unlink method of the Path object:

p = Path("C:/workspace/project/file.txt")
p.unlink()

for deleting directories, we can use the rmdir method:

p = Path("C:/workspace/project")
p.rmdir()

However, the rmdir method can delete only empty directories. To delete a directory with content, we can use the shutil module:

p = Path("C:/workspace/project")
shutil.rmtree(p)

Deleting Windows read-only files (i.e. Access Denied error)

On Windows, all the delete methods can fail because lot of files and directories are read-only. This is not a problem for most application, but it breaks Python delete methods. One way to solve this is to handle the error and change the attribute in the habdler. Example for shutil:

import os
import stat
import shutil

p = Path("C:/workspace/project")
shutil.rmtree(p, onerror=lambda func, path, _: (os.chmod(path, stat.S_IWRITE), func(path)))

Working with temporary files

The tempfile module provides a convenient way to work with temporary files.

To create a temporary file, we can use the NamedTemporaryFile function:

import tempfile

with tempfile.NamedTemporaryFile() as f:
    f.write(<data>)

Unlike normal files, we can both read and write to the temporary file using a single file object. However, we must return the file pointer to the beginning of the file:

with open('file.txt', 'rw') as f:
    f.write('data')
    f.seek(0)
    data = f.read()

I/O

For simple file operations, we can use the open function. A simple file read is done as follows:

with open('file.txt', 'r') as f:
    data = f.read()

A simple file write is done as follows:

with open('file.txt', 'w') as f:
    f.write('data')

By default, the open function opens the file in text mode. To open the file in binary mode, we have to use the b flag:

with open('file.txt', 'rb') as f:
    data = f.read()

CSV

Official Manual

The csv module provides a Python interface for working with CSV files. The basic usage is:

import csv

with open('file.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        # do something

Reader parameters:

  • delimiter - the delimiter character

JSON

Official Manual

TO read a JSON file:

import json

with open('file.json', 'r') as f:
    data = json.load(f)

To write a JSON file:

import json

data = {'a': 1, 'b': 2}

with open('file.json', 'w') as f:
    json.dump(data, f)

Important parameters:

  • indent: the number of spaces used for indentation. This also enables other pretty-printing functionalities, like newlines after each element.

Custom serialization

The json module can serialize only basic types. If we need to serialize custom objects, we have to provide a custom serialization class. We then supply the class to the cls parameter of the dump function.

The serialization class is usually a subclass of the JSONEncoder class. The class has to implement the default method, which is called for each object that cannot be serialized by the standard serialization methods. Example:

import json

class MyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, MyCustomClass):
            return obj.to_json()
        return super().default(obj)

HDF5

HDF5 is a binary file format for storing large amounts of data. The h5py module provides a Python interface for working with HDF5 files.

An example of reading a dataset from an HDF5 file on SO

Command line arguments

The sys module provides access to the command line arguments. They are stored in the argv list with the first element being the name of the script.

INI files

The configparser module provides a Python interface for working with INI files. The basic usage is:

import configparser

config = configparser.ConfigParser()
config.read('file.ini')

value = config['section']['key']

If we do not have sections in the INI file, we have to:

  • use the allow_unnamed_section argument of the ConfigParser: Python config = configparser.ConfigParser(allow_unnamed_section=True)
  • use the configparser.UNNAMED_SECTION in place of the section name: Python value = config[configparser.UNNAMED_SECTION]['key']

Logging

Official Manual

The logging itself is then done using the logging module methods:

logging.info("message")
logging.warning("message %s", "with parameter")

A simple logging configuration:

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s %(message)s',
    handlers=[
        logging.FileHandler("log.txt"),
        logging.StreamHandler()
    ]
)

Note that this configuration can be done only once. Therefore, it should not be done in a library as it prevents the user from configuring the logging.

To set the level for a specific logger, we use the setLevel method: logger.setLevel(logging.DEBUG). We can also use a string representation of the level, e.g., logger.setLevel('DEBUG').

To check the level of the logger, we can use the isEnabledFor method:

if logger.isEnabledFor(logging.DEBUG):
    ...

This can be useful for avoiding expensive computations needed just for logging if the logging level is set to a higher level.

Type hints

Official Manual

Type hints are a useful way to help the IDE and other tools to understand the code so that they can provide better support (autocompletion, type checking, refactoring, etc.). The type hints are not enforced at runtime, so they do not affect the performance of the code.

We can specify the type of a variable using the : operator:

a: int = 1

Apart from the basic types, we can also use the typing module to specify more complex types:

from typing import List, Dict, Tuple, Set, Optional, Union, Any

a: List[int] = [1, 2, 3]

We can also specify the type of a function argument and return value:

def foo(a: int, b: str) -> List[int]:
    return [a, b]

Type hints in loops

The type of the loop variable is usually inferred by IDE from the type of the iterable. However, this sometimes fails, e.g., for zip objects. In such cases, we need to specify the type of the loop variable. However, we cannot use the : directly in the loop, but instead, we have to declare the variable before the loop:

for a: int in ... # error

a: int
for a in ... # ok

Circular type hints

Unfortunately, Python currently does not support circular type hints. However, it should be possible to use circular type hints since Python 3.14.

There are two types of circular type hints:

  • we need to refer the type while defining it. For that, we use the Self type: Python class MyClass: def get_me(self) -> Self: return self
  • two or more types refer to each other. For that, use a string representation of the type: ```Python class ClassA: def init(self, b: 'ClassB'): self.b = b

    class ClassB: def set_a(self, a: ClassA): self.a = a ```

Common type hints

Language types

  • None
  • Numeric types:
    • int,
    • float,
  • bool
  • str
  • Collection types:
    • list, or List[<type>],
    • tuple, or Tuple[<type>, ...],
    • set, or Set[<type>],
    • dict, or Dict[<key type>, <value type>],
  • Iterables:
    • Iterable[<type>] - any iterable
    • Sequence[<type>] - iterable with random access ([] operator)
  • Any - any type
  • Union[<type>, ...] - any of the specified types
  • Optional[<type>] - the specified type or None
  • Callable[[<arg type>, ...], <return type>] - a function with specified arguments and return type

Pandas types

Pandas does not provide type hints. We can use the types itself, but this is only partially useful. We can use Series as a hint, but we cannot specify the inner type (e.g.: Series[int]). For that, we can use a wrapper library called pandera:

from pandera.typing import Series

a: Series[int]

Officiall documentation for Pandera data types

Calling external programs

To call an external program, we use the subprocess module. Most of the time, we use the run function:

import subprocess

subprocess.run(['ls', '-l'])

Important parameters:

  • check: if set to True, the function raises an exception if the return code is not 0. Default is False.
  • text, or universal_newlines: if set to True, the function returns the output as a string. Default is None.
  • env: A dictionary with the environment variables.
    • Note that the default environment obtained from the parent process is not extended, but replaced by the provided dictionary. Therefore, if we want to extend the environment, we have to initialize the dictionary with the parent environment: Python env = os.environ.copy()

However, the subprocess.run method has some limitations. Notably, it cannot both capture and stream the output. To achieve this, and some other advanced features, we have to use the subprocess.Popen class.

subprocess.Popen

The subprocess.Popen class provides more control over the process. The basic usage is:

p = subprocess.Popen(['ls', '-l']) # start the process

# now we can communicate with the process, stream the output, etc.

p.wait() # wait for the process to finish

# now we can get the return code, continue with the code, etc.

Loading resources

Resources can be loaded using the importlib.resources module. This way, we can handle files but also resources stored in an archive.

The basic usage is:

import importlib.resources

file = importlib.resources.files('package').joinpath('file.txt')

# send the file to the function expecting a file-like object

my_function(importlib.resources.as_file(file))

Numpy

Data types

Documentation of basic data types

Date and time

Documentation

Numpy use the datetime64 data type for date and time. This type has its internal resolution, which can be anyting from nanoseconds to years. The resolution is displayed in the type name, e.g., datetime64[ns] and it is determined

  • automatically from the input data if dtype is not specified or specified as datetime64 or
  • by the dtype parameter if specified as datetime64[<resolution>].

Initialization

We can create the new array as:

  • zero-filled: np.zeros(<shape>, <dtype>)
  • ones-filled: np.ones(<shape>, <dtype>)
  • empty: np.empty(<shape>, <dtype>)
  • filled with a constant: np.full(<shape>, <value>, <dtype>)

Sorting

for sorting, we use the sort function.

There is no way how to set the sorting order, we have to use a trick for that:

a = np.array([1, 2, 3, 4, 5])
a[::-1].sort() # sort in reverse order

Export to CSV

To export the numpy array to CSV, we can use the savetxt function:

np.savetxt('file.csv', a, delimiter=',')

By default, the function saves values in the mathematical float format, even if the values are integers. To save the values as integers, we can use the fmt parameter:

np.savetxt('file.csv', a, delimiter=',', fmt='%i')

Usefull array properties:

  • size: number of array items
    • unlike len, it counts all items in the mutli-dimensional array
  • itemsize: memory (bytes) needed to store one item in the array
  • nbytes: array size in bytes. Should be equal to size * itemsize .

Usefull functions

Regular expressions

In Python, the regex patterns are not compiled by default. Therefore we can use strings to store them.

The basic syntax for regex search is:

result = re.search(<pattern>, <string>)
if result: # pattern matches
    group = result.group(<group index>) # print the first group

The 0th group is the whole match, as usual.

To substitute the matched pattern, we can use the sub function:

pattern = re.compile(r'(\d+)')
result = pattern.sub(r'[\1]', '123') # '[123]'

Sometimes, we need to use numbers around the group index. In such cases, we have to use the \<group index><<number>> notation:

pattern = re.compile(r'(\d+)')
result = pattern.sub(r'\g<1>2025', '123') # '1232025'

Lambda functions

Lambda functions in python have the following syntax:

lambda <input parameters>: <return value>

Example:

f = lambda x: x**2

Only a single expression can be used in the lambda function, so we need standard functions for more complex logic (temporary variables, loops, etc.).

Decorators

Decorators are a special type of function that can be used to modify other functions.

When we write an annotation with the name of a function above another function, the annotated function is decorated. It means that when we call the annotated function, a wrapper function is called instead. The wrapper function is the function returned by the decorater: the function with the same name as the annotation.

If we want to also keep the original function functionality, we have to pass the function to the decorator and call it inside the wrapper function.

In the following example, we create a dummy decorator that keeps the original function functionality: Example:

def decorator(func):
    def wrapper():
        result = func()
        return result
    return wrapper

@decorator
def my_func():
    # do something
    return result

Decorator with arguments

If the original function has arguments, we have to pass them to the wrapper function. Example:

def decorator(func):
    def wrapper(param_1, param_2):
        result = func(param_1, param_2)
        return result
    return wrapper

@decorator
def my_func(param_1, param_2):
    # do something
    return result

Singletons

SO question

There are several ways how to implement a singleton in Python. The most common are:

  • using a module-level variable
  • using the __new__ method in combination with a base class
  • use a metaclass
  • using a decorator

Module-level variable

The most simple way is to use a module-level variable. Note taht if the singleton has to be initialized from the outside, the initialization has to be done in a singleton method, not in the constructor!

class Singleton:
    def __init__(self):
        self.initialized = False

    def init(self, init_param):
        if not self.initialized:
            self.initialized = True
            # do the initialization

singleton = Singleton()

# initialization
singleton.init(init_param)

Testing with pytest

Pytest is a simple testing framework for Python. It uses the assert statement for testing. The tests are defined in functions with the test_ prefix.

Fixtures

Fixtures are used to set up the environment for more than one test. If defined in the conftest.py file, they are available for all tests in the project.

Fixtures are defined using the @pytest.fixture decorator. The fixture can be used in the test function by passing the fixture name as an argument. The fixture has the following structure:

@pytest.fixture
def my_fixture():
    # code for setting up the environment
    yield # the test is executed here
    # clean up code

Mocking

For mocking, we can use the pytest-mock package. After installation, we can use the mocker fixture in any test function.

Capturing output

Documentation

To capture the output, we can use the capsys fixture:

def test_output(capsys):
    print('hello')
    captured = capsys.readouterr()
    assert captured.out == 'hello\n'

Similarly, we can inspect the standard error output using the captured.err attribute.

Jupyter

Memory

Most of the time, when the memory allocated by the notebook is larger than expected, it is caused by some library objects (plots, tables...]). However sometimes, it can be forgotten user objects. To list all user objects, from the largest:

# These are the usual ipython objects, including this one you are creating
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']

# Get a sorted list of the objects and their sizes
sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith('_') and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)

Reloding modules with autoreload

When modules are imported they are not reloaded unless the kernel is restarted. In python scripts, this does not matter, we just execute the script again. However, when working with notebooks , it may be inconvenient to reload the kernel and all necessary cells just because of a small change in the imported module. Instead, we can use the autoreload extension.

First, we have to load the extension:

%load_ext autoreload

Then, we configure the autoreload with %autoreload <mode>. The most common modes are:

  • now (default): reload all modules immediately (if not excluded by the %aimport magic)
    • This is useful especially if the automatic reloading does not work as expected.
  • 0, off: disable autoreload
  • 1, explicit: reload modules that were imported using the %aimport magic every time before executing the Python code
  • 2, all: reload all modules (except those excluded by %aimport) every time before executing the Python code
  • 3, complete: same as 2, but also add any new objects in the module

Plotting

There are several libraries for plotting in Python. The most common are:

  • matplotlib
  • plotly

In the table below, we can see a comparison of the most common plotting libraries:

Functionality Matplotlib Plotly
real 3D plots no yes
detail legend styling (padding, round corners...) yes no

Matplotlib

Official Manual

Saving figures

To save a figure, we can use the savefig function. The savefig function has to be called before the show function, otherwise the figure will be empty.

Docstrings

For documenting Python code, we use docstrings, special comments soroudned by three quotation marks: """ docstring """

Unlike in other languages, there are multiple styles for docstring content. The most common are:

  • Epytext Python """ @param <param name>: <param description> @return: <return description> """
  • Google Python """ Args: <param name>: <param description> Returns: <return description> """
  • Numpy Python """ Parameters ---------- <param name> : <param type> <param description> Returns ------- <return type> <return description> """
  • reStructuredText Python """ :param <param name>: <param description> :return: <return description> """

Progress bars

For displaying progress bars, we can use the tqdm library. It is very simple to use:

from tqdm import tqdm
for i in tqdm(range(100)):
    ...

Important parameters:

  • desc: description of the progress bar

TQDM in Jupyter

When using tqdm in Jupyter, the basic progress bar may not work (it may print other logs repeatedly). In such cases, we can change the import to:

from tqdm.notebook import tqdm

If the code can be called both from Jupyter and from console, we can use `autonotebook

PostgreSQL

When working with PostgreSQL databases, we usually use either

psycopg2

documentation

To connect to a database:

con = psycopg2.connect(<connection string>)

After running this code a new session is created in the database, this session is handeled by the con object.

The operation to the database is then done as follows:

  1. create a cursor object which represents a database transaction Python cur = con.cursor()
  2. execute any number of SQL commands Python cur.execute(<sql>)
  3. commit the transaction Python con.commit()

SQLAlchemy

Connection documentation

SQLAlchemy works with engine objects that represent the application's connection to the database. The engine object is created using the create_engine function:

from sqlalchemy import create_engine

engine = create_engine('postgresql://user:password@localhost:5432/dbname')

A simple SELECT query can be executed using using the following code:

with engine.connect() as conneciton:
    result = conneciton.execute("SELECT * FROM table")
    ...

With modifying statements, the situation is more complicated as SQLAlchemy uses transactions by default. Therefore we need to commit the transaction. There are two ways how to do that:

  • using the commit method of the connection object Python with engine.connect() as conneciton: conneciton.execute("INSERT INTO table VALUES (1, 2, 3)") conneciton.commit()

  • creating a new block for the transaction using the begin method of the connection object Python with engine.connect() as conneciton: with conneciton.begin(): conneciton.execute("INSERT INTO table VALUES (1, 2, 3)")

    • this option has also its shortcut: the begin method of the engine object Python with engine.begin() as conneciton: conneciton.execute("INSERT INTO table VALUES (1, 2, 3)")

Note that the old execute method of the engine object is not available anymore in newer versions of SQLAlchemy.

Statements with parameters

Sometimes it is desirable to use parameters in the SQL statements:

  • it prevents SQL injection in case of user input,
  • the provided parameters are automatically escaped, so we don't have to worry : in the SQL statement.

The syntax is:

connection.execute("INSERT INTO table VALUES (:param1, :param2)", param1=1, param2=2)

# or

connection.execute("INSERT INTO table VALUES (:param1, :param2)", {'param1': 1, 'param2': 2})

Executing statements without transaction

By default, sqlalchemy executes sql statements in a transaction. However, some statements (e.g., CREATE DATABASE) cannot be executed in a transaction. To execute such statements, we have to use the execution_options method:

with sqlalchemy_engine.connect() as conn:
    conn.execution_options(isolation_level="AUTOCOMMIT")
    conn.execute("<sql>")
    conn.commit()

Getting the affected rowcount

The result object returned by the execute method has the rowcount attribute that contains the number of affected rows.

Executing multiple statements at once

To execute multiple statements at once, for example when executing a script, it is best to use the execute method of the psycopg2 connection object. Moreover, to safely handle errors, it is best to catch the exceptions and manually rollback the transaction in case of an error:

conn = psycopg2.connect(<connection string>)
cursor = conn.cursor()
try:
    cursor.execute(<sql>)
    conn.commit()
except Exception as e:
    conn.rollback()
    raise e
finally:
    cursor.close()
    conn.close()

Working with GIS

When working with gis data, we usually change the pandas library for its GIS extension called geopandas.

For more, see the pandas manual.

Geocoding

For geocoding, we can use the Geocoder library.

Complex data structures

KDTree

documentation

KDTree can be found in the scipy library.

Geometry

There are various libraries for working with geometry in Python:

Downloading files

To download files from the internet, we can use the requests library. The basic usage is:

import requests

response = requests.get(<url>)

with open(<filename>, 'wb') as f:
    f.write(response.content)