Basics¶
Variables¶
To check if a local variable is defined, we can use the locals
function:
if 'my_variable' in locals():
print('Variable exists')
Conditions and boolean context¶
Comparison operators¶
Python uses the standard set of comparison operators (==
, !=
, <
, >
, <=
, >=
).
They are functionally similar to C++ operators: they can be overloaded and the semantic meaning of ==
is equality, not identity (in contrast to Java).
Automatic conversion to bool
¶
Unlike in other languages, any expression can be used in boolean context in python, as there are rules how to convert any type to bool
. The following statement is valid, foor example:
s = 'hello'
if s:
print(s)
The code above prints 'hello', as the variable s
evaluates to True
.
Any object in Python evaluates to True
, with exeption of:
False
None
- numerically zero values (e.g.,
0
,0.0
) - standard library types that are empty (e.g., empty string,
list
,dict
)
The automatic conversion to bool in boolean context has some couner intuitive consequences. The following conditions are not equal:
s = 'hello'
if s: # s evaluates to True
if s == True: # the result of s == True is False, then False evaluete to False
Checking the type¶
To check the exact type:
if type(<VAR>) is <TYPE>:
# e.g.
if type(o) is str:
To check the type in the polymorphic way, including the subtypes:
if isinstance(<VAR>, <TYPE>):
# e.g.
if isinstance(o, str):
Built-in data types¶
Numbers¶
Python has the following numeric types:
int
- integerfloat
- floating point number
The int
type is unlimited, i.e., it can represent any integer number. The float
type is limited by the machine precision, i.e., it can represent only a finite number of real numbers.
Check if a float number is integer¶
To check whether a float number is integer, we can use the is_integer
function:
Check if a number is NaN¶
To check whether a number is NaN, we can use the math.isnan
function or the numpyp.isnan
function:
Rounding¶
To round a number, use the round
function.
For rounding up, use the math.ceil
function.
For rounding down, use the math.floor
function.
Strings¶
Strings in Python can be enclosed in single or double quotes (equivalent). The triple quotes can be used for multiline strings.
String formatting¶
The string formatting can be done in several ways:
- using the
f
prefix to string literal:f'{<VAR>}'
- using the
format
method:'{}'.format(<VAR>)
Each variable can be formatted for that, Python has a string formatting mini language.
The format is specified after the :
character (e.g., f'{47:4}'
set the width of the number 47 to 4 characters). Most of the format specifiers have default values, so we can omit them (e.g., f'{47:4}'
is equivalent to f'{47:4d}'
).
The following are the most common options:
To use the character {
and }
in the string, we have to escape them using double braces: {{
and }}
.
String methods¶
capitalize
: capitalize the first letter of the stringlower
: convert the string to lowercaseupper
: convert the string to uppercasestrip
: remove leading and trailing whitespaceslstrip
: remove leading whitespacesrstrip
: remove trailing whitespaces
Enumerations¶
For enumerations, we can use the enum
module. The basic syntax is:
from enum import Enum
class MyEnum(Enum):
VALUE1 = 1
VALUE2 = 2
Collections and generators¶
Python has several built-in data structures, most notably list
, tuple
, dict
, and set
. These are less efficient then comparable structures in other languages, but they are very convenient to use.
Also, there is a special generator type. It does not store the data it is only a convinient way how to access data generated by some function.
Generator¶
Generators are mostly used in the iteration, we can iterte them the same way as lists.
To get the first item of the generator, we can use the next
function:
g = (x for x in range(10))
first = next(g) # 0
To create a generator function (a function that returns a generator), we can use the yield
keyword. The following function returns a generator that yields the numbers 1, 2, and 3:
def gen():
yield 1
yield 2
yield 3
The length of the generator is not known in advance, to get the length, we have to iterate the generator first, for example using len(list(<generator>))
Tuple¶
Tuples are meant to store a fixed sequence of values. They are immutable.
The tuple literal is a comma-separated list of values in round braces:
t = (1, 2, 3)
Dictionary¶
Disctionaries are initialized using curly braces ({}
) and the :
operator:
d = {
'key1': 'value1',
'key2': 'value2',
...
}
Two dictionaries can be merged using the |
operator:
d3 = d1 | d2
Set¶
Sets are initialized using curly braces ({}
) or the set
function:
s = {1, 2, 3}
s = set([1, 2, 3])
To add elements to the set, we use either the add
for a single element or the update
for multiple elements. In both cases, a union of the set and the new elements is computed, i.e., no exception is raised if an element is already in the set.
Comprehensions¶
In addition to literals, Python has a convinient way of creating basic data structures: the comprehensions. The basic syntax is:
<struct var> =
<op. brace> <member var expr.> for <member var> in <iterable><cl. brace>
As for literals, we use square braces ([]
) for lists, curly braces ({}
) for sets, and curly braces with colons for dictionaries. In contrast, we get a generator expression when using round braces (()
), not a tuple.
We can also use the if
keyword to filter the elements:
a = [it for it in range(10) if it % 2 == 0] # [0, 2, 4, 6, 8]
Sorting¶
For sorting, you can use the sorted
function.
Instead of using comparators, Python has a different concept of key functions for custom sorting. The key function is a function that is applied to each element before sorting. For any expected object, the key function should return a value that can be compared.
Complex sorting using tuples¶
If we need to apply some complex sorting, we can use tuples as the key function return value. The tuples have comparison operator defined, the implementation is as follows:
- elements are compared one by one
- on first non-equal element, the comparison result is returned
This way, we can implement a complex sorting that would normaly require several conditions by storing the condition results in the tuple.
Slices¶
Many Python data structures support slicing: selecting a subset of elements. The syntax is:
<object>[<start>:<end>:<step>]
The start
and end
are inclusive.
The step
is optional and defaults to 1. The start is also optional and defaults to 0.
Instead of omitting the start
and end
, we can use the None
keyword:
a = [1, 2, 3, 4, 5]
a[None:3] # [1, 2, 3]
Sometimes, it is not possible to use the slice syntax:
- when we need to use a variable for the step or,
- when the object use the slice syntax for something else, e.g., for selecting columns in a Pandas dataframe.
In such cases, we can use the slice
object:
a[0:10:2]
s = slice(0, 10, 2)
a[s] # equivalent
Here, the parameters can be ommited as well. We can select everything by using slice(None)
, which is equivalent to slice(None, None, None)
.
Copying collections¶
If we copy a complex collection (e.g., a list of dictionaries), we typically want to create a deep copy so that the original collection is not modified. We can use the copy
module for that:
import copy
a = [{'a': 1}, {'b': 2}]
b = copy.deepcopy(a)
Date and time¶
The base object for date and time is datetime
datetime
construction¶
The datetime
object can be directly constructed from the parts:
from daterime import datetime
d = datetime(2022, 12, 20, 22, 30, 0) # 2022-12-20 22:30:00
The time part can be ommited.
We can load the datetime from string using the strptime
function:
d = datetime.strptime('2022-05-20 18:00', '%Y-%m-%d %H:%M')
For all possible time formats, check the strftime
cheatsheet
Accessing the parts of datetime
¶
The datetime
object has the following attributes:
year
month
day
hour
minute
second
We can also query the day of the week using the weekday()
method. The day of the week is represented as an integer, where Monday is 0 and Sunday is 6.
Intervals¶
There is also a dedicated object for time interval named timedelta
. It can be constructed from parts (seconds to days), all parts are optional.
We can obtain a timedelta by substracting a datetime
from another datetime
:
d1 = datetime.strptime('2022-05-20 18:00', '%Y-%m-%d %H:%M')
d2 = datetime.strptime('2022-05-20 18:30', '%Y-%m-%d %H:%M')
interval = d2 - d1 # 30 minutes
We can also add or substract a timedelta
object from the datetime
object:
d = datetime.strptime('2022-05-20 18:00', '%Y-%m-%d %H:%M')
interval = timedelta(hours=1)
d2 = d + interval # '2022-05-20 19:00'
Converting to Unix timestamp¶
To convert a datetime
object to unix timestamp, we can use the timestamp
method. It returns the number of seconds since the epoch (1.1.1970 00:00:00). Note however, that the timestamp is computed based on the datetime
object's timezone, or your local timezone if the datetime
object has no timezone information.
Named tuples¶
Apart from the standard tuple, Python has a named tuple class that can be created using the collections.namedtuple
function. In named tuple, each member has a name and can be accessed using the dot operator:
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])
p = Point(1, 2)
print(p.x) # 1
Functions¶
Argument unpacking¶
if we need to conditionaly execute function with a different set of parameters (supposed the function has optional/default parameters), we can avoid multiple function calls inside the branching tree by using argument unpacking.
Suppose we have a function with three optional parameters: a
, b
, c
.
If we skip only last n parameters, we can use a list
for parameters and unpack it using *
:
def call_me(a, b, c):
...
l = ['param A', True]
call_me(*l) # calls the function with a = 'param A' and b = True
If we need to skip some parameters in the middle, we have to use a dict and unpack it using **
:
d = {'c': 142}
call_me(**d) # calls the function with c = 142
String formatting¶
To format python strings we can use the format function of the string or the equivalen fstring:
a = 'world'
message = "Hello {} world".format(a)
message = f"Hello {a}" # equivalent
If we need to a special formatting for a variable, we can specify it behind :
as we can see in the following example that padds the number from left:
uid = '47'
message = "Hello user {:0>4d}".format(a) # prints "Hello user 0047"
message = f"Hello {a:0>4d}" # equivalent
More formating optios can be found in the Python string formatting cookbook.
Classes¶
Classes in Python are defined using the class
keyword:
class MyClass:
...
Unlike in other languages, we only declare the function members, other members are declared in the constructor or even later.
Constructor¶
The constructor is a special function named __init__
. Usually, non-function members are declared in the constructor:
class MyClass:
def __init__(self, a, b):
self.a = a
self.b = b
self.c = 0
self.d = None
Check if an object contains a member¶
To check whether an object contains a member, we can use the hasattr
function:
if hasattr(obj, 'member'):
...
Constructor overloading¶
Python does not support function overloading, including the constructor. That is unfortunate as default arguments are less powerfull mechanism. For other functions, we can supplement overloading using a function with a different name. However, for the constructor, we need to use a different approach.
The most clean way is to use a class method as a constructor. Example:
class MyClass:
def __init__(self, a, b = 0):
self.a = a
self.b = b
self.c = 0
self.d = None
@classmethod
def from_a(cls, b):
return cls(0, b)
Importing¶
In python, we can import whole modules as:
import <module>
Also, we can import specific functions, classes, or variables from the module:
from <module> import <name>
Note that when importing variable, we import the reference to the variable. Therefore, it will become out of sync with the original variable if the original variable is reassigned. Therefore, importing non-constant variables is not recommended.
The module path can absolute or relative (starting with .
). Absolute imports are recommended, as they are more robust and less error-prone.
Resolving absolute module paths¶
If the path is absolute, it is resolved as follows:
- The already imported modules are searched
- The built-in modules are searched
- The module is searched in the import path which is a list of directories stored in the
sys.path
variable. Thesys.path
variable typically contains the following directories:- the directory of the script that is executed (
''
in case of the interactive shell), - the directories in the
PYTHONPATH
environment variable, - the standard library directories (e.g.,
/usr/lib/python3.9
), and - the site-packages directory.
- the directory of the script that is executed (
Resolving relative module paths¶
Relative imports can only be used in packages (directories with __init__.py
file). The relative path may start with
.
: relative to the current module,..
relative to the parent module.
Imports in tests¶
The tests are located outside the main package, so we cannot use the absolute import starting with the package name. One option is to use relative imports. But a better option is to use absolute imports starting from the project root. We can do that, because test suites like pytest
add the project root to the sys.path
variable.
The project root is typically determined automatically by the test suite, e.g. by searching for the setup.py
file. Therefore, if the tests
directory is located in the same directory as the setup.py
file, we can import as follows:
import tests/common
Exceptions¶
Syntax:
try:
<code that can raise exception>
except <ERROR TYPE> as <ERROR VAR>:
<ERROR HANDELING>
finally:
<code that is executed always>
The except
and finally
blocks are optional. In other words, we can handle errors without having any default cleanup code, and we can have cleanup code without handling errors.
Raising exceptions¶
To raise an exception, we can use the raise
keyword:
raise ValueError('message')
Sometimes, we want just to re-raise an exception after some partial exception handling. In such cases, we can use the raise
keyword without arguments:
try:
...
except:
...
raise
Assertions¶
In Python, assertions are executed by defult. They can be disabled by running python with the -O
or -OO
flag.
The syntax is:
assert <condition>, <message>
Filesystem¶
There are three ways commonly used to work with filesystem in Python:
The folowing code compares both approaches for path concatenation:
# string path concatenation
a = "C:/workspace"
b = "project/file.txt"
c = f"{a}/{b}"
# os.path concatenation
a = "C:/workspace"
b = "project/file.txt"
c = os.path.join(a, b)
# pathlib concatentation
a = Path("C:/workspace")
b = Path("project/file.txt")
c = a / b
As the pathlib
is the most modern approach, we will use it in the following examples. Appart from pathlib
documentation, there is also a cheat sheet available on github.
Path editing¶
Computing relative path¶
To prevent misetakes, it is better to compute relative paths beteen directories than to hard-code them. Fortunately, there are methods we can use for that.
If the desired relative path is a child of the start path, we can simply use the relative_to
method of the Path object:
a = Path("C:/workspace")
b = Path("C:/workspace/project/file.txt")
rel = b.relative_to(a) # rel = 'project/file.txt'
However, if we need to go back in the filetree, we need a more sophisticated method from os.path
:
a = Path("C:/Users")
b = Path("C:/workspace/project/file.txt")
rel = os.path.relpath(a, b) # rel = '../Workspaces/project/file.txt'
Get parent directory¶
We can use the parent
property of the Path
object:
p = Path("C:/workspace/project/file.txt")
parent = p.parent # 'C:\\workspace\\project'
Absolute and canonical path¶
We can use the absolute
method of the Path
object to get the absolute path. To get the canonical path, we can use the resolve
method.
Splitting paths and working with path parts¶
To read the file extension, we can use the suffix
property of the Path
object. The property returns the extension with the dot.
To change the extension, we can use the with_suffix
method:
p = Path("C:/workspace/project/file.txt")
p = p.with_suffix('.csv') # 'C:\\workspace\\project\\file.csv'
To remove the extension, just use the with_suffix
method with an empty string.
We can split the path into parts using the parts
property:
p = Path("C:/workspace/project/file.txt")
parts = p.parts # ('C:\\', 'workspace', 'project', 'file.txt')
To find the index of some specific part, we can use the index
method:
p = Path("C:/workspace/project/file.txt")
index = p.parts.index('project') # 2
Later, we can use the index to manipulate the path:
p = Path("C:/workspace/project/file.txt")
index = p.parts.index('project') # 2
p = Path(*p.parts[:index]) # 'C:\\workspace'
Changing path separators¶
To change the path separators to forward slashes, we can use the as_posix
and
method:
p = Path(r"C:\workspace\project\file.txt")
p = p.as_posix() # 'C:/workspace/project/file.txt'
Using ~
as the home directory in paths¶
Normally, the ~
character is not recognized as the home directory in Python paths. To enable this, we can use the expanduser
method:
p = Path("~/project/file.txt")
p = p.expanduser() # 'C:\\Users\\user\\project\\file.txt'
Working directory¶
os.getcwd()
- get the current working directoryos.chdir(<path>)
- set the current working directory
Iterating over files¶
The pathlib
module provides a convenient way to iterate over files in a directory. The particular methods are:
iterdir
- iterate all files and directories in a directoryglob
- iterate over files in a single directory, using a filterrglob
- iterate over files in a directory and all its subdirectories, using a filter
In general, the files will be sorted alphabetically. When we need a different order, we have to store the results in a list and sort it.
Single directory iteration¶
Using pathlib, we can iterate over files using a filter with the glob
method:
p = Path("C:/workspace/project")
for filepath in p.glob('*.txt') # iterate over all txt files in the project directory
The old way is to use the os.listdir
method:
p = Path("C:/workspace/project")
for filename in os.listdir(p):
if filename.endswith('.txt'):
filepath = p / filename
Recursive iteration¶
Using pathlib, we can iterate over files using a filter with the rglob
method:
p = Path("C:/workspace/project")
for filepath in p.rglob('*.txt') # iterate over all txt files in the project directory and all its subdirectories
The old way is to use the os.walk
method:
p = Path("C:/workspace/project")
for root, dirs, files in os.walk(p):
for filename in files:
if filename.endswith('.txt'):
filepath = Path(root) / filename
Iterate only directories/files¶
There is no specific filter for files/directories, but we can use the is_file
or is_dir
method to filter out directories:
p = Path("C:/workspace/project")
for filepath in p.glob('*'):
if filepath.is_file():
# do something
Use more complex filters¶
Unfortunately, the glob
and rglob
methods do not support more complex filters (like regex). However, we can easily apply the regex filter manually:
p = Path("C:/workspace/project")
for filepath in p.glob('*'):
if not re.match(r'^config.yaml$', filepath.name):
# do something
Get the path to the current script¶
Path(__file__).resolve().parent
Checking write permissions for a directory¶
Unfortunatelly, most of the methods for checking write permissions are not reliable outside Unix systems. The most reliable way is to try to create a file in the directory:
p = Path("C:/workspace/project")
try:
with open(p / 'test.txt', 'w') as f:
pass
p.unlink()
return True
except PermissionError:
return False
except:
raise # re-raise the exception
Other methods like os.access
or using tempfile
module are not reliable on Windows (see e.g.: https://github.com/python/cpython/issues/66305).
Creating directories¶
To create a directory, we can use the mkdir
method of the Path
object:
p = Path("C:/workspace/project")
p.mkdir()
Important parameters:
parents
: if set toTrue
, the directory will be created even if the parent directories do not exist. Default isFalse
.exist_ok
: if set toTrue
, the directory will not be created if it already exists. Default isFalse
.
Copying files and directories¶
For copying files and directories, we can use the shutil
module. The most used method is copy2
, which copies the file with all metadata:
import shutil
p1 = Path("C:/workspace/project/file.txt")
p2 = Path("C:/workspace/project/file2.txt")
shutil.copy2(p1, p2)
The copy2
method can also copy into a directory:
p1 = Path("C:/workspace/project/file.txt")
p2 = Path("C:/workspace/project2")
shutil.copy2(p1, p2) # the new file will be 'C:/workspace/project2/file.txt'
Other methods and the comparison are described in a SO question.
Deleting files and directories¶
To delete a file, we can use the unlink
method of the Path
object:
p = Path("C:/workspace/project/file.txt")
p.unlink()
for deleting directories, we can use the rmdir
method:
p = Path("C:/workspace/project")
p.rmdir()
However, the rmdir
method can delete only empty directories. To delete a directory with content, we can use the shutil
module:
p = Path("C:/workspace/project")
shutil.rmtree(p)
Deleting Windows read-only files (i.e. Access Denied error)¶
On Windows, all the delete methods can fail because lot of files and directories are read-only. This is not a problem for most application, but it breaks Python delete methods. One way to solve this is to handle the error and change the attribute in the habdler. Example for shutil:
import os
import stat
import shutil
p = Path("C:/workspace/project")
shutil.rmtree(p, onerror=lambda func, path, _: (os.chmod(path, stat.S_IWRITE), func(path)))
Working with temporary files¶
The tempfile
module provides a convenient way to work with temporary files.
To create a temporary file, we can use the NamedTemporaryFile
function:
import tempfile
with tempfile.NamedTemporaryFile() as f:
f.write(<data>)
Unlike normal files, we can both read and write to the temporary file using a single file object. However, we must return the file pointer to the beginning of the file:
with open('file.txt', 'rw') as f:
f.write('data')
f.seek(0)
data = f.read()
I/O¶
For simple file operations, we can use the open
function. A simple file read is done as follows:
with open('file.txt', 'r') as f:
data = f.read()
A simple file write is done as follows:
with open('file.txt', 'w') as f:
f.write('data')
By default, the open
function opens the file in text mode. To open the file in binary mode, we have to use the b
flag:
with open('file.txt', 'rb') as f:
data = f.read()
CSV¶
The csv
module provides a Python interface for working with CSV files. The basic usage is:
import csv
with open('file.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
# do something
Reader parameters:
delimiter
- the delimiter character
JSON¶
TO read a JSON file:
import json
with open('file.json', 'r') as f:
data = json.load(f)
To write a JSON file:
import json
data = {'a': 1, 'b': 2}
with open('file.json', 'w') as f:
json.dump(data, f)
Important parameters:
indent
: the number of spaces used for indentation. This also enables other pretty-printing functionalities, like newlines after each element.
Custom serialization¶
The json
module can serialize only basic types. If we need to serialize custom objects, we have to provide a custom serialization class. We then supply the class to the cls
parameter of the dump
function.
The serialization class is usually a subclass of the JSONEncoder
class. The class has to implement the default
method, which is called for each object that cannot be serialized by the standard serialization methods. Example:
import json
class MyEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, MyCustomClass):
return obj.to_json()
return super().default(obj)
HDF5¶
HDF5 is a binary file format for storing large amounts of data. The h5py
module provides a Python interface for working with HDF5 files.
An example of reading a dataset from an HDF5 file on SO
Command line arguments¶
The sys
module provides access to the command line arguments. They are stored in the argv
list with the first element being the name of the script.
INI files¶
The configparser
module provides a Python interface for working with INI files. The basic usage is:
import configparser
config = configparser.ConfigParser()
config.read('file.ini')
value = config['section']['key']
If we do not have sections in the INI file, we have to:
- use the
allow_unnamed_section
argument of the ConfigParser:Python config = configparser.ConfigParser(allow_unnamed_section=True)
- use the
configparser.UNNAMED_SECTION
in place of the section name:Python value = config[configparser.UNNAMED_SECTION]['key']
Logging¶
The logging itself is then done using the logging
module methods:
logging.info("message")
logging.warning("message %s", "with parameter")
A simple logging configuration:
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s',
handlers=[
logging.FileHandler("log.txt"),
logging.StreamHandler()
]
)
Note that this configuration can be done only once. Therefore, it should not be done in a library as it prevents the user from configuring the logging.
To set the level for a specific logger, we use the setLevel
method:
logger.setLevel(logging.DEBUG)
. We can also use a string representation of the level, e.g., logger.setLevel('DEBUG')
.
To check the level of the logger, we can use the isEnabledFor
method:
if logger.isEnabledFor(logging.DEBUG):
...
This can be useful for avoiding expensive computations needed just for logging if the logging level is set to a higher level.
Type hints¶
Type hints are a useful way to help the IDE and other tools to understand the code so that they can provide better support (autocompletion, type checking, refactoring, etc.). The type hints are not enforced at runtime, so they do not affect the performance of the code.
We can specify the type of a variable using the :
operator:
a: int = 1
Apart from the basic types, we can also use the typing
module to specify more complex types:
from typing import List, Dict, Tuple, Set, Optional, Union, Any
a: List[int] = [1, 2, 3]
We can also specify the type of a function argument and return value:
def foo(a: int, b: str) -> List[int]:
return [a, b]
Type hints in loops¶
The type of the loop variable is usually inferred by IDE from the type of the iterable. However, this sometimes fails, e.g., for zip objects. In such cases, we need to specify the type of the loop variable. However, we cannot use the :
directly in the loop, but instead, we have to declare the variable before the loop:
for a: int in ... # error
a: int
for a in ... # ok
Circular type hints¶
Unfortunately, Python currently does not support circular type hints. However, it should be possible to use circular type hints since Python 3.14.
There are two types of circular type hints:
- we need to refer the type while defining it. For that, we use the
Self
type:Python class MyClass: def get_me(self) -> Self: return self
-
two or more types refer to each other. For that, use a string representation of the type: ```Python class ClassA: def init(self, b: 'ClassB'): self.b = b
class ClassB: def set_a(self, a: ClassA): self.a = a ```
Common type hints¶
Language types¶
None
- Numeric types:
int
,float
,
bool
str
- Collection types:
list
, orList[<type>]
,tuple
, orTuple[<type>, ...]
,set
, orSet[<type>]
,dict
, orDict[<key type>, <value type>]
,
- Iterables:
Iterable[<type>]
- any iterableSequence[<type>]
- iterable with random access ([]
operator)
Any
- any typeUnion[<type>, ...]
- any of the specified typesOptional[<type>]
- the specified type orNone
Callable[[<arg type>, ...], <return type>]
- a function with specified arguments and return type
Pandas types¶
Pandas does not provide type hints. We can use the types itself, but this is only partially useful. We can use Series
as a hint, but we cannot specify the inner type (e.g.: Series[int]
). For that, we can use a wrapper library called pandera
:
from pandera.typing import Series
a: Series[int]
Officiall documentation for Pandera data types
Calling external programs¶
To call an external program, we use the subprocess
module. Most of the time, we use the run
function:
import subprocess
subprocess.run(['ls', '-l'])
Important parameters:
check
: if set toTrue
, the function raises an exception if the return code is not 0. Default isFalse
.text
, oruniversal_newlines
: if set toTrue
, the function returns the output as a string. Default isNone
.env
: A dictionary with the environment variables.- Note that the default environment obtained from the parent process is not extended, but replaced by the provided dictionary. Therefore, if we want to extend the environment, we have to initialize the dictionary with the parent environment:
Python env = os.environ.copy()
- Note that the default environment obtained from the parent process is not extended, but replaced by the provided dictionary. Therefore, if we want to extend the environment, we have to initialize the dictionary with the parent environment:
However, the subprocess.run
method has some limitations. Notably, it cannot both capture and stream the output. To achieve this, and some other advanced features, we have to use the subprocess.Popen class.
subprocess.Popen
¶
The subprocess.Popen
class provides more control over the process. The basic usage is:
p = subprocess.Popen(['ls', '-l']) # start the process
# now we can communicate with the process, stream the output, etc.
p.wait() # wait for the process to finish
# now we can get the return code, continue with the code, etc.
Loading resources¶
Resources can be loaded using the importlib.resources
module. This way, we can handle files but also resources stored in an archive.
The basic usage is:
import importlib.resources
file = importlib.resources.files('package').joinpath('file.txt')
# send the file to the function expecting a file-like object
my_function(importlib.resources.as_file(file))
Numpy¶
Data types¶
Documentation of basic data types
Date and time¶
Numpy use the datetime64
data type for date and time. This type has its internal resolution, which can be anyting from nanoseconds to years. The resolution is displayed in the type name, e.g., datetime64[ns]
and it is determined
- automatically from the input data if dtype is not specified or specified as
datetime64
or - by the
dtype
parameter if specified asdatetime64[<resolution>]
.
Initialization¶
We can create the new array as:
- zero-filled:
np.zeros(<shape>, <dtype>)
- ones-filled:
np.ones(<shape>, <dtype>)
- empty:
np.empty(<shape>, <dtype>)
- filled with a constant:
np.full(<shape>, <value>, <dtype>)
Sorting¶
for sorting, we use the sort
function.
There is no way how to set the sorting order, we have to use a trick for that:
a = np.array([1, 2, 3, 4, 5])
a[::-1].sort() # sort in reverse order
Export to CSV¶
To export the numpy array to CSV, we can use the savetxt
function:
np.savetxt('file.csv', a, delimiter=',')
By default, the function saves values in the mathematical float format, even if the values are integers. To save the values as integers, we can use the fmt
parameter:
np.savetxt('file.csv', a, delimiter=',', fmt='%i')
Usefull array properties:¶
size
: number of array items- unlike len, it counts all items in the mutli-dimensional array
itemsize
: memory (bytes) needed to store one item in the arraynbytes
: array size in bytes. Should be equal tosize * itemsize
.
Usefull functions¶
Regular expressions¶
In Python, the regex patterns are not compiled by default. Therefore we can use strings to store them.
The basic syntax for regex search is:
result = re.search(<pattern>, <string>)
if result: # pattern matches
group = result.group(<group index>) # print the first group
The 0th group is the whole match, as usual.
To substitute the matched pattern, we can use the sub
function:
pattern = re.compile(r'(\d+)')
result = pattern.sub(r'[\1]', '123') # '[123]'
Sometimes, we need to use numbers around the group index. In such cases, we have to use the \<group index><<number>>
notation:
pattern = re.compile(r'(\d+)')
result = pattern.sub(r'\g<1>2025', '123') # '1232025'
Lambda functions¶
Lambda functions in python have the following syntax:
lambda <input parameters>: <return value>
Example:
f = lambda x: x**2
Only a single expression can be used in the lambda function, so we need standard functions for more complex logic (temporary variables, loops, etc.).
Decorators¶
Decorators are a special type of function that can be used to modify other functions.
When we write an annotation with the name of a function above another function, the annotated function is decorated. It means that when we call the annotated function, a wrapper function is called instead. The wrapper function is the function returned by the decorater: the function with the same name as the annotation.
If we want to also keep the original function functionality, we have to pass the function to the decorator and call it inside the wrapper function.
In the following example, we create a dummy decorator that keeps the original function functionality: Example:
def decorator(func):
def wrapper():
result = func()
return result
return wrapper
@decorator
def my_func():
# do something
return result
Decorator with arguments¶
If the original function has arguments, we have to pass them to the wrapper function. Example:
def decorator(func):
def wrapper(param_1, param_2):
result = func(param_1, param_2)
return result
return wrapper
@decorator
def my_func(param_1, param_2):
# do something
return result
Singletons¶
There are several ways how to implement a singleton in Python. The most common are:
- using a module-level variable
- using the
__new__
method in combination with a base class - use a metaclass
- using a decorator
Module-level variable¶
The most simple way is to use a module-level variable. Note taht if the singleton has to be initialized from the outside, the initialization has to be done in a singleton method, not in the constructor!
class Singleton:
def __init__(self):
self.initialized = False
def init(self, init_param):
if not self.initialized:
self.initialized = True
# do the initialization
singleton = Singleton()
# initialization
singleton.init(init_param)
Testing with pytest¶
Pytest is a simple testing framework for Python. It uses the assert
statement for testing. The tests are defined in functions with the test_
prefix.
Fixtures¶
Fixtures are used to set up the environment for more than one test. If defined in the conftest.py
file, they are available for all tests in the project.
Fixtures are defined using the @pytest.fixture
decorator. The fixture can be used in the test function by passing the fixture name as an argument. The fixture has the following structure:
@pytest.fixture
def my_fixture():
# code for setting up the environment
yield # the test is executed here
# clean up code
Mocking¶
For mocking, we can use the pytest-mock
package. After installation, we can use the mocker
fixture in any test function.
Capturing output¶
To capture the output, we can use the capsys
fixture:
def test_output(capsys):
print('hello')
captured = capsys.readouterr()
assert captured.out == 'hello\n'
Similarly, we can inspect the standard error output using the captured.err
attribute.
Jupyter¶
Memory¶
Most of the time, when the memory allocated by the notebook is larger than expected, it is caused by some library objects (plots, tables...]). However sometimes, it can be forgotten user objects. To list all user objects, from the largest:
# These are the usual ipython objects, including this one you are creating
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']
# Get a sorted list of the objects and their sizes
sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith('_') and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)
Reloding modules with autoreload¶
When modules are imported they are not reloaded unless the kernel is restarted. In python scripts, this does not matter, we just execute the script again. However, when working with notebooks , it may be inconvenient to reload the kernel and all necessary cells just because of a small change in the imported module. Instead, we can use the autoreload
extension.
First, we have to load the extension:
%load_ext autoreload
Then, we configure the autoreload with %autoreload <mode>
. The most common modes are:
now
(default): reload all modules immediately (if not excluded by the%aimport
magic)- This is useful especially if the automatic reloading does not work as expected.
0
,off
: disable autoreload1
,explicit
: reload modules that were imported using the%aimport
magic every time before executing the Python code2
,all
: reload all modules (except those excluded by%aimport
) every time before executing the Python code3
,complete
: same as2
, but also add any new objects in the module
Plotting¶
There are several libraries for plotting in Python. The most common are:
matplotlib
plotly
In the table below, we can see a comparison of the most common plotting libraries:
Functionality | Matplotlib | Plotly |
---|---|---|
real 3D plots | no | yes |
detail legend styling (padding, round corners...) | yes | no |
Matplotlib¶
Saving figures¶
To save a figure, we can use the savefig
function. The savefig
function has to be called before the show
function, otherwise the figure will be empty.
Docstrings¶
For documenting Python code, we use docstrings, special comments soroudned by three quotation marks: """ docstring """
Unlike in other languages, there are multiple styles for docstring content. The most common are:
- Epytext
Python """ @param <param name>: <param description> @return: <return description> """
- Google
Python """ Args: <param name>: <param description> Returns: <return description> """
- Numpy
Python """ Parameters ---------- <param name> : <param type> <param description> Returns ------- <return type> <return description> """
- reStructuredText
Python """ :param <param name>: <param description> :return: <return description> """
Progress bars¶
For displaying progress bars, we can use the tqdm
library. It is very simple to use:
from tqdm import tqdm
for i in tqdm(range(100)):
...
Important parameters:
desc
: description of the progress bar
TQDM in Jupyter¶
When using tqdm
in Jupyter, the basic progress bar may not work (it may print other logs repeatedly). In such cases, we can change the import to:
from tqdm.notebook import tqdm
If the code can be called both from Jupyter and from console, we can use `autonotebook
PostgreSQL¶
When working with PostgreSQL databases, we usually use either
- the psycopg2 adapter or,
- the sqlalchemy.
psycopg2¶
To connect to a database:
con = psycopg2.connect(<connection string>)
After running this code a new session is created in the database, this session is handeled by the con
object.
The operation to the database is then done as follows:
- create a cursor object which represents a database transaction
Python cur = con.cursor()
- execute any number of SQL commands
Python cur.execute(<sql>)
- commit the transaction
Python con.commit()
SQLAlchemy¶
SQLAlchemy works with engine objects that represent the application's connection to the database. The engine object is created using the create_engine
function:
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:password@localhost:5432/dbname')
A simple SELECT
query can be executed using using the following code:
with engine.connect() as conneciton:
result = conneciton.execute("SELECT * FROM table")
...
With modifying statements, the situation is more complicated as SQLAlchemy uses transactions by default. Therefore we need to commit the transaction. There are two ways how to do that:
-
using the
commit
method of the connection objectPython with engine.connect() as conneciton: conneciton.execute("INSERT INTO table VALUES (1, 2, 3)") conneciton.commit()
-
creating a new block for the transaction using the
begin
method of the connection objectPython with engine.connect() as conneciton: with conneciton.begin(): conneciton.execute("INSERT INTO table VALUES (1, 2, 3)")
- this option has also its shortcut: the
begin
method of the engine objectPython with engine.begin() as conneciton: conneciton.execute("INSERT INTO table VALUES (1, 2, 3)")
- this option has also its shortcut: the
Note that the old execute
method of the engine object is not available anymore in newer versions of SQLAlchemy.
Statements with parameters¶
Sometimes it is desirable to use parameters in the SQL statements:
- it prevents SQL injection in case of user input,
- the provided parameters are automatically escaped, so we don't have to worry
:
in the SQL statement.
The syntax is:
connection.execute("INSERT INTO table VALUES (:param1, :param2)", param1=1, param2=2)
# or
connection.execute("INSERT INTO table VALUES (:param1, :param2)", {'param1': 1, 'param2': 2})
Executing statements without transaction¶
By default, sqlalchemy executes sql statements in a transaction. However, some statements (e.g., CREATE DATABASE
) cannot be executed in a transaction. To execute such statements, we have to use the execution_options
method:
with sqlalchemy_engine.connect() as conn:
conn.execution_options(isolation_level="AUTOCOMMIT")
conn.execute("<sql>")
conn.commit()
Getting the affected rowcount¶
The result object returned by the execute
method has the rowcount
attribute that contains the number of affected rows.
Executing multiple statements at once¶
To execute multiple statements at once, for example when executing a script, it is best to use the execute
method of the psycopg2 connection object. Moreover, to safely handle errors, it is best to catch the exceptions and manually rollback the transaction in case of an error:
conn = psycopg2.connect(<connection string>)
cursor = conn.cursor()
try:
cursor.execute(<sql>)
conn.commit()
except Exception as e:
conn.rollback()
raise e
finally:
cursor.close()
conn.close()
Working with GIS¶
When working with gis data, we usually change the pandas
library for its GIS extension called geopandas
.
For more, see the pandas manual.
Geocoding¶
For geocoding, we can use the Geocoder library.
Complex data structures¶
KDTree¶
KDTree can be found in the scipy
library.
Geometry¶
There are various libraries for working with geometry in Python:
scipy.spatial
: for basic geometry operationsshapely
geopandas
: for gis data
Downloading files¶
To download files from the internet, we can use the requests
library. The basic usage is:
import requests
response = requests.get(<url>)
with open(<filename>, 'wb') as f:
f.write(response.content)