Main principles¶

Pandas extensively uses the term axis. In Pandas, axis 0 is vertical (rows) and axis 1 is horizontal (columns).

Pandas Data Types¶

Mainly, pandas uses the numpy data types.

Object¶

If pandas does not recognize the type of the column, or there are multiple types in the column, it uses the object type. However this may sound like a wonderful solution, it causes many problems, so be sure to avoid object type columns at all costs. Typically, the problem arises when we try to apply a vector operation to the column:

we round a column with mix of floats and ints: fail (loop of ufunc does not support argument 0 of type float which has no callable rint method)
we need to apply string functions, but the column contains numbers as well

The solution is usually:

fill the missing values with the fillna function
convert the column to str type using the astype function
apply string functions to clear the data
convert the column to the desired type

Categorical data¶

Sometimes, it can be usefull to treat a column as a categorical variable instead of a string or a number. For that, we can use the Categorical class.

Typically, we:

define the categorical type: python cat_type = pd.Categorical(categories=['a', 'b', 'c'])
use the categorical type to, e.g., convert a column to the categorical type: python df['col'] = df['col'].astype(cat_type)

Important parameters of the Categorical constructor:

categories: By default, the categories are inferred from the data. However, we can specify the categories explicitly using the categories parameter. When using this parameter, each value is converted to the matching category specified in the list. Note that the category must to be an exact match, including the data type, case (when using strings), etc. Unmatched values are converted to NaN.
ordered: if True, the categories are ordered in the order of the categories parameter.

Reading categorical data from a file¶

We can read data into categories using the dtype parameter. However, this way, we cannot specify the categories explicitly. To do that, we need to read the data into the correct data type, and then convert the column/index to the categorical type after the dataframe is created.

Datetime¶

Pandas has a special type for datetime values. One of its dangerous properties is that zero parts of the datetime are truncated both when displaying and on export:

df = pd.DataFrame({'date': pd.to_datetime(['2021-01-01 00:00:00', '2021-01-01 00:00:00'])})
print(df)
# output:
# '2021-01-01'
# '2021-01-01'

Creating a DataFrame¶

The DataFrame class has a constructor that supports multiple formats of input data as well as many configuration parameters. Therefore , for most formats of input data, we can create a dataframe using the constructor. However, we can also crete a dataframe using the from_* functions, and for some formats, these functions are the only way to create a dataframe.

From a dictionary¶

When having a dictionary, we can choose between two options the constructor and the from_dict function.

The required syntax depend on the shape of the dictionary with respect to the required dataframe.

Keys are column names, values are list of column values¶

df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
# or equivalently
df.DataFrame.from_dict({'col1': [1, 2], 'col2': [3, 4]})

Note that the values of the dictionary have to be lists. If we have a dictionary with values that are not lists (i.e., only one row), we have to use the orient parameter to specify the orientation of the data and then transpose the dataframe:

d = {'col1': 1, 'col2': 2}

df = pd.DataFrame.from_dict(d, orient='index').T
# or equivalently
df = pd.DataFrame([d], columns=d.keys())

Keys are indices, values are values of a single column¶

df = pd.DataFrame.from_dict({'row1': 1, 'row2': 2}, orient='index', columns=['Values'])

Keys are indices, values are values of single row¶

df = pd.DataFrame.from_dict({'row1': [1, 2], 'row2': [3, 4]}, orient='index')

Keys are one column, values are another column¶

d = {'row1 col1': 'row1 col2', 'row2 col1': 'row2 col2'

df = pd.DataFrame.from_dict(d.items())

# or equivalently
df = pd.DataFrame({'col1': d.keys(), 'col2': d.values()})

From a list of dictionaries¶

df = pd.DataFrame([{'col1': 1, 'col2': 3}, {'col1': 2, 'col2': 4}])

From a list of lists¶

df = pd.DataFrame([[1, 3], [2, 4]], columns=['col1', 'col2'])

Creating a zero or constant-filled dataframe¶

To create a dataframe filled with a constant, we can use the dataframe constructor and pass the constant as the first (data) argument:

df = pd.DataFrame(0, index=range(10), columns=['col1', 'col2'])

Generating the index¶

As displayed in the above example, we can generate a numerical index using the range function. However, there are more options:

date index with date_range
- pd.date_range(<start date>, <end date>, freq=<frequency>)

Determining the data type¶

By defualt, Pandas infers the data type of the columns by the content. However, this has some limitations:

The data are processed line by line to avoid excessive memory usage. Therefore, if the data contains a value not compatible with the type inferred from the first batch of data, the previously processed data have to be processed again.
If the data are not complete or valid, the data is typically inferred as object type, instead of reporting an error.

To specify the data type of the columns we proceed according to the desired data type:

for datetime data, we use the parse_dates parameter of the read_csv function
for other data types, we use the dtype parameter of the read_csv function.

Obtaining info about dataset¶

For a DataFrame df:

column names: df.columns
column types: df.dtypes
number of rows: len(df)

Iteration¶

Standard Iteration¶

https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas

Iteration without modifying the dataframe. From the fastest to the slowest.

Vector operations¶

List Comprehensions¶

Apply¶

The apply function can be used to apply a function to each row or column of the dataframe. For iterating over rows, we need to set the axis parameter to 1. Example:

df['new_col'] = df.apply(lambda row: row['col1'] + row['col2'], axis=1)

itertuples()¶

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.itertuples.html

Returns dataframe rows as pandas named tuples with index as the first member of the tuple.

iterrows()¶

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html

returns a tuple (index, data)

it does not preserve the dtype

items()¶

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.items.html Iterates over columns

Iteration with modification¶

For modification, the best strategy is to:

select what we want to modify (see selection)
modify the selection with the assignment operator. The right side of the assignment operator can be the result of an iteration.

Filtration¶

filtered = df[df['max_delay'] == x]

or equivalently:

filtered = df[df.max_delay == x]

Filtration by Multiple Columns¶

Example:

filtered = df[(df['max_delay'] == x) & (df['exp_length'] == y)]

Accept multiple values¶

For that, we can use the isin function:

filtered = df[df['max_delay'].isin([x, y])]

Using the query function¶

The query function can be used for more complicated filters. It is more flexible and the syntax is less verbose. The above filter can be rewriten as:

filtered = df.query('max_delay == x and exp_length == y']

Special filter functions can be used in the query function as well:

filtered = df.query('max_delay == x and exp_length.isnull()')

Filtering Series¶

A seris can be filtered even simpler then the dataframe:

s = df['col']
sf = s[s <= 10] # now we have a Series with values from df['col'] less than 10

Useful filter functions¶

null values: <column selection>.isnull()
non null/nan values: <column selection>.notnull()
filtring using the string value: <column selection>.str.<string function>
filtering dates: <column selection>.dt.<date function>

Selection¶

If we want to select a part of the dataframe (a set of rows and columns) independently of the values of the dataframe (for that, see filtration), we can use these methods:

loc: select by index, works for both rows and columns
iloc: select by position, works for both rows and columns
[]: select by index, works only for columns

There are also other methods that works for selection but does not work for setting values, such as:

xs: select by label, works for both rows and columns

The return type of the selection is determined by the number of selected rows and columns. For a single row or column, the result is a series, for multiple rows and columns, the result is a dataframe. If we want to get a dataframe for a single row or column, we can use the [] operator with a list of values:

df[['col1']]
# or 
df.loc[['row1']]
# or
df.iloc[[0]]

`loc`¶

The operator loc has many possible input parameters, the most common syntax is

df.loc[<row selection>, <column selection>]

each selection has the form of <start label>:<end label>. For the whole column, we therefore use:

df.loc[:, <column name>]

Difference between array operator on dataframe and on `loc`¶

Both methods can be used both for getting and setting the column:

a = df['col']
# or equivalently
a = df.loc[:, 'col']

df2['col'] = a
# or equivalently
df2.loc[:, 'col'] = a

The difference between these two methods is apparent when we want to use a chained selection, i.e., selecting from a selection. While the loc selects the appropriate columns in one step, so we know that we still refer to the original dataframe, the array operator operations are separate, and therefore, the result value can refer to a temporary:

dfmi.loc[:, ('one', 'second')] = value # we set a value of a part of dfmi
dfmi['one']['second'] = value # can be dangerous, we can set value to a temporary

This problem is indicated by a SettingWithCopy warning. Sometimes it is not obvious that we use a chain of array operator selections, e.g.:

sel = df[['a', 'b']]
.....
sel['a'] =  ... # we possibly edit a temporary!

For more, see the dovumentation.

`iloc`¶

The iloc method works similarly to the loc method, but it uses the position instead of the label. To select more values, we can use the slice syntax

df.iloc[<start position>:<end position>:<step>,<collumn slicing...>]

Be aware that if the iloc operator selects by single value (e.g.: df.iloc[3]), it returns the single row as series. To get a dataframe slice, we need to use a list of values (e.g.: df.iloc[[3]]).

Selecting all columns but one¶

If we do not mind copying the dataframe, we can use the drop function.

Otherwise, we can use the loc method and supply the filtered column lables obtained using the columns property:

df.loc[:, df.columns != '<column to skip>']

Multi-index selection¶

documentation

When selecting from a dataframe with a multi-index, things get a bit more complicated. There are three ways how to select from a multi-index dataframe:

using loc with slices: simple, but verbose
using loc with IndexSlice object: more readable, but requires the IndexSlice object to be created first
using xs function, neat, but does not support all the features, e.g., it does not support ranges

Using `loc`¶

The general loc usage is the same as for a single index dataframe:

df.loc[<row selection>, <column selection>]

However, each selection is now a tuple, where each element of the tuple corresponds to one level of the multi-index:

df.loc[(<row selection level 1>, <row selection level 2>, ...), (<column selection level 1>, <column selection level 2>, ...)]

The <row selection> can be a specifica value, a list of values, or a slice. Note that we have to use the slice function, as pandas uses the standard slice syntax for something else.

We can skip lower levels to select all values from those levels. However, we cannot skip upper levels. If we want to select all values from the upper level, we need to use the slice(None) for that level:

df.loc[(slice(None), slice(15, 30)), ...]

Note that for multi-index slicing, the index needs to be sorted. If it is not, we can use the sort_index function.

pandas slicing documentation

Using `IndexSlice` for more readable syntax¶

We can obtain the same result with a more readable syntax using the IndexSlice object:

idx = pd.IndexSlice
dft.loc[idx[:, 15:30], ...]

Handeling the `too many indexers` error¶

Sometimes, when using the loc method, the selection can fail with the too many indexers error, because it is ambiguous whether we select by rows or by columns. In that case, we can either

use the axis parameter to specify the axis to select from: python df.loc(axis=0)[<row selection>]
or use the IndexSlice instead.

Using `xs`¶

The xs function can be used to select from a multi-index dataframe. However, slices (ranges) are not supported. Example:

df.xs(15, level=1) # selects all rows with level 1 equal to 15

Select row with a maximum value in a column¶

To get the index of the row with the maximum value in a column, we can use the idxmax function:

df['col'].idxmax()

Then we can use the loc method to get the row.

Selecting a single value (cell, scalar)¶

When we select a single value from a dataframe, the result is sometimes a series, especially when we use a filtration. To get a scalar, we can use the item() method:

df.loc[<row>, <column>].item()

Sorting¶

for sorting the dataframe, we can use the sort_values function. The first argument is the list of columns to sort by, starting with the most important column. Example:

df.sort_values(['col1', 'col2'])

If we want to use a custom sorting function, we can use the key argument. The key function should

satisfy the classical python sorting interface (see Python manual) and
it should be a vector function, i.e., instead of returning a single position for a given value, it should return a vector of positions for a given vector of values.

Example:

def key_fn(column: list):
    return [len(x) for x in l]

df.sort_values('col', key=key_fn)

Working with columns¶

Adding a column¶

The preferable way is to use the assign function:

# adds a column named 'instance_length' with constant value
result_df_5_nyc_mv.assign(instance_length = 5)

Multiple columns can be added at once:

trips = trips.assign(dropoff_datetime = 0, dropoff_GPS_lon = 0, dropoff_GPS_lat = 0, pickup_GPS_lon = 0, pickup_GPS_lat = 0)

Rename a column¶

To rename a column, we can use the pandas rename function:

df.rename(columns={'<old name 1>': '<new name 1>', '<old name 2>': '<new name 2>'})

# or equivalently
df.rename({'<old name 1>': '<new name 1>', '<old name 2>': '<new name 2>'}, axis='columns')

Rename a Series (column)¶

The column name in the series object is actually the name of the series. To rename the series, we can use the rename function, or we can set the name property of the series:

s.rename('<new name>')
# or equivalently
s.name = '<new name>'

Index¶

Index of a dataframe df can be accessed by df.index. Standard range operation can be applied to index.

Selecting just a single index level from a multi-index¶

If we want to select just a single index level, we can use the get_level_values function:

df.index.get_level_values(<level>)

Note however, that this function returns duplicated values when there are multiple values in other levels. To get unique values, we can use the unique function.

There is also another method, that returns unique values: the level property:

df.index.levels[<level>]

However, this way, we can get outdated values, as the values are not always updated when the index is changed. To get the updated values, we need to call the method remove_unused_levels after each change of the index.

Renaming the index¶

The Index.rename function can be used for that.

Replacing the index¶

We have several options how to replace the index with a new index:

Using columns as a new index: for that, we can use the set_index function.
Using an existing index to create a new index For that, we can use the reindex function.
Creating index from scratch

`reindex`¶

The first parameter is the new index. Example:

df.reindex(df.index + 1) # creates a new index by adding 1 to the old index

Important parameters:

fill_value: the value to use for missing values. By default, the missing values are filled with NaN.
level: the level to reindex in case of a multi-index.

Note that the reindex function can only be used for unique values. If we have a multi-index with duplicate values on the level we want to reindex, we need to create a new index from scratch.

Creating index from scratch¶

To create an index from scratch, we just assign the index to the dataframe index property:

df.index = pd.Index([1, 2, 3, 4, 5])

We can also assign a range directly to the index:

df.index = range(5)

Creating multi-index¶

documentation

There is a MultiIndex constructor that can be used to create a multi-index. However, most of the time, it is more convenient to use dedicated factory functions:

MultiIndex.from_arrays: creates a multi-index from an array of arrays (e.g. a list of lists).
MultiIndex.from_tuples
MultiIndex.from_product: creates a multi-index from the cartesian product of the given iterables. Example: ```python df.index = pd.MultiIndex.from_product([['one', 'two'], ['a', 'b']])

results in¶

MultiIndex([¶

('one', 'a'),¶

('one', 'b'),¶

('two', 'a'),¶

('two', 'b'),¶

])¶

```

Important parameters for both the constructor and the factory functions:

names: the names of the index levels

Shifting a time/date index¶

To shift a time or date or datetime index, we can use the shift function with the freq parameter.

The freq use is important, as it change the mode of operation of the shift function:

if freq is not specified, the index stays the same and the data are shifted by the specified number of periods.
if freq is specified, the index is shifted by the specified number of periods while the data stay in the same position.

Example:

df.index = pd.date_range('2021-01-01', periods=10, freq='1h')
df.shift(1, freq='1h') # shifts the index by 1 hour
df.shift(1) # shifts the data by 1 hour against the index

The shift function is not defined for MultiIndex!. For that, we need to create the index level manually:

df.index = pd.MultiIndex.from_product([['one', 'two'], pd.date_range('2021-01-01', periods=10, freq='1h')])
new_index = df.index.levels[1] + pd.Timedelta(hours=1)
df.index = df.index.set_levels(new_index, level=1)

Aggregation¶

Analogously to SQL, pandas has a groupby function for aggreagting rows. The usage is as follows:

group = df.groupby(<columns>) # returns a groupby object grouped by the columns
selection = group[<columns>] # we can select only some columns from the groupby object
aggregation = selection.<aggregation function> # we apply an aggregation function to the selected columns

We can skip the selection step and apply the aggregation function directly to the groupby object. This way, the aggregation function is applied to all columns.

Example (sum):

df.groupby('col').sum()

Sums the results for each group (column by column)

To get a count, we can call the size function:

df.groupby('col').size()

Note that unlike in SQL, the aggregation function does not have to return a single value. It can return a series or a dataframe. In that case, the result is a dataframe with the columns corresponding to the returned series/dataframe. In other words, the aggregation does not have to actually aggregate the data, it can also transform it.

Aggregate functions¶

For the aggregate function, we can use one of the prepared aggregation functions. Classical functions(single value per group):

sum
mean
median
min
max
count

Transformation functions (value for each row):

cumsum: cumulative sum
diff: difference between the current and the previous row.
- the periods parameter specifies which row to use for the difference. By default, it is the previous row (periods=1). For next row, use periods=-1, but note that the result is then negative. We can use the abs function to get the absolute value.

Custom aggegate function¶

Also, there are more general aggregate functions:

agg function that is usefull for applying different functions for different columns and
apply: the most flexible function that can be used for custom aggregation and transformation operations.

These two functions have different interfaces for the custom aggregation functions they call. These are summarized in the following table:

property	agg	apply
can just transform the data	no	yes
can use data from one column in another column	no	yes
applied to	each specified column	the whole dataframe representing single group
output	dataframe	scalar, series, or dataframe
can use multiple aggregate functions	yes	no

`agg`¶

Example:

df.groupby('col').agg({'col1': 'sum', 'col2': 'mean'})

`apply`¶

The apply function takes a custom function as an argument. That custom aggregation function:

takes a DataFrame/Series (depending on the source object) as the first argument
- this dataframe/series contains the data for the group (all columns)
- the key of each particular group can be accessed using the name attribute of the dataframe/series
returns a Series, DataFrame, or a scalar
- when a scalar is returned, the result is a series with the scalar value for each group
- we do not have to reduce the data to a single value or a single row, we can just transform the data arbitrarily.

The process works as follows:

The dataframe is split into groups according to the groupby function.
The custom function is applied to each group.
The results are combined into a single dataframe.

In other words, the custom function only sees the dataframe/series representing the group, not the whole dataframe/series. The grouping and combining aggreate results is done by the apply function.

Time aggregation¶

documentation

We can also aggregate by time. For that, we need an index or column with datetime values. We use the resample function. Example:

df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, index=pd.date_range('2021-01-01', periods=10, freq='1h'))

df.resample('1D').sum()

# results in
#             col1
# 2021-01-01     55

The groups are created by the following logic:

the groups are aligned to the index of the dataframe (not to the records)
each group has the duration specified by the rule parameter
each group is labeled by the start of the group
- e.g. for rule='1D', the group is labeled by the start of the day, if rule='1h', then 12:00 means 12:00-13:00
records are assigned to the group based on their datetime value
the records that exactly match the start/end of the group are assigned depending on the closed parameter
- left (default): the record is assigned to the group with the matching start
- right: the record is assigned to the group with the matching end

There are two possible results of resampling (both may appear in the same dataframe):

Downsampling: multiple records in the same time group.
Upsampling: no values in the group. The missing values are filled with NaN.
- note that the NaN automatically changes the data type of the column to float, and it is not reverted by filling the missing values later. Therefore, when upsampling, we have to manually change the data type of each column to the original data type after filling the values.
- We can use the ffill function to fill in the missing values. Example: Python df.resample('1H').sum().ffill()

Joins¶

Similarly to SQL, Pandas has a way to join two dataframes. There are two functions for that:

merge: the most general function that has the behavior known from SQL
join: a more specialized function,

The following table lists the most important differences between the two functions:

property	merge	join
default join type	inner	left
join right table via	column (default) or index (`right_index=True`)	index
join left table via	column (default) or index (`left_index=True`)	index, or column (`on=key_or_keys`)

There is also a static pd.merge function. All merge and join methods are just wrappers around this function.

The indexes are lost after the join (if not used for the join). To keep an index, we can store it as a column before the join.

Appending and Concatenating data¶

In pandas, there is a concat function that can be used to concatenate data:

pd.concat([df1, df2])

It can concatenate dataframes or series and it can concatenate vertically (by rows, default) or horizontally (by columns)

By default, the indices from both input parameters are preserved. To reset the index, we can use the ignore_index parameter. Alternatively, to preserve one of the indices, we can set the index of the other dataframe to the index of the first dataframe before the concatenation using the set_index function.

I/O¶

csv¶

For reading csv files, we can use the read_csv function. Important params:

sep: separator
header: row number to use as column names. If None, no header is used
skiprows: number of rows to skip from the beginning
delim_whitespace: if True, the whitespace is used as a separator. The sep parameter is ignored in that case. This is a way how to read a file with variable number of whitespaces between columns.

For export, we can use the to_csv method:

df.to_csv(<file name> [, <other params>])

Useful parameters:

index: if False, the index is not exported
index_label: the name of the index column

Json¶

For exporting to json, we can use the to_json function.

By default, the data are exported as a list of columns. To export the data as a list of rows, we can use the orient parameter:

df.to_json(<file name>, orient='records')

Other important parameters:

indent: the number of spaces to use for indentation

Excel¶

For reading excel files, we can use the read_excel function. Important params:

sheet_name: the name of the sheet to read. If None, the first sheet is read.

Insert dataframe into db¶

We can use the to_sql method for that:

df.to_sql(<table name>, <sql alchemy engine> [, <other params>])

Important params:

to append, not replace existing records: if_exists='append'
do not import dataframe index: index=False

For larger datasets, it is important to not insert everything at once, while also tracking the progress. The following code does exactly that

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

chunksize = int(len(data) / 1000) # 0.1%
with tqdm(total=len(data)) as pbar:
    for i, cdf in enumerate(chunker(data, chunksize)):
        cdf.to_sql(<table name>, <sqlalchemy_engine>)
        pbar.update(chunksize)

If the speed is slow, it can be caused by a low upload speed of your internet connection. Note that due to the SQL syntax, the size of the SQL strings may be much larger than the size of the dataframe.

Latex export¶

Currently, the DataFrame.to_latex function is deprecated. The Styler class should be used for latex exports instead. You can get the Styler from the DataFrame using the style property. The usual workfolow is:

Create a Styler object from the dataframe using the style property,
Apply the desired formatting to the styler object,
Export DataFrame to latex using the Styler.to_latex method.

Keep in mind that the Styler object is immutable, so you need to assign the result of each formatting operation to a new variable or chain the calls. Example:

# wrong, the format is not apllied
df.style.format(...)
df.style.to_latex(...)

# correct: temp var
s = df.style.format(...)
s.to_latex(...)

# correct: chain calls
df.style.format(...).to_latex(...)

Finally, to produce a nice output, tou should use print to display the latext string:

print(styler.to_latex())

Formatting the index: columns and row labels¶

The columns' and row labels' format is configures by the format_index function. Important parameters:

axis: 0 for rows, 1 for columns (cannot be both)
escape: by default, the index is not escaped, to do so, we need to set escape to 'latex'.

Formatting or changing the values¶

The values are formated by the format function. Important parameters:

escape: by default, the values are not escaped, to do so, we need to set escape to 'latex'.
na_rep: the string to use for missing values
precision: the number of decimal places to use for floats

Replacing values¶

For replace some values for the presentation with something else, we can also use the format function.For example, to change the boolean presentation in column col we call:

df.style.format({'col': lambda x: 'yes' if x else 'no'})

Hihglighting min/max values¶

For highlighting the min/max values, we can use the highlight_min and highlight_max functions. Important parameters:

subset: the columns in which the highlighting should be applied
props: the css properties to apply to the highlighted cells

Hiding some columns, rows, or indices¶

For hiding some columns, rows, or indices, we can use the hide function. Important Parameters:

axis: 0 for hiding row indices (default), 1 for hiding column names
level: the level of the multi-index to hide (default is all levels)
subset: the columns or rows to hide (default is all columns or rows)

When used without the subset parameter, the hide function hides the whole index. To hide just a selected row or column from the data, the subset parameter has to be used.

By default, the <index_name> refers to the row index. To hide a column:

df.style.hide_columns(<column name>, axis=1)

Changing the header (column labels)¶

There is no equivalent to the header parameter of the old to_latex function in the new style system. Instead, it is necessary to change the column names of the dataframe.

Exporting to latex¶

For the export, we use the to_latex function. Important parameters:

convert_css: if True, the css properties are converted to latex commands
multirow_align: the alignment of the multirow cells. Options are t, c, b
hrules: if set to True, the horizontal lines are added to the table, specifically to the top, bottom, and between the header and the body. Note that these hrules are realized as the \toprule, \midrule, and \bottomrule commands from the booktabs package, so the package has to be imported .
clines: configuration for hlines between rows. It is a string composed of two parts divided by ; (e.g.: skip-last;data). The parts are:
- whether to skip last row or not (skip-last or all)
- whether to draw the lines between indices or the whole rows (index or data)

Displaying the dataframe in console¶

We can display the dataframe in the conslo print or int the log just by supplying the dataframe as an argument because it implements the __repr__ method. Sometimes, however, the default display parameters are not sufficient. In that case, we can use the set_option function to change the display parameters:

pd.set_option('display.max_rows', 1000)

Important parameters:

display.max_rows: the maximum number of rows to display
display.max_columns: the maximum number of columns to display
display.max_colwidth: the maximum width of a column

Other useful functions¶

drop_duplicates to quickly drop duplicate rows based on a subset of columns.
factorize to encode a series values as a categorical variable, i.e., assigns a different number to each unique value in series.
pivot_table: function that can aggragate and transform a dataframe in one step. with this function, one can create a pivot table, but also a lot more.
cut: function that can be used to discretize a continuous variable into bins.

`pivot_table`¶

The pivot_table function do a lot of things at once:

it aggregates the data
it transforms the data
it sorts the data due to reindexing

Although this function is very powerfall there are also many pitfalls. The most important ones are:

column data type change for columns with missing values

Column data type change for columns with missing values¶

The tranformation often creates row-column combinations that do not exist in the original data. These are filled with NaN values. But some data types does not support NaN values, and in conclusion, the data type of the columns with missing values is changed to float. Possible solutions:

we can use the fill_value parameter to fill the missing values with some value that is supported by the data type (e.g. -1 for integers)
we can use the dropna parameter to drop the rows with missing values
we can change the data type of the columns with missing values prior to calling the pivot_table function. For example, the pandas integer data types support NaN values.

`to_datetime`¶

The to_datetime function can convert various inputs to datetime. It can be used to both scalars and vectors. Important parameters:

unit: the unit of the input, e.g., s for seconds.
origin: the origin of the input, e.g., unix for unix timestamps. It can be also any specific datetime object.
format: the format of the input, e.g., %Y-%m-%d for 2021-01-01.

`squeeze`¶

The squeeze function removes the unnecessary dimension from a dataframe or series. It is usefull when we want to convert a dataframe with a single column to a series, or a series with a single value to a scalar.

Geopandas¶

Geopandas is a GIS addon to pandas, an equivalent to PostGIS. Unfortunately, it currently supports only one geometry column per table.

Do not ever copy paste the geometries from jupyter notebook as the coordinates are rounded! Use the to_wkt function instead.

Create a geodataframe from CSV¶

Geopandas has it's own read_csv function, however, it requires a very specific csv format, so it is usually easier to first import csv to pandas and then create geopandas dataframe from pandas dataframe.

Converting pandas Dataframe to geopandas Dataframe¶

The geopandas dataframe constructor accepts pandas dataframes, we just need to specify the geometry column and the coordinate system:

gdf = gpd.GeoDataFrame(
    <PANDAS DATAFRAME>
    geometry=gpd.points_from_xy(<X COLUMN>, <Y COLUMN>),
    crs=<SRID>
)

Create geodataframe from `shapely`¶

To load data from shapely, execute

gdf = gpd.read_file(<PATH TO FOLDER WITH SHAPEFILES>)

Working with the geometry¶

The geometry can be accessed using the geometry property of the geodataframe.

Spliting multi-geometry columns¶

If the geometry column contains multi-geometries, we can split them into separate rows using the explode function:

gdf = gdf.explode()

Insert geodataframe into db¶

preprocesssing¶

Before inserting a geodataframe into the database, we need to process it a little bit:

set the SRID: gdf.set_crs(epsg=<SRID>, allow_override=True, inplace=True)
set the geometry: gdf.set_geometry('geom', inplace=True)
select, rename, or add columns so that the resulting geodataframe match the corresponding database table. This process is same as when working with pandas

Simple insertion¶

When the data are in the correct format and we don|t need any customization for the db query, we can use the to_postgis method:

gdf.to_postgis(<TABLE NAME>, <SQL ALCHEMY CONNECTION>, if_exists='append')

Customized Insertion: `geoalchemy`¶

If we need some special insert statement, we cannot rely on the geodataframe.to_postgis function, as it is not flexible enough. The pandas dataframe.to_sql function is more flexible, however, it has trouble when working with geodata. The easiest options is therefore to use geoalchemy, the database wraper used in geopandas (extension of sqlalchemy, which is a database wrapper for pandas).

First, we need to create the insert statement. The example here uses a modification for handeling duplicite elements.

meta = sqlalchemy.MetaData()    # create a collection for geoalchemy database  
                                # objects

table = geoalchemy2.Table(
    '<TABLE NAME>', meta, autoload_with=<SQL ALCHEMY CONNECTION>)
insert_statement 
    = sqlalchemy.dialects.postgresql.insert(table).on_conflict_do_nothing()

In the above example, we create a geoalchemy representation of a table and then we use this representation to create a customized insert statement (the on_conflict_do_nothing is the speciality here.). Note that we use a speciatl PostgreSQL insert statement instead of the standard SQLAlchemy insert statement.

Second, we need to prepare the data as a list of dictionary entries:

list_to_insert = [
    {'id': 0, 'geom': <GEOM>, ...},
    {'id': 0, 'geom': <GEOM>, ...},
    ....
]

Note that the geometry in the geodataframe is in the shapely format. Therefore, we need to convert it to string using the geoalchemy from_shape function:

geoalchemy2.shape.from_shape(<GEOMETRY>, srid=<SRID>)

Finally, we can execute the query using an sqlalchemy connection:

sqlalchemy_connection.execute(insert_statement, list_to_insert)

Exporting geodataframe to a file¶

The geodataframe can be exported to a file using the to_file method. The format is determined by the file extension.

Pandas in Jupyter¶

Displaying more rows or columns¶

To display more rows or columns in Jupyter than the default, we have to set the pandas option_context to the desired values:

# instead of
df

# write
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
    display(df)

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

Main principles¶

Pandas Data Types¶

Object¶

Categorical data¶

Reading categorical data from a file¶

Datetime¶

Creating a DataFrame¶

From a dictionary¶

Keys are column names, values are list of column values¶

Keys are indices, values are values of a single column¶

Keys are indices, values are values of single row¶

Keys are one column, values are another column¶

From a list of dictionaries¶

From a list of lists¶

Creating a zero or constant-filled dataframe¶

Generating the index¶

Determining the data type¶

Obtaining info about dataset¶

Iteration¶

Standard Iteration¶

Vector operations¶

List Comprehensions¶

Apply¶

itertuples()¶

iterrows()¶

items()¶

Iteration with modification¶

Filtration¶

Filtration by Multiple Columns¶

Accept multiple values¶

Using the query function¶

Filtering Series¶

Useful filter functions¶

Selection¶

loc¶

Difference between array operator on dataframe and on loc¶

iloc¶

Selecting all columns but one¶

Multi-index selection¶

Using loc¶

Using IndexSlice for more readable syntax¶

Handeling the too many indexers error¶

Using xs¶

Select row with a maximum value in a column¶

Selecting a single value (cell, scalar)¶

Sorting¶

Working with columns¶

Adding a column¶

Rename a column¶

Rename a Series (column)¶

Index¶

Selecting just a single index level from a multi-index¶

Renaming the index¶

Replacing the index¶

reindex¶

Creating index from scratch¶

Creating multi-index¶

results in¶

MultiIndex([¶

('one', 'a'),¶

('one', 'b'),¶

('two', 'a'),¶

('two', 'b'),¶

])¶

Shifting a time/date index¶

Aggregation¶

Aggregate functions¶

Custom aggegate function¶

agg¶

apply¶

Time aggregation¶

Joins¶

Appending and Concatenating data¶

I/O¶

csv¶

Json¶

Excel¶

Insert dataframe into db¶

Latex export¶

Formatting the index: columns and row labels¶

`loc`¶

Difference between array operator on dataframe and on `loc`¶

`iloc`¶

Using `loc`¶

Using `IndexSlice` for more readable syntax¶

Handeling the `too many indexers` error¶

Using `xs`¶

`reindex`¶

`agg`¶

`apply`¶

`pivot_table`¶

`to_datetime`¶

`squeeze`¶

Create geodataframe from `shapely`¶

Customized Insertion: `geoalchemy`¶