Main principles

Pandas extensively uses the term axis. In Pandas, axis 0 is vertical (rows) and axis 1 is horizontal (columns).

Pandas Data Types

Documentation of basic data types

Mainly, pandas uses the numpy data types.

Object

If pandas does not recognize the type of the column, or there are multiple types in the column, it uses the object type. However this may sound like a wonderful solution, it causes many problems, so be sure to avoid object type columns at all costs. Typically, the problem arises when we try to apply a vector operation to the column:

  • we round a column with mix of floats and ints: fail (loop of ufunc does not support argument 0 of type float which has no callable rint method)
  • we need to apply string functions, but the column contains numbers as well

The solution is usually:

  1. fill the missing values with the fillna function
  2. convert the column to str type using the astype function
  3. apply string functions to clear the data
  4. convert the column to the desired type

Categorical data

Sometimes, it can be usefull to treat a column as a categorical variable instead of a string or a number. For that, we can use the Categorical class.

Typically, we:

  1. define the categorical type: python cat_type = pd.Categorical(categories=['a', 'b', 'c'])

  2. use the categorical type to, e.g., convert a column to the categorical type: python df['col'] = df['col'].astype(cat_type)

Important parameters of the Categorical constructor:

  • categories: By default, the categories are inferred from the data. However, we can specify the categories explicitly using the categories parameter. When using this parameter, each value is converted to the matching category specified in the list. Note that the category must to be an exact match, including the data type, case (when using strings), etc. Unmatched values are converted to NaN.
  • ordered: if True, the categories are ordered in the order of the categories parameter.

Reading categorical data from a file

We can read data into categories using the dtype parameter. However, this way, we cannot specify the categories explicitly. To do that, we need to read the data into the correct data type, and then convert the column/index to the categorical type after the dataframe is created.

Datetime

Pandas has a special type for datetime values. One of its dangerous properties is that zero parts of the datetime are truncated both when displaying and on export:

df = pd.DataFrame({'date': pd.to_datetime(['2021-01-01 00:00:00', '2021-01-01 00:00:00'])})
print(df)
# output:
# '2021-01-01'
# '2021-01-01'

Creating a DataFrame

The DataFrame class has a constructor that supports multiple formats of input data as well as many configuration parameters. Therefore , for most formats of input data, we can create a dataframe using the constructor. However, we can also crete a dataframe using the from_* functions, and for some formats, these functions are the only way to create a dataframe.

From a dictionary

When having a dictionary, we can choose between two options the constructor and the from_dict function.

The required syntax depend on the shape of the dictionary with respect to the required dataframe.

Keys are column names, values are list of column values

df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
# or equivalently
df.DataFrame.from_dict({'col1': [1, 2], 'col2': [3, 4]})

Note that the values of the dictionary have to be lists. If we have a dictionary with values that are not lists (i.e., only one row), we have to use the orient parameter to specify the orientation of the data and then transpose the dataframe:

d = {'col1': 1, 'col2': 2}

df = pd.DataFrame.from_dict(d, orient='index').T
# or equivalently
df = pd.DataFrame([d], columns=d.keys())

Keys are indices, values are values of a single column

df = pd.DataFrame.from_dict({'row1': 1, 'row2': 2}, orient='index', columns=['Values'])

Keys are indices, values are values of single row

df = pd.DataFrame.from_dict({'row1': [1, 2], 'row2': [3, 4]}, orient='index')

Keys are one column, values are another column

d = {'row1 col1': 'row1 col2', 'row2 col1': 'row2 col2'

df = pd.DataFrame.from_dict(d.items())

# or equivalently
df = pd.DataFrame({'col1': d.keys(), 'col2': d.values()})

From a list of dictionaries

df = pd.DataFrame([{'col1': 1, 'col2': 3}, {'col1': 2, 'col2': 4}])

From a list of lists

df = pd.DataFrame([[1, 3], [2, 4]], columns=['col1', 'col2'])

Creating a zero or constant-filled dataframe

To create a dataframe filled with a constant, we can use the dataframe constructor and pass the constant as the first (data) argument:

df = pd.DataFrame(0, index=range(10), columns=['col1', 'col2'])

Generating the index

As displayed in the above example, we can generate a numerical index using the range function. However, there are more options:

  • date index with date_range
    • pd.date_range(<start date>, <end date>, freq=<frequency>)

Determining the data type

By defualt, Pandas infers the data type of the columns by the content. However, this has some limitations:

  • The data are processed line by line to avoid excessive memory usage. Therefore, if the data contains a value not compatible with the type inferred from the first batch of data, the previously processed data have to be processed again.
  • If the data are not complete or valid, the data is typically inferred as object type, instead of reporting an error.

To specify the data type of the columns we proceed according to the desired data type:

  • for datetime data, we use the parse_dates parameter of the read_csv function
  • for other data types, we use the dtype parameter of the read_csv function.

Obtaining info about dataset

For a DataFrame df:

  • column names: df.columns
  • column types: df.dtypes
  • number of rows: len(df)

Iteration

Standard Iteration

https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas

Iteration without modifying the dataframe. From the fastest to the slowest.

Vector operations

List Comprehensions

Apply

The apply function can be used to apply a function to each row or column of the dataframe. For iterating over rows, we need to set the axis parameter to 1. Example:

df['new_col'] = df.apply(lambda row: row['col1'] + row['col2'], axis=1)

itertuples()

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.itertuples.html

Returns dataframe rows as pandas named tuples with index as the first member of the tuple.

iterrows()

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html

returns a tuple (index, data)

  • it does not preserve the dtype

items()

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.items.html Iterates over columns

Iteration with modification

For modification, the best strategy is to:

  1. select what we want to modify (see selection)
  2. modify the selection with the assignment operator. The right side of the assignment operator can be the result of an iteration.

Filtration

filtered = df[df['max_delay'] == x]

or equivalently:

filtered = df[df.max_delay == x]

Filtration by Multiple Columns

Example:

filtered = df[(df['max_delay'] == x) & (df['exp_length'] == y)]

Accept multiple values

For that, we can use the isin function:

filtered = df[df['max_delay'].isin([x, y])]

Using the query function

The query function can be used for more complicated filters. It is more flexible and the syntax is less verbose. The above filter can be rewriten as:

filtered = df.query('max_delay == x and exp_length == y']

Filtering Series

A seris can be filtered even simpler then the dataframe:

s = df['col']
sf = s[s <= 10] # now we have a Series with values from df['col'] less than 10

Useful filter functions

  • null values: <column selection>.isnull()
  • non null/nan values: <column selection>.notnull()
  • filtring using the string value: <column selection>.str.<string function>
  • filtering dates: <column selection>.dt.<date function>

Selection

If we want to select a part of the dataframe (a set of rows and columns) independently of the values of the dataframe (for that, see filtration), we can use these methods:

  • loc: select by index, works for both rows and columns
  • iloc: select by position, works for both rows and columns
  • []: select by index, works only for columns

There are also other methods that works for selection but does not work for setting values, such as:

  • xs: select by label, works for both rows and columns

The return type of the selection is determined by the number of selected rows and columns. For a single row or column, the result is a series, for multiple rows and columns, the result is a dataframe. If we want to get a dataframe for a single row or column, we can use the [] operator with a list of values:

df[['col1']]
# or 
df.loc[['row1']]
# or
df.iloc[[0]]

loc

The operator loc has many possible input parameters, the most common syntax is

df.loc[<row selection>, <column selection>]

each selection has the form of <start label>:<end label>. For the whole column, we therefore use:

df.loc[:, <column name>]

Difference between array operator on dataframe and on loc

Both methods can be used both for getting and setting the column:

a = df['col']
# or equivalently
a = df.loc[:, 'col']

df2['col'] = a
# or equivalently
df2.loc[:, 'col'] = a

The difference between these two methods is apparent when we want to use a chained selection, i.e., selecting from a selection. While the loc selects the appropriate columns in one step, so we know that we still refer to the original dataframe, the array operator operations are separate, and therefore, the result value can refer to a temporary:

dfmi.loc[:, ('one', 'second')] = value # we set a value of a part of dfmi
dfmi['one']['second'] = value # can be dangerous, we can set value to a temporary

This problem is indicated by a SettingWithCopy warning. Sometimes it is not obvious that we use a chain of array operator selections, e.g.:

sel = df[['a', 'b']]
.....
sel['a'] =  ... # we possibly edit a temporary!

For more, see the dovumentation.

iloc

The iloc method works similarly to the loc method, but it uses the position instead of the label. To select more values, we can use the slice syntax

df.iloc[<start position>:<end position>:<step>,<collumn slicing...>]

Be aware that if the iloc operator selects by single value (e.g.: df.iloc[3]), it returns the single row as series. To get a dataframe slice, we need to use a list of values (e.g.: df.iloc[[3]]).

Selecting all columns but one

If we do not mind copying the dataframe, we can use the drop function.

Otherwise, we can use the loc method and supply the filtered column lables obtained using the columns property:

df.loc[:, df.columns != '<column to skip>']

Multi-index selection

documentation

When selecting from a dataframe with a multi-index, things get a bit more complicated. There are three ways how to select from a multi-index dataframe:

  • using loc with slices: simple, but verbose
  • using loc with IndexSlice object: more readable, but requires the IndexSlice object to be created first
  • using xs function, neat, but does not support all the features, e.g., it does not support ranges

Using loc

The general loc usage is the same as for a single index dataframe:

df.loc[<row selection>, <column selection>]

However, each selection is now a tuple, where each element of the tuple corresponds to one level of the multi-index:

df.loc[(<row selection level 1>, <row selection level 2>, ...), (<column selection level 1>, <column selection level 2>, ...)]

The <row selection> can be a specifica value, a list of values, or a slice. Note that we have to use the slice function, as pandas uses the standard slice syntax for something else.

We can skip lower levels to select all values from those levels. However, we cannot skip upper levels. If we want to select all values from the upper level, we need to use the slice(None) for that level:

df.loc[(slice(None), slice(15, 30)), ...]

Note that for multi-index slicing, the index needs to be sorted. If it is not, we can use the sort_index function.

pandas slicing documentation

Using IndexSlice for more readable syntax

We can obtain the same result with a more readable syntax using the IndexSlice object:

idx = pd.IndexSlice
dft.loc[idx[:, 15:30], ...]

Handeling the too many indexers error

Sometimes, when using the loc method, the selection can fail with the too many indexers error, because it is ambiguous whether we select by rows or by columns. In that case, we can either

  • use the axis parameter to specify the axis to select from: python df.loc(axis=0)[<row selection>]

  • or use the IndexSlice instead.

Using xs

The xs function can be used to select from a multi-index dataframe. However, slices (ranges) are not supported. Example:

df.xs(15, level=1) # selects all rows with level 1 equal to 15

Select row with a maximum value in a column

To get the index of the row with the maximum value in a column, we can use the idxmax function:

df['col'].idxmax()

Then we can use the loc method to get the row.

Selecting a single value (cell, scalar)

When we select a single value from a dataframe, the result is sometimes a series, especially when we use a filtration. To get a scalar, we can use the item() method:

df.loc[<row>, <column>].item()

Sorting

for sorting the dataframe, we can use the sort_values function. The first argument is the list of columns to sort by, starting with the most important column. Example:

df.sort_values(['col1', 'col2'])

If we want to use a custom sorting function, we can use the key argument. The key function should

  • satisfy the classical python sorting interface (see Python manual) and
  • it should be a vector function, i.e., instead of returning a single position for a given value, it should return a vector of positions for a given vector of values.

Example:

def key_fn(column: list):
    return [len(x) for x in l]

df.sort_values('col', key=key_fn)

Working with columns

Adding a column

The preferable way is to use the assign function:

# adds a column named 'instance_length' with constant value
result_df_5_nyc_mv.assign(instance_length = 5) 

Multiple columns can be added at once:

trips = trips.assign(dropoff_datetime = 0, dropoff_GPS_lon = 0, dropoff_GPS_lat = 0, pickup_GPS_lon = 0, pickup_GPS_lat = 0)

Rename a column

To rename a column, we can use the pandas rename function:

df.rename(columns={'<old name 1>': '<new name 1>', '<old name 2>': '<new name 2>'})

# or equivalently
df.rename({'<old name 1>': '<new name 1>', '<old name 2>': '<new name 2>'}, axis='columns')

Rename a Series (column)

The column name in the series object is actually the name of the series. To rename the series, we can use the rename function, or we can set the name property of the series:

s.rename('<new name>')
# or equivalently
s.name = '<new name>'

Index

Index of a dataframe df can be accessed by df.index. Standard range operation can be applied to index.

Selecting just a single index level from a multi-index

If we want to select just a single index level, we can use the get_level_values function:

df.index.get_level_values(<level>)

Note however, that this function returns duplicated values when there are multiple values in other levels. To get unique values, we can use the unique function.

There is also another method, that returns unique values: the level property:

df.index.levels[<level>]

However, this way, we can get outdated values, as the values are not always updated when the index is changed. To get the updated values, we need to call the method remove_unused_levels after each change of the index.

Renaming the index

The Index.rename function can be used for that.

Replacing the index

We have several options how to replace the index with a new index:

  • Using columns as a new index: for that, we can use the set_index function.
  • Using an existing index to create a new index For that, we can use the reindex function.
  • Creating index from scratch

reindex

The first parameter is the new index. Example:

df.reindex(df.index + 1) # creates a new index by adding 1 to the old index

Important parameters:

  • fill_value: the value to use for missing values. By default, the missing values are filled with NaN.
  • level: the level to reindex in case of a multi-index.

Note that the reindex function can only be used for unique values. If we have a multi-index with duplicate values on the level we want to reindex, we need to create a new index from scratch.

Creating index from scratch

To create an index from scratch, we just assign the index to the dataframe index property:

df.index = pd.Index([1, 2, 3, 4, 5])

We can also assign a range directly to the index:

df.index = range(5)

Creating multi-index

documentation

There is a MultiIndex constructor that can be used to create a multi-index. However, most of the time, it is more convenient to use dedicated factory functions:

Important parameters for both the constructor and the factory functions:

  • names: the names of the index levels

Shifting a time/date index

To shift a time or date or datetime index, we can use the shift function with the freq parameter.

The freq use is important, as it change the mode of operation of the shift function:

  • if freq is not specified, the index stays the same and the data are shifted by the specified number of periods.
  • if freq is specified, the index is shifted by the specified number of periods while the data stay in the same position.

Example:

df.index = pd.date_range('2021-01-01', periods=10, freq='1h')
df.shift(1, freq='1h') # shifts the index by 1 hour
df.shift(1) # shifts the data by 1 hour against the index

The shift function is not defined for MultiIndex!. For that, we need to create the index level manually:

df.index = pd.MultiIndex.from_product([['one', 'two'], pd.date_range('2021-01-01', periods=10, freq='1h')])
new_index = df.index.levels[1] + pd.Timedelta(hours=1)
df.index = df.index.set_levels(new_index, level=1)

Aggregation

Analogously to SQL, pandas has a groupby function for aggreagting rows. The usage is as follows:

group = df.groupby(<columns>) # returns a groupby object grouped by the columns
selection = group[<columns>] # we can select only some columns from the groupby object
aggregation = selection.<aggregation function> # we apply an aggregation function to the selected columns

We can skip the selection step and apply the aggregation function directly to the groupby object. This way, the aggregation function is applied to all columns.

Example (sum):

df.groupby('col').sum()

Sums the results for each group (column by column)

To get a count, we can call the size function:

df.groupby('col').size()

Note that unlike in SQL, the aggregation function does not have to return a single value. It can return a series or a dataframe. In that case, the result is a dataframe with the columns corresponding to the returned series/dataframe. In other words, the aggregation does not have to actually aggregate the data, it can also transform it.

Aggregate functions

For the aggregate function, we can use one of the prepared aggregation functions. Classical functions(single value per group):

  • sum
  • mean
  • median
  • min
  • max
  • count

Transformation functions (value for each row):

  • cumsum: cumulative sum
  • diff: difference between the current and the previous row.
    • the periods parameter specifies which row to use for the difference. By default, it is the previous row (periods=1). For next row, use periods=-1, but note that the result is then negative. We can use the abs function to get the absolute value.

Custom aggegate function

Also, there are more general aggregate functions:

  • agg function that is usefull for applying different functions for different columns and
  • apply: the most flexible function that can be used for custom aggregation and transformation operations.

These two functions have different interfaces for the custom aggregation functions they call. These are summarized in the following table:

property agg apply
can just transform the data no yes
can use data from one column in another column no yes
applied to each specified column the whole dataframe representing single group
output dataframe scalar, series, or dataframe
can use multiple aggregate functions yes no

agg

Example:

df.groupby('col').agg({'col1': 'sum', 'col2': 'mean'})

apply

The apply function takes a custom function as an argument. That custom aggregation function:

  • takes a DataFrame/Series (depending on the source object) as the first argument
    • this dataframe/series contains the data for the group (all columns)
    • the key of each particular group can be accessed using the name attribute of the dataframe/series
  • returns a Series, DataFrame, or a scalar
    • when a scalar is returned, the result is a series with the scalar value for each group
    • we do not have to reduce the data to a single value or a single row, we can just transform the data arbitrarily.

The process works as follows:

  1. The dataframe is split into groups according to the groupby function.
  2. The custom function is applied to each group.
  3. The results are combined into a single dataframe.

In other words, the custom function only sees the dataframe/series representing the group, not the whole dataframe/series. The grouping and combining aggreate results is done by the apply function.

Time aggregation

documentation

We can also aggregate by time. For that, we need an index or column with datetime values. We use the resample function. Example:

df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, index=pd.date_range('2021-01-01', periods=10, freq='1h'))

df.resample('1D').sum()

# results in
#             col1
# 2021-01-01     55

The groups are created by the following logic:

  • the groups are aligned to the index of the dataframe (not to the records)
  • each group has the duration specified by the rule parameter
  • each group is labeled by the start of the group
    • e.g. for rule='1D', the group is labeled by the start of the day, if rule='1h', then 12:00 means 12:00-13:00
  • records are assigned to the group based on their datetime value
  • the records that exactly match the start/end of the group are assigned depending on the closed parameter
    • left (default): the record is assigned to the group with the matching start
    • right: the record is assigned to the group with the matching end

There are two possible results of resampling (both may appear in the same dataframe):

  • Downsampling: multiple records in the same time group.
  • Upsampling: no values in the group. The missing values are filled with NaN.
    • note that the NaN automatically changes the data type of the column to float, and it is not reverted by filling the missing values later. Therefore, when upsampling, we have to manually change the data type of each column to the original data type after filling the values.
    • We can use the ffill function to fill in the missing values. Example: Python df.resample('1H').sum().ffill()

Joins

Similarly to SQL, Pandas has a way to join two dataframes. There are two functions for that:

  • merge: the most general function that has the behavior known from SQL
  • join: a more specialized function,

The following table lists the most important differences between the two functions:

property merge join
default join type inner left
join right table via column (default) or index (right_index=True) index
join left table via column (default) or index (left_index=True) index, or column (on=key_or_keys)

There is also a static pd.merge function. All merge and join methods are just wrappers around this function.

The indexes are lost after the join (if not used for the join). To keep an index, we can store it as a column before the join.

Appending and Concatenating data

In pandas, there is a concat function that can be used to concatenate data:

pd.concat([df1, df2])

It can concatenate dataframes or series and it can concatenate vertically (by rows, default) or horizontally (by columns)

By default, the indices from both input parameters are preserved. To reset the index, we can use the ignore_index parameter. Alternatively, to preserve one of the indices, we can set the index of the other dataframe to the index of the first dataframe before the concatenation using the set_index function.

I/O

csv

For reading csv files, we can use the read_csv function. Important params:

  • sep: separator
  • header: row number to use as column names. If None, no header is used
  • skiprows: number of rows to skip from the beginning
  • delim_whitespace: if True, the whitespace is used as a separator. The sep parameter is ignored in that case. This is a way how to read a file with variable number of whitespaces between columns.

For export, we can use the to_csv method:

df.to_csv(<file name> [, <other params>])

Useful parameters:

  • index: if False, the index is not exported
  • index_label: the name of the index column

Json

For exporting to json, we can use the to_json function.

By default, the data are exported as a list of columns. To export the data as a list of rows, we can use the orient parameter:

df.to_json(<file name>, orient='records')

Other important parameters:

  • indent: the number of spaces to use for indentation

Excel

For reading excel files, we can use the read_excel function. Important params:

  • sheet_name: the name of the sheet to read. If None, the first sheet is read.

Insert dataframe into db

We can use the to_sql method for that:

df.to_sql(<table name>, <sql alchemy engine> [, <other params>])

Important params:

  • to append, not replace existing records: if_exists='append'
  • do not import dataframe index: index=False

For larger datasets, it is important to not insert everything at once, while also tracking the progress. The following code does exactly that

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

chunksize = int(len(data) / 1000) # 0.1%
with tqdm(total=len(data)) as pbar:
    for i, cdf in enumerate(chunker(data, chunksize)):
        cdf.to_sql(<table name>, <sqlalchemy_engine>)
        pbar.update(chunksize)

If the speed is slow, it can be caused by a low upload speed of your internet connection. Note that due to the SQL syntax, the size of the SQL strings may be much larger than the size of the dataframe.

Latex export

Currently, the DataFrame.to_latex function is deprecated. The Styler class should be used for latex exports instead. You can get the Styler from the DataFrame using the style property. The usual workfolow is:

  1. Create a Styler object from the dataframe using the style property,
  2. Apply the desired formatting to the styler object,
  3. Export DataFrame to latex using the Styler.to_latex method.

Keep in mind that the Styler object is immutable, so you need to assign the result of each formatting operation to a new variable or chain the calls. Example:

# wrong, the format is not apllied
df.style.format(...)
df.style.to_latex(...)

# correct: temp var
s = df.style.format(...)
s.to_latex(...)

# correct: chain calls
df.style.format(...).to_latex(...)

Finally, to produce a nice output, tou should use print to display the latext string:

print(styler.to_latex())

Formatting the index: columns and row labels

The columns' and row labels' format is configures by the format_index function. Important parameters:

  • axis: 0 for rows, 1 for columns (cannot be both)
  • escape: by default, the index is not escaped, to do so, we need to set escape to 'latex'.

Formatting or changing the values

The values are formated by the format function. Important parameters:

  • escape: by default, the values are not escaped, to do so, we need to set escape to 'latex'.
  • na_rep: the string to use for missing values
  • precision: the number of decimal places to use for floats

Replacing values

For replace some values for the presentation with something else, we can also use the format function.For example, to change the boolean presentation in column col we call:

df.style.format({'col': lambda x: 'yes' if x else 'no'})

Hihglighting min/max values

For highlighting the min/max values, we can use the highlight_min and highlight_max functions. Important parameters:

  • subset: the columns in which the highlighting should be applied
  • props: the css properties to apply to the highlighted cells

Hiding some columns, rows, or indices

For hiding some columns, rows, or indices, we can use the hide function. Important Parameters:

  • axis: 0 for hiding row indices (default), 1 for hiding column names
  • level: the level of the multi-index to hide (default is all levels)
  • subset: the columns or rows to hide (default is all columns or rows)

When used without the subset parameter, the hide function hides the whole index. To hide just a selected row or column from the data, the subset parameter has to be used.

By default, the <index_name> refers to the row index. To hide a column:

df.style.hide_columns(<column name>, axis=1)

Changing the header (column labels)

There is no equivalent to the header parameter of the old to_latex function in the new style system. Instead, it is necessary to change the column names of the dataframe.

Exporting to latex

For the export, we use the to_latex function. Important parameters:

  • convert_css: if True, the css properties are converted to latex commands
  • multirow_align: the alignment of the multirow cells. Options are t, c, b
  • hrules: if set to True, the horizontal lines are added to the table, specifically to the top, bottom, and between the header and the body. Note that these hrules are realized as the \toprule, \midrule, and \bottomrule commands from the booktabs package, so the package has to be imported .
  • clines: configuration for hlines between rows. It is a string composed of two parts divided by ; (e.g.: skip-last;data). The parts are:
    • whether to skip last row or not (skip-last or all)
    • whether to draw the lines between indices or the whole rows (index or data)

Displaying the dataframe in console

We can display the dataframe in the conslo print or int the log just by supplying the dataframe as an argument because it implements the __repr__ method. Sometimes, however, the default display parameters are not sufficient. In that case, we can use the set_option function to change the display parameters:

pd.set_option('display.max_rows', 1000)

Important parameters:

  • display.max_rows: the maximum number of rows to display
  • display.max_columns: the maximum number of columns to display
  • display.max_colwidth: the maximum width of a column

Other useful functions

  • drop_duplicates to quickly drop duplicate rows based on a subset of columns.
  • factorize to encode a series values as a categorical variable, i.e., assigns a different number to each unique value in series.
  • pivot_table: function that can aggragate and transform a dataframe in one step. with this function, one can create a pivot table, but also a lot more.
  • cut: function that can be used to discretize a continuous variable into bins.

pivot_table

The pivot_table function do a lot of things at once:

  • it aggregates the data
  • it transforms the data
  • it sorts the data due to reindexing

Although this function is very powerfall there are also many pitfalls. The most important ones are:

  • column data type change for columns with missing values

Column data type change for columns with missing values

The tranformation often creates row-column combinations that do not exist in the original data. These are filled with NaN values. But some data types does not support NaN values, and in conclusion, the data type of the columns with missing values is changed to float. Possible solutions:

  • we can use the fill_value parameter to fill the missing values with some value that is supported by the data type (e.g. -1 for integers)
  • we can use the dropna parameter to drop the rows with missing values
  • we can change the data type of the columns with missing values prior to calling the pivot_table function. For example, the pandas integer data types support NaN values.

to_datetime

The to_datetime function can convert various inputs to datetime. It can be used to both scalars and vectors. Important parameters:

  • unit: the unit of the input, e.g., s for seconds.
  • origin: the origin of the input, e.g., unix for unix timestamps. It can be also any specific datetime object.
  • format: the format of the input, e.g., %Y-%m-%d for 2021-01-01.

squeeze

The squeeze function removes the unnecessary dimension from a dataframe or series. It is usefull when we want to convert a dataframe with a single column to a series, or a series with a single value to a scalar.

Geopandas

Geopandas is a GIS addon to pandas, an equivalent to PostGIS. Unfortunately, it currently supports only one geometry column per table.

Do not ever copy paste the geometries from jupyter notebook as the coordinates are rounded! Use the to_wkt function instead.

Create a geodataframe from CSV

Geopandas has it's own read_csv function, however, it requires a very specific csv format, so it is usually easier to first import csv to pandas and then create geopandas dataframe from pandas dataframe.

Converting pandas Dataframe to geopandas Dataframe

The geopandas dataframe constructor accepts pandas dataframes, we just need to specify the geometry column and the coordinate system:

gdf = gpd.GeoDataFrame(
    <PANDAS DATAFRAME>
    geometry=gpd.points_from_xy(<X COLUMN>, <Y COLUMN>),
    crs=<SRID>
)

Create geodataframe from shapely

To load data from shapely, execute

gdf = gpd.read_file(<PATH TO FOLDER WITH SHAPEFILES>)

Working with the geometry

The geometry can be accessed using the geometry property of the geodataframe.

Spliting multi-geometry columns

If the geometry column contains multi-geometries, we can split them into separate rows using the explode function:

gdf = gdf.explode()

Insert geodataframe into db

preprocesssing

Before inserting a geodataframe into the database, we need to process it a little bit:

  1. set the SRID: gdf.set_crs(epsg=<SRID>, allow_override=True, inplace=True)
  2. set the geometry: gdf.set_geometry('geom', inplace=True)
  3. select, rename, or add columns so that the resulting geodataframe match the corresponding database table. This process is same as when working with pandas

Simple insertion

When the data are in the correct format and we don|t need any customization for the db query, we can use the to_postgis method:

gdf.to_postgis(<TABLE NAME>, <SQL ALCHEMY CONNECTION>, if_exists='append')

Customized Insertion: geoalchemy

If we need some special insert statement, we cannot rely on the geodataframe.to_postgis function, as it is not flexible enough. The pandas dataframe.to_sql function is more flexible, however, it has trouble when working with geodata. The easiest options is therefore to use geoalchemy, the database wraper used in geopandas (extension of sqlalchemy, which is a database wrapper for pandas).

First, we need to create the insert statement. The example here uses a modification for handeling duplicite elements.

meta = sqlalchemy.MetaData()    # create a collection for geoalchemy database  
                                # objects

table = geoalchemy2.Table(
    '<TABLE NAME>', meta, autoload_with=<SQL ALCHEMY CONNECTION>)
insert_statement 
    = sqlalchemy.dialects.postgresql.insert(table).on_conflict_do_nothing()

In the above example, we create a geoalchemy representation of a table and then we use this representation to create a customized insert statement (the on_conflict_do_nothing is the speciality here.). Note that we use a speciatl PostgreSQL insert statement instead of the standard SQLAlchemy insert statement.

Second, we need to prepare the data as a list of dictionary entries:

list_to_insert = [
    {'id': 0, 'geom': <GEOM>, ...},
    {'id': 0, 'geom': <GEOM>, ...},
    ....
]

Note that the geometry in the geodataframe is in the shapely format. Therefore, we need to convert it to string using the geoalchemy from_shape function:

geoalchemy2.shape.from_shape(<GEOMETRY>, srid=<SRID>)

Finally, we can execute the query using an sqlalchemy connection:

sqlalchemy_connection.execute(insert_statement, list_to_insert)

Exporting geodataframe to a file

The geodataframe can be exported to a file using the to_file method. The format is determined by the file extension.

Pandas in Jupyter

Displaying more rows or columns

To display more rows or columns in Jupyter than the default, we have to set the pandas option_context to the desired values:

# instead of
df

# write
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
    display(df)