Main principles¶
Pandas extensively uses the term axis. In Pandas, axis 0 is vertical (rows) and axis 1 is horizontal (columns).
Pandas Data Types¶
Documentation of basic data types
Mainly, pandas uses the numpy data types.
Object¶
If pandas does not recognize the type of the column, or there are multiple types in the column, it uses the object
type. However this may sound like a wonderful solution, it causes many problems, so be sure to avoid object type columns at all costs. Typically, the problem arises when we try to apply a vector operation to the column:
- we round a column with mix of floats and ints: fail (
loop of ufunc does not support argument 0 of type float which has no callable rint method
) - we need to apply string functions, but the column contains numbers as well
The solution is usually:
- fill the missing values with the
fillna
function - convert the column to
str
type using theastype
function - apply string functions to clear the data
- convert the column to the desired type
Categorical data¶
Sometimes, it can be usefull to treat a column as a categorical variable instead of a string or a number. For that, we can use the Categorical
class.
Typically, we:
-
define the categorical type:
python cat_type = pd.Categorical(categories=['a', 'b', 'c'])
-
use the categorical type to, e.g., convert a column to the categorical type:
python df['col'] = df['col'].astype(cat_type)
Important parameters of the Categorical
constructor:
categories
: By default, the categories are inferred from the data. However, we can specify the categories explicitly using thecategories
parameter. When using this parameter, each value is converted to the matching category specified in the list. Note that the category must to be an exact match, including the data type, case (when using strings), etc. Unmatched values are converted toNaN
.ordered
: ifTrue
, the categories are ordered in the order of thecategories
parameter.
Reading categorical data from a file¶
We can read data into categories using the dtype
parameter. However, this way, we cannot specify the categories explicitly. To do that, we need to read the data into the correct data type, and then convert the column/index to the categorical type after the dataframe is created.
Datetime¶
Pandas has a special type for datetime values. One of its dangerous properties is that zero parts of the datetime are truncated both when displaying and on export:
df = pd.DataFrame({'date': pd.to_datetime(['2021-01-01 00:00:00', '2021-01-01 00:00:00'])})
print(df)
# output:
# '2021-01-01'
# '2021-01-01'
Creating a DataFrame¶
The DataFrame
class has a constructor that supports multiple formats of input data as well as many configuration parameters. Therefore , for most formats of input data, we can create a dataframe using the constructor. However, we can also crete a dataframe using the from_*
functions, and for some formats, these functions are the only way to create a dataframe.
From a dictionary¶
When having a dictionary, we can choose between two options the constructor and the from_dict
function.
The required syntax depend on the shape of the dictionary with respect to the required dataframe.
Keys are column names, values are list of column values¶
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
# or equivalently
df.DataFrame.from_dict({'col1': [1, 2], 'col2': [3, 4]})
Note that the values of the dictionary have to be lists. If we have a dictionary with values that are not lists (i.e., only one row), we have to use the orient
parameter to specify the orientation of the data and then transpose the dataframe:
d = {'col1': 1, 'col2': 2}
df = pd.DataFrame.from_dict(d, orient='index').T
# or equivalently
df = pd.DataFrame([d], columns=d.keys())
Keys are indices, values are values of a single column¶
df = pd.DataFrame.from_dict({'row1': 1, 'row2': 2}, orient='index', columns=['Values'])
Keys are indices, values are values of single row¶
df = pd.DataFrame.from_dict({'row1': [1, 2], 'row2': [3, 4]}, orient='index')
Keys are one column, values are another column¶
d = {'row1 col1': 'row1 col2', 'row2 col1': 'row2 col2'
df = pd.DataFrame.from_dict(d.items())
# or equivalently
df = pd.DataFrame({'col1': d.keys(), 'col2': d.values()})
From a list of dictionaries¶
df = pd.DataFrame([{'col1': 1, 'col2': 3}, {'col1': 2, 'col2': 4}])
From a list of lists¶
df = pd.DataFrame([[1, 3], [2, 4]], columns=['col1', 'col2'])
Creating a zero or constant-filled dataframe¶
To create a dataframe filled with a constant, we can use the dataframe constructor and pass the constant as the first (data) argument:
df = pd.DataFrame(0, index=range(10), columns=['col1', 'col2'])
Generating the index¶
As displayed in the above example, we can generate a numerical index using the range
function. However, there are more options:
- date index with
date_range
pd.date_range(<start date>, <end date>, freq=<frequency>)
Determining the data type¶
By defualt, Pandas infers the data type of the columns by the content. However, this has some limitations:
- The data are processed line by line to avoid excessive memory usage. Therefore, if the data contains a value not compatible with the type inferred from the first batch of data, the previously processed data have to be processed again.
- If the data are not complete or valid, the data is typically inferred as
object
type, instead of reporting an error.
To specify the data type of the columns we proceed according to the desired data type:
- for datetime data, we use the
parse_dates
parameter of theread_csv
function - for other data types, we use the
dtype
parameter of theread_csv
function.
Obtaining info about dataset¶
For a DataFrame df
:
- column names:
df.columns
- column types:
df.dtypes
- number of rows:
len(df)
Iteration¶
Standard Iteration¶
https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
Iteration without modifying the dataframe. From the fastest to the slowest.
Vector operations¶
List Comprehensions¶
Apply¶
The apply
function can be used to apply a function to each row or column of the dataframe. For iterating over rows, we need to set the axis
parameter to 1. Example:
df['new_col'] = df.apply(lambda row: row['col1'] + row['col2'], axis=1)
itertuples()¶
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.itertuples.html
Returns dataframe rows as pandas named tuples with index as the first member of the tuple.
iterrows()¶
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html
returns a tuple (index, data)
- it does not preserve the dtype
items()¶
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.items.html Iterates over columns
Iteration with modification¶
For modification, the best strategy is to:
- select what we want to modify (see selection)
- modify the selection with the assignment operator. The right side of the assignment operator can be the result of an iteration.
Filtration¶
filtered = df[df['max_delay'] == x]
or equivalently:
filtered = df[df.max_delay == x]
Filtration by Multiple Columns¶
Example:
filtered = df[(df['max_delay'] == x) & (df['exp_length'] == y)]
Accept multiple values¶
For that, we can use the isin
function:
filtered = df[df['max_delay'].isin([x, y])]
Using the query function¶
The query function can be used for more complicated filters. It is more flexible and the syntax is less verbose. The above filter can be rewriten as:
filtered = df.query('max_delay == x and exp_length == y']
Filtering Series¶
A seris can be filtered even simpler then the dataframe:
s = df['col']
sf = s[s <= 10] # now we have a Series with values from df['col'] less than 10
Useful filter functions¶
- null values:
<column selection>.isnull()
- non null/nan values:
<column selection>.notnull()
- filtring using the string value:
<column selection>.str.<string function>
- filtering dates:
<column selection>.dt.<date function>
Selection¶
If we want to select a part of the dataframe (a set of rows and columns) independently of the values of the dataframe (for that, see filtration), we can use these methods:
loc
: select by index, works for both rows and columnsiloc
: select by position, works for both rows and columns[]
: select by index, works only for columns
There are also other methods that works for selection but does not work for setting values, such as:
xs
: select by label, works for both rows and columns
The return type of the selection is determined by the number of selected rows and columns. For a single row or column, the result is a series, for multiple rows and columns, the result is a dataframe. If we want to get a dataframe for a single row or column, we can use the []
operator with a list of values:
df[['col1']]
# or
df.loc[['row1']]
# or
df.iloc[[0]]
loc
¶
The operator loc
has many possible input parameters, the most common syntax is
df.loc[<row selection>, <column selection>]
each selection has the form of <start label>:<end label>
. For the whole column, we therefore use:
df.loc[:, <column name>]
Difference between array operator on dataframe and on loc
¶
Both methods can be used both for getting and setting the column:
a = df['col']
# or equivalently
a = df.loc[:, 'col']
df2['col'] = a
# or equivalently
df2.loc[:, 'col'] = a
The difference between these two methods is apparent when we want to use a chained selection, i.e., selecting from a selection. While the loc
selects the appropriate columns in one step, so we know that we still refer to the original dataframe, the array operator operations are separate, and therefore, the result value can refer to a temporary:
dfmi.loc[:, ('one', 'second')] = value # we set a value of a part of dfmi
dfmi['one']['second'] = value # can be dangerous, we can set value to a temporary
This problem is indicated by a SettingWithCopy
warning. Sometimes it is not obvious that we use a chain of array operator selections, e.g.:
sel = df[['a', 'b']]
.....
sel['a'] = ... # we possibly edit a temporary!
For more, see the dovumentation.
iloc
¶
The iloc
method works similarly to the loc
method, but it uses the position instead of the label. To select more values, we can use the slice syntax
df.iloc[<start position>:<end position>:<step>,<collumn slicing...>]
Be aware that if the iloc
operator selects by single value (e.g.: df.iloc[3]
), it returns the single row as series. To get a dataframe slice, we need to use a list of values (e.g.: df.iloc[[3]]
).
Selecting all columns but one¶
If we do not mind copying the dataframe, we can use the drop
function.
Otherwise, we can use the loc
method and supply the filtered column lables obtained using the columns
property:
df.loc[:, df.columns != '<column to skip>']
Multi-index selection¶
When selecting from a dataframe with a multi-index, things get a bit more complicated. There are three ways how to select from a multi-index dataframe:
- using
loc
with slices: simple, but verbose - using
loc
withIndexSlice
object: more readable, but requires theIndexSlice
object to be created first - using
xs
function, neat, but does not support all the features, e.g., it does not support ranges
Using loc
¶
The general loc
usage is the same as for a single index dataframe:
df.loc[<row selection>, <column selection>]
However, each selection is now a tuple, where each element of the tuple corresponds to one level of the multi-index:
df.loc[(<row selection level 1>, <row selection level 2>, ...), (<column selection level 1>, <column selection level 2>, ...)]
The <row selection>
can be a specifica value, a list of values, or a slice. Note that we have to use the slice
function, as pandas uses the standard slice syntax for something else.
We can skip lower levels to select all values from those levels. However, we cannot skip upper levels. If we want to select all values from the upper level, we need to use the slice(None)
for that level:
df.loc[(slice(None), slice(15, 30)), ...]
Note that for multi-index slicing, the index needs to be sorted. If it is not, we can use the sort_index
function.
Using IndexSlice
for more readable syntax¶
We can obtain the same result with a more readable syntax using the IndexSlice
object:
idx = pd.IndexSlice
dft.loc[idx[:, 15:30], ...]
Handeling the too many indexers
error¶
Sometimes, when using the loc
method, the selection can fail with the too many indexers
error, because it is ambiguous whether we select by rows or by columns. In that case, we can either
-
use the
axis
parameter to specify the axis to select from:python df.loc(axis=0)[<row selection>]
-
or use the IndexSlice instead.
Using xs
¶
The xs
function can be used to select from a multi-index dataframe. However, slices (ranges) are not supported. Example:
df.xs(15, level=1) # selects all rows with level 1 equal to 15
Select row with a maximum value in a column¶
To get the index of the row with the maximum value in a column, we can use the idxmax
function:
df['col'].idxmax()
Then we can use the loc
method to get the row.
Selecting a single value (cell, scalar)¶
When we select a single value from a dataframe, the result is sometimes a series, especially when we use a filtration. To get a scalar, we can use the item()
method:
df.loc[<row>, <column>].item()
Sorting¶
for sorting the dataframe, we can use the sort_values
function. The first argument is the list of columns to sort by, starting with the most important column. Example:
df.sort_values(['col1', 'col2'])
If we want to use a custom sorting function, we can use the key
argument. The key function should
- satisfy the classical python sorting interface (see Python manual) and
- it should be a vector function, i.e., instead of returning a single position for a given value, it should return a vector of positions for a given vector of values.
Example:
def key_fn(column: list):
return [len(x) for x in l]
df.sort_values('col', key=key_fn)
Working with columns¶
Adding a column¶
The preferable way is to use the assign
function:
# adds a column named 'instance_length' with constant value
result_df_5_nyc_mv.assign(instance_length = 5)
Multiple columns can be added at once:
trips = trips.assign(dropoff_datetime = 0, dropoff_GPS_lon = 0, dropoff_GPS_lat = 0, pickup_GPS_lon = 0, pickup_GPS_lat = 0)
Rename a column¶
To rename a column, we can use the pandas rename
function:
df.rename(columns={'<old name 1>': '<new name 1>', '<old name 2>': '<new name 2>'})
# or equivalently
df.rename({'<old name 1>': '<new name 1>', '<old name 2>': '<new name 2>'}, axis='columns')
Rename a Series (column)¶
The column name in the series object is actually the name of the series. To rename the series, we can use the rename
function, or we can set the name
property of the series:
s.rename('<new name>')
# or equivalently
s.name = '<new name>'
Index¶
Index of a dataframe df
can be accessed by df.index
. Standard range operation can be applied to index.
Selecting just a single index level from a multi-index¶
If we want to select just a single index level, we can use the get_level_values
function:
df.index.get_level_values(<level>)
Note however, that this function returns duplicated values when there are multiple values in other levels. To get unique values, we can use the unique
function.
There is also another method, that returns unique values: the level property:
df.index.levels[<level>]
However, this way, we can get outdated values, as the values are not always updated when the index is changed. To get the updated values, we need to call the method remove_unused_levels
after each change of the index.
Renaming the index¶
The Index.rename
function can be used for that.
Replacing the index¶
We have several options how to replace the index with a new index:
- Using columns as a new index: for that, we can use the
set_index
function. - Using an existing index to create a new index
For that, we can use the
reindex
function. - Creating index from scratch
reindex
¶
The first parameter is the new index. Example:
df.reindex(df.index + 1) # creates a new index by adding 1 to the old index
Important parameters:
fill_value
: the value to use for missing values. By default, the missing values are filled withNaN
.level
: the level to reindex in case of a multi-index.
Note that the reindex
function can only be used for unique values. If we have a multi-index with duplicate values on the level we want to reindex, we need to create a new index from scratch.
Creating index from scratch¶
To create an index from scratch, we just assign the index to the dataframe index
property:
df.index = pd.Index([1, 2, 3, 4, 5])
We can also assign a range directly to the index:
df.index = range(5)
Creating multi-index¶
There is a MultiIndex
constructor that can be used to create a multi-index. However, most of the time, it is more convenient to use dedicated factory functions:
MultiIndex.from_arrays
: creates a multi-index from an array of arrays (e.g. a list of lists).MultiIndex.from_tuples
-
MultiIndex.from_product
: creates a multi-index from the cartesian product of the given iterables. Example: ```python df.index = pd.MultiIndex.from_product([['one', 'two'], ['a', 'b']])results in¶
MultiIndex([¶
('one', 'a'),¶
('one', 'b'),¶
('two', 'a'),¶
('two', 'b'),¶
])¶
```
Important parameters for both the constructor and the factory functions:
names
: the names of the index levels
Shifting a time/date index¶
To shift a time or date or datetime index, we can use the shift
function with the freq
parameter.
The freq
use is important, as it change the mode of operation of the shift
function:
- if
freq
is not specified, the index stays the same and the data are shifted by the specified number of periods. - if
freq
is specified, the index is shifted by the specified number of periods while the data stay in the same position.
Example:
df.index = pd.date_range('2021-01-01', periods=10, freq='1h')
df.shift(1, freq='1h') # shifts the index by 1 hour
df.shift(1) # shifts the data by 1 hour against the index
The shift function is not defined for MultiIndex!. For that, we need to create the index level manually:
df.index = pd.MultiIndex.from_product([['one', 'two'], pd.date_range('2021-01-01', periods=10, freq='1h')])
new_index = df.index.levels[1] + pd.Timedelta(hours=1)
df.index = df.index.set_levels(new_index, level=1)
Aggregation¶
Analogously to SQL, pandas has a groupby
function for aggreagting rows. The usage is as follows:
group = df.groupby(<columns>) # returns a groupby object grouped by the columns
selection = group[<columns>] # we can select only some columns from the groupby object
aggregation = selection.<aggregation function> # we apply an aggregation function to the selected columns
We can skip the selection
step and apply the aggregation function directly to the groupby object. This way, the aggregation function is applied to all columns.
Example (sum):
df.groupby('col').sum()
Sums the results for each group (column by column)
To get a count, we can call the size
function:
df.groupby('col').size()
Note that unlike in SQL, the aggregation function does not have to return a single value. It can return a series or a dataframe. In that case, the result is a dataframe with the columns corresponding to the returned series/dataframe. In other words, the aggregation does not have to actually aggregate the data, it can also transform it.
Aggregate functions¶
For the aggregate function, we can use one of the prepared aggregation functions. Classical functions(single value per group):
sum
mean
median
min
max
count
Transformation functions (value for each row):
cumsum
: cumulative sumdiff
: difference between the current and the previous row.- the
periods
parameter specifies which row to use for the difference. By default, it is the previous row (periods=1). For next row, use periods=-1, but note that the result is then negative. We can use theabs
function to get the absolute value.
- the
Custom aggegate function¶
Also, there are more general aggregate functions:
agg
function that is usefull for applying different functions for different columns andapply
: the most flexible function that can be used for custom aggregation and transformation operations.
These two functions have different interfaces for the custom aggregation functions they call. These are summarized in the following table:
property | agg | apply |
---|---|---|
can just transform the data | no | yes |
can use data from one column in another column | no | yes |
applied to | each specified column | the whole dataframe representing single group |
output | dataframe | scalar, series, or dataframe |
can use multiple aggregate functions | yes | no |
agg
¶
Example:
df.groupby('col').agg({'col1': 'sum', 'col2': 'mean'})
apply
¶
The apply
function takes a custom function as an argument. That custom aggregation function:
- takes a DataFrame/Series (depending on the source object) as the first argument
- this dataframe/series contains the data for the group (all columns)
- the key of each particular group can be accessed using the
name
attribute of the dataframe/series
- returns a Series, DataFrame, or a scalar
- when a scalar is returned, the result is a series with the scalar value for each group
- we do not have to reduce the data to a single value or a single row, we can just transform the data arbitrarily.
The process works as follows:
- The dataframe is split into groups according to the
groupby
function. - The custom function is applied to each group.
- The results are combined into a single dataframe.
In other words, the custom function only sees the dataframe/series representing the group, not the whole dataframe/series. The grouping and combining aggreate results is done by the apply
function.
Time aggregation¶
We can also aggregate by time. For that, we need an index or column with datetime values. We use the resample
function. Example:
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, index=pd.date_range('2021-01-01', periods=10, freq='1h'))
df.resample('1D').sum()
# results in
# col1
# 2021-01-01 55
The groups are created by the following logic:
- the groups are aligned to the index of the dataframe (not to the records)
- each group has the duration specified by the
rule
parameter - each group is labeled by the start of the group
- e.g. for
rule='1D'
, the group is labeled by the start of the day, ifrule='1h'
, then12:00
means 12:00-13:00
- e.g. for
- records are assigned to the group based on their datetime value
- the records that exactly match the start/end of the group are assigned depending on the
closed
parameterleft
(default): the record is assigned to the group with the matching startright
: the record is assigned to the group with the matching end
There are two possible results of resampling (both may appear in the same dataframe):
- Downsampling: multiple records in the same time group.
- Upsampling: no values in the group. The missing values are filled with
NaN
.- note that the
NaN
automatically changes the data type of the column tofloat
, and it is not reverted by filling the missing values later. Therefore, when upsampling, we have to manually change the data type of each column to the original data type after filling the values. - We can use the
ffill
function to fill in the missing values. Example:Python df.resample('1H').sum().ffill()
- note that the
Joins¶
Similarly to SQL, Pandas has a way to join two dataframes. There are two functions for that:
merge
: the most general function that has the behavior known from SQLjoin
: a more specialized function,
The following table lists the most important differences between the two functions:
property | merge | join |
---|---|---|
default join type | inner | left |
join right table via | column (default) or index (right_index=True ) |
index |
join left table via | column (default) or index (left_index=True ) |
index, or column (on=key_or_keys ) |
There is also a static pd.merge
function. All merge
and join
methods are just wrappers around this function.
The indexes are lost after the join (if not used for the join). To keep an index, we can store it as a column before the join.
Appending and Concatenating data¶
In pandas, there is a concat
function that can be used to concatenate data:
pd.concat([df1, df2])
It can concatenate dataframes or series and it can concatenate vertically (by rows, default) or horizontally (by columns)
By default, the indices from both input parameters are preserved. To reset the index, we can use the ignore_index
parameter. Alternatively, to preserve one of the indices, we can set the index of the other dataframe to the index of the first dataframe before the concatenation using the set_index
function.
I/O¶
csv¶
For reading csv files, we can use the read_csv
function. Important params:
sep
: separatorheader
: row number to use as column names. IfNone
, no header is usedskiprows
: number of rows to skip from the beginningdelim_whitespace
: ifTrue
, the whitespace is used as a separator. Thesep
parameter is ignored in that case. This is a way how to read a file with variable number of whitespaces between columns.
For export, we can use the to_csv
method:
df.to_csv(<file name> [, <other params>])
Useful parameters:
index
: ifFalse
, the index is not exportedindex_label
: the name of the index column
Json¶
For exporting to json, we can use the to_json
function.
By default, the data are exported as a list of columns. To export the data as a list of rows, we can use the orient
parameter:
df.to_json(<file name>, orient='records')
Other important parameters:
indent
: the number of spaces to use for indentation
Excel¶
For reading excel files, we can use the read_excel
function. Important params:
sheet_name
: the name of the sheet to read. IfNone
, the first sheet is read.
Insert dataframe into db¶
We can use the to_sql
method for that:
df.to_sql(<table name>, <sql alchemy engine> [, <other params>])
Important params:
- to append, not replace existing records:
if_exists='append'
- do not import dataframe index:
index=False
For larger datasets, it is important to not insert everything at once, while also tracking the progress. The following code does exactly that
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
chunksize = int(len(data) / 1000) # 0.1%
with tqdm(total=len(data)) as pbar:
for i, cdf in enumerate(chunker(data, chunksize)):
cdf.to_sql(<table name>, <sqlalchemy_engine>)
pbar.update(chunksize)
If the speed is slow, it can be caused by a low upload speed of your internet connection. Note that due to the SQL syntax, the size of the SQL strings may be much larger than the size of the dataframe.
Latex export¶
Currently, the DataFrame.to_latex
function is deprecated. The Styler
class should be used for latex exports instead. You can get the Styler
from the DataFrame using the style
property. The usual workfolow is:
- Create a
Styler
object from the dataframe using thestyle
property, - Apply the desired formatting to the styler object,
- Export DataFrame to latex using the
Styler.to_latex
method.
Keep in mind that the Styler
object is immutable, so you need to assign the result of each formatting operation to a new variable or chain the calls. Example:
# wrong, the format is not apllied
df.style.format(...)
df.style.to_latex(...)
# correct: temp var
s = df.style.format(...)
s.to_latex(...)
# correct: chain calls
df.style.format(...).to_latex(...)
Finally, to produce a nice output, tou should use print
to display the latext string:
print(styler.to_latex())
Formatting the index: columns and row labels¶
The columns' and row labels' format is configures by the format_index
function.
Important parameters:
axis
: 0 for rows, 1 for columns (cannot be both)escape
: by default, the index is not escaped, to do so, we need to setescape
to'latex'
.
Formatting or changing the values¶
The values are formated by the format
function. Important parameters:
escape
: by default, the values are not escaped, to do so, we need to setescape
to'latex'
.na_rep
: the string to use for missing valuesprecision
: the number of decimal places to use for floats
Replacing values¶
For replace some values for the presentation with something else, we can also use the format
function.For example, to change the boolean presentation in column col
we call:
df.style.format({'col': lambda x: 'yes' if x else 'no'})
Hihglighting min/max values¶
For highlighting the min/max values, we can use the highlight_min
and highlight_max
functions. Important parameters:
subset
: the columns in which the highlighting should be appliedprops
: the css properties to apply to the highlighted cells
Hiding some columns, rows, or indices¶
For hiding some columns, rows, or indices, we can use the hide
function. Important Parameters:
axis
: 0 for hiding row indices (default), 1 for hiding column nameslevel
: the level of the multi-index to hide (default is all levels)subset
: the columns or rows to hide (default is all columns or rows)
When used without the subset
parameter, the hide
function hides the whole index. To hide just a selected row or column from the data, the subset parameter has to be used.
By default, the <index_name>
refers to the row index. To hide a column:
df.style.hide_columns(<column name>, axis=1)
Changing the header (column labels)¶
There is no equivalent to the header parameter of the old to_latex
function in the new style system. Instead, it is necessary to change the column names of the dataframe.
Exporting to latex¶
For the export, we use the to_latex
function. Important parameters:
convert_css
: ifTrue
, the css properties are converted to latex commandsmultirow_align
: the alignment of the multirow cells. Options aret
,c
,b
hrules
: if set toTrue
, the horizontal lines are added to the table, specifically to the top, bottom, and between the header and the body. Note that these hrules are realized as the\toprule
,\midrule
, and\bottomrule
commands from thebooktabs
package, so the package has to be imported .clines
: configuration for hlines between rows. It is a string composed of two parts divided by;
(e.g.:skip-last;data
). The parts are:- whether to skip last row or not (
skip-last
orall
) - whether to draw the lines between indices or the whole rows (
index
ordata
)
- whether to skip last row or not (
Displaying the dataframe in console¶
We can display the dataframe in the conslo print or int the log just by supplying the dataframe as an argument because it implements the __repr__
method. Sometimes, however, the default display parameters are not sufficient. In that case, we can use the set_option
function to change the display parameters:
pd.set_option('display.max_rows', 1000)
Important parameters:
display.max_rows
: the maximum number of rows to displaydisplay.max_columns
: the maximum number of columns to displaydisplay.max_colwidth
: the maximum width of a column
Other useful functions¶
drop_duplicates
to quickly drop duplicate rows based on a subset of columns.factorize
to encode a series values as a categorical variable, i.e., assigns a different number to each unique value in series.pivot_table
: function that can aggragate and transform a dataframe in one step. with this function, one can create a pivot table, but also a lot more.cut
: function that can be used to discretize a continuous variable into bins.
pivot_table
¶
The pivot_table
function do a lot of things at once:
- it aggregates the data
- it transforms the data
- it sorts the data due to reindexing
Although this function is very powerfall there are also many pitfalls. The most important ones are:
- column data type change for columns with missing values
Column data type change for columns with missing values¶
The tranformation often creates row-column combinations that do not exist in the original data. These are filled with NaN
values. But some data types does not support NaN
values, and in conclusion, the data type of the columns with missing values is changed to float
. Possible solutions:
- we can use the
fill_value
parameter to fill the missing values with some value that is supported by the data type (e.g. -1 for integers) - we can use the
dropna
parameter to drop the rows with missing values - we can change the data type of the columns with missing values prior to calling the
pivot_table
function. For example, the pandas integer data types supportNaN
values.
to_datetime
¶
The to_datetime
function can convert various inputs to datetime. It can be used to both scalars and vectors. Important parameters:
unit
: the unit of the input, e.g.,s
for seconds.origin
: the origin of the input, e.g.,unix
for unix timestamps. It can be also any specificdatetime
object.format
: the format of the input, e.g.,%Y-%m-%d
for2021-01-01
.
squeeze
¶
The squeeze
function removes the unnecessary dimension from a dataframe or series. It is usefull when we want to convert a dataframe with a single column to a series, or a series with a single value to a scalar.
Geopandas¶
Geopandas is a GIS addon to pandas, an equivalent to PostGIS. Unfortunately, it currently supports only one geometry column per table.
Do not ever copy paste the geometries from jupyter notebook as the coordinates are rounded! Use the to_wkt
function instead.
Create a geodataframe from CSV¶
Geopandas has it's own read_csv
function, however, it requires a very specific csv format, so it is usually easier to first import csv to pandas and then create geopandas dataframe from pandas dataframe.
Converting pandas Dataframe to geopandas Dataframe¶
The geopandas dataframe constructor accepts pandas dataframes, we just need to specify the geometry column and the coordinate system:
gdf = gpd.GeoDataFrame(
<PANDAS DATAFRAME>
geometry=gpd.points_from_xy(<X COLUMN>, <Y COLUMN>),
crs=<SRID>
)
Create geodataframe from shapely
¶
To load data from shapely, execute
gdf = gpd.read_file(<PATH TO FOLDER WITH SHAPEFILES>)
Working with the geometry¶
The geometry can be accessed using the geometry
property of the geodataframe.
Spliting multi-geometry columns¶
If the geometry column contains multi-geometries, we can split them into separate rows using the explode
function:
gdf = gdf.explode()
Insert geodataframe into db¶
preprocesssing¶
Before inserting a geodataframe into the database, we need to process it a little bit:
- set the SRID:
gdf.set_crs(epsg=<SRID>, allow_override=True, inplace=True)
- set the geometry:
gdf.set_geometry('geom', inplace=True)
- select, rename, or add columns so that the resulting geodataframe match the corresponding database table. This process is same as when working with
pandas
Simple insertion¶
When the data are in the correct format and we don|t need any customization for the db query, we can use the to_postgis
method:
gdf.to_postgis(<TABLE NAME>, <SQL ALCHEMY CONNECTION>, if_exists='append')
Customized Insertion: geoalchemy
¶
If we need some special insert statement, we cannot rely on the geodataframe.to_postgis
function, as it is not flexible enough. The pandas
dataframe.to_sql
function is more flexible, however, it has trouble when working with geodata. The easiest options is therefore to use geoalchemy
, the database wraper used in geopandas
(extension of sqlalchemy
, which is a database wrapper for pandas
).
First, we need to create the insert statement. The example here uses a modification for handeling duplicite elements.
meta = sqlalchemy.MetaData() # create a collection for geoalchemy database
# objects
table = geoalchemy2.Table(
'<TABLE NAME>', meta, autoload_with=<SQL ALCHEMY CONNECTION>)
insert_statement
= sqlalchemy.dialects.postgresql.insert(table).on_conflict_do_nothing()
In the above example, we create a geoalchemy
representation of a table and then we use this representation to create a customized insert statement (the on_conflict_do_nothing
is the speciality here.). Note that we use a speciatl PostgreSQL insert statement
instead of the standard SQLAlchemy insert statement.
Second, we need to prepare the data as a list of dictionary entries:
list_to_insert = [
{'id': 0, 'geom': <GEOM>, ...},
{'id': 0, 'geom': <GEOM>, ...},
....
]
Note that the geometry in the geodataframe is in the shapely format. Therefore, we need to convert it to string using the geoalchemy
from_shape
function:
geoalchemy2.shape.from_shape(<GEOMETRY>, srid=<SRID>)
Finally, we can execute the query using an sqlalchemy
connection:
sqlalchemy_connection.execute(insert_statement, list_to_insert)
Exporting geodataframe to a file¶
The geodataframe can be exported to a file using the to_file
method. The format is determined by the file extension.
Pandas in Jupyter¶
Displaying more rows or columns¶
To display more rows or columns in Jupyter than the default, we have to set the pandas option_context
to the desired values:
# instead of
df
# write
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
display(df)