SQL is a standard for relational databases. There is no complete implementation of the standard. The standard is not free, you have to pay for it. Therefore, it is sometimes hard to say what exactly is SQL and what is not.

Operators

Comparison operators:

  • =: equal
  • <> or !=: not equal
  • >: greater than
  • <: less than
  • >=: greater or equal
  • <=: less or equal

String operators:

  • ||: concatenation
  • LIKE: pattern matching

Literals

In SQL, there are three basic types of literals:

  • numeric: 1, 1.2, 1.2e3
  • string: 'string', "string"
  • boolean: true, false

When we need a constant value for other types, we usually use either a constructor function or a specifically formatted string literal.

String literals can also span multiple lines, we do not need any operator to split the line (unlike in Python and despite the JetBrains IDEs adds a string concatenation operator automatically on newline).

NULL in SQL

wiki

In SQL NULL is a marker for missing values.

When using NULL as an operand, it propagates, meaning that the operation returns NULL:

SELECT 1 + NULL; -- returns NULL
SELECT 'The value is ' || NULL; -- returns NULL

This may be counterintuitive, especially in the case of comparison operators:

SELECT 1 = NULL; -- returns NULL
SELECT 1 != NULL; -- returns NULL
SELECT NULL = NULL; -- returns NULL

The only exception are logical operators:

SELECT TRUE OR NULL; -- returns TRUE
SELECT FALSE AND NULL; -- returns FALSE

To solve the problem with comparison operators, SQL has special operators IS NULL and IS NOT NULL:

SELECT 1 IS NULL; -- returns FALSE
SELECT 1 IS NOT NULL; -- returns TRUE
SELECT NULL IS NULL; -- returns TRUE
SELECT NULL IS NOT NULL; -- returns FALSE

Basically, for any a that evaluates to NULL, a IS NULL returns TRUE, while a IS NOT NULL returns FALSE. For any other value of a, the opposite is true.

WITH

Statement for defining variables pointing to temporary data, that can be used in the related SELECT statement. Usage:

WITH <var> 
AS (<sql that assigns data to var>) 
<sql that use the variable>

Note that the variable has to appear in the FROM clause!

Multiple variables in the WITH statement ahould be delimited by a comma:

WITH <var>  AS (<sql that assigns data to var>),
    <var 2> AS (<sql that assigns data to var 2>)
<sql that use the variables>

SELECT

Most common SQL statement, syntax:

SELECT <EXPRESSION> [AS <ALIAS>][, <EXPRESSION 2> [AS <ALIAS 2> ...]]

The most common expression is just a column name.

Selecting row number

We can select row numbers using the function ROW_SELECT() in the SELECT statement:

SELECT 
    ...
    ROW_NUMBER() OVER([<PARTITIONING AND NUMBERING ORDER>]) AS <RESULT COLUMN NAME>,
    ...

Inside the OVER statement, we can specify the order of the row numbering. Note however, that this does not order the result, for that, we use the ORDER BY statement. If we want the rown numbering to correspond with the row order in the result, we can left the OVER statement empty.

Select unique rows

To select unique rows, we can use the DISTINCT keyword:

SELECT DISTINCT <column name> FROM ...

Count selecting rows

The count() function can be used to count the selection. The standard syntax is:

SELECT count(1) FROM ...

Count distinct

To count distinct values in a selection we can use:

SELECT count(DISTINCT <column name>) FROM...

Select from another column if the specified column is NULL

We can use a replacement column using the coalesce function:

SELECT coalesce (<primary column>, <secondary column>)

The secondary column value will be used if the primary column value in the row is NULL.

UNION

UNION and UNION ALL are important keywords that enables to merge query results verticaly, i.e., appending rows of one query to the results set of another.

The difference between them is that UNION discards duplicate rows, while UNION ALL keeps them

The UNUION statement appends one SELECT statement to another, but some statements that appears to be part of the SELECT needs to stay outside (i.e., be specified just once for the whole union), namely ORDER BY, and LIMIT. In contrast, the GROUP BY and HAVING statement stays inside each individual select.

Select hardcoded values

For selecting hardcoded values, we can use the VALUES statement. The syntax is:

VALUES (<column 1>, <column 2>, ...), (<column 1>, <column 2>, ...), ...

Full example of selecting hardcoded values:

SELECT * FROM (VALUES (1, 'a'), (2, 'b')) AS t (id, label)

JOIN

Classical syntax:

JOIN <table name> [[AS] <ALIAS>] ON <CONDITION>

The alias is obligatory if there are duplicate names (e.g., we are joining one table twice)

Types

  • INNER (default): when there is nothing to join, the row is discarded
  • [LEFT/RIGHT/FULL] OUTER: when there is nothing to join, the missing values are set to null
  • CROSS: creates cartesian product between tables

OUTER JOIN

The OUTER JOIN has three subtypes

  • LEFT: joins right table to left, fills the missing rows on right with null
  • RIGHT: joins left to right, fills the missing rows on left with null
  • FULL: both LEFT and RIGHT JOIN is performend

Properties

  • When there are multiple matching rows, all of them are matched (i.e., it creates duplicit rows)
  • If you want to filter the tables before join, you need to specify the condition inside ON caluse

Join Only One Row

Joining a specific row

Sometimes, we want to join only one row from many fulfilling some condition. One way to do that is to use a subquery:

SELECT * FROM a
JOIN (
    SELECT * FROM b ORDER BY column_in_b DESC LIMIT 1
)

This query joins table b to table a, using the row with the highest column_in_b.

Note however, that all rows from a will be joined to the same row from b. To use a different row from b depending on a, we need to look outside the subquery to filter out b according to a. The folowing query, which should do exactly that, is invalid:

SELECT * FROM a
JOIN (
    SELECT * FROM b 
    WHERE column_in_b <= a.column_in_a 
    ORDER BY column_in_b DESC 
    LIMIT 1
)

The problem here is that the subquery cannot refer the tables outside in, in the preceeeding FROM clause. Luckily, in the most use db systems, there is a magical keyword LATERAL that enables exacly that:

SELECT * FROM a
JOIN LATERAL (
    SELECT * FROM b 
    WHERE column_in_b <= a.column_in_a 
    ORDER BY column_in_b DESC 
    LIMIT 1
)

Join random row

To join a random row, we can use the RANDOM function in ordering, e.g.:

SELECT * FROM a
JOIN (
    SELECT * FROM b ORDER BY RANDOM() LIMIT 1
)

However, this subquery is evaluated just once, hence we have the same problem as with the first example in this section: to every row in a, we are joining the same random row from b.

To force the db system to execute the subquery for each row in a, we need to refer a in the subquery, even with some useless filter (and of course we need a LATERAL join for that):

SELECT * FROM a
JOIN LATERAL(
    SELECT * FROM b WHERE a.column_in_a IS NOT NULL ORDER BY RANDOM() LIMIT 1
)

Joining any row find active nodes

Sometimes, we need to join any matching row from B to find a set of active (referenced) rows in A. For example, we need to find a set of custommers with pending orders.

IF the average number of matches in B is low, we proceed with normal join and then aggregate the results or use DISTINCT. However, sometimes, the average number of matching rows in B can be very high (e.g., finding all countries with at least one MC Donald). For those cases, the LATERAL join is again the solution:

SELECT *
    FROM zones
    JOIN LATERAL (
        SELECT id AS request_id 
        FROM demand
        WHERE zones.id = demand.destination
        LIMIT 1
    ) demand ON TRUE

For more information about the trade offs of this solution, check the SO answer

Getting All Combinations of Rows

This can be done using the CROSS JOIN, e.g.:

SELECT * FROM table_a CROSS JOIN table_b

The proble arises when you want to filter one of the tables before joining, because CROSS JOIN does not support the ON clause (see more in the Oracle docs).

Then you can use the equivalent INNER JOIN:

SELECT * FROM table_a INNER JOIN table_b ON true

Here you replace the true value with your intended condition.

Inverse JOIN

Sometimes, it is useful to find all rows in table A that has no match in table B. The usual approach is to use LEFT JOIN and filter out non null rows after the join:

SELECT ... FROM tableA
    LEFT JOIN tableB
WHERE tableB.id is NULL

Joining a table on multiple options

There are situations, where we want to join a single table on multiple possible matches. For example, we want to match all people having birthday or name day the same day as some promotion is happanning. The foollowing query is a straighrorward solution:

SELECT *
FROM promo
    JOIN people ON promo.date = people.birthaday OR promo.date = people.name_day

However, as we explain in the performance chapter, using OR in SQL is almost never a good solution. The usual way of gatting rid of OR is to use IN:

SELECT *
FROM promo
    JOIN people ON promo.date IN(people.birthaday, people.name_day)

Nevertheless, this query can still lead to problems, despite being more compact. When we use any complex condition while joining tables, we risk poor performance. In general, it is better to use just simple column-to-column matches, even when there are more joins as a result. If you have performance problems, consult the "Replacing OR" section in the "Performance Optimization" chapter.

GROUP BY

wiki

GROUP BY is used to aggregate data. Usual syntax:

GROUP BY <LIST OF COLUMNS>

Note that we need to use some aggreagte function for all columns in the SELECT clause that are not present in the GROUP BY list. On the other hand, we can use columns not present in the SELECT in the GROUP BY statement.

Using aggregate functions for the whole result set while using GROUP BY

If a query contains a GROUP BY statement, all aggregate functions (e.g., count, avg) are applied to groups, i.e., for each row of the result set. If we need to apply an aggregate function to the whole result set while using GROUP BY, we need to specify it using the OVER statement:

SELECT count(1) OVER () as result_count
...

This will add a coulumn with a total count to each row of the result. If we do not need the actual groups, but only the distinct count, we can use a LIMIT statement.

Get any row from each group

Sometimes, we need a value from a non-grouped column, but we do not care which one. The reson can be, for example, that we no that the values are the same for all rows in the group. There is no dedicated aggregation for this case, but we can use some simple ones as MIN or MAX.

Window functions

Sometimes, we would need an aggregate function that somehow use two different columns (e.g., value of col A for the row where col B is largest). For that, we cannot use the classical aggregation, but we rather have to use a window function.

A window function is a function that for each row returns a value computed from one or multiple rows.

Syntactically, we recognize a window function by the OVER clause that determines the rows used as an input for the function. Functions with the same name can exist as aggregate and window functions.

Window functions are evaluated after the GROUP BY clause and aggregate functions.

Specifiing the range

  • (): the whole result set
  • (PARTITION BY <column set>): the rows with the same values for that set of columns We can also order the result for the selected range using ORDER BY inside the parantheses.

In some SQL dialects (e.g., PostgreSQL), there are even more sophisticated ways how to specify the range for the window functions.

ORDER BY

By default, there is no guarantee that the result will be in any particular order. To sort the result, we need to add an ORDER BY statement at the end of the query:

ORDER BY <LIST OF COLUMNS>

INSERT

Standard syntax for the INSERT statement is

INSERT INTO <table>
(<col_1>, <col_2>,...)
VALUES (<val_1>, <val_2>,...)

If we fill all columns and we are confident with the column ordering, we can omit columns:

INSERT INTO <table>
VALUES (<val_1>, <val_2>,...)

Sometimes, we need to handle the eroro cases, e.g., the case when the record already exists. The solutions for these cases are, however, database specific.

INSERT SELECT

If we want to duplicate the records, we can use:

INSERT INTO <table> 
SELECT * FROM <table>  
[WHERE  <condition>]

If we need to fill some columns ourselfs:

INSERT INTO <table> (<col_1>, <col_2>,...)
SELECT <col_1>, <any expression> FROM <anything suported by select>  
[WHERE  <condition>]

UPDATE

UPDATE query has the following structure:

UPDATE table_name
SET column1 = value1, column2 = value2...., columnN = valueN
WHERE <condition>

Unfortunately, the statement is ready to update N records with one set of values, but not to update N records with N set of values. To do that, we have only an option to select from another table:

UPDATE table_name
SET column1 = other_table.column1  
FROM other_table WHERE other_table.id = table_name.id

Don't forget the WHERE clause here, otherwise, you are matching the whole result set returned by the FROM clause to each row of the table.

Use the table for updating itself

If we are abou to update the table using date stored in int, we need to use aliases:

UPDATE nodes_ways new SET way_id = ways.osm_id
FROM nodes_ways old
JOIN ways ON old.way_id = ways.id AND old.area = ways.area
WHERE new.way_id = old.way_id AND new.area = old.area AND new.position = old.position

DELETE

To delete records from a table, we use the DELETE statement:

DELETE FROM <table_name> WHERE <condition>

If some data from another table are required for the selection of the records to be deleted, the syntax varies depending on the database engine.

EXPLAIN

Sources

Remarks:

  • to show the actual run times, we need to run EXPLAIN ANALYZE

Nodes

Sources

Node example:


->  Parallel Seq Scan on records_mon  (cost=0.00..4053077.22 rows=2074111 width=6) (actual time=398.243..74465.389 rows=7221902 loops=2)
    Filter: ((trip_length_0_1_mi = '0'::double precision) AND (trip_length_1_2_mi = '0'::double precision) AND (trip_length_2_3_mi = '0'::double precision) AND (trip_length_3_4_mi = '0'::double precision) AND (trip_length_4_5_mi = '0'::double precision))
    Rows Removed by Filter: 8639817
    Buffers: shared hit=157989 read=3805864

Description: In first parantheses, there are expected values:

  • cost the estimated cost in arbitrary units. The format is startup cost..total cost, where startup cost is a flat cost of the node, an init cost, while total cost is the estimated cost of the node. Averege per loop.
  • rows: expected number of rows produced by this node. Averege per loop.
  • width the width of each row in bytes

In the second parantheses, there are measured results:

  • actual time: The measured execution time in miliseconds. The format is startup time..total time.
  • rows The real number of rows returned.

The Rows Removed by the Filter indicates the number of rows that were filtered out. The Buffers statistic shows the number of buffers used. Each buffer consists of 8 KB of data.

Keys

Keys serves as a way of identifying table rows, they are unique. There are many type of keys, see databastar article for the terminology overview.

Primary key

Most important keys in ORM are primary keys. Each table should have a single primary key. A primary key has to be non null.

When choosing primary key, we can either

  • use a uniqu combination of database colums: a natural key
  • use and extra column: surogate key

If we use a natural key and it is composed from multiple columns, we call it a composite key

The following table summarize the adventages and disadvantages of each of the solutions: Area | Property | Natural key | Composite key | Surrogate key | |-|-|-|-|-| | usage | SQL joins | easy | hard | easy | || changing natural key columns | hard | hard | easy | | Performance | extra space | none | A lot if there are reference tables, otherwise none | one extra column | || space for indexes | normal | extra | normal || extra insertion time | no | A lot if there are reference tables, otherwise none | yes | || join performance | suboptimal due to sparse values | even more sub-optimal due to sparse values | optimal

From the table above, we can see that using natural keys should be considered only if rows can be identified by a single column and we have a strong confidence in that natural id, specifically in its uniquness and timelessnes.

Non-primary (UNIQUE) keys

Sometimes, we need to enforse a uniqueness of a set of columns that does not compose a primary key (e.g., we use a surogate key). We can use a non primary key for that.

One of the differences between primary and non-primary keys is that non-primary keys can be null, and each of the null values is considered unique.

Indices

Indices are essential for speeding queries containing conditions (including conditional joins).

The basic syntax for creating an index is:

CREATE INDEX <index name> ON <table name>(<column name>);

Show Table Indices

MySQL:

SHOW INDEX FROM <tablename>;

PostgreSQL

SELECT * FROM pg_indexes WHERE tablename = '<tablename >';

Show Indices From All Tables

MySQL:

SELECT DISTINCT TABLE_NAME, INDEX_NAME
FROM INFORMATION_SCHEMA.STATISTICS
WHERE TABLE_SCHEMA = '<schemaname>';

CREATE TABLE

The syntax is:

CREATE TABLE <TABLE NAME> (
    <COLUNM 1 NAME> <COLUNM 1 TYPE>,
    ...
    <COLUNM N NAME> <COLUNM N TYPE>
    [, PRIMARY KEY (<LIST OF KEY COLUMNS>)]
)

The primary key is optional, but usually, any table should have one. Each row has to have a unique primary key. An index is created automatically for the list of primary key columns.

ALTER TABLE

Add column generated from other columns

SQL:

ALTER TABLE <table name> ADD <column name> AS (<generator expression>);

In MySQL we have to add the data type:

ALTER TABLE <tablename> ADD <columnname> <datatype> AS (<generator expression>);

In PostgreSQL, the syntax is quite different:

ALTER TABLE <tablename> ADD <columnname> <datatype> GENERATED ALWAYS AS (<generator expression>) STORED

The advantage of the Postgres approach is that the column is set as generated, therefore it is generated for the new rows automatically.

Procedures and functions

In SQL there are two constructs that are able to encapsulate SQL statements to increase resusability and readability: functions and procedues. The main difference between these two is that functions are intended to be called inline from SLQ statements, while procedures cannot be used in SQL statements and instead, they are used as a wrapper of a set of SQL statement to be called repeatedly with different arguments. The extensive summary of the different capabilities of functions and procedures is on SO.

Calling a procedure

The keyword for calling a procedure differs between database system, refer to the documentation for your system for the right keyword.

Creating a procedure

The syntax for calling a procedure differs between database system, refer to the documentation for your system for the right syntax.

Parameters

Procedures and functions can have parameters similar to parameters in programming languages. We can use those parameters in the body of a function/procedure equally as a column or constant.

Parameters can have default values, suplied after the = sign. We can test whether a default argument was supplied by testing the parameter for the default value.

Views

Views are basically named SQL queries stored in database. The queries are run on each view invocation, unles the view is materialized.

The syntax is:

CREATE VIEW <VIEW NAME> AS <QUERY>

Modifying the view

The view can be modified with CREATE OR REPLACE VIEW, however, existing columns cannot be changed. If you need to change existing columns, drop the view first.

Schemas

Schema in SQL is a container or namespace for tables, views, and other database objects. This means that we can have multiple objects with the same name in different schemas.

The SQL schema should not be confused with the database schema, which is a logical structure of the database.

Sadly, the concept of SQL schema is not standardized across different database systems. The following table shows how SQL schema is implemented in different systems:

Database System Schema concept default schema
MySQL Databases are used as schemas -
PostgreSQL Multiple schemas in a single database as per the SQL standard public
Oracle Each user has an associated schema (user = schema) user name
SQL Server Multiple schemas in a single database as per the SQL standard dbo

Performace Optimization

When the query is slow, first inspect the following checklist:

  • Do not use OR or IN for a set of columns (see replacing OR below).
  • Check that all column and combination of columns used in conditions (WHERE) are indexed.
  • Check that all foreign keys are indexed.
  • Check that all joins are simple joins (column to column, or set of columns to a matching set of columns).

If nothing from the above works, try to start with a simple query and add more complex pars to find where the problem is.

If decomposing the query also does not bing light into the problem, refer to either one of the subsections below, or to the external sources.

Also, note that some IDEs limits the number of returned rows automatically, which can hide serious problems and confuse you. Try to remove the limit when testing the performance this way.

Replacing OR

We can slow down the query significantly using OR or IN statements if the set of available options is not constant (e.g., IN(1, 2) is okish, while IN(origin, destination) can have drastic performance impact).

To get rid of these disjunctioncs, we can use the UNION statement, basically duplicating the query. The resulting query will be double in size, but much faster:

SELECT people.id
FROM promo
    JOIN people ON promo.date IN(people.birthaday, people.name_day);

-- can be revritten as

SELECT *
FROM promo
    JOIN people ON promo.date = people.birthaday
UNION 
SELECT *
FROM promo
    JOIN people ON promo.date = people.name_day
GROUP BY people.id

A specific join makes the query slow

If a single join makes the query slow, there is a great chance that the index is not used for the join. Even if the table has an index on the referenced column(s), the join can still not use it if we are joining not to the table itself but to:

  • a subquery,
  • a variable created with a WITH statement,
  • a view,
  • or temborary table

created from the indexed table. You can solve the situation by creating a materialized view or temporary table instead, and adding inedices to the table manualy. Specifically, you need to split the query into multiple queries:

  1. delete the materialized view/table if exists
  2. create the materialized view/table
  3. create the required indices
  4. perform the actual query that utilizes th view/table

Of course we can skip the first three steps if the materialized view is constant for all queries, which is common during the testing phase.

Slow DELETE

When the delete is slow, the cause can be the missing index on the child table that refers to the table we are deleting from.

Other possible causes are listed in the SO answer.

Sources

Database development mistakes made by application developers