3.3 Databases and SQL

Introduction

Databases are an organized collection of data (e.g. tables, reports, ...) and are extremely useful for large amounts of data, when using plain text files to store all your information is not feasible anymore. In order to access data in such databases one would use a database management system that simplifies the access via rapid search and retrieval of data. There are different types of databases and database management systems (see here if you want to know more). But we will only shortly touch relational databases today.

Relational databases

In relational databases sets of tables are used for data representation. Each two-dimensional table holds different type of data in each column and a record in each row. Redundant data is used to link records in different tables. In the following example the Table 1 holds information about all contigs of our metagenome and the Table 2 holds information about all genes in our metagenome. Each table contains one column which must be specified as the "primary key". This means that the value in this column needs to be unique for each row to be able to distinguish the different records in the table. The second table additionally contains a column that is a so called "foreign key", originating from the primary key of Table 1. This is the column that links the data in Table 1 with the data in Table 2.

Table 1 including contig information:

contig_ID	length	GC
1	17050	55.6
2	33567	45.9
3	108000	33.7
4	598	45.7
5	7692	54.3

Table 2 including gene information:

gene_ID	contig_ID	start	stop
101	1	3	1400
102	1	1600	1453
103	1	1789	2354
104	2	55	589
105	3	1	345
106	4	1276	2

In this course you have already or will be using two programs which are connected to relational databases. These are the genome annotation system GenDB/JCoast and the metagenome visualization tool anvi’o. So each time you run them and information is displayed the programs retrieve data from the database using a database management system like e.g. MySQL or SQLite.

SQL

The programming language for interacting with the database management system is called SQL (structured query language). SQL follows a formal syntax and includes commands, functions and variables (of which some are listed below).

Creating tables in a database (CREATE TABLE)
Populating tables with data (INSERT)
Querying data (SELECT)
Connecting tables (JOIN)
Deleting data from tables (DELETE)

SELECT statement components:

Component	Purpose
SELECT	determines which columns to include in the query's result set
FROM	identifies the tables from which to draw data and how the tables should be joined
WHERE	filters out unwanted data
ORDER BY	sorts the rows of the final result set by one or more columns
LIMIT	constrain number of rows in a result set

SELECT * FROM Table1 WHERE Column1 = 2;
SELECT Column1, Column2 FROM Table1 WHERE Column1 = 2;
SELECT Column1, Column2 FROM Table1 WHERE Column2 = 'something';

Operators in WHERE clause:

Operator	Description
=	equal
!=	not equal
> or <	greater or smaller than
>= or <=	greater or smaller than or equal
BETWEEN	between an inclusive range
LIKE	search for a pattern

For text fields single quotes are required. Numeric fields should not be enclosed in quotes.

JOIN statement:

The JOIN statement is used to combine entries from two or more tables based on a common field between them. There are different types of JOINs. The most commonly used one is the INNER JOIN that returns all of the records in table 1 that have a matching record in table 2. Other types of JOINs are nicely described and visualized here.

SELECT Table1.Column1, Table2.Column2 
FROM Table1 JOIN Table2 
ON Table1.Column1 = Table2.Column1 
WHERE Table2.Column2 = 'xyz'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.3 Databases and SQL

Introduction

Relational databases

SQL

Clone this wiki locally