Skip to content

3.3 Databases and SQL

mc68462 edited this page Feb 19, 2017 · 1 revision

Introduction

Databases are an organized collection of data (e.g. tables, reports, ...) and are extremely useful for large amounts of data, when using plain text files to store all your information is not feasible anymore. In order to access data in such databases one would use a database management system that simplifies the access via rapid search and retrieval of data. There are different types of databases and database management systems (see here if you want to know more). But we will only shortly touch relational databases today.


Relational databases

In relational databases sets of tables are used for data representation. Each two-dimensional table holds different type of data in each column and a record in each row. Redundant data is used to link records in different tables. In the following example the Table 1 holds information about all contigs of our metagenome and the Table 2 holds information about all genes in our metagenome. Each table contains one column which must be specified as the "primary key". This means that the value in this column needs to be unique for each row to be able to distinguish the different records in the table. The second table additionally contains a column that is a so called "foreign key", originating from the primary key of Table 1. This is the column that links the data in Table 1 with the data in Table 2.

Table 1 including contig information:

contig_ID length GC
1 17050 55.6
2 33567 45.9
3 108000 33.7
4 598 45.7
5 7692 54.3

Table 2 including gene information:

gene_ID contig_ID start stop
101 1 3 1400
102 1 1600 1453
103 1 1789 2354
104 2 55 589
105 3 1 345
106 4 1276 2

In this course you have already or will be using two programs which are connected to relational databases. These are the genome annotation system GenDB/JCoast and the metagenome visualization tool anvi’o. So each time you run them and information is displayed the programs retrieve data from the database using a database management system like e.g. MySQL or SQLite.


SQL

The programming language for interacting with the database management system is called SQL (structured query language). SQL follows a formal syntax and includes commands, functions and variables (of which some are listed below).

  • Creating tables in a database (CREATE TABLE)
  • Populating tables with data (INSERT)
  • Querying data (SELECT)
  • Connecting tables (JOIN)
  • Deleting data from tables (DELETE)

SELECT statement components:

Component Purpose
SELECT determines which columns to include in the query's result set
FROM identifies the tables from which to draw data and how the tables should be joined
WHERE filters out unwanted data
ORDER BY sorts the rows of the final result set by one or more columns
LIMIT constrain number of rows in a result set
SELECT * FROM Table1 WHERE Column1 = 2;
SELECT Column1, Column2 FROM Table1 WHERE Column1 = 2;
SELECT Column1, Column2 FROM Table1 WHERE Column2 = 'something';

Operators in WHERE clause:

Operator Description
= equal
!= not equal
> or < greater or smaller than
>= or <= greater or smaller than or equal
BETWEEN between an inclusive range
LIKE search for a pattern

For text fields single quotes are required. Numeric fields should not be enclosed in quotes.

JOIN statement:

The JOIN statement is used to combine entries from two or more tables based on a common field between them. There are different types of JOINs. The most commonly used one is the INNER JOIN that returns all of the records in table 1 that have a matching record in table 2. Other types of JOINs are nicely described and visualized here.

SELECT Table1.Column1, Table2.Column2 
FROM Table1 JOIN Table2 
ON Table1.Column1 = Table2.Column1 
WHERE Table2.Column2 = 'xyz'