-
Notifications
You must be signed in to change notification settings - Fork 3
3.3 Databases and SQL
Databases are an organized collection of data (e.g. tables, reports, ...) and are extremely useful for large amounts of data, when using plain text files to store all your information is not feasible anymore. In order to access data in such databases one would use a database management system that simplifies the access via rapid search and retrieval of data. There are different types of databases and database management systems (see here if you want to know more). But we will only shortly touch relational databases today.
In relational databases sets of tables are used for data representation. Each two-dimensional table holds different type of data in each column and a record in each row. Redundant data is used to link records in different tables. In the following example the Table 1 holds information about all contigs of our metagenome and the Table 2 holds information about all genes in our metagenome. Each table contains one column which must be specified as the "primary key". This means that the value in this column needs to be unique for each row to be able to distinguish the different records in the table. The second table additionally contains a column that is a so called "foreign key", originating from the primary key of Table 1. This is the column that links the data in Table 1 with the data in Table 2.
Table 1 including contig information:
contig_ID | length | GC |
---|---|---|
1 | 17050 | 55.6 |
2 | 33567 | 45.9 |
3 | 108000 | 33.7 |
4 | 598 | 45.7 |
5 | 7692 | 54.3 |
Table 2 including gene information:
gene_ID | contig_ID | start | stop |
---|---|---|---|
101 | 1 | 3 | 1400 |
102 | 1 | 1600 | 1453 |
103 | 1 | 1789 | 2354 |
104 | 2 | 55 | 589 |
105 | 3 | 1 | 345 |
106 | 4 | 1276 | 2 |
In this course you have already or will be using two programs which are connected to relational databases. These are the genome annotation system GenDB/JCoast and the metagenome visualization tool anvi’o. So each time you run them and information is displayed the programs retrieve data from the database using a database management system like e.g. MySQL or SQLite.
The programming language for interacting with the database management system is called SQL (structured query language). SQL follows a formal syntax and includes commands, functions and variables (of which some are listed below).
- Creating tables in a database (CREATE TABLE)
- Populating tables with data (INSERT)
- Querying data (SELECT)
- Connecting tables (JOIN)
- Deleting data from tables (DELETE)
SELECT statement components:
Component | Purpose |
---|---|
SELECT | determines which columns to include in the query's result set |
FROM | identifies the tables from which to draw data and how the tables should be joined |
WHERE | filters out unwanted data |
ORDER BY | sorts the rows of the final result set by one or more columns |
LIMIT | constrain number of rows in a result set |
SELECT * FROM Table1 WHERE Column1 = 2;
SELECT Column1, Column2 FROM Table1 WHERE Column1 = 2;
SELECT Column1, Column2 FROM Table1 WHERE Column2 = 'something';
Operators in WHERE clause:
Operator | Description |
---|---|
= | equal |
!= | not equal |
> or < | greater or smaller than |
>= or <= | greater or smaller than or equal |
BETWEEN | between an inclusive range |
LIKE | search for a pattern |
For text fields single quotes are required. Numeric fields should not be enclosed in quotes.
JOIN statement:
The JOIN statement is used to combine entries from two or more tables based on a common field between them. There are different types of JOINs. The most commonly used one is the INNER JOIN that returns all of the records in table 1 that have a matching record in table 2. Other types of JOINs are nicely described and visualized here.
SELECT Table1.Column1, Table2.Column2
FROM Table1 JOIN Table2
ON Table1.Column1 = Table2.Column1
WHERE Table2.Column2 = 'xyz'