-
Notifications
You must be signed in to change notification settings - Fork 6
sql_dialect
This page introduces the reader to the SQL dialect supported by HRDMBS.
For starters you probably want to create some tables.
CREATE [COLUMN] TABLE schema.table(col def, col def, ..., [PRIMARY KEY(col, col,...)]) [COLORDER(int, int, ...)] [ORGANIZATION(int, int, ...)] groupExp nodeExp devExp
Syntax in [] is optional. Column definitions are pretty much standard as with any database. Primary keys can be defined as part of a column definition or as a separate PRIMARY KEY clause. If you use the COLORDER clause, it takes a list of integers, 1 through the number of columns in the tables. The list specifies the orders in which to physically stored the columns, and it only makes sense for a column table. For example COLORDER(2,3,1) says store the second column first, followed by the third column, and then the first column.
If you use the ORGANIZATION cluase, it takes a list of integers as well. Again the integers are representing columns from the column definition list. The first column has a value of 1. The ORGANIZATION clause specifies a sort order for how data should be loaded into the table. For example ORGANIZATION(3,2) says that when loading data, sort it first by column 3 and then by column 2 and load it in that order.
The groupExp, nodeExp, and devExp clauses are explained next are used to tell HRDBMS how to distribute the data across the cluster. They use a rather expressive language that can actually result in quite a number of possibilities. I really need to write a separate document explaining this language. For now, I will just explain how to do the most basic distribution.
For groupExp, just use the value NONE.
For nodeExp, use. all,hash,{col|col|...}
You can use 1 or more columns (all though one column is usually best if you can find 1 column that will work. The column or columns you choose will be hashed to determine what node each row will be placed on. Multiple columns are separate with a pipe symbol, |.
Alternatively, for small tables, you can have them distributed in full to all nodes by using ANY for the nodeExp.
For devExp, use. all,hash{col|col|...} This hash determines which device on the node the row is placed on. Never use the same set of columns for this expression that you used for nodeExp. It is OK however, if nodeExp is a subset of devExp or vice versa.
HRDBMS automatically builds indexes for you to support primary key constraints. While it is true, that indexes can help (and help a good deal) for certain queries, it is not that common of a situation for complex queries. So rather than creating what indexes you think make sense, let HRDBMS tell you!
Start with no indexes other than for the primary keys and run your SQL statements. Check the debug log (hrdbms.log) in the install directory on the coordinators, to find the coordinator that actually did the optimization for the queries. If the optimizer wanted to use an index but one did not exist, it will tell you in the debug log! It will tell you what columns in wanted an index on. Very cool.
If you need to create an index, the syntax is.
CREATE [UNIQUE] INDEX name on schema.table(col ASC|DESC, col ASC|DESC, ...)
Note that the index name does not have a schema portion. The schema of the index is always the schema of the table. Personally, I wouldn't bother with unique indexes except for the primary key indexes that HRDBMS creates for you. There is a restriction today that they are only unique within a node anyway. They are not globally unique. There is a design in place to have them be globally unique, but it has not been implemented yet.
Now you probably want to load data. Note that there is no difference in performance between creating indexes before you load data or after you load data. Internally, it's the same code that's going to run.
LOAD REPLACE|RESUME INTO schema.table DELIMITER char FROM path
REPLACE will truncate the table before loading. RESUME will append to the table. The data to load needs to exist in files on the coordinator where the LOAD runs. Since coordinators can load balance, you need to make sure you are running on a specific coordinator. The way to do this is addressed below when I talk about using the CLI.
The file needs to be a delimited file, but the delimiter can be any single character. For example if my delimiter was the pipe symbol, I would say.
DELIMITER | FROM ...
The path clause tells HRDBMS what files contain the data to load. This can be a single file or a wildcard pattern. If it is multiple files, they will be processes in parallel. I typically break my input data up into as many files as I have cores so I take advantage of parallelism. The path must be surrounded by single quotes. For example,
FROM '/projects/SC14SCC/data/tpch*/dbgen/lineitem.tbl*'
After you load data, make sure to collect statistics with...
RUNSTATS on schema.table
INSERT, UPDATE, DELETE, and DROP are all straightforward. Except that are known issues with update and delete right now. It's been a low priority for me as all my benchmarks for my research papers are just doing loads and then a bunch of queries.
CREATE VIEW is straightforward except for that right now, it doesn't let you give column names in the view definition. The column names are whatever is in the SELECT clause in the defining SELECT statement.
All the basic SELECT functionality is there, except outer joins. Outer joins are not yet supported because HRDBMS does not yet support nulls. For now you need to use an invalid data value just like you would in a file. Currently null support is a medium priority for me. Only the most basic scalar functions are there. It's simple to add more, but rather than sitting down and adding 100, I've been adding them as needed. I can't say it's SQL-92 compliant or anything like that, but it does most of what I see people do in the real world. If you need something it doesn't do, let me know. Contact info at the bottom.