Johannes Tang Kristensen

Data Science / Data Engineering / Econometrics / R / Python

Building a big data stack: Running SQL queries on big data

Recently, I’ve been experimenting with various big data components and I figured it was about time, that I tried to be a bit more systematic and set-up a proper big data stack on my pc. To begin with I’m only interested in getting a big data SQL engine up and running such that I can query large amounts of data. Hopefully, down the road, I’ll then have time to add more components to the stack.

In short, the aim for now is to get the following up and running:

A few obvious questions to begin with:

Why Minio instead of just using Hadoop/HDFS?

First, there is little benefit in using HDFS when running only a single node, on the contrary it can cause headaches if you are used to just keeping the data you want to analyze on your local file system. If you go with HDFS you would need to move your data into HDFS in which case it is no longer easily accessible from other tools that are not HDFS aware. When using Minio, your data is just stored in a local directory. Note, however, that if you prefer using HDFS an alternative solution could be to mount HDFS using FUSE.

Second, when running things in the cloud I usually use Digital Ocean’s offerings including their S3-compatible Spaces, so I really want to try out Hive/Presto on S3.

Why Presto?

According to their website:

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Which is exactly what I want to do. Presto has connectors for many different systems, but in this case only the Hive connector is of interest as it is needed when data is in HDFS or S3. Hence, we obviously also need to setup Hive.

There are of course many other interesting big data SQL engines, e.g. Impala, Spark SQL, and Drill. For background on these (and more) have a look at this great post.

Prerequisites and quick-start

All the components will be running in Docker containers, so you’ll need Docker installed including Docker Compose. In the following I’ll be using Ubuntu 18.04. See my ultra quick guide to get Docker installed on Ubuntu 18.04. Further, you will need Java:

$ sudo apt install default-jre

I have collected all needed files in a repository on Github, so the first step is to clone that:

$ git clone https://github.com/johannestang/bigdata_stack
Cloning into 'bigdata_stack'...
... lots of output ...
$ cd bigdata_stack
$ cp sample.env .env

Check the configuration in .env:

DATADIR=/data
MAPBOX_API_KEY=
MINIO_ACCESS_KEY=minio
MINIO_SECRET_KEY=minio123

All storage needed for the Docker containers will be in subfolders of DATADIR, make sure it points to a suitable place. If you want to be able to use maps in Superset you need an API key from Mapbox, sign up to their service and put it here. Finally, you can specify keys for Minio.

You can now bring up the stack:

$ docker-compose up -d
... lots of output ...

Before Superset and Hue are ready to use, their databases need to be initialized, and you need to set username/password for Superset. For Hue username/password is set at first login.

$ ./scripts/init-superset.sh
Username [admin]:
User first name [admin]:
User last name [user]:
Email [admin@fab.org]:
Password:
Repeat for confirmation:
... lots of output ...
$ ./scripts/init-hue.sh
... lots of output ...

The stack should now be up and running and the following services available:

Finally, taking the stack down again is as simple as:

$ docker-compose down
... lots of output ...

Taking the stack for a test run

Before trying out the graphical interfaces let’s verify that everything is working from the command-line. We start with Hive. We can start its command-line interface, Beeline, inside the running Hive container:

$ ./scripts/beeline.sh
... lots of output ...
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 2.3.4)
Driver: Hive JDBC (version 2.3.4)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.4 by Apache Hive
0: jdbc:hive2://localhost:10000>

The container comes with a dummy dataset we can try to load into Hive and query:

0:> CREATE TABLE pokes (foo INT, bar STRING);
No rows affected (1.444 seconds)
0:> LOAD DATA LOCAL INPATH '/opt/hive/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
No rows affected (1.374 seconds)
0:> SELECT * FROM pokes LIMIT 1;
+------------+------------+
| pokes.foo  | pokes.bar  |
+------------+------------+
| 238        | val_238    |
+------------+------------+
1 row selected (1.586 seconds)

Success! If you now open the web-interface for the namenode you will be able to see that the dataset is now stored in the HDFS volume:

HDFS browser on namenode

Since the table has now been created in the Hive meta-store we can also query it from Presto. Start the Presto command-line interface:

$ ./scripts/presto-cli.sh
presto:default> SELECT * FROM pokes LIMIT 1;
 foo |   bar
-----+---------
 238 | val_238
(1 row)

Query 20190119_133502_00003_h8xt6, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:01 [500 rows, 5.68KB] [354 rows/s, 4.02KB/s]

presto:default>

Note, that if you open the Presto web-interface you can monitor the submitted queries.

Data in S3

In order to test the setup against our S3 storage in Minio we need some data, Let’s use the Iris dataset. Either download the dataset and upload it to Minio through the web-interface (remember to create a bucket first), or do it via the command-line:

$ sudo mkdir -p /data/minio/data/datasets/iris
$ cd /data/minio/data/datasets/iris
$ sudo wget https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
$ cd -

Notice, that the first level of directories define the buckets, hence we have created a bucket called “datasets”. By going to the web-interface we can verify that the data is indeed in the bucket:

Minio Iris

In order to be able to query the data we need to map it to a table in Hive, we do that by creating an external table. Start Beeline again using ./scripts/beeline.sh and run the following query:

CREATE EXTERNAL TABLE iris (sepal_length DECIMAL, sepal_width DECIMAL, 
petal_length DECIMAL, petal_width DECIMAL, species STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3a://datasets/iris/'
TBLPROPERTIES ("skip.header.line.count"="1");

Notice, that we need to tell Hive to skip the first line of the file as that line contains column headers. Using either Hive or, in this case, Presto, it is now possible to query the data:

presto:default> SELECT * FROM iris LIMIT 1;
 sepal_length | sepal_width | petal_length | petal_width | species
--------------+-------------+--------------+-------------+---------
 5            | 4           | 1            | 0           | setosa
(1 row)

Query 20190119_160716_00011_h8xt6, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:00 [150 rows, 3.63KB] [484 rows/s, 11.7KB/s]

Superset and Hue

Superset and Hue provide two convenient, but quite different, web-interfaces for querying data. Describing them in detail is beyond the scope of this post, instead I will briefly show their query interfaces.

Hue

When logging in to Hue we immediately see that the previously created tables are available for querying:

Hue Hive

Note, however, that the default choice in Hue is Hive (as seen just below the Query dropdown). If instead, we want to use the Presto engine, it must be selected in the Query dropdown:

Hue Presto

One other noteworthy feature of Hue is its File Browser which allows you to access the HDFS volume and is far superior to the web-interface that comes with the Hadoop namenode.

Superset

For Superset I have only configured Presto. In order to query the data, the SQL Lab can be used. Just select “Presto” as database and “default” as schema, then the tables are available:

Superset

Where’s the big data?

Obviously, the data used in the examples above do not qualify as being “big data”, but in principle, as long as your data is in a format Hive/Presto understands, moving towards big data is only a matter of putting bigger/more files into the S3 bucket. In practice, however, there are of course challenges which I will hopefully get a chance to touch upon in future posts.

Stacking all the components

If you’re happy with what you’ve seen so far, then there might not be any reason to read on. But if you’re interesting in the details of the stack (maybe you want to modify it to better suit your needs) then the following tries to explain some of the details.

Hadoop/Hive

There are many different Hadoop Docker images available. I ended up using images from the Big Data Europe project as they are quite simple (hence easy to modify) but still cover many different components. Basically their Hive stack appeared usable out of the box, and as long as I kept my data in HDFS it was working just fine. But trying to access data in S3 turned out to be a challenge.

I assumed that all that had to be done was to add the AWS S3 SDK jars to the Docker image, as explained here. It turns out that you have to be very careful about matching versions when choosing the jars. See discussion here. As Big Data Europe’s images used Hadoop 2.7.4, apparently the AWS SDK needed to be 1.7.4. Adding that version of the SDK appeared to work, but it turned out that one important option had not been implemented in that version: fs.s3a.path.style.access

Why is fs.s3a.path.style.access important? Well, it is not if you use AWS S3, but Minio does not support dns-style bucket naming, instead buckets are accessed as sub-directories, thus without this option the stack would not work with Minio. As it was not possible to use a newer version of the AWS SDK without updating Hadoop, that is what I had to do. I decided to “only” update to 2.8 instead of 2.9 (or 3) hoping it would be easier. Luckily it turned out to be quite straight-forward after a bit of trial-and-error while determining which jars from the AWS SDK to add. While at it, I also updated Hive. So for the Hive image, the crucial difference compared to the base image is the following:

ENV HADOOP_VERSION 2.8.5
ENV HIVE_VERSION 2.3.4
ENV AWS_VERSION=1.10.75.2
RUN curl -L http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/${AWS_VERSION}/aws-java-sdk-${AWS_VERSION}.jar -o /opt/hive/lib/aws-java-sdk.jar && \
    curl -L http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-core/${AWS_VERSION}/aws-java-sdk-core-${AWS_VERSION}.jar -o /opt/hive/lib/aws-java-sdk-core.jar && \
    curl -L http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/${AWS_VERSION}/aws-java-sdk-s3-${AWS_VERSION}.jar -o /opt/hive/lib/aws-java-sdk-s3.jar && \
    curl -L http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar -o /opt/hive/lib/hadoop-aws.jar

Plus the following additions in the hadoop-hive.env configuration file:

HDFS_CONF_fs_s3a_endpoint=http://minio:9000
HDFS_CONF_fs_s3a_path_style_access=true
HDFS_CONF_fs_s3a_impl=org.apache.hadoop.fs.s3a.S3AFileSystem

Note, keys for Minio are set in docker-compose.yml which in turn gets them from .env, e.g.:

hive-server:
  image: johannestang/hive:2.3.4-postgresql-metastore-s3
  restart: always
  env_file:
    - ./config/hadoop-hive.env
  environment:
    HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
    SERVICE_PRECONDITION: "hive-metastore:9083"
    HDFS_CONF_fs_s3a_access_key: ${MINIO_ACCESS_KEY}
    HDFS_CONF_fs_s3a_secret_key: ${MINIO_SECRET_KEY}
  ports:
    - "10000:10000"
    - "10002:10002"

Presto

The Hive stack from Big Data Europe actually already contained Presto. However, the image they use, shawnzhu/prestodb, is, at the time of writing, based on a 1 year old version of Presto, so I decided to update the image.

Further, I wanted to be able to provide the Minio keys using environment variables as for Hive above. However, as discussed in this post it is not possible to get Presto to read the keys from environment variables. So instead I added the following default configuration to etc/catalog/hive.properties:

hive.s3.aws-access-key=accesskey123
hive.s3.aws-secret-key=secretkey123
hive.s3.endpoint=http://minio:9000
hive.s3.path-style-access=true
hive.s3.ssl.enabled=false

and modified the start-up script such that keys, and endpoint, are updated if specified in environment variables. See the container in docker-compose.yml:

presto-coordinator:
  image: johannestang/prestodb:0.215
  restart: always
  ports:
    - "8080:8080"
  environment:
    S3_ACCESS_KEY: ${MINIO_ACCESS_KEY}
    S3_SECRET_KEY: ${MINIO_SECRET_KEY}
    S3_ENDPOINT: "http://minio:9000"

Superset

Superset was quite easy to add to the stack. I used the amancevice/superset image, see here for details on the setup. The only thing that needed to be done was to add Presto as a data source. It can be done through the web-interface, but the init-superset.sh script used above does it automatically by adding the following configuration, add-presto.yml, to the running container:

databases:
- database_name: Presto
  expose_in_sqllab: true
  sqlalchemy_uri: presto://presto-coordinator:8080/hive/default
  tables: []

and running:

docker-compose exec superset superset import_datasources -p /tmp/add-presto.yml

Hue

Hue turned out to be more difficult to get working. First, it was a matter of getting the configuration right. The configuration is in config/pseudo-distributed.ini (with the original “unconfigured” version in config/pseudo-distributed.ini.tmpl for comparison) which was changes in the following ways:

To begin with we need to tell Hue where the Hadoop and Hive instances are:

default_hdfs_superuser=root
fs_defaultfs=hdfs://namenode:8020
webhdfs_url=http://namenode:50070/webhdfs/v1
hive_server_host=hive-server

and which database it can use for storage (see docker-compose.yml for details on the database container):

[[database]]
  engine=postgresql_psycopg2
  host=hue-postgres
  port=5432
  user=hue
  password=hue
  name=hue

I then removed the parts of Hue not needed:

app_blacklist=search,hbase,impala,jobbrowser,jobsub,oozie,pig,security,spark,sqoop,zookeeper

While not strictly necessary it clears up the interface, and in particular search caused errors since it was not configured. Finally, Presto was added as a data source (and all other data sources were commented out):

[[[presto]]]
  name=PrestoDB
  interface=jdbc
  options='{"url": "jdbc:presto://presto-coordinator:8080/hive/default", 
            "driver": "com.facebook.presto.jdbc.PrestoDriver",
            "user":"", "password":""}'

For all of that to work, a few modifications to the Dockerfile were needed. Specifically, Presto JDBC and Postgres Python drivers were added:

ENV PRESTO_VERSION=0.215
ENV HUE_VERSION=4.3.0
RUN curl -L https://repo1.maven.org/maven2/com/facebook/presto/presto-jdbc/${PRESTO_VERSION}/presto-jdbc-${PRESTO_VERSION}.jar -o /usr/lib/jvm/java-8-oracle/jre/lib/ext/presto-jdbc.jar
RUN /bin/bash -c "source ./build/env/bin/activate && pip install psycopg2-binary && deactivate"

At this point, the Hive connection was working fine, but when trying to access tables through Presto the following error appeared:

Hue Bug

Thankfully, Hue is open source, and after a quick look at the source code it turned out that the JDBC interface expects the database to have a table_comment column in the tables table in the information_schema schema (and likewise a column_comment column in columns), but as we can see, this is not the case for Presto:

presto:default> DESC information_schema.tables;
    Column     |  Type   | Extra | Comment
---------------+---------+-------+---------
 table_catalog | varchar |       |
 table_schema  | varchar |       |
 table_name    | varchar |       |
 table_type    | varchar |       |
(4 rows)

Query 20190123_170247_00016_s4hbn, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:00 [4 rows, 325B] [21 rows/s, 1.68KB/s]

I’m not sure where this incompatibility stems from (if you do, I’d love to hear about it), and I didn’t want to spend alot of time investigating it, so I simply made a quick-fix which ignores those columns and patched the code in the Dockerfile:

COPY jdbc.py.diff /hue/desktop/libs/notebook/src/notebook/connectors/jdbc.py.diff
RUN cd /hue/desktop/libs/notebook/src/notebook/connectors && \
    patch < jdbc.py.diff

Final remarks

Helper scripts

The scripts directory contains some helper scripts. Some of which were used above:

  • beeline.sh: Lauch Beeline, the Hive CLI.
  • presto-cli.sh: Launch the Presto CLI.
  • init-superset.sh: Initialize the Superset database and setup Presto as data source.
  • init-hue.sh: Initialize the Hue database.

But also hadoop-client.sh which creates a temporary container with Hadoop command-line utilities and logs into it. The host filesystem is mounted as /host so it is useful it you need to e.g. move files into HDFS from the command line.

Hive on MapReduce

So far the queries have been trivial, but if we try to do a simple aggregation using Hive the following happens:

0: jdbc:hive2://localhost:10000> SELECT COUNT(*) FROM iris;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. 
Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+------+
| _c0  |
+------+
| 150  |
+------+
1 row selected (1.526 seconds)

We get a deprecation warning! As my focus has been on Presto I have not looked into changing the Hive engine, but it could definitely be interesting to do at a later point.