large scale data processing at IXL and how to make it happen
Design Doc: Large-Scale Data Processing
-rbruns
Overview
Spark is a framework to parallelize many computations across many compute resources easily. Spark can run on top of Mesos which is a framework that makes allocating compute resources very easy. All of these frameworks can coordinate with Zookeeper to provide service discovery and high availability. This design discusses using Spark on top of Mesos coordinated through Zookeeper.
Business Needs
Imeh and Eduardo need to process a ton of data as quickly and efficiently as possible.
Design
We have a lot of compute power. Peak traffic hours, most machines are pretty well utilized, but during “off hours”, many of those machines sit idle with plenty of compute time going to waste.
With our current infrastructure, adding new machines to run a single dedicated task is becoming expensive and inefficient. Especially to meet our current business needs and unknown future business needs, we must be much more flexible.
One framework, Spark, allows a developer to spread expensive computations over to worker nodes in a Spark cluster. This is mandatory if we want to reach our data processing goals within any reasonable amount of time.
However, Spark by itself is not a very flexible framework. It is possible to create a single Spark cluster, but resizing it dynamically is not a trivial task. Also, should we ever consider to run a second framework like Storm, it would require setting up a second cluster.
Apache Mesos is a generalized compute cluster framework that allows other frameworks like Spark, Storm, and many others including Hadoop to piggy back off of to allocate compute resources. Mesos is easy to scale and contract to allow including more compute resources to the Mesos cluster as machines become idle and removing them when they are not needed.
Terminology
This section is better to be quoted directly from the Mesos white paper.
Mesos consists of a master process that manages slave daemons running on each cluster node, and frameworks that run tasks on these slaves.
The master implements fine-grained sharing across frameworks using resource offers. Each resource offer contains a list of free resources on multiple slaves. The master decides how many resources to offer to each framework according to a given organizational policy, such as fair sharing, or strict priority. To support a diverse set of policies, the master employs a modular architecture that makes it easy to add new allocation modules via a pluggin mechanism. To make the master fault tolerant we use ZooKeeper…to implement the failover mechanism…
A framework running on top of Mesos consists of two components: a scheduler that registers with the master to be offered resources, and an executor process that is launched on slave nodes to run the framework’s tasks. While the master determines how many resources are offered to each framework, the frameworks’ schedulers select which of the offered resources to use. When a frameworks accepts offered resources, it passes to Mesos a description of the tasks it wants to run on them. In turn, Mesos launches the tasks on the corresponding slaves.
Mesos
Review Mesos architecture.
Assumptions
These are some of the assumption I have made when setting up Mesos.
Service Discovery
If you want Mesos master servers to run with any sort of High Availability, Zookeeper is required. This also simplifies configuration because you can point every framework to Zookeeper to find the Mesos cluster. However, it is possible to run Mesos without Zookeeper. Mesos only operates with a single master server; there is no sharding or clustering capability within it itself. However, HA could be implemented with clever use of HA Proxy and DNS switching.
Name Discovery
Some form of name resolution exists, either with DNS or /etc/hosts
files. This is especially important when you wish to navigate the
Mesos web UI or use Spark. Most frameworks require a hostname or IP
that can resolve to a hostname (this includes Spark), also no hostname
can resolve to 127.0.0.1.
Choice of Distro
To build Mesos from source, you need a distribution of linux that ships with GCC that implements C++11. There are also a list of dependencies, the best example of these are listed as Ubuntu dependencies in their Dockerfile. All of these have analogues for CentOS and Fedora operating systems. Precompiled packages for Ubuntu and CentOS (which is what I used) exist from Mesosphere.
Installed Services and Software
All machines have Docker and some version of Java installed with the
JAVA_HOME
either provided at boot time or installed to a standard
location.
IPTables
All machines have iptables
completely flushed and allowing all
traffic.
The reason for this is that many ports are randomly assigned, especially in Spark. It can be configured to use a particular port, but then ports for the Mesos slaves are also randomly assigned for each task.
Spark will also make available jars required for a task via http to all the slaves, so this is just another port that needs to be available.
Installation
First draft at installing Mesos and configuring Spark.
Mesos Master
To install mesos, the assumption made was that we are using a linux distrubution that has a pre-compiled package available from Mesosphere, in particular CentOS.
First set up the repository.
sudo rpm -Uvh http://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-1.noarch.rpm
Then install mesos
.
sudo yum install -y mesos
We can run Mesos master directly supplying command line arguments which looks something like:
mesos-master --zk=zk://ctest1:2181,ctest2:2181,ctest3:2181/mesos \
--work_dir=/var/lib/mesos --quorum=1 --log_dir=/var/log/mesos \
--port=5050
Configuring mesos
, we need to tell it where to find zookeeper.
Create a file /etc/mesos/zk
with the contents:
zk://ctest1:2181,ctest2:2181,ctest3:2181/mesos
Replace with the appropriate Zookeeper servers.
There is a second option, /etc/mesos-master/quorum
. This number must
be configured correctly to ensure the master log is replicated to
backup servers. If there are n
Mesos master servers, then quorum
must be set to n/2 + 1
, eg if there are 3 Mesos masters, then quorum
must be 2
. The default value is 1
.
Mesos Slave
The install process is identical to that of the Master Master. We will
add one extra change to the configuration and it is setting the
containerizers
flag to be docker,mesos
.
These Mesos slaves will also be running Spark jobs, so they have some
extra required software. They must have Java (or a JAVA_HOME
set)
installed, as well as Docker.
sudo yum install -y docker java-1.7.0-openjdk-devel
Ensure Docker is running.
systemctl enable docker
systemctl start docker
Then we can run Mesos slave directly with
mesos-slave --master=zk://ctest1:2181,ctest2:2181,ctest3:2181/mesos \
--containerizers=docker,mesos --log_dir=/var/log/mesos \
--work_dir=/var/run/mesos
Otherwise, it can be configured similar to the master using config
files if you installed the CentOS package and managed by systemd
.
This is done by writing into the file
/etc/mesos-slave/containerizers
docker,mesos
Now we enable and start both docker
and mesos-slave
.
systemctl enable mesos-slave
systemctl start mesos-slave
Spark
Download a Spark binary tarball. There are many prebuilt tarballs, but any of them will work.i This will be required to run the Spark driver which will launch the Spark context and run tasks on the Mesos slaves.
Spark jobs must be submit locally from the view of the Spark drivers.
Inside the extracted tarball, there are two files the need to be
configured inside the conf
directory: spark-defaults.conf and
spark-env.sh.
Inside spark-defaults.conf
, we need to add an option that can tell
the Mesos slaves where to find the master and where to download a
Spark binary tarball.
spark.master mesos://zk://ctest1:2181,ctest2:2181,ctest3:2181/mesos
spark.executor.uri http://path/to/same-tarball-downloaded.tgz
Then spark-env.sh
needs to export an evironment variable telling
Spark where it can find the Mesos native library.
export MESOS_NATIVE_LIBRARY=/usr/lib/libmesos.so
Now this should be enough to submit a sample job to the Mesos cluster.
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi
lib/spark-examples*.jar 10
Spark Coarse-Grained Mode on Mesos
Inside the spark-defaults.conf
file, you can set an option for
coarse-grained mode in spark. This will have Spark use all cpus
offered by mesos immediately. By default, Spark runs in fine-grained
mode, where each Spark task is a single Mesos task and resources get
used fairly but has a little overhead. This is explained more in the
Spark documentation.
Availability and Resilience
The Mesos white paper can better describe how they maintain Mesos master availability:
Since all the frameworks depends on the master, it is critical to make the master fault-tolerant. To achieve this we use two techniques. First, we have designed the master to be soft state, i.e., the master can reconstruct completely its internal state from the periodic messages it gets from the slaves, and from the framework schedulers. Second, we have implemented a hot-standby design, where the master is shadowed by several backups that are ready to take over when the master fails. Upon master failure, we use ZooKeeper 4 to select a new master from the existing backups, and direct all slaves and framework schedulers to this master. Subsequently, the new master will reconstruct the internal state from the messages it receives from slaves and framework schedulers.
Reconstructing the state depends on our choice of the value for
quorum
. To be able to fully reconstruct the state, quorum
must be
larger than half the number of master servers.
The Mesos slaves are also resilient in that they’re each expendable. If a Mesos slave dies while running a task, it gets marked as “Lost” in the Mesos master and that task will eventually get rescheduled to run on another machine.
Zookeeper
Since we are using Zookeeper to coordinate everything, it is imperative that it is running. Otherwise slaves will not know who the leading master is and so they will not get any tasks to run.
Open Questions and Problems
Mesos is very powerful and can provide ways to not only run computational frameworks like Spark and Hadoop, but can run general tasks and even services inside Docker containers. This means you can use Mesos to coordinate running services like Cassandra, Redis, and even web application servers like Tomcat.
There is also a framework called Chronos which essentially lets you run distributed cron jobs.
Spark Rejecting Resources
Spark will reject offers from Mesos if the number of CPUs a Spark task
requires (by default 1
) is less than half the available CPUs on a
Mesos slave.
This is something that was not clearly documented, but should not be an issue if using Mesos slaves that have at least 2 CPUs. Just good to keep in mind.
Slave Recovery
Still an open problem. For especially long jobs, slave recovery from accidental mesos termination may become something very important. There is a plan for Slave Recovery in Mesos and can be configured. However, a Mesos slave can only recover if the executor and tasks were continuing to run. A server reboot, for example, would not be recovered.
Networking
If running services over Mesos, one problem is handling NAT or any other networking. Some frameworks exist to help tackle this like Flannel as part of CoreOS, also there is Kubernetes and in the near future Mesophere DCOS.
Will Mesos be around?
Despite there having been an explosion of frameworks to run on top of Mesos, another problem is finding which one will persist and be used one year from now, two years from now, forever?
Non-root Users
To use an unprivileged (non-root
) user, eg mesos
, there are a few
considerations that need to be made. An open problem is determining
the best method of handling this situation. There was a point where if
the Spark driver, unconfigured for users, after submitting a job to
Mesos, would see all the jobs fail because they were attempting to run
the jobs under the same user as the one who submitted the job
initially, eg rbruns
. Not every machine had this user, so those jobs
failed.
So through out most of this testing, I was certain the root
user
existed on every machine and that was what was used. Using another
user we could just need to ensure all machines have the same user and
all Spark jobs are submitted for that user.
Logging
Because many tasks will be running on randomly assigned servers, Mesos provides a rudimentary method of viewing logs, but a more robust system of remote logging would be preferred.
Shared Storage
Currently, Spark cannot reference any files inside a Spark job. If a Spark job references a local file, every Mesos Slave must have the same file at the same path specified in the Spark job. This is not very practical, so using some sort of shared storage is preferred.
There are an abundance of ways to accomplish this though the preferred methods seem to be HDFS or S3. Luckily, there is a way to easily set up HDFS over Mesos.