The Small Data Hadoop System, running on AppAgile PaaS Cloud is based on original Apache Hadoop and intended to enable development of programs for hive and spark. To deploy the Small Volume Hadoop environment on your AppAgile PaaS – check out the:
The Small Volume solution is intended to provide a full hadoop eco-systems for trials, pilots, project-work or data-scientists who need FAST access to a hadoop environment and dont like to spent too much time into the infrastructure itself. The Hadoop environment can be scaled out to several TB of data, or transformed to a real BIG Data-Cluster, storing 100s of PB of data like our HUGE Volume setup. The Small Volume edition is an excellent place to use our “Datascience Workbench” – which is an upcoming repository to provide analytic-tools, data streams and services as well as eningeering support, to combine latest innovative tools with hadoop, NoSQL, NewSQL or whatever data-pools, to identify new trends and correlations -> or in short: optimal support for data-science to make their work. Small Volume is the first step into such an environment and provides you a full managed service for hadoop in the cloud.
- Hive is an Apache Application intended to add data Warehouse functions to Hadoop
- Apache Spark is a Cluster Computing Framework for In Memory Data Processing and more.
- Spark can be used without Hadoop but it works perfect with the Distributed HDFS File System. It implements five parts
- Spark-Core: The Basic functions of Spark to use distributed in Memory computing. There are Standard API for Scala (and Java because Scala is based on Java), Python and R
- Spark SQL: The Spark SQL language to Access Data Frames which can come from Hive or e.g. from R Data Frames with a SQL language
- Spark Streaming: To enable Data Streaming Options for Spark. It is using Mini Batches to process Data, so it is not a perfect match for very small data packages due to the start/stop of the Batches but easy to use
- MLlib Machine Learning Library: A function library for machine learning algorithm
- GraphX: A Framework for graphs
At start time the number of slave Server can be defined. The HDFS storage is persistent and replicated. Due to this, from Hadoop side the System is configured like any other large hadoop System, and after transfering the created programs to other Hadoop System they should run without any Problems also on very large Clusters. For storing the metadata a Mysql database in a dedciated Container is used. The principle overview of the System is given below. Always there will one MySql and Master Pod. The Number of the Slave Nodes can be defined at Startup, and also the Hue POD. In the future additional Pods or e.g. the Kafka Pod can used to access the systems If one of the PODS crashed it needs to be restarted by hand, but the data are not lost due to the persistency of MySql and the Hadoop directories of the worker nodes. Following Hadoop Applications are installed:
- Spark with the following languages
Hue is optional and only necessary if a Web Browser Interface is needed, but also the programming on the Shell Level without Hue is possible. All three spark Shells, spark, pyspark and sparkR are available and start the language Shells of Scala, Python and R and connect to the spark System. Also the sparksql Shell is available. For security reasons the normal monitoring using Web Interface, e.g. Yarn is not allowed. The access to Hue is done by using the http link given by appagile. At Startup time you have to use User admin with Password admin123. You should change it. From Hue you can access the notebooks which are shown below in an example. All languages which are described above are supported.