After deployment of the small volume hadoop system, you should see a number of Pods in your environment:
Programm/Data Science should directly move to the shell and start one of the spark or hive shells:
If you dont like to use the shell, you should use Hue. Hue is already setup within the environment. For initial start use:
- username: admin
- passwort admin123
Internet Explorer is NOT supported by the frontend – check out Firefox or Chrome etc.
With Hue it is pretty easy to upload/ingest data files into the hadoop – just click upload:
For your first try, upload the yellow.txt.gz
You should select the 700kb file. 30.000 Yellow Cab data from NY 06/2016 out from 11,1 Mio which are about 300 MB data volume.
After uploading, the file will be transfared into HDFS on the slaves and replicas.
Now you can ingest data into Hive.
Click on Metadata Manager and select the Default database. The database should be empty. Click now on the paper-style icon with the + included at the right top which is blue:
Fill out the files like shown blow. Important if you do not change to leave the table empty the you will get an error later. Hue has problems with ingest gz files directly currently. Click on „weiter“
To get data into the table click on the icon with is marked blue at the right top corner. Select again the yellow.tar.gz file. The file will now vanish from your directory. If you want a stats update click on the icon right of STATS. It should show now 30000 rows. Will run about 10-20 seconds
I created a example notebook which uses different techniques.
It can be uploaded by a JSON file (see example)
If you are in notebooks fo the the right top side with the icon with the two tags. The upload works like the upload of the gz file.
After this you have the possibility to select this as CeBIT example v1
If you select it the math will run automatically, except for the fourth example.
You will see:
1.A simple hive Select statement which counts the lines which should give 30000
2.A more complex hive statement which select the the yellow cab vendors and count how many drives they had. You see also there are errors in the yellow cab data because we have also lines which have no taxi vendor
3.A spark sql statement which selects the pickup longitude and latitude and the number of passengers. Which is shown on a NY map. If you select one point you see the number of passengers. You see only 100 data points on the map, with 30000 you would see only blue makersJ.
4.A sparkR script which creates out of the Hive table a Spark RDD Data Frame. Filters it for NULL values and calculates the mean fare amount by payment type. To print it out nicely the last step is to sort it by payment type.