nema5pic
vi-hell
clippie

Big Data

HDFS

HDFS usage
Ref: YDN Hadoop This guide likely written before hdfs split off from hadoop. Thus, the "hdfs" command was called "hadoop".

hdfs dfs -ls			# list user's HDFS BASE dir (/user/$USER)
hdfs dfs -ls /		# list files from root of HDFS
hdfs dfs -mkdir /user/bofh 	# make user's home dir, special priv requeired.

hdfs dfs -put   foo  bar	# copy file/dir "foo" from unix to hdfs and name it bar  
				# -put will copy file to ALL dataNodes.  
				# -put will error if destination file "bar" already exist.
				# -put will recursive copy if "foo" is a directory.
hdfs dfs -put   foo     	# copy file foo from unix.  Destination is HDFS BASE dir since not specified.   ??  or not allowed ??

hdfs dfs -get bar baz		# -get is to retrieve file/dir "bar" from HDFS to unix, saving it as "baz".
				# only -put and -get deal with file exchange b/w hdfs and unix
				# all other commands are manipulating files w/in hdfs

hdfs dfs -setrep		# set replication level
hdfs dfs -help CMD		# get help on command

hdfs dfs -cat bar		# like cat inside HDFS
hdfs dfs -lsr 		# ls -r inside hdfs
hdfs dfs -du path		
hdfs dfs -dus			# du -s, ie display summary data
hdfs dfs -mv src dest		# move WITHIN hdfs
hdfs dfs -cp src dest		# copy WITHIN hdfs
hdfs dfs -rm path		# rm   WITHIN hdfs.  use -rmr for rm -r
hdfs dfs -touchz path		# z for zero
hdfs dfs -test -e|z|d  path	# Exist, Zero legth, Directory 
hdfs dfs -stat FORMAT  path	# 
hdfs dfs -tail -f      bar   	# tail [-f] bar (file inside HDFS)
hdfs dfs -chmod -R 750 path
hdfs dfs -chown -R OWNER path	# chown, if no owner defined, change to me
hdfs dfs -chgrp -R GRP   path


hdfs distcp -help		# read up on distributed cp, it starts MapReduce task to lighten large copy 

hdfs dfsadmin -report 
hdfs dfsadmin -help

hdfs fsck PATH OPTIONS		# check health of hdfs

Apache Hadoop

HBase

Apache Hive

Apache Spark

spark troubleshooting
http://n0156.lr3:8080/  - spark master (scheduler, monitor worker)
http://n0161.lr3:8081/  - worker process

http://n0093.lr3:4040/jobs/	- overview of job process, one port per job
spark://n0093.lr3:7077/		- spark protocol for spart-submit to send job to the master

Apache Kafka

Apache Storm

Cassandra

CouchDB, PouchDB, memcached, Couchbase

MongoDB

SciDB

    su - scidb
    scidb.py init all mycluster
    scidb.py startall mycluster
    scidb.py status   mycluster
    iquery -aq "list('arrays')" 	# list avail arrays.  [] means empty list
    iquery -q 
    iquery -q 'create array X < x: uint64 > [ i=1:10001,1000,0, j=1:10001,1000,0]' 	# creates a test array
    

Docker

Apache Aurora

Node.JS

Parallel Environment

Apache Mesos

AirBnB Chronos (Mesos Framework)

CfnCluster

cfncluster is a framework that deploys and maintains HPC clusters on AWS. It is reasonably agnostic to what the cluster is for and can easily be extended to support different frameworks. The CLI is stateless, everything is done using CloudFormation or resources within AWS
http://cfncluster.readthedocs.org/en/latest/getting_started.html

MIT StarCluster

Also a way to deploy and maintains HPC in AWS. but cfncluster seems to be where the action is now. see aws.html for sample POC setup session.




Terminology

RDD, DataFrame
shard


Apache Parquet


scala

Links






Search within the PSG pages:

Copyright info about this work

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike2.5 License. Pocket Sys Admin Survival Guide: for content that I wrote, (CC) some rights reserved. 2005,2012 Tin Ho [ tin6150 (at) gmail.com ]
Some contents are "cached" here for easy reference. Sources include man pages, vendor documents, online references, discussion groups, etc. Copyright of those are obviously those of the vendor and original authors. I am merely caching them here for quick reference and avoid broken URL problems.



Where is PSG hosted these days?