Big Data
Big data is the term for a collection of datasets so large
and complex that it becomes difficult to
process using on-hand database management tools or traditional data
processing applications.
• V’s of Big Data: Volume, Variety,
Velocity & Veracity
Hadoop ecosystem :
Hadoop Common: The common utilities that support the
other Hadoop subprojects.
HDFS: A distributed file system that provides high
throughput access to application data.
MapReduce: A software framework for distributed
processing of large data sets on compute clusters.
Other Hadoop-related projects at Apache include:
Avro: A data serialization system.
Cassandra: A scalable multi-master database with no
single points of failure.
HBase: A scalable, distributed database that supports
structured data storage for large tables.
Hive: A data warehouse infrastructure that provides
data summarization and ad hoc querying.
Mahout: A Scalable machine learning and data mining
library.
Pig: A high-level data-flow language and execution
framework for parallel computation.
ZooKeeper: A high-performance coordination service
for distributed applications
Hadoop Ecosystem :
Master Services :
1. NameNode :
NameNode is the main and heartbeat node of Hdfs and also
called master. It stores the meta data in RAM for quick access and track the
files across hadoop cluster. If Namenode failure the whole hdfs is inaccessible
so NameNode is very critical for HDFS. NameNode is the health of datanode and
it access datanode data only. NameNode Tracking all information from files such
as which file saved in cluster, access time of file and Which user access a
file on current time. FsImage contains the complete state of the file system
namespace since the start of the NameNode. Editlogs contains all the
recent modifications made to the file system with respect to the most recent
FsImage.
Secondary Namenode :
Secondary
NameNode helps to Primary NameNode and merge the namespaces. Secondary NameNode
stores the data when NameNode failure and used to restart the NameNode. It
requires huge amount of memory for data storing. Secondary NameNode runs on
different machines for memory management. Secondary NameNode is checking point
of NameNode.
1.
Checkpoint Node(Not come under master services)
:
Checkpoint Node
mainly designed for solves the NameNode drawbacks. It tracks the latest
checkpoint directory that has same structure. It creates the checking point for
NameNode namespace and downloads the edits and fsimage from NameNode and mering
locally. The new image are uploaded to NameNode. After this uploads the result
to NameNode.
2.
Backup Node(Not come under master services) :
It provide the functions
to Check Point but it interact with NameNode and it supports the online
streaming of filesystems. In Backup Nodes namespace are available on main
memory because it interact with primary node. It maintains up-to date file
namespace for streaming process. Backup node having own memory so just save and
copy the namespace from main memory.
3.
JobTracker :
Job Tracker node used to MapReduce
jobs
and it process runs on separate node. It receives the request for MapReduce
from client. It Loacte the data when after talking of NameNode. Job tracked
choose best TaskTracker for executes the task and it give slots to tasktracker
for execute the task. It monitors the all TaskTrackers and give status report
to Client. If Job Tracked failure MapReduce function does not executed and all
functions are halted so Job Tracker is critical for MapReduce.
Slave
Services :
1. DataNode :
DataNode stores actual data of HDFS and also
called Slave. If DataNode failure it does not affect any data which stored in
DataNode. It Configured lot of disk space because DataNode stores actually
data. It perform read and write operations as per client request. Performance
of DataNode are based on NameNode Instuctions.
2. TaskTrackerNode :
Task Tracker are runs on DataNode. TaskTrackers
will be assigned Mapper and Reducer tasks to execute by JobTracker. If Task
Tracker failure the job tracker assign the task for another node so MapReduce Task
running successfully..
Blocks are the nothing but
the smallest continuous location on your hard drive where data is stored. The
default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x)
Replication Management: HDFS provides a reliable way to store
huge data in a distributed environment as data blocks. The blocks are also
replicated to provide fault tolerance. The default replication factor is 3
which is again configurable. So, as you can see in the figure below where each
block is replicated three times and stored on different DataNodes (considering
the default replication factor):
Therefore, if you are storing
a file of 128 MB in HDFS using the default configuration, you will end up
occupying a space of 384 MB (3*128 MB) as the blocks will be replicated three
times and each replica will be residing on a different DataNode.
How to Configure Replication Factor and Block Size for HDFS?
Hadoop Distributed File System (HDFS) stores files as data blocks and distributes these blocks across the entire cluster. As HDFS was designed to be fault-tolerant and to run on commodity hardware, blocks are replicated a number of times to ensure high data availability. The replication factor is a property that can be set in the HDFS configuration file that will allow you to adjust the global replication factor for the entire cluster. For each block stored in HDFS, there will be n – 1 duplicated blocks distributed across the cluster. For example, if the replication factor was set to 3 (default value in HDFS) there would be one original block and two replicas.
Open the hdfs-site.xml file. This file is usually found in the conf/ folder of the Hadoop installation directory. Change or add the following property to hdfs-site.xml:
<property>
<name>dfs.replication<name>
<value>3<value>
<description>Block Replication<description>
<property>
hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.
You can also change the replication factor on a per-file basis using the Hadoop FS shell.
[umeshp6655@gw02 ~]$ hadoop fs –setrep –w 3 /my/file
Alternatively, you can change the replication factor of all the files under a directory.
[umeshp6655@gw02 ~]$ hadoop fs –setrep –w 3 -R /my/dir
Hadoop Distributed File System was designed to hold and manage large amounts of data; therefore typical HDFS block sizes are significantly larger than the block sizes you would see for a traditional filesystem (for example, the filesystem on my laptop uses a block size of 4 KB). The block size setting is used by HDFS to divide files into blocks and then distribute those blocks across the cluster. For example, if a cluster is using a block size of 64 MB, and a 128-MB text file was put in to HDFS, HDFS would split the file into two blocks (128 MB/64 MB) and distribute the two chunks to the data nodes in the cluster.
Open the hdfs-site.xml file. This file is usually found in the conf/ folder of the Hadoop installation directory.Set the following property in hdfs-site.xml:
<property>
<name>dfs.block.size<name>
<value>134217728<value>
<description>Block size<description>
<property>
hdfs-site.xml is used to configure HDFS. Changing the dfs.block.size property in hdfs-site.xml will change the default block size for all the files placed into HDFS. In this case, we set the dfs.block.size to 128 MB. Changing this setting will not affect the block size of any files currently in HDFS. It will only affect the block size of files placed into HDFS after this setting has taken effect.
Below is the formula to calculate the HDFS Storage size required, when building a new Hadoop cluster.
H = C*R*S/(1-i)* 120%
Where :
C = Compression ratio. It depends on the type of compression used (Snappy, LZOP, …) and size of the data. When no compression is used, C=1.
R = Replication factor. It is usually 3 in a production cluster.
S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and incremental data. (In this, we need to consider the growth rate of Initial Data as well, at least for next 3-6 months period, Like we have 500 TB data now, and it is expected that 50 TB will be ingested in next three months, and Output files from MR Jobs may create at least 10 % of the initial data, then we need to consider 600 TB as the initial data size).
i = intermediate data factor. It is usually 1/3 or 1/4. It is Hadoop’s Intermediate working space dedicated to storing intermediate results of Map Tasks are any temporary storage used in Pig or Hive. This is a common guidlines for many production applications. Even Cloudera has recommended 25% for intermediate results.
120 % – or 1.2 times the above total size, this is because, We have to allow room for the file system underlying the HDFS. For HDFS, this is ext3 or ext4 usually which gets very, very unhappy at much above 80% fill. I.e. For example, if you have your cluster total size as 1200 TB, but it is recommended to use only up to 1000 TB.
Example :
C= 1, a Replication factor of 3, , and Intermediate factor of 0.25 = 1/4
H = 1*3*S/(1-1/4) = 3*S/(3/4) = 4*S
With the assumptions above, the Hadoop storage is estimated to be 4 times the size of the initial data size.
RACK AWARENESS :
Anyways, moving ahead, let’s
talk more about how HDFS places replica and what is rack awareness? Again, the
NameNode also ensures that all the replicas are not stored on the same rack or
a single rack. It follows an in-built Rack Awareness Algorithm to reduce
latency as well as provide fault tolerance. Considering the replication
factor is 3, the Rack Awareness Algorithm says that the first replica of a
block will be stored on a local rack and the next two replicas will be stored
on a different (remote) rack but, on a different DataNode within that (remote)
rack as shown in the figure above. If you have more replicas, the rest of
the replicas will be placed on random DataNodes provided not more than two
replicas reside on the same rack, if possible.
This is how an actual Hadoop
production cluster looks like. Here, you have multiple racks populated with
DataNodes:
Advantages of Rack Awareness:
So, now you will be thinking
why do we need a Rack Awareness algorithm? The reasons are:
- To improve the network performance: The
communication between nodes residing on different racks is directed via
switch. In general, you will find greater
network bandwidth between machines in the same rack than the
machines residing in different rack. So, the Rack Awareness helps you to
have reduce write traffic in between different racks and thus providing a
better write performance. Also, you will be gaining increased read
performance because you are using the bandwidth of multiple racks.
- To prevent loss of data: We don’t have to worry about the data even if an entire rack fails because of the switch failure or power failure. And if you think about it, it will make sense, as it is said that never put all your eggs in the same basket.
What is batch processing in hadoop?
Batch processing is the execution of a series of jobs in a program on a computer
without manual intervention (non-interactive). Strictly speaking, it is a
processing mode: the execution of a series of programs each on a set or
"batch" of inputs, rather than a single input (which would instead be a custom job). However, this
distinction has largely been lost, and the series of steps in a batch process
are often called a "job" or "batch job".
Hadoop
is based on batch processing of big data. This means that the data is stored
over a period of time and is then processed using Hadoop. Whereas in Spark,
processing can take place in real-time.
Block(Physical) and Split(logical)
Hadoop Mapreduce :
MapReduce is a processing
technique and a program model for distributed computing based on java
- Input String
- Mapper
- Shuffling & Sorting
- Reducer