Hadoop : May 2019

Big Data

Big data is the term for a collection of datasets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

• V’s of Big Data: Volume, Variety, Velocity & Veracity

Hadoop ecosystem :

Hadoop Common: The common utilities that support the other Hadoop subprojects.

HDFS: A distributed file system that provides high throughput access to application data.

MapReduce: A software framework for distributed processing of large data sets on compute clusters.

Other Hadoop-related projects at Apache include:

Avro: A data serialization system.

Cassandra: A scalable multi-master database with no single points of failure.

HBase: A scalable, distributed database that supports structured data storage for large tables.

Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.

Mahout: A Scalable machine learning and data mining library.

Pig: A high-level data-flow language and execution framework for parallel computation.

ZooKeeper: A high-performance coordination service for distributed applications

Hadoop Ecosystem :

Master Services :

1. NameNode :

NameNode is the main and heartbeat node of Hdfs and also called master. It stores the meta data in RAM for quick access and track the files across hadoop cluster. If Namenode failure the whole hdfs is inaccessible so NameNode is very critical for HDFS. NameNode is the health of datanode and it access datanode data only. NameNode Tracking all information from files such as which file saved in cluster, access time of file and Which user access a file on current time. FsImage contains the complete state of the file system namespace since the start of the NameNode. Editlogs contains all the recent modifications made to the file system with respect to the most recent FsImage.

Secondary Namenode :

Secondary NameNode helps to Primary NameNode and merge the namespaces. Secondary NameNode stores the data when NameNode failure and used to restart the NameNode. It requires huge amount of memory for data storing. Secondary NameNode runs on different machines for memory management. Secondary NameNode is checking point of NameNode.

1. Checkpoint Node(Not come under master services) :

Checkpoint Node mainly designed for solves the NameNode drawbacks. It tracks the latest checkpoint directory that has same structure. It creates the checking point for NameNode namespace and downloads the edits and fsimage from NameNode and mering locally. The new image are uploaded to NameNode. After this uploads the result to NameNode.

2. Backup Node(Not come under master services) :

It provide the functions to Check Point but it interact with NameNode and it supports the online streaming of filesystems. In Backup Nodes namespace are available on main memory because it interact with primary node. It maintains up-to date file namespace for streaming process. Backup node having own memory so just save and copy the namespace from main memory.

3. JobTracker :

Job Tracker node used to MapReduce jobs and it process runs on separate node. It receives the request for MapReduce from client. It Loacte the data when after talking of NameNode. Job tracked choose best TaskTracker for executes the task and it give slots to tasktracker for execute the task. It monitors the all TaskTrackers and give status report to Client. If Job Tracked failure MapReduce function does not executed and all functions are halted so Job Tracker is critical for MapReduce.

Slave Services :

1. DataNode :

DataNode stores actual data of HDFS and also called Slave. If DataNode failure it does not affect any data which stored in DataNode. It Configured lot of disk space because DataNode stores actually data. It perform read and write operations as per client request. Performance of DataNode are based on NameNode Instuctions.

2. TaskTrackerNode :

Task Tracker are runs on DataNode. TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker. If Task Tracker failure the job tracker assign the task for another node so MapReduce Task running successfully..

Blocks are the nothing but the smallest continuous location on your hard drive where data is stored. The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x)

Replication Management: HDFS provides a reliable way to store huge data in a distributed environment as data blocks. The blocks are also replicated to provide fault tolerance. The default replication factor is 3 which is again configurable. So, as you can see in the figure below where each block is replicated three times and stored on different DataNodes (considering the default replication factor):

Therefore, if you are storing a file of 128 MB in HDFS using the default configuration, you will end up occupying a space of 384 MB (3*128 MB) as the blocks will be replicated three times and each replica will be residing on a different DataNode.

How to Configure Replication Factor and Block Size for HDFS?

Hadoop Distributed File System (HDFS) stores files as data blocks and distributes these blocks across the entire cluster. As HDFS was designed to be fault-tolerant and to run on commodity hardware, blocks are replicated a number of times to ensure high data availability. The replication factor is a property that can be set in the HDFS configuration file that will allow you to adjust the global replication factor for the entire cluster. For each block stored in HDFS, there will be n – 1 duplicated blocks distributed across the cluster. For example, if the replication factor was set to 3 (default value in HDFS) there would be one original block and two replicas.

Open the hdfs-site.xml file. This file is usually found in the conf/ folder of the Hadoop installation directory. Change or add the following property to hdfs-site.xml:

<name>dfs.replication<name>

<description>Block Replication<description>

hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.

You can also change the replication factor on a per-file basis using the Hadoop FS shell.

[umeshp6655@gw02 ~]$ hadoop fs –setrep –w 3 /my/file

Alternatively, you can change the replication factor of all the files under a directory.

[umeshp6655@gw02 ~]$ hadoop fs –setrep –w 3 -R /my/dir

Hadoop Distributed File System was designed to hold and manage large amounts of data; therefore typical HDFS block sizes are significantly larger than the block sizes you would see for a traditional filesystem (for example, the filesystem on my laptop uses a block size of 4 KB). The block size setting is used by HDFS to divide files into blocks and then distribute those blocks across the cluster. For example, if a cluster is using a block size of 64 MB, and a 128-MB text file was put in to HDFS, HDFS would split the file into two blocks (128 MB/64 MB) and distribute the two chunks to the data nodes in the cluster.

Open the hdfs-site.xml file. This file is usually found in the conf/ folder of the Hadoop installation directory.Set the following property in hdfs-site.xml:

<name>dfs.block.size<name>

<description>Block size<description>

hdfs-site.xml is used to configure HDFS. Changing the dfs.block.size property in hdfs-site.xml will change the default block size for all the files placed into HDFS. In this case, we set the dfs.block.size to 128 MB. Changing this setting will not affect the block size of any files currently in HDFS. It will only affect the block size of files placed into HDFS after this setting has taken effect.

Below is the formula to calculate the HDFS Storage size required, when building a new Hadoop cluster.

H = C*R*S/(1-i)* 120%

Where :

C = Compression ratio. It depends on the type of compression used (Snappy, LZOP, …) and size of the data. When no compression is used, C=1.

R = Replication factor. It is usually 3 in a production cluster.

S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and incremental data. (In this, we need to consider the growth rate of Initial Data as well, at least for next 3-6 months period, Like we have 500 TB data now, and it is expected that 50 TB will be ingested in next three months, and Output files from MR Jobs may create at least 10 % of the initial data, then we need to consider 600 TB as the initial data size).

i = intermediate data factor. It is usually 1/3 or 1/4. It is Hadoop’s Intermediate working space dedicated to storing intermediate results of Map Tasks are any temporary storage used in Pig or Hive. This is a common guidlines for many production applications. Even Cloudera has recommended 25% for intermediate results.

120 % – or 1.2 times the above total size, this is because, We have to allow room for the file system underlying the HDFS. For HDFS, this is ext3 or ext4 usually which gets very, very unhappy at much above 80% fill. I.e. For example, if you have your cluster total size as 1200 TB, but it is recommended to use only up to 1000 TB.

Example :

C= 1, a Replication factor of 3, , and Intermediate factor of 0.25 = 1/4

H = 1*3*S/(1-1/4) = 3*S/(3/4) = 4*S

With the assumptions above, the Hadoop storage is estimated to be 4 times the size of the initial data size.

RACK AWARENESS :

Anyways, moving ahead, let’s talk more about how HDFS places replica and what is rack awareness? Again, the NameNode also ensures that all the replicas are not stored on the same rack or a single rack. It follows an in-built Rack Awareness Algorithm to reduce latency as well as provide fault tolerance. Considering the replication factor is 3, the Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack and the next two replicas will be stored on a different (remote) rack but, on a different DataNode within that (remote) rack as shown in the figure above. If you have more replicas, the rest of the replicas will be placed on random DataNodes provided not more than two replicas reside on the same rack, if possible.

This is how an actual Hadoop production cluster looks like. Here, you have multiple racks populated with DataNodes:

Advantages of Rack Awareness:

So, now you will be thinking why do we need a Rack Awareness algorithm? The reasons are:

To improve the network performance: The communication between nodes residing on different racks is directed via switch. In general, you will find greater network bandwidth between machines in the same rack than the machines residing in different rack. So, the Rack Awareness helps you to have reduce write traffic in between different racks and thus providing a better write performance. Also, you will be gaining increased read performance because you are using the bandwidth of multiple racks.

To prevent loss of data: We don’t have to worry about the data even if an entire rack fails because of the switch failure or power failure. And if you think about it, it will make sense, as it is said that never put all your eggs in the same basket.

What is batch processing in hadoop?

Batch processing is the execution of a series of jobs in a program on a computer without manual intervention (non-interactive). Strictly speaking, it is a processing mode: the execution of a series of programs each on a set or "batch" of inputs, rather than a single input (which would instead be a custom job). However, this distinction has largely been lost, and the series of steps in a batch process are often called a "job" or "batch job".

Hadoop is based on batch processing of big data. This means that the data is stored over a period of time and is then processed using Hadoop. Whereas in Spark, processing can take place in real-time.

Block(Physical) and Split(logical)

Hadoop Mapreduce :

MapReduce is a processing technique and a program model for distributed computing based on java

Input String
Mapper
Shuffling & Sorting
Reducer

Hadoop Command-line

Create a file on hadoop file1.txt
- localfile path : /home/cloudera/Python-3.5.3/python_code/1st.py
- hdfs path : /user/cloudera/umesh/file1.txt
1. appendToFile
[cloudera@quickstart Python]$ hadoop fs -appendToFile /home/cloudera/Python-3.5.3/python_code/1st.py /user/cloudera/umesh/file1.txt
2. cat
[cloudera@quickstart Python]$ hadoop fs -cat /user/cloudera/umesh/file1.txt
abcd
verma
tania
umesh
print("Hello world")
3. checksum
hadoop fs -checksum URI
[cloudera@quickstart Python]$ hadoop fs -checksum /user/cloudera/umesh/file1.txt
/user/cloudera/umesh/file1.txt MD5-of-0MD5-of-512CRC32C 00000200000000000000000040ae2cbb8e2213e5724356c838dfb6b1
4. chgrp
Example1: Change group name:sales of a file to other group name:hrgroup.
chgrp hrgroup file1
Example2: Give access permissions to a command so that the command can be executed by all users belonging to apache-admins
chgrp apache-admins /etc/init.d/httpd
Example3: Change group ownership all the files located in /var/apache to group:apache
chgrp -R apache /var/apache
Example4:Change group ownership forcefully
chgrp -f apache /var/apache
5. chmod
Following are the symbolic representation of three different roles:
u is for user,
g is for group,
and o is for others.
Following are the symbolic representation of three different permissions:
r is for read permission,
w is for write permission,
x is for execute permission.
1. Add single permission to a file/directory
Changing permission to a single set. + symbol means adding permission. For example, do the following to give execute permission for the user irrespective of anything else:
$ chmod u+x filename
2. Add multiple permission to a file/directory
Use comma to separate the multiple permission sets as shown below.
$ chmod u+r,g+x filename
3. Remove permission from a file/directory
Following example removes read and write permission for the user.
$ chmod u-rx filename
4. Change permission for all roles on a file/directory
Following example assigns execute privilege to user, group and others (basically anybody can execute this file).
$ chmod a+x filename
5. Make permission for a file same as another file (using reference)
If you want to change a file permission same as another file, use the reference option as shown below. In this example, file2’s permission will be set exactly same as file1’s permission.
$ chmod --reference=file1 file2
6. Apply the permission to all the files under a directory recursively
Use option -R to change the permission recursively as shown below.
$ chmod -R 755 directory-name/
7. Change execute permission only on the directories (files are not affected)
On a particular directory if you have multiple sub-directories and files, the following command will assign execute permission only to all the sub-directories in the current directory (not the files in the current directory).
$ chmod u+X *
6. chown
chown root file1
chown :[group-name] [file-name]
To change both user and group owner
chown [new-owner]:[new-group] [file-name]
There might be situations where-in you may want to first cross-check existing owner/group of a file before making any change. So for those cases, you can use the --from command line option. This option requires you to mention the owner/group name that you want to verify.
chown --from=[curr-own]:[curr-group] [new-owner]:[new-group] [filename]
For example:
chown --from=root:himanshu himanshu:root file1
To make the chown command recursively operate on files and directories, use the -R command line option.
chown -R [new-owner]:[new-group] [directory-name-or-path]
7. copyFromLocal
[cloudera@quickstart Python]$ hadoop fs -copyFromLocal vermadir techaltum
8. copyToLocal
[cloudera@quickstart ~]$ hadoop fs -copyToLocal verma/file.txt /home/cloudera/Python-3.5.3/pfile2.txt
9. count
[cloudera@quickstart Python]$ hadoop fs -count -h /user/cloudera/umesh
10. copy
• hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
• hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
11. df
Usage: hadoop fs -df [-h] URI [URI …]
Displays free space.
Options:
• The -h option will format file sizes in a “human-readable” fashion (e.g 64.0m instead of 67108864)
Example:
• hadoop dfs -df /user/hadoop/dir1
12. du
Usage: hadoop fs -du [-s] [-h] URI [URI …]
Displays sizes of files and directories contained in the given directory or the length of a file in case it’s just a file.
Options:
• The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files.
• The -h option will format file sizes in a “human-readable” fashion (e.g 64.0m instead of 67108864)
Example:
• hadoop fs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1
Exit Code: Returns 0 on success and -1 on error.
13. expunge
Usage: hadoop fs -expunge
Empty the Trash. Refer to the HDFS Architecture Guide for more information on the Trash feature.
14. get
Usage: hadoop fs -get [-ignorecrc] [-crc]
Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.
Example:
• hadoop fs -get /user/hadoop/file localfile
• hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile
15. getfacl
Usage: hadoop fs -getfacl [-R] Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then getfacl also displays the default ACL.
Options:
• -R: List the ACLs of all files and directories recursively.
• path: File or directory to list.
Examples:
• hadoop fs -getfacl /file
• hadoop fs -getfacl -R /dir
16. getmerge
Usage: hadoop fs -getmerge [-nl]
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.
Examples:
• hadoop fs -getmerge -nl /src /opt/output.txt
• hadoop fs -getmerge -nl /src/file1.txt /src/file2.txt /output.txt
17. ls
Example : hadoop fs -ls /
18. help
Usage: hadoop fs -help
19. mkdir
Example: hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
20. mv
Usage: hadoop fs -mv URI [URI …]
Moves files from source to destination.
Example:hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
21. put
Usage: hadoop fs -put …
Copy single src, or multiple srcs from local file system to the destination file system.
Example : hadoop fs -put localfile /user/hadoop/hadoopfile
22. rm
Usage: hadoop fs -rm [-f] [-r |-R] [-skipTrash] URI [URI …]
Delete files specified as args.
Options:
• The -f option will not display a diagnostic message or modify the exit status to reflect an error if the file does not exist.
• The -R option deletes the directory and any content under it recursively.
• The -r option is equivalent to -R.
• The -skipTrash option will bypass trash if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory.
23. rmdir
Usage: hadoop fs -rmdir [–ignore-fail-on-non-empty] URI [URI …]
Delete a directory.
Options:
• –ignore-fail-on-non-empty: When using wildcards, do not fail if a directory still contains files.
Example:
• hadoop fs -rmdir /user/hadoop/emptydir
24. tail
Usage: hadoop fs -tail [-f] URI

Hadoop

Tuesday, 28 May 2019

Hadoop Ecosystem

Big Data

Below is the formula to calculate the HDFS Storage size required, when building a new Hadoop cluster.

Monday, 27 May 2019

Hadoop Command-line

Hadoop Command-line

HIVE FILE-FORMAT

Search This Blog