Hadoop : HIVE FILE-FORMAT

Apache Hive

Apache Hive is an open source data warehouse software that facilitates querying and managing of large datasets residing in distributed storage. Hive provides a language called HiveQL(handles structured data only) which allows users to query and is similar to SQL.

By default, Hive has derby database to store the data in it. Data is eventually stored in files. There are some specific file formats which Hive can handle such as:

TEXTFILE
SEQUENCEFILE
RCFILE
ORCFILE

File Format

A file format is a way in which information is stored or encoded in a computer file. In Hive it refers to how records are stored inside the file. As we are dealing with structured data, each record has to be its own structure. How records are encoded in a file defines a file format.
These file formats mainly vary between data encoding, compression rate, usage of space and disk I/O.
Hive does not verify whether the data that you are loading matches the schema for the table or not. However, it verifies if the file format matches the table definition or not.

TEXTFILE

TEXTFILE format is a famous input/output format used in Hadoop. In Hive if we define a table as TEXTFILE it can load data of from CSV (Comma Separated Values), delimited by Tabs, Spaces, and JSON data. This means fields in each record should be separated by comma or space or tab or it may be JSON(JavaScript Object Notation) data.
By default, if we use TEXTFILE format then each line is considered as a record.

Storing in TEXTFILE

Creation of table in Hive which will store data in TEXTFILE

/*---------------------------------------------------------------------------------------------------------------------*/

hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, destination String)

/*-----The following data is a Comment, Row formatted fields such as Field terminator, Lines terminator, and Stored File type. --------*/

COMMENT 'Employee details'

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

/*---------------------------------------------------------------------------------------------------------------------*/

Load Data Statement

Generally, after creating a table in SQL, we can insert data using the Insert statement. But in Hive, we can insert data using the LOAD DATA statement.

While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are two ways to load data: one is from local file system and second is from Hadoop file system.

Syntax for loading data is as follows:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename;

//LOCAL is identifier to specify the local path. It is optional.

//OVERWRITE is optional to overwrite the data in the table.

Example:

hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt' OVERWRITE INTO TABLE employee;

SEQUENCEFILE

We know that Hadoop’s performance is drawn out when we work with a small number of files with big size rather than a large number of files with small size. If the size of a file is smaller than the typical block size in Hadoop, we consider it as a small file. Due to this, a number of metadata increases which will become an overhead to the NameNode. To solve this problem sequence files are introduced in Hadoop. Sequence files act as a container to store the small files.

Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record. Sequence files are in the binary format which can be split and the main use of these files is to club two or more smaller files and make them as a one sequence file.

In Hive we can create a sequence file by specifying STORED AS SEQUENCEFILE in the end of a CREATE TABLE statement.

There are three types of sequence files:

• Uncompressed key/value records.

• Record compressed key/value records – only 'values' are compressed here

• Block compressed key/value records – both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.

Hive has its own SEQUENCEFILE reader and SEQUENCEFILE writer libraries for reading and writing through sequence files.

Storing in SEQUENCEFILE

/*---------------------------------------------------------------------------------------------------------------------*/

create table olympic_sequencefile(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT)

/*-----The following data is a Comment, Row formatted fields such as Field terminator, Lines terminator, and Stored File type. --------*/

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LINES TERMINATED BY '\n'

STORED AS SEQUENCEFILE

/*---------------------------------------------------------------------------------------------------------------------*/

So to load the data into SEQUENCEFILE we need to use the following approach:

INSERT OVERWRITE TABLE olympic_sequencefile

SELECT * FROM olympic;

Load Data from file common for all:

#To overwrite the data in the table use -

LOAD DATA INPATH '/home/hadoop/data/ olympicseq' OVERWRITE INTO TABLE olympic_sequencefile;

#To append the data in the table use -

LOAD DATA INPATH '/home/hadoop/data/ olympicseq' INTO TABLE olympic_sequencefile;

RCFILE

RCFILE stands of Record Columnar File which is another type of binary file format which offers high compression rate on the top of the rows.

RCFILE is used when we want to perform operations on multiple rows at a time.

RCFILEs are flat files consisting of binary key/value pairs, which shares many similarities with SEQUENCEFILE. RCFILE stores columns of a table in form of record in a columnar manner. It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way. RCFILE first stores the metadata of a row split, as the key part of a record, and all the data of a row split as the value part. This means that RCFILE encourages column oriented storage rather than row oriented storage.

This column oriented storage is very useful while performing analytics. It is easy to perform analytics when we have a column oriented storage type.

Storing in RCFILE

/*---------------------------------------------------------------------------------------------------------------------*/

create table olympic_sequencefile(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT)

/*-----The following data is a Comment, Row formatted fields such as Field terminator, Lines terminator, and Stored File type. --------*/

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LINES TERMINATED BY '\n'

STORED AS RCFILE

/*---------------------------------------------------------------------------------------------------------------------*/

ORCFILE

ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.

An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.

Storing in ORCFILE

/*---------------------------------------------------------------------------------------------------------------------*/

create table olympic_sequencefile(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT)

/*-----The following data is a Comment, Row formatted fields such as Field terminator, Lines terminator, and Stored File type. --------*/

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LINES TERMINATED BY '\n'

STORED AS ORC

/*---------------------------------------------------------------------------------------------------------------------*/

Hadoop

Monday, 8 July 2019

HIVE FILE-FORMAT

Apache Hive

No comments:

Post a Comment

HIVE FILE-FORMAT

Search This Blog