Apache Hive
Apache Hive is
an open source data warehouse software that facilitates querying and managing
of large datasets residing in distributed storage. Hive provides a language
called HiveQL(handles structured data only) which allows users to query
and is similar to SQL.
By default,
Hive has derby database to store the data in it. Data is
eventually stored in files. There are some specific file formats which Hive can
handle such as:
- TEXTFILE
- SEQUENCEFILE
- RCFILE
- ORCFILE
File Format
A file format is a
way in which information is stored or encoded in a computer file. In Hive it
refers to how records are stored inside the file. As we are dealing with
structured data, each record has to be its own structure. How records are
encoded in a file defines a file format.
These file formats mainly vary between data encoding, compression rate, usage of space and disk I/O.
Hive does not verify whether the data that you are loading matches the schema for the table or not. However, it verifies if the file format matches the table definition or not.
These file formats mainly vary between data encoding, compression rate, usage of space and disk I/O.
Hive does not verify whether the data that you are loading matches the schema for the table or not. However, it verifies if the file format matches the table definition or not.
TEXTFILE
TEXTFILE format is a famous input/output format
used in Hadoop. In Hive if we
define a table as TEXTFILE it can load data of from CSV (Comma Separated
Values), delimited by Tabs, Spaces, and JSON data. This means fields in each
record should be separated by comma or space or tab or it may be
JSON(JavaScript Object Notation) data.
By default, if we use TEXTFILE format then each line is considered as a record.
By default, if we use TEXTFILE format then each line is considered as a record.
Storing in TEXTFILE
Creation of table in Hive which will store data
in TEXTFILE
/*---------------------------------------------------------------------------------------------------------------------*/
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
/*-----The following data is a Comment, Row formatted fields such as
Field terminator, Lines terminator, and Stored File type. --------*/
COMMENT 'Employee details'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
/*---------------------------------------------------------------------------------------------------------------------*/
Load Data Statement
Generally, after creating a table in SQL, we can insert data using the
Insert statement. But in Hive, we can insert data using the LOAD DATA
statement.
While inserting data into Hive, it is better to use LOAD DATA to store
bulk records. There are two ways to load data: one is from local file system
and second is from Hadoop file system.
Syntax for loading data is as follows:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename;
//LOCAL is identifier to specify the local path. It is optional.
//OVERWRITE is optional to overwrite the data in the table.
Example:
hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt' OVERWRITE INTO
TABLE employee;
SEQUENCEFILE
We know that Hadoop’s performance is drawn out when we work with a small
number of files with big size rather than a large number of files with small
size. If the size of a file is smaller than the typical block size in Hadoop,
we consider it as a small file. Due to this, a number of metadata increases
which will become an overhead to the NameNode. To solve this problem sequence
files are introduced in Hadoop. Sequence files act as a container to store
the small files.
Sequence files are flat files consisting of binary key-value pairs. When
Hive converts queries to MapReduce jobs, it decides on the appropriate
key-value pairs to be used for a given record. Sequence files are in the binary
format which can be split and the main use of these files is to club two or
more smaller files and make them as a one sequence file.
In Hive we can create a sequence file by specifying STORED AS
SEQUENCEFILE in the end of a CREATE TABLE statement.
There are three types of sequence files:
• Uncompressed key/value records.
• Record compressed key/value records – only 'values' are compressed
here
• Block compressed key/value records – both keys and values are
collected in 'blocks' separately and compressed. The size of the 'block' is
configurable.
Hive has its own SEQUENCEFILE reader and SEQUENCEFILE writer libraries
for reading and writing through sequence files.
Storing in SEQUENCEFILE
/*---------------------------------------------------------------------------------------------------------------------*/
create table olympic_sequencefile(athelete STRING,age INT,country
STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze
INT,total INT)
/*-----The following data is a Comment, Row formatted fields such as
Field terminator, Lines terminator, and Stored File type. --------*/
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS SEQUENCEFILE
/*---------------------------------------------------------------------------------------------------------------------*/
So to load the data into SEQUENCEFILE we need to use the following
approach:
INSERT OVERWRITE TABLE olympic_sequencefile
SELECT * FROM olympic;
Load Data from file common for all:
#To overwrite the data in the table use -
LOAD DATA INPATH '/home/hadoop/data/ olympicseq' OVERWRITE INTO TABLE olympic_sequencefile;
#To append the data in the table use -
LOAD DATA INPATH '/home/hadoop/data/ olympicseq' INTO TABLE olympic_sequencefile;
RCFILE
RCFILE stands of Record Columnar File which is another type of binary
file format which offers high compression rate on the top of the rows.
RCFILE is used when we want to perform operations on multiple rows at a
time.
RCFILEs are flat files consisting of binary key/value pairs, which
shares many similarities with SEQUENCEFILE. RCFILE stores columns of a table in
form of record in a columnar manner. It first partitions rows horizontally into
row splits and then it vertically partitions each row split in a columnar way.
RCFILE first stores the metadata of a row split, as the key part of a record,
and all the data of a row split as the value part. This means that RCFILE
encourages column oriented storage rather than row oriented storage.
This column oriented storage is very useful while performing analytics.
It is easy to perform analytics when we have a column oriented storage type.
Storing in RCFILE
/*---------------------------------------------------------------------------------------------------------------------*/
create table olympic_sequencefile(athelete STRING,age INT,country
STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze
INT,total INT)
/*-----The following data is a Comment, Row formatted fields such as
Field terminator, Lines terminator, and Stored File type. --------*/
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS RCFILE
/*---------------------------------------------------------------------------------------------------------------------*/
ORCFILE
ORC stands for Optimized Row Columnar which means it can store data in
an optimized way than the other file formats. ORC reduces the size of the
original data up to 75%(eg: 100GB file will become 25GB). As a result the speed
of data processing also increases. ORC shows better performance than Text,
Sequence and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a
file footer. ORC format improves the performance when Hive is processing the
data.
Storing in ORCFILE
/*---------------------------------------------------------------------------------------------------------------------*/
create table olympic_sequencefile(athelete STRING,age INT,country
STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze
INT,total INT)
/*-----The following data is a Comment, Row formatted fields such as
Field terminator, Lines terminator, and Stored File type. --------*/
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS ORC
/*---------------------------------------------------------------------------------------------------------------------*/
No comments:
Post a Comment