Big-Data File-Formats
Evaluation Framework: Row vs. Column
At the highest level, column-based storage is most useful when
performing analytics queries that require only a subset of columns examined
over very large data sets.
If your queries require access to all or most of the columns of each row
of data, row-based storage will be better suited to your needs.
To help illustrate the differences between row and column-based data,
consider this table of basic transaction data. For each transaction, we have
the customer name, the product ID, sale amount, and the date.
Row-based storage is the simplest form of data table and is used in many
applications, from web log files to highly-structured database systems like
MySql and Oracle.
In a database, this data would be stored by row, as follows:
Emma,Prod1,100.00,2018-04-02;Liam,Prod2,79.99,2018-
04-02;Noah,Prod3,19.99,2018-04-01;Oliv ia,Prod2,79.99,2018-04-03
Column-based data formats, as you might imagine, store data by column.
Using our transaction data as an example, in a columnar database this data
would be stored as follows: Emma,Liam,Noah,Olivia;Prod1,Prod2,Prod3;Pr
od2;100.00,79.99,19.99,79.99;2018-04-02,2018-04-02, 2018-04-01, 2018-04-03
Evaluation Framework: Schema Evolution
When we talk about “schema” in a database context, we are really talking
about its organization—the tables, columns, views, primary keys, relationships,
etc. When we talk about schemas in the context of an individual dataset or data
file, it’s helpful to simplify schema further to the individual attribute level
(column headers in the simplest use case). The schema will store the definition
of each attribute and its type. Unless your data is guaranteed to never change,
you’ll need to think about schema evolution, or how your data schema changes
over time. How will your file format manage fields that are added or deleted?
One of the most important considerations when selecting a data format is
how it manages schema evolution. When evaluating schema evolution specifically,
there are a few key questions to ask of any data format:
-
How easy is it to update a schema (such as adding a field, removing
or renaming a field)?
-
How will different versions of the schema “talk” to each other?
-
Is it human-readable? Does it need to be?
-
How fast can the schema be processed?
-
How does it impact the size of data?
Evaluation Framework: Splitability
Datasets are commonly composed of hundreds to thousands of files, each
of which may contain thousands to millions of records or more. Furthermore,
these file-based chunks of data are often being generated continuously.
Processing such datasets efficiently usually requires breaking the job up into
parts that can be farmed out to separate processors. In fact, large-scale
parallelization of processing is key to performance. Your choice of file format
can critically affect the ease with which this parallelization can be
implemented.
Row-based formats, such as Avro, can be split along row boundaries, as
long as the processing can proceed with one record at a time. If groups of
records related by some particular column value are required for processing,
out-of-the box partitioning may be more challenging for row-based data stored
in random order.
A column-based format will be more amenable to splitting into separate
jobs if the query calculation is concerned with a single column at a time. The
columnar formats we discuss in this paper are row-columnar, which means they
take a batch of rows and store that batch in columnar format. These batches
then become split boundaries.
Evaluation Framework: Compression
Data compression reduces the amount of information needed for the
storage or transmission of a given set of data. It reduces the resources required
to store and transmit data, typically saving time and money. Compression uses
encoding for frequently repeating data to achieve this reduction, done at the
source of the data before it is stored and/or transmitted. Simply reducing the
size of a data file can be referred to as data compression.
Columnar data can achieve better compression rates than row-based data.
Storing values by column, with the same type next to each other, allows you to
do more efficient compression on them than if you’re storing rows of data. For
example, storing all dates together in memory allows for more efficient
compression than storing data of various types next to each other—such as
string, number, date, string, date.
While compression may save on storage costs, it is important to also
consider compute costs and resources. Chances are, at some point you will want
to decompress that data for use in another application. Decompression is not
free—it incurs compute costs. If and how you compress the data will be a
function of how you want to optimize the compute costs vs. storage costs for
your given use case.
APACHE AVRO: A ROW BASED FORMAT
Apache Avro was released by the Hadoop working group in 2009. It is a
rowbased format that is highly splittable. The innovative, key feature of Avro
is that the schema travels with data. The data definition is stored in JSON
format while the data is stored in binary format, minimizing file size and
maximizing efficiency. Avro features robust support for schema evolution by
managing added fields, missing fields, and fields that have changed. This
allows old software to read the new data and new software to read the old
data—a critical feature if your data has the potential to change.
We understand this intuitively—as soon as you’ve finished what you’re
sure is the master schema to end all schemas, someone will come up with a new
use case and request to add a field. This is especially true for big,
distributed systems in large corporations. With Avro’s capacity to manage
schema evolution, it’s possible to update components independently, at
different times, with low risk of incompatibility. This saves applications from
having to write if-else statements to process different schema versions, and
saves the developer from having to look at old code to understand old schemas.
Because all versions of the schema are stored in a human-readable JSON header,
it’s easy to understand all the fields that you have available.
Avro can support many different programming languages. Because the
schema is stored in JSON while the data is in binary, Avro is a relatively
compact option for both persistent data storage and wire transfer. Avro is
typically the format of choice for write-heavy workloads given its easy to
append new rows.
APACHE PARQUET: A COLUMN BASED FORMAT
Parquet was developed by Cloudera and Twitter (and inspired by Google’s
Dremel query system) to serve as an optimized columnar data store on Hadoop.
Because data is stored by columns, it can be highly compressed and splittable
(for the reasons noted above). Parquet is commonly used with Apache Impala, an
analytics database for Hadoop. Impala is designed for low latency and high
concurrency queries on Hadoop.
The column metadata for a Parquet file is stored at the end of the
file, which allows for fast, one-pass writing. Metadata can include information
such as, data types, compression/encoding scheme used (if any), statistics,
element names, and more.
Parquet is especially adept at analyzing wide datasets with many
columns. Each Parquet file contains binary data organized by “row group.” For
each row group, the data values are organized by column. This enables the
compression benefits that we described above. Parquet is a good choice for
read-heavy workloads.
Generally, schema evolution in the Parquet file type is not an issue and
is supported. However, not all systems that prefer Parquet support schema
evolution optimally. For example, consider a columnar store like Impala. It is
hard for that data store to support schema evolution, as the database needs to
have two versions of schema (old and new) for a table.
APACHE ORC: A ROW-COLUMNAR BASED FORMAT
Optimized Row Columnar (ORC) format was first developed at Hortonworks
to optimize storage and performance in Hive, a data warehouse for
summarization, query and analysis that lives on top of Hadoop. Hive is designed
for queries and analysis, and uses the query language HiveQL (similar to SQL).
ORC files are designed for high performance when Hive is reading,
writing, and processing data. ORC stores row data in columnar format. This
row-columnar format is highly efficient for compression and storage. It allows
for parallel processing across a cluster, and the columnar format allows for
skipping of unneeded columns for faster processing and decompression. ORC files
can store data more efficiently without compression than compressed text files.
Like Parquet, ORC is a good option for read-heavy workloads.
This advanced level of compression is possible because of its index
system. ORC files contain “stripes” of data, or 10,000 rows. These stripes are
the data building blocks and independent of each other, which means queries can
skip to the stripe that is needed for any given query. Within each stripe, the
reader can focus only on the columns required. The footer file includes
descriptive statistics for each column within a stripe such as count, sum, min,
max, and if null values are present.
ORC is designed to maximize storage and query efficiency. According to
the Apache Foundation, “Facebook uses ORC to save tens of petabytes in their
data warehouse and demonstrated that ORC is significantly faster than RC File
or Parquet.”
Similar to Parquet, schema evolution is supported by the ORC file
format, but its efficacy is dependent on what the data store supports. Recent
advances have been made in Hive that allow for appending columns, type
conversion, and name mapping.
Comparison between AVRO, ORC & PARQUET file
No comments:
Post a Comment