Hadoop : Big-Data File-Formats

Saturday, 15 June 2019

Big-Data File-Formats

Big-Data File-Formats

Evaluation Framework: Row vs. Column

At the highest level, column-based storage is most useful when performing analytics queries that require only a subset of columns examined over very large data sets.

If your queries require access to all or most of the columns of each row of data, row-based storage will be better suited to your needs.

To help illustrate the differences between row and column-based data, consider this table of basic transaction data. For each transaction, we have the customer name, the product ID, sale amount, and the date.

Row-based storage is the simplest form of data table and is used in many applications, from web log files to highly-structured database systems like MySql and Oracle.

In a database, this data would be stored by row, as follows: Emma,Prod1,100.00,2018-04-02;Liam,Prod2,79.99,2018- 04-02;Noah,Prod3,19.99,2018-04-01;Oliv ia,Prod2,79.99,2018-04-03

Column-based data formats, as you might imagine, store data by column. Using our transaction data as an example, in a columnar database this data would be stored as follows: Emma,Liam,Noah,Olivia;Prod1,Prod2,Prod3;Pr od2;100.00,79.99,19.99,79.99;2018-04-02,2018-04-02, 2018-04-01, 2018-04-03

Evaluation Framework: Schema Evolution

When we talk about “schema” in a database context, we are really talking about its organization—the tables, columns, views, primary keys, relationships, etc. When we talk about schemas in the context of an individual dataset or data file, it’s helpful to simplify schema further to the individual attribute level (column headers in the simplest use case). The schema will store the definition of each attribute and its type. Unless your data is guaranteed to never change, you’ll need to think about schema evolution, or how your data schema changes over time. How will your file format manage fields that are added or deleted?

One of the most important considerations when selecting a data format is how it manages schema evolution. When evaluating schema evolution specifically, there are a few key questions to ask of any data format:

How easy is it to update a schema (such as adding a field, removing or renaming a field)?

How will different versions of the schema “talk” to each other?

Is it human-readable? Does it need to be?

How fast can the schema be processed?

How does it impact the size of data?

Evaluation Framework: Splitability

Datasets are commonly composed of hundreds to thousands of files, each of which may contain thousands to millions of records or more. Furthermore, these file-based chunks of data are often being generated continuously. Processing such datasets efficiently usually requires breaking the job up into parts that can be farmed out to separate processors. In fact, large-scale parallelization of processing is key to performance. Your choice of file format can critically affect the ease with which this parallelization can be implemented.

Row-based formats, such as Avro, can be split along row boundaries, as long as the processing can proceed with one record at a time. If groups of records related by some particular column value are required for processing, out-of-the box partitioning may be more challenging for row-based data stored in random order.

A column-based format will be more amenable to splitting into separate jobs if the query calculation is concerned with a single column at a time. The columnar formats we discuss in this paper are row-columnar, which means they take a batch of rows and store that batch in columnar format. These batches then become split boundaries.

Evaluation Framework: Compression

Data compression reduces the amount of information needed for the storage or transmission of a given set of data. It reduces the resources required to store and transmit data, typically saving time and money. Compression uses encoding for frequently repeating data to achieve this reduction, done at the source of the data before it is stored and/or transmitted. Simply reducing the size of a data file can be referred to as data compression.

Columnar data can achieve better compression rates than row-based data. Storing values by column, with the same type next to each other, allows you to do more efficient compression on them than if you’re storing rows of data. For example, storing all dates together in memory allows for more efficient compression than storing data of various types next to each other—such as string, number, date, string, date.

While compression may save on storage costs, it is important to also consider compute costs and resources. Chances are, at some point you will want to decompress that data for use in another application. Decompression is not free—it incurs compute costs. If and how you compress the data will be a function of how you want to optimize the compute costs vs. storage costs for your given use case.

APACHE AVRO: A ROW BASED FORMAT

Apache Avro was released by the Hadoop working group in 2009. It is a rowbased format that is highly splittable. The innovative, key feature of Avro is that the schema travels with data. The data definition is stored in JSON format while the data is stored in binary format, minimizing file size and maximizing efficiency. Avro features robust support for schema evolution by managing added fields, missing fields, and fields that have changed. This allows old software to read the new data and new software to read the old data—a critical feature if your data has the potential to change.

We understand this intuitively—as soon as you’ve finished what you’re sure is the master schema to end all schemas, someone will come up with a new use case and request to add a field. This is especially true for big, distributed systems in large corporations. With Avro’s capacity to manage schema evolution, it’s possible to update components independently, at different times, with low risk of incompatibility. This saves applications from having to write if-else statements to process different schema versions, and saves the developer from having to look at old code to understand old schemas. Because all versions of the schema are stored in a human-readable JSON header, it’s easy to understand all the fields that you have available.

Avro can support many different programming languages. Because the schema is stored in JSON while the data is in binary, Avro is a relatively compact option for both persistent data storage and wire transfer. Avro is typically the format of choice for write-heavy workloads given its easy to append new rows.

APACHE PARQUET: A COLUMN BASED FORMAT

Parquet was developed by Cloudera and Twitter (and inspired by Google’s Dremel query system) to serve as an optimized columnar data store on Hadoop. Because data is stored by columns, it can be highly compressed and splittable (for the reasons noted above). Parquet is commonly used with Apache Impala, an analytics database for Hadoop. Impala is designed for low latency and high concurrency queries on Hadoop.

The column metadata for a Parquet file is stored at the end of the file, which allows for fast, one-pass writing. Metadata can include information such as, data types, compression/encoding scheme used (if any), statistics, element names, and more.

Parquet is especially adept at analyzing wide datasets with many columns. Each Parquet file contains binary data organized by “row group.” For each row group, the data values are organized by column. This enables the compression benefits that we described above. Parquet is a good choice for read-heavy workloads.

Generally, schema evolution in the Parquet file type is not an issue and is supported. However, not all systems that prefer Parquet support schema evolution optimally. For example, consider a columnar store like Impala. It is hard for that data store to support schema evolution, as the database needs to have two versions of schema (old and new) for a table.

APACHE ORC: A ROW-COLUMNAR BASED FORMAT

Optimized Row Columnar (ORC) format was first developed at Hortonworks to optimize storage and performance in Hive, a data warehouse for summarization, query and analysis that lives on top of Hadoop. Hive is designed for queries and analysis, and uses the query language HiveQL (similar to SQL).

ORC files are designed for high performance when Hive is reading, writing, and processing data. ORC stores row data in columnar format. This row-columnar format is highly efficient for compression and storage. It allows for parallel processing across a cluster, and the columnar format allows for skipping of unneeded columns for faster processing and decompression. ORC files can store data more efficiently without compression than compressed text files. Like Parquet, ORC is a good option for read-heavy workloads.

This advanced level of compression is possible because of its index system. ORC files contain “stripes” of data, or 10,000 rows. These stripes are the data building blocks and independent of each other, which means queries can skip to the stripe that is needed for any given query. Within each stripe, the reader can focus only on the columns required. The footer file includes descriptive statistics for each column within a stripe such as count, sum, min, max, and if null values are present.

ORC is designed to maximize storage and query efficiency. According to the Apache Foundation, “Facebook uses ORC to save tens of petabytes in their data warehouse and demonstrated that ORC is significantly faster than RC File or Parquet.”

Similar to Parquet, schema evolution is supported by the ORC file format, but its efficacy is dependent on what the data store supports. Recent advances have been made in Hive that allow for appending columns, type conversion, and name mapping.

Comparison between AVRO, ORC & PARQUET file

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)