Hive Concepts (SerDe,Partitions and Sort by vs Order by)

Hive Concepts (SerDe,Partitions and Sort by vs Order by)

ยท

1 min read

๐Ÿ—‚๏ธ Hive SerDe

SerDe is short for Serializer/Deserializer. Hive uses the SerDe 
interface for IO. The interface handles both serialization and 
deserialization and also interpreting the results of serialization 
as individual fields for processing.

A SerDe allows Hive to read in data from a table, and write it 
back out to HDFS in any custom format. Anyone can write their 
own SerDe for their own data formats

Serialization

Process of converting an object in memory into bytes that can be stored in a file or transmitted over a network

Deserialization

Process of converting the bytes back into an object in memory

Built-in SerDes

  1. Avro
  2. ORC
  3. RegEx
  4. Thrift
  5. Parquet
  6. CSV
  7. JsonSerDe

๐Ÿ“ Hive Partitions

Static Partition

In Static Partitioning, we have to manually decide how many partitions tables will have and also value for those partitions

Dynamic Partition

Dynamic partitions provide us with flexibility and create partitions automatically depending on the data that we are inserting into the table

h66.webp

โœ… Hive Sort by vs Order by

Sort by

  1. Uses multiple reducers to produce final output
  2. It Only guarantees ordering of rows within a reducer

Order by

  1. Uses only single reducer to produce final output
  2. LIMIT can be used to minimize sort time
ย