Class SequenceFile

java.lang.Object
org.apache.hadoop.io.SequenceFile

@Public @Stable public class SequenceFile extends Object
SequenceFiles are flat files consisting of binary key/value pairs.

SequenceFile provides SequenceFile.Writer, SequenceFile.Reader and SequenceFile.Sorter classes for writing, reading and sorting respectively.

There are three SequenceFile Writers based on the SequenceFile.CompressionType used to compress key/value pairs:
  1. Writer : Uncompressed records.
  2. RecordCompressWriter : Record-compressed files, only compress values.
  3. BlockCompressWriter : Block-compressed files, both keys & values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.

The actual compression algorithm used to compress key and/or values can be specified by using the appropriate CompressionCodec.

The recommended way is to use the static createWriter methods provided by the SequenceFile to chose the preferred format.

The SequenceFile.Reader acts as the bridge and can read any of the above SequenceFile formats.

SequenceFile Formats

Essentially there are 3 different formats for SequenceFiles depending on the CompressionType specified. All of them share a common header described below.

  • version - 3 bytes of magic header SEQ, followed by 1 byte of actual version number (e.g. SEQ4 or SEQ6)
  • keyClassName -key class
  • valueClassName - value class
  • compression - A boolean which specifies if compression is turned on for keys/values in this file.
  • blockCompression - A boolean which specifies if block-compression is turned on for keys/values in this file.
  • compression codec - CompressionCodec class which is used for compression of keys and/or values (if compression is enabled).
  • metadata - SequenceFile.Metadata for this file.
  • sync - A sync marker to denote end of the header.

Uncompressed SequenceFile Format

  • Header
  • Record
    • Record length
    • Key length
    • Key
    • Value
  • A sync-marker every few 100 kilobytes or so.

Record-Compressed SequenceFile Format

  • Header
  • Record
    • Record length
    • Key length
    • Key
    • Compressed Value
  • A sync-marker every few 100 kilobytes or so.
Block-Compressed SequenceFile Format
  • Header
  • Record Block
    • Uncompressed number of records in the block
    • Compressed key-lengths block-size
    • Compressed key-lengths block
    • Compressed keys block-size
    • Compressed keys block
    • Compressed value-lengths block-size
    • Compressed value-lengths block
    • Compressed values block-size
    • Compressed values block
  • A sync-marker every block.

The compressed blocks of key lengths and value lengths consist of the actual lengths of individual keys/values encoded in ZeroCompressedInteger format.

See Also: