Archiving small files
The Hadoop Archive's data format is called har, with the following layout:
foo.har/_masterindex //stores hashes and offsets
foo.har/_index //stores file statuses
foo.har/part-[1..n] //stores actual file data
The file data is stored in multiple part files, which are indexed for keeping the original separation of data intact. Moreover, the part files can be accessed by MapReduce programs in parallel. The index files also record the original directory tree structures and the file statuses. In Figure 1, a directory containing many small files is archived into a directory with large files and indexes.
HarFileSystem – A first-class FileSystem providing transparent access
Most archival systems, such as tar, are tools for archiving and de-archiving. Generally, they do not fit into the actual file system layer and hence are not transparent to the application writer in that the user had to de-archive the archive before use.
Hadoop Archive is integrated in the Hadoop’s FileSystem interface. The HarFileSystemimplements the FileSystem interface and provides access via the
har://
scheme. This exposes the archived files and directory tree structures transparently to the users. Files in a har can be accessed directly without expanding it. For example, we have the following command to copy a HDFS file to a local directory:hadoop fs –get hdfs://namenode/foo/file-1 localdir
Suppose an archive
bar.har
is created from the foo directory. Then, the command to copy the original file becomeshadoop fs –get har://namenode/bar.har#foo/file-1 localdir
Users only have to change the URI paths. Alternatively, users may choose to create a symbolic link (from
hdfs://namenode/foo
to har://namenode/bar.har#foo
in the example above), then even the URIs do not need to be changed. In either case,HarFileSystem
will be invoked automatically for providing access to the files in the har. Because of this transparent layer, har is compatible with the Hadoop APIs, MapReduce, the shell command -ine interface, and higher-level applications like Pig, Zebra, Streaming, Pipes, and DistCp.
0 comments: