File Storage: The Digital Home of Data 🏠

The ability to store data in files is the core element of any computing system. In the past, we stored our data on local hard disks or USB drives; however, today the "central storage location" for organizations is the cloud. Cloud storage provides cost-effective, secure, and reliable storage for large volumes of data.

As an architect, we ask three critical questions when choosing a file format:

What is the type of data? (Structured, semi-structured, or unstructured).
Which applications and services will read, write, and process the data?.
Should the data files be readable by humans, or optimized for efficient storage and processing?.

📝 1. The Language Everyone Speaks: Common Formats

These formats are special because they are "human-readable".

Delimited Text Files (CSV/TSV): This is the common language of the data world. Fields are separated by commas (CSV) or tabs (TSV), and rows end with a new line. It is the best choice if you need wide support from many applications.
JSON (JavaScript Object Notation): A ubiquitous format for modern systems. It uses a hierarchical document schema to define objects with multiple attributes. It is very flexible for both structured and semi-structured data.
XML (Extensible Markup Language): A popular format from the 90s and 2000s. It defines elements and attributes using tags (<../>). It is more verbose than JSON but still used by some systems.
BLOB (Binary Large Object): All files are actually stored as binary data (1's and 0's). We use the term BLOB for raw binary data like images, video, audio, and documents that do not have a specific structure.

🚀 2. The Architect's Secret Weapons: Optimized Formats

In the Big Data world, human readability is replaced by performance and compression. These formats use special indexing and storage techniques to reduce processing times significantly.

🛰️ Avro (Row-based)

Created by Apache, this is a row-based format.

Each record contains a JSON header that describes the structure of the data.
The data itself is stored as binary information, which minimizes storage and network bandwidth needs.

🏛️ ORC (Optimized Row Columnar)

Developed by HortonWorks to optimize read and write operations in Apache Hive.

It organizes data into columns rather than rows.
An ORC file contains stripes of data. Each stripe includes an index, the data for each row, and a footer with statistical information (count, sum, max, min).

💎 Parquet (Columnar)

Created by Cloudera and X, this is another powerful columnar data format.

It stores data for each column together in row groups.
The biggest strength is Metadata. An application can use this metadata to quickly find the correct chunk of data and read only the specified columns.
It is very efficient at storing and processing nested data types.

Important Note: You should choose Avro if you are writing data from a constant stream (streaming); however, you should choose Parquet for complex analysis and reporting on massive datasets.

📝 1. The Language Everyone Speaks: Common Formats​

🚀 2. The Architect's Secret Weapons: Optimized Formats​

🛰️ Avro (Row-based)​

🏛️ ORC (Optimized Row Columnar)​

💎 Parquet (Columnar)​

📝 1. The Language Everyone Speaks: Common Formats

🚀 2. The Architect's Secret Weapons: Optimized Formats

🛰️ Avro (Row-based)

🏛️ ORC (Optimized Row Columnar)

💎 Parquet (Columnar)