Hadoop: An Architecture for Large Datasets

bitheerani319 · Post by **bitheerani319** » Tue Jan 28, 2025 4:48 am

Pedro Cesar Tebaldi Gomes By Pedro César Tebaldi Gomes April 18, 2024 12 Min Read
The exponential increase in the volume of data created in the 21st century has triggered a radical change in the way we store, process and use information.

Traditional data processing systems, often built rcs data belarus monolithic and centralized architectures, face increasing challenges in dealing with the volume, velocity and variety of data in this era of Big Data .

According to IDC's “Data Age 2025” report , the forecast for 2025 is that each individual will generate around 5.3 GB per day. If we think about the business context, the volume of data is even greater. Every day, the world generates around 2.5 quintillion pieces of data. The intriguing fact about this is that 90% of the data available today was generated in the last 3 years.

And the challenge of dealing with large data sets, efficiently and scalably, is what led to the development of technologies like Hadoop .

Hadoop's origins date back to the early 2000s, when Doug Cutting and Mike Cafarella were working on the open source Nutch project. Nutch was a web search engine designed to crawl and search billions of pages on the Internet, which introduced new challenges in handling large volumes of data that were beyond the capabilities of existing solutions.

At the same time, Google published two groundbreaking papers on its technologies, Google File System (GFS) in 2003 and MapReduce in 2004. These technologies were solving the very problems that the Nutch team was facing. Inspired by these papers, Cutting and Cafarella decided to implement similar solutions at Nutch.

In 2006, they separated this part of the code and named it Hadoop, after Cutting's son's toy elephant. The Apache Software Foundation adopted Hadoop, and in 2008 it became one of its flagship projects.

The objective of this article is to detail a little about this free software architecture oriented towards Big Data and used by large companies, such as Facebook, Amazon, Netflix, Uber and Google itself.