Introduction to Big Data
What is Big Data?
"Any data that is large in volume(in GBs, TBs, etc.), is generated with fast velocity and can be of multiple varieties (log files, audio files, mp4, etc.) is known as Big Data". Volume, Velocity, and Variety are also known as 3V characters of Big Data.
Why Big Data?
Now, what is so special about volume, velocity, and variety in Big Data?
the specialness comes because traditional monolithic systems cannot handle the storage and processing of Big Data.
Traditional monolithic systems are not scalable enough to handle the storage and processing of large volumes of data. In a monolith system, we can have vertical scaling (increasing resources of a single system) but it is limited and at one point we will not be getting much performance. when we double the resources of a monolithic system it will not necessarily increase the performance by a similar margin.
what is the solution for this?
We need to distribute storage and processing for Big Data with the help of multiple nodes ( computer systems) working together to fulfill the requirement of Storage and Processing.
So when we need to increase the performance of a cluster(set of these nodes working together) we just add more nodes to the cluster and here on adding double resources (meaning double the size of the cluster) the performance will be doubled, unlike a monolithic system.
Hadoop is one such framework that basically provides a collection of technologies and tools to solve any Big Data related problem by providing HDFS(distributed storage) and MapReduce(distributed computing).
About Hadoop and its History
Hadoop is a framework that provides a collection of tools and technologies used to solve any big data related problem and the Hadoop ecosystem is comprised of other technologies that work on top of Hadoop to make a big data job easier by providing prewritten java code for a specific use-case that can run directly on Hadoop.
In 2003 Google published a whitepaper named GFS (Google file system) introducing the idea of distributed storage then in 2004 they released MapReduce introducing the idea of distributed computing. After a year in 2006, Yahoo launched Hadoop which was inspired by GFS and MapReduce. Yahoo renamed GFS as HDFS and the core components of Hadoop were HDFS & MapReduce.
After that in 2012, Hadoop came under Apache and was named Hadoop 1.0 but there was a significant performance hit as the MapReduce (Compute engine) was doing the work of resource management also hence a new version of Hadoop was introduced in 2013 with a new component YARN and Hadoop 2 was released comprising of three core components HDFS (for distributed storage), MapReduce(for distributed computing), Yarn (for resource management).