Posts

Showing posts from April, 2017

Top 5 mistakes to avoid when writing Apache Spark applications

Image
Top 5 Mistakes to Avoid When Writing Apache Spark Applications Spark is one of the big data engines that are trending in recent times. One of the main reasons is that is because of its ability to process real-time streaming data. Its advantages over traditional MapReduce are: It is faster than MapReduce Well equipped with Machine Learning abilities. Supports multiple programming languages. However, in spite of having all these advantages over Hadoop, we often get stuck in certain situations which arise due to inefficient codes are written for applications. The situations and their solutions are discussed below: Always try to use reducebykey instead of groupbykey Reduce should be lesser than TreeReduce Always try to lower the side of maps as much as possible Try not to shuffle more Try to keep away from Skews as well as partitions too Do not let the jobs to slow down: When the application is shuffled, it takes more time around 4 long hours to run. This makes th

Why is Apache Spark is faster than MapReduce?

Image
Why is Apache Spark getting all the attention when it comes to the Big Data space? Why is Apache Spark 100x faster than MapReduce and how is it possible is the question for many in this space. This blog post is my way to answer this question. Why is Apache Spark getting attention in Big Data Space? Well, the answer is, for the scenarios where parallel processing is required and have many interdependent tasks, Apache Spark in memory processing offers the best big data processing platform. Hence the attention. Why is Apache Spark faster than MapReduce? Data processing requires computer resource like the memory, storage, etc. In Apache Spark, the data needed is loaded into the memory as Resilient Distributed Dataset (RDD) and processed in parallel by performing various transformation and action on it. In some cases, the output RDD from one task is used as input to another task, creating a lineage of RDDs which are inter-dependent on each other. However, in traditional MapReduce,