Top 5 mistakes to avoid when writing Apache Spark applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications Spark is one of the big data engines that are trending in recent times. One of the main reasons is that is because of its ability to process real-time streaming data. Its advantages over traditional MapReduce are: It is faster than MapReduce Well equipped with Machine Learning abilities. Supports multiple programming languages. However, in spite of having all these advantages over Hadoop, we often get stuck in certain situations which arise due to inefficient codes are written for applications. The situations and their solutions are discussed below: Always try to use reducebykey instead of groupbykey Reduce should be lesser than TreeReduce Always try to lower the side of maps as much as possible Try not to shuffle more Try to keep away from Skews as well as partitions too Do not let the jobs to slow down: When the application is shuffled, it takes more time around 4 long hours to run. This make...