What are the main features of Apache Spark?

Performance: The key feature of Apache Spark is its Performance. With Apache Spark we can run programs up to 100 times faster than Hadoop MapReduce in memory and 10 times faster when running on disk, by reducing number of read / write to disk.  RDD helps to transparently store the data on memory and persist on disk upon need.

Ease of Use: Spark supports Java, Python, R, Scala etc. languages. So it makes it much easier to develop applications for Apache Spark.  It comes with a built-in set of over 80 high-level operators.

Integration with Hadoop and Existing Hadoop data: Spark can run independently of Hadoop and also it can run on Hadoop 2’s YARN cluster manager. It can read HBase and HDFS data.

Support for Advanced Analytics: Spark supports sophisticated analytics by seamlessly combining SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box.

Integrated Solution: In Spark we can create an integrated solution that combines the power of SQL, Streaming and data analytics.

Run Everywhere: Apache Spark can run on many platforms. It can run on Hadoop, Mesos, in Cloud or standalone. It can also connect to many data sources like HDFS, Cassandra, HBase, S3 etc.

Stream Processing: Apache Spark also supports real time stream processing. With Spark’s lightweight and yet powerful APIs, it’s easy to build Spark Streaming. Spark by way of lost work recovery and delivering exactly-once semantics out of box with no extra code and settings. Streaming data can be seamlessly integrated with historical data by re-using the same code for batch and stream processing. With real time streaming we can provide real time analytics solutions. This is very useful for real-time data.

What is a Resilient Distribution Dataset (RDD) in Apache Spark?

Resilient Distribution Dataset (RDD) is an immutable and partitioned collection of records. It is a distributed and resilient collection of records spread over many partitions. RDD hides the data partitioning and distribution behind the scenes.

Main features of RDD:

Distributed: Data in a RDD is distributed across multiple nodes.

Resilient: RDD is a fault-tolerant dataset. In case of node failure, Spark can re-compute data.

Immutable: Data in RDD cannot be modified after creation. But we can transform it using a Transformation.

Dataset: It is a collection of data similar to collections in Scala.


What is a Transformation in Apache Spark?

Transformation in Apache Spark is a lazy operation (not executed immediately) that can be applied to a RDD. Transformation functions produce a new Resilient Distributed Dataset. Once we call an action, transformation is executed. A Transformation does not change the input RDD. We can also create a pipeline of certain Transformations to create a Data flow.