Big Data developers interview questions

Differentiate between Spark Datasets,Dataframes and RDDs
CriteriaSpark DatasetsSpark DataframesSpark RDDs
Representation of dataSpark Datasets is a combination of Dataframes and RDDs with features like static type safety and object-oriented interfaces.Spark Dataframe is a distributed collection of data that is organized into named columns.

Spark RDDs are a distributed collection of data without

schema.

OptimizationDatasets make use of catalyst optimizers for optimization.Dataframes also makes use of catalyst optimizer for optimization.There is no built-in optimization engine.
Schemma ProjectionDatasets find out schema automatically using SQL Engine.Dataframes also find the schema automatically.Schema needs to be defined manually in RDDs.
Aggregation SpeedDataset aggregation is faster than RDD but slower than Dataframes.Aggregations  are faster in Dataframes due to the provision of easy and powerful APIs.RDDs are slower than both the Dataframes and the Datasets while performing even simple operations like data grouping.

2、List the types of Deploy Modes in Spark.

There are 2 deploy modes in Spark. They are :

  • Client Mode: The deploy mode is said to be in client mode when the spark driver component runs on the machine node from where the spark job is submitted.
  1.  The main disadvantage of this mode is if the machine node fails, then the entire job fails.
  2. This mode supports both interactive shells or the job submission commands.
  3. The performance of this mode is worst and is not preferred in production environments.      
  • Cluster Mode : If the spark job driver component does not run on the machine from which the spark job has been submitted,then the deploy mode is said to be in cluster mode.
  1. The spark job launches the driver component within the cluster as a part of the sub-process of ApplicationMaster.
  2. This mode supports deployment only using the spark-submit command (interactive shell mode is not supported).
  3. Here,since the driver programs are run in ApplicationMaster,in case the program fails, the driver program is re-instantiated.
  4. In this mode, there is a dedicated cluster manager (such as stand-alone, YARN, Apache Mesos, Kubernetes, etc) for allocating the resources required for the job to run as shown in the below architecture.

3、Under what scenarios do you use Client and Cluster modes for deployment?

  • In case the client machines are not close to the cluster, then the Cluster mode should be used for deployment. This is done to avoid the network latency caused while communication between the executors which would occur in the Client mode. Also, in Client mode, the entire process is lost if the machine goes offline.
  • If we have the client machine inside the cluster, then the Client mode can be used for deployment. Since the machine is inside the cluster, there won’t be issues of network latency and since the maintenance of the cluster is already handled, there is no cause of worry in cases of failure.

4、 Define Executor Memory in Spark

The applications developed in Spark have the same fixed cores count and fixed heap size defined for spark executors. The heap size refers to the memory of the Spark executor that is controlled by making use of the property spark.executor.memory that belongs to the -executor-memory flag. Every Spark applications have one allocated executor on each worker node it runs. The executor memory is a measure of the memory consumed by the worker node that the application utilizes.

5、Write a spark program to check if a given keyword exists in a huge text file or not?

def keywordExists(line):

    if (line.find(“my_keyword”) > -1):

          return 1

return 0

lines = sparkContext.textFile(“test_file.txt”);

isExist = lines.map(keywordExists);

sum = isExist.reduce(sum);

print(“Found” if sum>0 else “Not Found”)

6、What can you say about Spark Datasets?

Spark Datasets are those data structures of SparkSQL that provide JVM objects with all the benefits (such as data manipulation using lambda functions) of RDDs alongside Spark SQL-optimised execution engine. This was introduced as part of Spark since version 1.6.

  • Spark datasets are strongly typed structures that represent the structured queries along with their encoders.
  • They provide type safety to the data and also give an object-oriented programming interface.
  • The datasets are more structured and have the lazy query expression which helps in triggering the action. Datasets have the combined powers of both RDD and Dataframes. Internally, each dataset symbolizes a logical plan which informs the computational query about the need for data production. Once the logical plan is analyzed and resolved, then the physical query plan is formed that does the actual query execution.

Datasets have the following features:

  • Optimized Query feature: Spark datasets provide optimized queries using Tungsten and Catalyst Query Optimizer frameworks. The Catalyst Query Optimizer represents and manipulates a data flow graph (graph of expressions and relational operators). The Tungsten improves and optimizes the speed of execution of Spark job by emphasizing the hardware architecture of the Spark execution platform.
  • Compile-Time Analysis: Datasets have the flexibility of analyzing and checking the syntaxes at the compile-time which is not technically possible in RDDs or Dataframes or the regular SQL queries.
  • Interconvertible: The type-safe feature of datasets can be converted to “untyped” Dataframes by making use of the following methods provided by the Datasetholder:
    • toDS():Dataset[T]
    • toDF():DataFrame
    • toDF(columName:String*):DataFrame
  • Faster Computation: Datasets implementation are much faster than those of the RDDs which helps in increasing the system performance.
  • Persistent storage qualified: Since the datasets are both queryable and serializable, they can be easily stored in any persistent storages.
  • Less Memory Consumed: Spark uses the feature of caching to create a more optimal data layout. Hence, less memory is consumed.
  • Single Interface Multiple Languages: Single API is provided for both Java and Scala languages. These are widely used languages for using Apache Spark. This results in a lesser burden of using libraries for different types of inputs.

7、Spark SQL DSL很重要的考点,国外很盛行。国内使用aggregation.

8、What are case classes in Scala?

Scala case classes are like regular classes except for the fact that they are good for modeling immutable data and serve as useful in pattern matching. Case classes include public and immutable parameters by default. These classes support pattern matching, which makes it easier to write logical code.  The following are some of the characteristics of a Scala case class: 

  • Instances of the class can be created without the new keyword.
  • As part of the case classes, Scala automatically generates methods such as equals(), hashcode(), and toString().
  • Scala generates accessor methods for all constructor arguments for a case class.

9、Explain Traits in Scala.

The concept of traits is similar to an interface in Java, but they are even more powerful since they let you implement members. It is composed of both abstract and non-abstract methods, and it features a wide range of fields as its members. Traits can either contain all abstract methods or a mixture of abstract and non-abstract methods. In computing, a trait is defined as a unit that encapsulates the method and its variables or fields. Furthermore, Scala allows partial implementation of traits, but no constructor parameters may be included in traits. To create traits, use the trait keyword. 

10、Explain how you will explain a function in Scala.

Functions are first-class values in Scala and can be created using the def keyword. When defining a function, the return type of the parameter must be specified. A function's return type is optional. The default return type of a function is Unit if it is not specified. Function declarations in Scala have the following form: − 

 def function_name ([list of parameters]) : [return type] = 
{ 
      //Statement to be executed 
} 

11、Explain Implicit Parameter.

The implicit parameter, as opposed to the regular parameter, can be passed silently to a method without going through the regular parameter list. The implicit keyword indicates these parameters and any parameters following the implicit keyword are implicit. Essentially, when we don't pass anything to methods or functions, the compiler will look for an implicit value and use that as a parameter. Whenever implicit keywords appear in a function's parameter scope, this marks all parameters as implicit. Each method can simply have a single implicit keyword. 

Syntax:

def func1(implicit a : Int)                                                   // a is implicit 
def func2(implicit a : Int, b : Int)                                        //  a and b both are implicit 
def func3 (a : Int)(implicit b : Int)                                      // only b is implicit  

Example: 

 object InterviewBit 
{ 
   def main(args: Array[String])  
   { 
      val value = 20 
      implicit val addition = 5 
      def add(implicit to: Int) = value + to 
      val result = add 
      println(result)  
   } 
}  

Output: 

 25 

 12、Explain Option and write its usage.

The Scala Option[T] is referred to as a carrier of one element or none for a specified type. As the name suggests, this container holds either a Some or a None object, which represents a missing value. It holds Some[T] if a value is stored, otherwise none. In simple words, Scala options are wrappers for missing values. 

Example:

 object option  
{  
      def main(args: Array[String])  
    {  
        val employee: Map("InterviewBit" -> "Aarav", "Scaler" -> "Ankesh")  
        val a= employee.get("InterviewBit")  
        val b= employee.get("Informatica")  
        println(a);  
        println(b);  
    }  
}  

Output:  

Some(Aarav)   
None  
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值