NPTEL Big Data Computing Week 1 And 2 Assignment Answers 2024
1. Which of the following best describes the concept of ‘Big Data’?
Options:
A. Data that is physically large in size
B. Data that is collected from multiple sources and is of high variety, volume, and velocity
C. Data that requires specialized hardware for storage
D. Data that is highly structured and easily analyzable
β
Answer: B
π Explanation: Big Data is defined by the 3 V’s: Volume (large size), Variety (diverse sources and formats), and Velocity (speed of generation and processing). It often includes both structured and unstructured data.
2. Which technology is commonly used for processing and analyzing Big Data in distributed computing environments?
Options:
A. MySQL
B. Hadoop
C. Excel
D. SQLite
β
Answer: B
π Explanation: Hadoop is an open-source framework that enables distributed storage and processing of large datasets across clusters of computers using simple programming models.
3. What is a primary limitation of traditional RDBMS when dealing with Big Data?
Options:
A. They cannot handle structured data
B. They are too expensive to implement
C. They struggle with scaling to manage very large datasets
D. They are not capable of performing complex queries
β
Answer: C
π Explanation: Traditional RDBMS are designed for structured data and often do not scale efficiently to handle the massive volume and variety in Big Data environments.
4. Which component of Hadoop is responsible for distributed storage?
Options:
A. YARN
B. HDFS
C. MapReduce
D. Pig
β
Answer: B
π Explanation: HDFS (Hadoop Distributed File System) is the storage layer of Hadoop, designed to store large files across multiple machines.
5. Which Hadoop ecosystem tool is primarily used for querying and analyzing large datasets stored in Hadoop’s distributed storage?
Options:
A. HBase
B. Hive
C. Kafka
D. Sqoop
β
Answer: B
π Explanation: Hive provides a SQL-like interface to query data stored in HDFS. It translates queries into MapReduce jobs under the hood.
6. Which YARN component is responsible for coordinating the execution of tasks within containers on individual nodes in a Hadoop cluster?
Options:
A. NodeManager
B. ResourceManager
C. ApplicationMaster
D. DataNode
β
Answer: A
π Explanation: NodeManager runs on each node and is responsible for managing containers and monitoring resource usage and task status.
7. What is the primary advantage of using Apache Spark over traditional MapReduce for data processing?
Options:
A. Better fault tolerance
B. Lower hardware requirements
C. Real-time data processing
D. Faster data processing
β
Answer: D
π Explanation: Apache Spark uses in-memory computation which allows much faster processing compared to disk-based MapReduce.
8. What is Apache Spark Streaming primarily used for?
Options:
A. Real-time data visualization
B. Batch processing of large datasets
C. Real-time stream processing
D. Data storage and retrieval
β
Answer: C
π Explanation: Spark Streaming enables real-time stream processing of live data streams using Sparkβs API.
9. Which operation in Apache Spark GraphX is used to perform triangle counting on a graph?
Options:
A. connectedComponents
B. triangleCount
C. shortestPaths
D. pageRank
β
Answer: B
π Explanation: triangleCount is used to count triangles (three-node cycles) in a graph, which is useful for social network analysis and other graph computations.
10. Which component in Hadoop is responsible for executing tasks on individual nodes and reporting back to the JobTracker?
Options:
A. HDFS Namenode
B. TaskTracker
C. YARN ResourceManager
D. DataNode
β
Answer: B
π Explanation: In the classic MapReduce (Hadoop v1), TaskTracker is responsible for executing individual map and reduce tasks and reporting back to the JobTracker.
NPTEL Big Data Computing Week 2 Assignment Answers
1. Which statement best describes the data storage model used by HBase?
Options:
A. Key-value pairs
B. Document-oriented
C. Encryption
D. Relational tables
β
Answer: A β Key-value pairs
π Explanation: HBase is a NoSQL database that stores data in a key-value pair format, built on top of HDFS for scalability and quick access to large datasets.
2. What is Apache Avro primarily used for in the context of Big Data?
Options:
A. Real-time data streaming
B. Data serialization
C. Machine learning
D. Database management
β
Answer: B β Data serialization
π Explanation: Apache Avro is a framework used for data serialization and deserialization, making it ideal for transferring large volumes of data between systems in Hadoop.
3. Which component in HDFS is responsible for storing actual data blocks on the DataNodes?
Options:
A. NameNode
B. DataNode
C. Secondary NameNode
D. ResourceManager
β
Answer: B β DataNode
π Explanation: DataNodes are responsible for storing the actual blocks of data in HDFS, while NameNode manages metadata and directory structure.
4. Which feature of HDFS ensures fault tolerance by replicating data blocks across multiple DataNodes?
Options:
A. Partitioning
B. Compression
C. Replication
D. Encryption
β
Answer: C β Replication
π Explanation: HDFS uses replication to copy data blocks across multiple nodes. The default is 3 replicas, ensuring fault tolerance and high availability.
5. Which component in MapReduce is responsible for sorting and grouping the intermediate key-value pairs before passing them to the Reducer?
Options:
A. Mapper
B. Reducer
C. Partitioner
D. Combiner
β
Answer: C β Partitioner
π Explanation: The Partitioner assigns intermediate key-value pairs to specific Reducers based on the key, which ensures that all values for the same key go to the same Reducer.
6. What is the default replication factor in Hadoop Distributed File System (HDFS)?
Options:
A. 1
B. 2
C. 3
D. 4
β
Answer: C β 3
π Explanation: The default replication factor in HDFS is 3, meaning each block of data is copied to 3 different DataNodes for redundancy.
7. In a MapReduce job, what is the role of the Reducer?
Options:
A. Sorting input data
B. Transforming intermediate data
C. Aggregating results
D. Splitting input data
β
Answer: C β Aggregating results
π Explanation: The Reducer receives intermediate data (key-value pairs) and aggregates or summarizes them (e.g., sums, averages, counts) to produce final output.
8. Which task can be efficiently parallelized using MapReduce?
Options:
A. Real-time sensor data processing
B. Single-row database queries
C. Image rendering
D. Log file analysis
β
Answer: D β Log file analysis
π Explanation: MapReduce is ideal for processing large, distributed datasets like logs. It splits the logs and processes them in parallel across multiple nodes.
9. Which MapReduce application involves counting the occurrence of words in a large corpus of text?
Options:
A. PageRank algorithm
B. K-means clustering
C. Word count
D. Recommender system
β
Answer: C β Word count
π Explanation: Word count is a classic example of a MapReduce program, demonstrating how to tokenize, map, and reduce text data to count word frequencies.
10. What does reversing a web link graph typically involve?
Options:
A. Removing dead links from the graph
B. Inverting the direction of edges
C. Adding new links to the graph
D. Sorting links based on page rank
β
Answer: B β Inverting the direction of edges
π Explanation: Reversing a graph means to invert the direction of all edges, often used in web crawling and PageRank algorithms for analyzing backlinks.