Learn about the Big Data Tools and Technologies

Big Data Tools

Apache Hadoop

Apache Hadoop is a software platform sporting an open-source system. It is made for distributed processing and distributed storage of massive data sets on clusters of computers, erected from commodity hardware. The following are examples of Hadoop services:


  • Data storage
  • Data processing
  • Data access
  • Data governance
  • Security
  • Operations.

Benefits of Hadoop

  • It is easily scalable and performance oriented. The idea is to process data in a distributed manner, local to each node in a cluster. This empowers Hadoop to manage, store, analyze, and process data in the petabyte range.
  • You can rely on Hadoop with your eyes closed. It is a known fact that computer clusters which are large are more prone to malfunction of individual nodes in the cluster. The build quality is resilient fundamentally. In case of a node failure, processing is immediately re-directed to the surviving nodes in the cluster. The data is then autonomously re-replicated to account for all future node failures.
  • Hadoop is highly extensible. In the case of traditional relational database management systems, we had to create structured schemas before storing data. In Hadoop however, you can store data in any format. The format can be both semi-structured or unstructured. After that, we have to parse and then administer schema to the data when read.
  • The ultimate aim is to optimize expenses. Hadoop is an open source software platform and as a result, it runs on low-cost commodity hardware. This makes it cost-effective.

Azure HDInsight

Azure HDInsight is a cloud distribution of the components from Hadoop in the Hortonworks Data Platform. Azure HDInsight is extremely easy to use, incredibly fast, and it is a very cost-effective tool to process large amounts of data. We have the option of using the leading open-source frameworks such as Storm R, LLAP, Hadoop, Hive, Kafka, Spark, and more. These frameworks empower and enable you to implement a wide range of functions like load, extract, and transform (ETL), machine learning, IOT, and data warehousing.


Why should I use Hadoop on HDInsight?

  • It is Cloud native. Azure HDInsight empowers and assists you to create efficient clusters for ML Services on Azure, Hadoop Spark, HBase, Interactive query (LLAP), Storm, and Kafka.Hadoop, Spark, Interactive query (LLAP), Kafka, Storm, HBase, and ML Services on Azure. HDInsight is capable of providing you with end-to-end SLA for all your production workloads.
  • It is cost-effective and easily scalable. HDInsight gives you the opportunity to scale your workloads up or down depending on your requirements. The option to optimize costs is always there by creating clusters on the requirement and paying only for the actual usage. To streamline your jobs, you can also build data pipelines. The option of Decoupled compute and storage enables it to perform better and be more flexible.
  • It is compliant and secure. You can now securely safeguard your enterprise data assets with Azure Virtual Network, encryption, and integration with Azure Active Directory by using HDInsight. HDInsight satisfies all the leading government and industry compliance standards.
  • Continuous monitoring is underway. You can now effectively monitor all your clusters using a single interface. This is possible by integrating Azure HDInsight with Azure Log Analytics to give you the best monitoring experience ever.
  • It is available globally. There are many analytics tools available in the market but HDInsight is available in more regions and has an extensive network. Azure HDInsight is now available in Azure Government, China, and Germany, which enables you to come to terms with your enterprise requirements in key sovereign areas.
  • It is highly productive. Azure HDInsight enables and empowers you to utilize feature-rich, highly productive tools for Hadoop and Spark with your favored development environments. Some of these development environments include Visual Studio, VSCode, Eclipse, and IntelliJ for Scala, Python, R, Java, and .NET aid. Data scientists can also integrate utilizing popular notebooks like Jupyter and Zeppelin.
  • It is highly extensible. There is an option to extend the HDInsight clusters with installed parts (Hue, Presto and so on) by utilizing script actions, by attaching edge nodes, or by collaborating with numerous other big data certified applications. HDInsight allows for seamless integration with the leading big data solutions with a one-click deployment.


Apache Hive

Apache Hive literally translates to a data warehouse system for Hadoop. Hive enables and empowers us with the following:

  • Data summarization
  • Querying
  • Analysis of data

We use HiveQL to write hive queries. HiveQL is a query language similar to SQL.
Hive enables you to project structure based on largely unstructured data. Once you have finished defining the structure, you can use HiveQL to query the data even if you don’t have knowledge of Java or MapReduce.
HDInsight provides us with numerous cluster types, which we optimize for specific workloads and workflows. We use the following cluster types most often for Hive queries:

  • Interactive Query – A Hadoop cluster that provides Low Latency Analytical Processing (LLAP) functionality to improve response times for interactive queries.
  • Hadoop – A Hadoop cluster that can handle batch processing workloads and workflows.
  • Spark – Apache Spark has built-in functionality for working with Hive.
  • HBase – HiveQL can be used to query data stored in HBase.



Presto (or PrestoDB) is an open source, distributed SQL query engine, created from nothing to speed up analytic queries against any volume of data. It is capable of supporting both non-relational sources (HBase, HDFS, MongoDB, Cassandra, and Amazon S3) and relational data sources (MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata.

The capability to query data where we store it, is in-built in Presto. The added benefit with Presto is that it does not need to move data into a separate analytics system to query it. The execution process for the query runs simultaneously over a pure memory-based architecture and the results are returned within seconds. Many leading companies like Nasdaq, Facebook, Atlassian, Netflix, and Airbnb use Presto.

Presto and Hadoop

Presto is an open source, distributed SQL query engine, created for quick, interactive queries on data in HDFS, and others. Presto is however not equipped with its own storage system, unlike Hadoop/HDFS. Therefore, Presto and Hadoop go hand in hand, with enterprises making use of both to achieve a varied range of business goals. We can install Presto with any tool of Hadoop, and its packaging is done in the Amazon EMR Hadoop distribution.

Who uses Presto?

Presto is used worldwide in production at numerous leading business institutions. Companies like Facebook, Airbnb, Netflix, Atlassian, Nasdaq, and many others use Presto to achieve their requirements. In Facebook, over a thousand employees work with Presto. They are capable of running more than 30,000 queries and can process one petabyte of data on a daily basis. The company Netflix on average runs around 3,500 queries per day on its Presto clusters. The company Airbnb created, built, and open-sourced, Airpal, a web-based tool for query execution which works in tandem with Presto.

Apache Sqoop

Apache Sqoop is a tool designed to efficiently transfer bulk data between structured data stores and Apache Hadoop. It uses a seamless technology to transport massive amounts of data between relational databases and Apache Hadoop. Sqoop also assists us by offloading specific tasks (such as ETL processing) from the EDW to Hadoop for systematic execution at cheaper prices. It may also be utilized to extract data from Hadoop and export it into structured datastores available externally. Sqoop integrates and collaborates with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

What is Sqoop capable of doing?

  • Sqoop helps us to import sequential datasets from the mainframe. It easily copes with the ever growing need to shift data from the mainframe to HDFS.
  • Sqoop is equipped with the option of directly importing to OCR Files. Sqoop offers better compression and light-weight indexing which leads to a much-improved query performance.
  • Sqoop can handle data imports efficiently. It is capable of moving specific data from external stores and EDWs into Hadoop to reduce the cost involved in combining data storage and processing. This ultimately leads to more savings.
  • Sqoop is capable of parallel data transfer. This results in improved performance and optimum utilization of system resources.
  • Sqoop is optimized to offer fast data copies. It can copy massive amounts of data accurately from external systems into Hadoop.
  • Sqoop has an efficient data analysis system. Sqoop helps improve the efficiency of data analysis by synthesizing structured data with unstructured data in a schema-on-read data lake.
  • Sqoop comes with an efficient in-built load balancing capability. It reduces the need for excessive storage processing loads to other systems.
  • YARN collaborates data ingest from Apache Sqoop and numerous additional services that transport data into the Enterprise Hadoop cluster.


Apache Storm

Apache Storm is a tool designed to process streaming data in real time. ApacheTM Storm is equipped with dependable real-time data processing abilities to Enterprise Hadoop. Storm on YARN is extremely powerful and efficient for situations which demanding real-time analytics, machine learning and constant monitoring of activities.
YARN integrates with Storm through Apache Slider and governs Storm while YARN looks for cluster resources to manage the data. YARN also looks after the operations and security aspects of a modern data architecture.

What does Storm do?

Storm is a real-time distributed calculation system to process massive amounts of high-velocity data. Storm is super fast and is capable of processing over a million records per second on each node on a cluster of decent volume. Business organizations tap this speed and integrate it with numerous additional data access applications in Hadoop to avoid unwanted instances or to enhance positive results.
The following are a few examples of new business opportunities:

  • Real-time customer service management
  • Data monetization
  • Operational dashboards
  • Cybersecurity analytics
  • Threat detection

The following table shows some usual “prevent” and “optimize” use cases for Storm.

“Prevent” Use Cases “Optimize Use Cases
Financial Services
  • Securities Fraud
  • Operational risks & compliance violations
  • Order routing
  • Pricing
  • Security Breaches
  • Network Outages
  • Bandwidth allocation
  • Customer service
  • Shrinkage
  • Stock-outs
  • Offers
  • Pricing
  • Preventative maintenance
  • Quality assurance
  • Supply chain optimization
  • Reduced plant downtime
  • Application failures
  • Operational issues
  • Personalized content
  • Driver monitoring
  • Predictive maintenance
  • Routes
  • Pricing

The upside of using Storm is that it is very simple and the design is minimalistic. Developers can easily write Storm topologies utilizing any programming language.
The following five attributes make Storm an ideal tool to process data workloads in real-time:

  • Speed – Storm is extremely fast and is benchmarked as processing one million 100 byte messages per second in one node.
  • Scalability – Storm is scalable with parallel computations spanning a cluster of machines.
  • Resiliency – Storm is highly resilient and very strong. In the event of a worker dying, Storm will automatically restart them. In the case of a node dying, the worker will be restarted on a different node.
  • Reliability – Storm comes with the guarantee that every unit of data (tuple) will undergo processing at least once or exactly once. Only in the case of failures, messages are replayed.
  • Ease of Operation – Storm is very easy to use and standard configurations are applicable for production on the initial day. Storm is extremely easy to operate once it is deployed.


Apache Spark

Apache Spark is an open-source parallel processing framework. We use it for running large-scale data analytics applications spanning numerous computers in a cluster. It is equipped with the capabilities of handling both batch and real-time analytics and to process data workloads. In February 2014, Spark became a high-level project of the Apache Software Foundation. Version 1.0 of Apache spark was unveiled and released in May 2014. In July of 2016, Spark version 2.0 was released. This technology was primarily developed by researchers in 2009 at the University of California in Berkeley to quicken processing jobs in Hadoop systems.

Why should we use Apache Spark?

  • 100x faster Workload running speed – Apache Spark is capable of achieving lightning fast speed and performance. It is very effective for both batch and streaming data. It uses a modern DAG scheduler, a query optimizer, and a physical execution engine.
  • Ease of operation – Spark engine is absolutely easy to use with over 80 high-level operators. They make it easy to design and build parallel applications. We can use the shells from Scala, Python, R, and SQL to use Spark interactively. We can write applications quicker in Java, Scala, Python, R, and SQL.
  • Versatility – Spark is highly versatile and flexible and can combine SQL, streaming, and complex analytics. Spark can power a number of libraries including, SQL and DataFrames, MLlib for machine learning, Graph X, and Spark Streaming. You can use the same application to seamlessly combine these libraries.
  • Universal in nature – Apache Spark can run everywhere. Spark can run on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It has the capability of accessing various data sources. We can run Spark utilizing its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. We can access data in HDFS, Alluxio, Apache Cassandra. Apache HBase, Apache Hive, and numerous other data sources.



Scala is a very efficient hybrid language for Big Data. Scala Stands for “scalable language”. It is a general-purpose programming language which is highly scalable. Scala brings together aspects of both object-oriented and functional programming. In the world of data science, Scala is regarded as one of the most important programming languages. Scala is currently in the market to rival more established languages like Java and Python. The boom of Apache Spark (which is written in Scala) has invariably led to the growth of Scala and has earned Scala a reputation of being a powerful language for data processing, machine learning, and streaming analytics.
In Scala, every value is an object, and every operator is a method while you can pass functions around as variables. Scala is equipped with an agile and modular mixin composition that combines the benefits of mixins and traits. Scala helps programmers to reuse un-inherited new class definitions. There is an in-built syntax that is capable of supporting higher-order functions in addition to anonymous functions.
Scala is also capable of supporting other paradigms including imperative and declarative. Scala has an added benefit over other conventional imperative programming languages in respect to parallelization. Scala uses a higher level of abstraction to describe algorithms. With the help of this abstraction, the same exact algorithm can be run in serial, in parallel spanning all obtainable cores on a single machine. The algorithms can also run in parallel across a cluster of machines, without altering any code.
Scala runs seamlessly on the Java Virtual Machine and collaborates very smoothly with Java. There is the possibility to directly utilize Java libraries, call Java codes, implement Java interfaces, and subclass Java classes in Scala and vice versa. However, we do have to note that some features of Scala like, traits with defined methods, and Scala’s advanced types are not accessible from Java.
In addition to everything, the programming language of Scala is efficient and concise. We can replace numerous loops with a single word. This makes Scala a lot less verbose (using or expressed in more words than are needed) than normal Java, Furthermore, Scala is type-safe because of its functional nature and is also statically typed.
We implemented a well specified compact algorithm in four different languages which included Scala, Java, Go, and C++. Scala is equipped with a concise notation system and dynamic language features allowing for the best optimization of code complexity.
From a world-wide point of view, Scala has evolved to become a decisive tool for machine learning and big data. Numerous technology giants including Twitter, Linkedin, and The Guardian designed and built their websites with Scala.

Why should we use Scala?

1. Scala is the most concise programming language. Scala is optimized to balance readability and conciseness. Because of this, the code becomes easy to understand. Scala is concise due to the following reasons:

  • Type inference – Unlike numerous other functional languages, Scala’s type inference is local. Only the compiler can infer types and as a result, they get out of the way.
  • A Mechanism for Pattern matching – Pattern matching is the second most used feature of Scala. It allows us to match any type of data with a first-match policy.
  • Scala is equipped with the ability to utilize functions as variables and reusing utility functions.

2. Leading edge class composition. Scala is an object-oriented language. Hence, it allows users to extend classes with sub-classing and an agile mixin-composition. This provides users with an effective way of reusing code and a substitute for multiple inheritances to avoid inheritance ambiguity.
3. Processes Streams in real-time. Unlike Hadoop MapReduce, (Hadoop MapReduce is used to process and generate massive datasets in-parallel but it is unable to handle real-time stream processing) Scala, aided by Spark, has an added advantage over other programming languages in the field of real-time stream processing. This has resulted in Scala emerging as the computational engine for fast data processing.
4. Streamlined integration with Java. There exists a vast ecosystem in Scala because of its seamless interoperability with Java. The integration between Scala and the big data eco-system (Java-based) is smooth and perfect without any glitches. Java libraries, IDEs (Eclipse, IntelliJ, etc), frameworks (Spring, Hibernate, etc) and tools function perfectly with Scala. Most popular frameworks nowadays consist of dual APIs for Scala and Java. The frameworks and APIs provided below are normally used by Scala programmers in big data projects:

  • Apache Spark – Apache Spark is an open-source parallel processing framework. It is used for running large-scale data analytics applications spanning numerous clustered computers.
  • Apache Flink – Apache Flink is a framework for batch data processing and distributed stream. Flink consists of numerous APIs for batch processing (DataSet API), real-time streaming (DataStream API), and relational queries (Table API).
  • Apache Kafka – Apache Kafka is an example of a distributed streaming platform. Programmers use it to handle real-time data feeds.
  • Apache Samza – Apache Samza is an example of a distributed stream-processing tool. Apache Samza combines with Apache Kafka and Apache Hadoop Yarn to provide fault tolerance, processor isolation, security, and resource management.
  • Akka – Akka is an example of an existing framework. It is used to build distributed applications.
  • Summingbird – Summingbird is a framework for integrating online MapReduce computations and batch.
  • Scalding – Scalding is a Scala API for the cascading. It is an abstraction of MapReduce.
  • Scrunch – We use the Scrunch framework to write, test, and run MapReduce pipelines.

In addition, there are numerous Java-based data storage tools that collaborate and integrate well with Scala including, Apache Cassandra (phantom and Cassie are Scala Cassandra clients), Apache HBase, Datomic, and Voldemort.
5. Numerous libraries for data analysis and data science. Scala’s libraries are comprehensive and provide a rock-solid base for big data projects. The libraries mentioned below are some of the most used machine learning and data analysis libraries:

  • Saddle – Saddle is a high-geared data manipulation library inspired by the pandas’ library for Python.
  • ScalaNLP – ScalaNLP is a combination of different libraries, including Breeze (a set of libraries used for numerical computing and machine learning), and Epic (a structured prediction library and high-performance statistical parser).
  • Apache Spark MLlib – It is a library for machine learning for Scala, Java, Python, and R.
  • Apache PredictionIO – It is an example of a machine learning server. Apache PredictionIO is based on Apache Spark, HBase, and Spray. It can be installed as a full machine learning stack.
  • DEEPLEARNING4J – This is a distributed deep-learning library for Java and Scala.
  • Scala-datable and Framian – We use it for data frames and data tables.

6. A vibrant ever-evolving community. Scala is made up of a vibrant community which is growing at a very large scale. The Kdnuggets Analytics Data Science 2016 Software poll regarded Scala as one among the tools with the highest growth.

Apache Kafka

Apache Kafka is a distributed publish-subscribe messaging technology created to disrupt traditional message brokers. Message brokers are basically middlewares which translate messages from one language to a different one and revolves normally around more commonly accepted language.
Apache Kafka was open sourced in 2011 after it was originally created and developed by LinkedIn. As of this moment, Apache Software Foundation is developing Kafka to research data infrastructures. Massive parallel commodity clusters have given rise to these new sets of data infrastructures for Kafka to exploit. Apache Kafka helps to improve orthodox message brokers by advancing in throughput, built-in-partitioning, replication, latency, and reliability.
We use Kafka for a number of reasons.

  • Messaging
  • Real-time tracking of website activity
  • Monitoring operational metrics of distributed applications
  • Log aggregation from multiple servers
  • Event sourcing where we log and order state alterations in a database
  • Commit logs at sync data place of distributed systems
  • Utilizing failed systems and restoring data from them



HBase is an example of an open source, non-relational, distributed database. It was released in 2008 and was modeled after Google’s Bigtable. HBase is written in Java. HBase works in tandem with as a part of Apache Software Foundation’s Apache Hadoop project. HBase runs on top of HDFS (Hadoop Distributed File System) and is capable of providing Bigtable-like capabilities for Hadoop. It is an efficient and fault-tolerant method of storing massive quantities of sparse data. Sparse data basically refers to the small collection of necessary information within a huge set of unimportant or empty data. For example, we have to find the 50 smallest items in a collection of 2 billion records, or locating the non-zero items denoting less than 0.1% of a large collection.
HBase is feature-rich and detail-oriented. It has compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original Bigtable paper. We can use the tables in HBase as the input and output for MapReduce jobs run in Hadoop. We can access them by using Java API, REST, Avro, or Thrift gateway APIs. On account of being a column-oriented key-value data store, HBase is used widely because of its interoperability with Hadoop and HDFS. HBase runs on top of HDFS and can easily manage speedy read and write functions on massive datasets with high throughput and low input/output latency.
We should not refer to HBase as a direct substitute for a traditional SQL database. In contrast, however, Apache Phoenix project gives us a SQL layer for HBase in addition to the JDBC driver. We can integrate it with numerous business intelligence and analytics applications. The Apache Trafodion project uses HBase as a storage engine. The project provides a SQL query engine with ODBC and JDBC drivers and distributed ACID transaction security spanning numerous statements, tables and rows.
Facebook used HBase to put in action their messaging platform in the November of 2010. As of February 2017, the 1.2.x series is the most recent stable release line. The following is a schedule of leading organizations that have used or are using HBase:

  • Yahoo
  • Tuenti – Tuenti uses HBase for it’s messaging platform
  • Spotify – Spotify uses HBase as the base for Hadoop and machine learning jobs
  • Sophos – Sophos uses HBase for specific back-end systems
  • Sears
  • Salesforce.com
  • Rocket Fuel
  • Richrelevance
  • Quicken Loans
  • Pinterest
  • Netflix
  • Kakao
  • Imgur – Imgur uses HBase to power its notifications system
  • Flurry
  • Facebook – Facebook was using HBase for it’s messaging platform between 2010 and 2018
  • Bloomberg – Bloomberg uses HBase for time series data storage
  • Amadeus IT Group – HBase is the primary long-term storage DB
  • Airbnb – Airbnb uses HBase as part of its AirStream real-time stream computation framework
  • Adobe
  • 23andME


Apache Oozie

Apache Oozie is an example of a workflow scheduling system which requires a server. We use it to manage Hadoop jobs. People also call it an orchestration system. Oozie is capable of running multistage Hadoop jobs as a single Oozie job. We can configure Oozie jobs to run on demand or periodically. We call the on-demand running Oozie jobs as workflow jobs while we call the periodically running Oozie jobs as coordinator jobs. In addition to workflow jobs and coordinator jobs, Oozie has bundle jobs. A bundle job is a collection of coordinator jobs which we manage as a single job.
In Oozie, we define workflows as a collection of control flow and action nodes in a directed acyclic graph. The control flow nodes define the beginning and the end of a workflow (start, end, and failure nodes). Control flow nodes is also a mechanism to manage the workflow execution path (decision, fork, and join nodes). We define action nodes as the method by which a workflow triggers the execution of a computation/processing task. Oozie comes with capabilities to support numerous types of actions including Hadoop MapReduce, Hadoop distributed file system operations, Pig, SSH, and e-mail. We can also extend Oozie to support additional types of actions.
We can parameterize Oozie workflows using variables such as ${inputDir} within the workflow definition. We must provide the values for the parameters when submitting a workflow job. If we give the proper parameters using different output directories, then numerous identical workflow jobs can run concurrently. We currently implement Oozie as a Java web application. It runs in a Java servlet container. Apache Oozie works under the Apache License 2.0.
Benefits of Apache Oozie

  • It is a general-purpose system to run multistage Hadoop jobs with efficiency and speed.
  • Apache Oozie is an adequate and well-understood programming model. Its adoption reduces developer ramp-up time.
  • We find it very easy to troubleshoot and we can easily recover jobs in case of a mishap.
  • It is highly extensible to support new types of jobs.
  • Everyone knows that it is highly scalable and can support several thousand concurrent jobs.
  • Apache Oozie runs in a Java servlet container which increases reliability.
  • It is a multitenant service which reduces the cost of operations.


The Spark Python API (PySpark)

The Spark Python API (PySpark) exposes the spark programming model to Python. We all know that Python is a very powerful and widely-used programming language for controlling and managing complex data analysis and data munging tasks. Python comes equipped with numerous inbuilt libraries and frameworks to do data mining tasks efficiently. Having said that, we must also note that no programming language can singlehandedly handle big data processing efficiently. They always require a distributed computing framework like Spark or Hadoop. Apache Spark supports Scala, Java, and Python. We refer to these three as the most powerful programming languages.
Scala programming language is in the core of Apache Spark. It compiles the program code into bytecode for the JVM for spark big data processing. A wonderful utility for spark python big data processing titled PySpark was developed by the open source community. PySpark assists data scientists to interface with powerful Distributed Datasets in Apache Spark and Python. Py4j is an example of a famous library within PySpark. It allows python and JVM objects (RDD’s) to interface dynamically.

November 9, 2018
© 2023 Hope Tutors. All rights reserved.

Site Optimized by GigCodes.com