Big Data Architecture: Detailed Overview

[ad_1]

As a company with deep expertise and a long-standing track record in Big Data solutions, we are eager to shed some light on its architecture. Although it seems complicated, Big Data architecture in practice has crystal-clear logic and is vital for businesses to leverage their data to the fullest and make informed decisions.

Lots of promising companies across the globe adopt the Big Data approach to process vast volumes of structured and unstructured data and produce accurate results that prove effective in a context of continuous growth. For instance, PepsiCo has used big data analytics to attract the most valuable customers. Working with target clients provided an 80% increase in sales in the first 12 weeks after the product launch.

To see how data flows through its systems and ensure that it’s managed properly and meets business needs for information, we need well-structured Big Data architecture. Data architecture is one of the domains of enterprise architecture, connecting business strategy and technical implementation. If it’s well-structured, it allows companies to:

Transform unstructured data for analysis and compiling reports;
Record, process and analyze unconnected streams in real-time or with low latency;
Conduct more accurate analysis, make informed decisions, and reduce costs.

In practical terms, Big Data architecture can be seen as a model for data collection, storage, processing, and transformation for subsequent analysis or visualization. The choice of an architectural model depends on the basic purpose of the information system and the context of its application, including the levels of processes and IT maturity, as well as the technologies currently available.

The most known paradigms are ETL (Extract, Transform, Load) and ELT (Extract, Load Transform) in conjunction with data lake, lakehouse, and data warehouse approaches.

There are a number of Big Data architecture components or layers. The key layers include data ingestion, storage, processing, analytics, and application, from the bottom to the top. Let’s take a closer look at these Big Data components to understand what architectural models consist of.

Components of Big Data architecture — *Image by author,* inspired by the source

Data sources, as the name suggests, are the sources of data for systems based on Big Data architecture. These sources include software and hardware capable of collecting and storing data. The variety of data collection methods depends directly on the source.

The most common data sources are:

Relational databases (Oracle, PostgreSQL., etc.)
NoSQL solutions (Document, Key/Value, Graph databases)
Time-series databases (TimescaleDB, InfluxDB)
File systems such as cloud storages, FTP/NFS/SMB storages
Distributed files systems (HDFS, AWS EFS, etc.)
Search engines (Elastic Search)
Message queues (RabbitMQ, Kafka, Redis)
Enterprise systems accessed via API
Legacy enterprise systems like mainframes

Each data source can hold one or more types of data:

Structured data is data arranged around a predefined schema (various databases, existing archives, enterprise internal systems, etc.)
Unstructured data is data not structured according to a predefined data model (GPS, audio and video files, text files, etc.)
Semi-structured data refers to data that doesn’t conform to the structure of a data model but still has definite classifying characteristics (internal system event logs, network services, XML, etc.)

As described above, data can be stored initially in any external system as a data source for a Big Data architecture platform. In addition, data can already exist in any data source or can be generated in real time.

The first step is to extract data from an external system or a data source and ingest it into a Big Data architecture platform for subsequent processing. It practically means the following:

Collect data from an external data source using pull or push approach
The pull approach is when your Big Data platform retrieves bulk data or individual records from an external data source. Data is usually collected in batches if the external data source supports it. In this case the system can control better the amount and throughout of ingested data per unit of time.
Push approach is when an external data source pushes data into your Big Data platform. It is usually delivered as real time events and messages. In this case the system should support high ingestion rate or use intermediate buffer/event log solutions like Apache Kafka as internal storage.
Persist data in a data lake, lakehouse, or distributed data storage as raw data. Raw data means that data has its original format and view and it guarantees that subsequent processing does not lose original information.
Transfer data to the next processing layer in the form of bulk items (batch processing) or individual messages for real-time processing (with or without intermediate data persistence).

The next step is processing or transformation of previously ingested data. The main objectives of such activity include:

Transforming data into structured data format based on predefined schema
Enriching and cleaning data, converting data into the required format
Performing data aggregation
Ingesting data into an analytical database, storage, data lake, lakehouse, or data warehouse
Transforming data from raw format into intermediate format for further processing
Implementing Machine or Deep Learning analysis and predictions

Depending on project requirements, there are different approaches to data processing:

Batch processing
Near real-time stream processing
Real-time stream processing

Let’s review each type in details below.

3.1. Batch Processing

After storing a dataset over a period of time, it moves to the processing stage. Batch processing presupposes that algorithms process and analyze datasets previously stored in an intermediate distributed data storage like a data lake, lakehouse, or a distributed file system.

As a rule, batch processing is used when data can be processed in chunks or batches on a daily, weekly, or monthly basis and end users can wait results for some time. Batch processing allows processing data more effectively from resource perspective but it increases latency when data is available after processing for storage, processing, and analysis.

3.2. Stream Processing

Incoming data can also be presented as a continuous stream of events from any external data source, which is usually pushed to the Big Data ingestion layer. In this case data is ingested and processed directly by consumers in the form of individual messages.

This approach is used when the end user or external system should see or use the result of computations almost immediately. The advantage of this approach is high efficiency from resource point of view per message and low latency to process data in near real-time manner.

3.2.1 Near Real-Time Stream Processing

If according to non-functional requirements incoming messages can be processed with latency measured in seconds, near real-time stream processing is the way to go. This type of streaming allows to process individual events in small micro-batches, combining close items together in a processing window as micro-batch. For example, Spark Streaming processes streams in this way, finding balance between latency, resource utilization and overall solution complexity.

3.2.2 Real-Time Stream Processing

When a system is required to process data in a real-time manner, then processing is optimized to achieve millisecond latency. These optimizations include memory processing, caching, asynchronous persistence of input/output results in addition to classical near real-time stream processing.

Real-time stream processing allows to process individual events with a maximum throughput per unit of time but with additional resources like memory and CPU. For example, Apache Flink processes each event immediately, applying the approaches mentioned above.

The majority of Big Data solutions are built in a way that facilitates further analysis and reporting in order to gain valuable insights. The analysis reports should be presented in a user-friendly format (tables, diagrams, typewritten text, etc.), meaning that the results should be visualized. Depending on the type and complexity of visualization, additional programs, services, or add-ons can be added to the system (table or multidimensional cube models, analytical notebooks, etc.).

To achieve this goal, ingested and transformed data should be persisted in an analytical data store, solution, or database in the appropriate format or structure optimized for faster ad-hoc queries, quick access and scalability to support large number of users.

Let’s see a couple of typical approaches.

The first popular approach is data warehouses, which is in essence a database optimized for read operations using column-based storage, optimized reporting schema and SQL Engine. This approach is usually applied when the data structure is known in advance and sub-second query latency is critical to support rich reporting functionality and ad-hoc user queries. For example, AWS Redshift, HP Vertika, Click House, Citrus PostgreSQL.

The next popular approach is data lakes. The original goal of a data lake was to democratize access to data for different uses cases, including machine learning algorithms, reporting, post-processing of data on the same ingested raw data. It works but with some limitations. This approach simplifies the complexity of overall solutions because a data warehouse is not required by default, so less tools and data transformations are needed. However, the performance of engines used for reporting is significantly lower even for Parget, Delta, Iceberg optimized formats. A typical example of this approach is the Classical Apache Spark setup which persists ingested data in Delta or Iceberg format and Pesto Query Engine.

The last trend is to combine both previous approaches in one and it is known as a lakehouse. In essence, the idea is to have a data lake with highly optimized data format and storage and SQL vector-based engine similar to data warehouses but based on Delta format which supports ACID/versions. For example, Data Bricks Enterprise version achieved performance for typical reporting queries better than classical data warehouse solutions.

Data warehouse vs data lake vs data lakehouse — *Image by author,* inspired by the source

Mostly, Big Data analytics solutions have similar repetitive business processes which include data collection, transfer, processing, uploading in analytical data stores, or direct transmission to the report. That’s why companies leverage orchestration technology to automate and optimize all the stages of data analysis.

Different Big Data tools can be used in this area depending on goals and skills.

The first level of abstraction is Big Data processing solutions themself described in the data transformation section. They usually have orchestration mechanisms where a pipeline and its logic are implemented in code directly based on the functional programing paradigm. For example, Spark, Apache Flink, Apache Beam all have such functionality. This level of abstraction is very functional but requires programing skills and deep knowledge of Big Data processing solutions.

The next level is orchestration frameworks, which is still based on writing code to implement the flow of steps for an automated process but these Big Data tools require basic knowledge of language to just link steps between each other without special knowledge how specific step or component is implemented. So, such tools have a list of predefined steps with the ability to be extended by advanced users. For example, Apache Airflow or Ludgi are popular choices for many people who work with data but have limited programing knowledge.

The last level is end-user GUI editors that allow to create orchestration flows and business processes using just a rich editor with graphical components, which should be linked visually and configured via component properties. BPMN notations are often used for such tools in conjunction with custom components to process data.

To efficiently handle customer requests and perform tasks well, applications have to interact with the warehouse. In a nutshell, we’ll look at two most popular Big Data architectures, known as Lambda and Kappa, that serve as the basis for various corporate applications.

Lambda has been the key Big Data architecture. It separates real-time and batch processing where batch processing is used to ensure consistency. This approach allows implementing most application scenarios. But for the most part, the batch and stream levels work with different cases, while their internal processing logic is almost the same. Thus, data and code duplications may happen, which becomes a source of numerous errors.
For this reason, the Kappa architecture was introduced, which consumes fewer resources but is great for real-time processing. Kappa is based on Lambda combining stream and batch processing models but information is stored in the data lake. The essence of this architecture is to optimize data processing by applying the same set of code for both processing models. It facilitates management and unifies the problem of calibration.

Lambda architecture

Kappa architecture

Analysts use various Big Data tools to monitor current market trends, clients’ needs and preferences, and other information vital for business growth. When building a solution for clients, we always take into consideration all these factors, offering Big Data services of supreme quality and providing you with the most profitable product.

Let’s take a glimpse at the most common Big Data tools and techniques used nowadays:

Distributed Storage and Processing Tools

Accommodating and analyzing expanding volumes of diverse data requires distributed database technologies. Distributed databases are infrastructures that can split data across multiple physical servers allowing multiple computers to be used anywhere. Some of the most widespread processing and distribution tools include:

Hadoop

Big Data will be difficult to process without Hadoop. It’s not only a storage system, but also a set of utilities, libraries, frameworks, and development distributions.

Hadoop consists of four components:

HDFS — a distributed file system designed to run on standard hardware and provide instant access to data across Hadoop clusters.
MapReduce — a distributed computing model used for parallel processing in different cluster computing environments.
YARN — a technology designed to manage clusters and use their resources for scheduling users’ applications.
Libraries for other HDFS modules

Spark

Spark is a solution capable of processing real-time, batch, and memory data for quick results. The tool can run on a local system, which facilitates testing and development. Today, this powerful open-source Big Data tool is one of the most important in the arsenal of top-performing companies.

Spark is created for a wide range of tasks such as batch applications, iterative algorithms, interactive queries, and machine learning. This makes it suitable for both amateur use and professional processing of large amounts of data.

No-SQL Databases

No-SQL databases differ from traditional SQL-based databases in that they support flexible schemes. This simplifies handling vast amounts of all types of information — especially unstructured and semi-structured data that are poorly suited for strict SQL systems.

Here are four main No-SQL categories adopted in businesses:

Document-Oriented DB stores data elements in structures like documents.
Graph DB connects data into graph-like structures to emphasize the relationships between information elements.
Key-value DB combines unique keys and related Big Data components into a relatively simple easily-scalable model.
Column-based DB stores information in tables that can contain many columns to handle a huge amount of elements.

MPP

A feature of the Massive parallel processing (MPP) architecture is the physical partitioning of data memory combined into a cluster. When data is received, only the necessary records are selected and the rest are eliminated to not take up space in RAM, which speeds up disk reading and processing of results. Predictive analytics, regular reporting, corporate data warehousing (CDW), and calculating churn rate are some of the typical applications of MPP.

Cloud Computing Tools

Clouds can be used in the initial phase of working with Big Data, in conducting experiments with data and testing hypotheses. It’s easier to test new assumptions and technologies, you don’t need your own infrastructure. Clouds make it faster and cheaper to launch a solution into industrial operations with certain requirements, such as data storage reliability, infrastructure performance, and others. In this way, more companies are moving their Big Data to clouds that are scalable and flexible.

When choosing a database solution, you have to bear in mind the following factors:

1. Data Requirements

Before launching a Big Data solution, find out which processing type (real-time or batch) will be more suitable for your business to achieve the highest entry speed and extract the relevant data for analysis. Don’t overlook such requirements as response time, accuracy and consistency, and fault-tolerance that play the crucial role in the data analytics process.

2. Stakeholders’ Needs

Identify your key external stakeholders and study their information needs to help them achieve mission-critical goals. This presupposes that the choice of a data strategy must be based on a comprehensive needs analysis of the stakeholders to bring about benefits to everyone.

3. Data Retention Periods

Data volumes keep growing exponentially, which makes its storage far more expensive and complicated. To prevent these losses, you have to determine the period within which each data set can bring value to your business and, thereby, be retained.

4. Open-Source or Commercial Big Data Tools

Open-source analytics tools will work best for you if you have the people and the skills to work with it. This software is more tailorable to your business needs as your staff can add features, updates, and other adjustments and improvements at any moment.

In case you don’t have enough staff to maintain your analytics platform — opting for a commercial tool can boost more tangible outcomes. Here, you depend on a software vendor but you get regular updates, tool improvements, and can use their support services to solve arising problems.

5. Continuous Evolution

The Big Data landscape is quickly changing as the technologies keep evolving, introducing new capabilities and offering advanced performance and scalability. In addition, your data needs are certainly evolving, too.

Make sure that your Big Data approach accounts for these changes meaning that your Big Data solution should make it easy to introduce any enhancements like integrate new data sources, add new custom modules, or implement additional security measures if needed.

If built correctly, Big Data architecture can save money and help predict important trends, but as a ground-breaking technology, it has some pitfalls.

Big Data Architecture Challenges — *Image by author, inspired by the* *source*

Budget Requirement

A Big Data project can often be held back by the cost of adopting Big Data architecture. Your budget requirements can vary significantly depending on the type of Big Data application architecture, its components and tools, management and maintenance activities, as well as whether you build your Big Data application in-house or outsource it to a third-party vendor. To overcome this challenge, companies need to carefully analyze their needs and plan their budget accordingly.

Data Quality

When information comes from different sources, it’s necessary to ensure consistency of the data formats and avoid duplication. Companies have to sort out and prepare data for further analysis with other data types.

Scalability

The value of Big Data lies in its quantity but it can also become an issue. If your Big Data architecture isn’t ready to expand, problems may soon arise.

If infrastructure isn’t managed, the cost of its maintenance will increase hurting the company’s budget.
If a company doesn’t plan to expand, its productivity may fall significantly.

Both of these issues need to be addressed at the planning stage.

Security

Cyberthreats are a common problem since hackers are very interested in corporate data. They may try to add their fake information or view corporate data to obtain confidential information. Thus, a robust security system should be built to protect sensitive information.

Skills Shortage

The industry is facing a shortage of data analysts due to a lack of experience and necessary skills in aspirants. Fortunately, this problem is solvable today by outsourcing your Big Data architecture problems to an expert team that has broad experience and can build a fit-for-purpose solution to drive business performance.

Once, the use of Big Data revolutionized information technology. Today, both established companies and newbies adopt top-notch Big Data architectures to make informed decisions, drive revenue, and stay a step ahead of competitors. In this way, analyzed data becomes an indispensable source for business optimization, reduced costs, and valuable insights.

Discuss data strategy with an expert

Big Data analytics experts will review your data struggles and help map out steps to achieve data-driven decision-making.

[ad_2]

Source link

Big Data Architecture: Detailed Overview

"Wild" robots. I am often asked about the "rule of...".Ed Lovelock | | Starship Technologies | September 2022

Leave a Reply Cancel reply

Recent Posts