Nice read – https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/
Emerging Architectures for Modern Data Infrastructure
Tech
Event-Driven Patterns for the Internet of Things
Due to the IoT devices and sensors being everywhere and gathering voluminous amounts of data, one of the most challenging issues is to assess the continuously growing data sets, particularly in real-time. For IoT systems, it is necessary to be capable enough to respond timely with the ever-changing trends so you can deliver real-time BI (business intelligence) that can have a major role in the IoT solutions’ success. If you are using an event-driven architecture for your internet of things ecosystem then the following patterns can hugely benefit you.
Complex Event Processing
Complex event processing tackles the issue of repeated incoming pattern events. Generally, the matching culminates into complicated events that are itself created from the input events.
In conventional DBMS, typically a query is run on stored data; on the other hand, complex event processing runs data on a stored query.
You can eliminate any irrelevant query data. CEP is useful as they are applicable to infinite data streams. Moreover, the processing of input data is quick. When the IoT ecosystem checks all the matching sequence events, results are released. As a result, it helps with the analytics capabilities of the complex event processing.
Put simply, event processing is used for the identification of events (like threats or opportunities) and generates a response for them prior to its occurrence or in its aftermath. Bear in mind that event-driven architecture and complex event processing are closely interlinked.
Event Processing
For creating IoT-scalable applications, one of the more popular distributed architecture patterns is the event-driven architecture. The pattern comprises of single-purpose and highly decoupled components of event processing which process and receives events with an asynchronous format.
One specific area where an event-driven architecture can come in handy is those complex engineered system with a loosely coupled structure. There is no need to specify a formal or bounded infrastructure to categorize components. Instead, you can introduce autonomy for the components that are open to both decoupling and coupling with various networks while responding to a wide range of events. Hence, it is possible to reuse IoT components with separate networks.
Mediator and broker are the two primary topologies of the event-driven architectures. The mediator topology is often required when you have to orchestrate several steps with a central mediator for an event. On the other, broker topology can be useful when you do not want to work with a central mediator for chaining events together.
Mediator Topology
Events that have a wide range of steps and need to process the event with orchestration can use the mediator topology.
The mediator topology offers event processors, event channels, event mediator, and event queues. The flow of the events begins when an event queue receives an event from a client, which is usually required for the event’s transportation to the event mediator. After getting the initial event, the event mediator performs the orchestration of the event; to do this, it sends more asynchronous events towards the event channels for executing all the process steps. The event channels are listened by the event processors who are given the event by the event mediator. Afterward, certain business logic is executed for processing the event.
Broker Topology
The broker topology is not the same as the mediator topology because of the lack of a central event mediator. Instead, a message broker is used to distribute the message flow between the components of an event processor with a chain-like approach. This topology is primarily used when there is no central event orchestration and the event processing flow is relatively straight-forward and simple.
Broker topology is composed of two components: an event processor and broker component. The broker component can be federated and centralized; all the event flow’s channels are stored by the broker. These channels can be message topics, message queues, or both.
Event Sourcing
The pattern is used to specify the methodology for managing data operations which are driven by a sequence of events—an append-only store saves all of them. An event-based series is sent by the application code which contains details of all the actions that have been performed on the event store’s data.
The concept around event sourcing revolves around the idea that when a system’s state is modified, the change of state can be referred to as an event. This means that the system can be reconstructed in the future if the events are reprocessed. The primary truth source is the event store where the system state entirely depends on it. A version-control system is one of the examples of this idea.
Events are raised by the event source and operations are performed by tasks in reply to those events. This tasks’ decoupling allows for achieving greater extensibility and flexibility. The event data or type of event, such details are known by the task, although they are unaware of the actual operation that invoked the event. Furthermore, every event can be managed by several tasks. This helps with integrating other systems and services that look for any latest event from the event store.
The events of event sourcing are typically on a low level; it is possible that you might need to create specific integration events.
An append-only operation is used to store events; they are asynchronous and immutable. In the background, tasks run to manage the events. Additionally, if the transactions’ processing suffers from lack of contention then scalability and performance can be significantly improved for IoT applications, particularly while working with the user interface and presentation level.
However, there are certain issues with the event source as well. For instance, there are no existing solutions or standard strategies like SQL queries to read the events for procuring information. To assess the current entity’s state, all events can be replayed which link with it against the original entity’s state. The process of replaying events has its own pitfalls, such as the outside systems’ interactions on which the results rely.
Introduction to Apache Kafka
Apache Kafka is a fault-tolerant and scalable messaging system that works on the publish-subscribe model. It helps developers to design distributed systems. Many major web applications like Airbnb, Twitter and Linkedin use Kafka.
Need for Kafka
Going forward, in order to design innovative digital services, developers require access to a wide data stream—which has to be integrated as well. Usually, the data sources such as transactional data like shopping carts, inventory, and orders are integrated with searches, recommendations, likes, and patch links. This portion of data holds an important role to offer insights into the behavior of customers’ purchasing habits. Here, different prediction analytics systems are used to predict future trends. It is this domain in which Kafka’s brilliance offers the companies a chance to edge their competitors.
How Was It Conceptualized
Around 9 years ago, in 2019, a team comprising of Neha Narkhede, Jun Rao, and Jay Kreps developed Apache Kafka at Linkedin. At that time, they were focusing to resolve a complex issue—voluminous amounts of event data related to LinkedIn’s infrastructure and website struggled from low latency ingestion. They planned to use a lambda architecture that took advantage of real-time event processing systems like Hadoop. Back then, they had no access to any real-time applications that could solve their issues.
For data ingestion, there were solutions in the form of offline batch systems. However, doing so risked exposing a lot of implementation information. These solutions also utilized a push model, capable of overwhelming consumers.
While the team had the option to use conventional messaging queues like RabbitMQ, they were deemed as overkill for the problem at hand. Companies do wish to add machine-learning but when they cannot get the data, the algorithms are of no use. Data extraction from the source systems was difficult, particularly moving it reliably. The existing enterprise messaging and batch-based solutions did not resolve the issue.
Hence, Kafka was designed as the ingestion backbone for such issues. By 2011, Kafka’s data ingestion was close to 1 billion events per day. In less than 5 years, it reached 1 trillion messages per day.
How Does Kafka Work?
Kafka offers scalable, persistent, and in-order messaging. Like other publish-subscribe systems, it is also powered by topics, subscribers, and publishers. It supports high parallel consumption via topic partitioning. Each message that is written to Kafka replicates and persists to the peer brokers. You can adjust the time span of these messages, for instance, if you configure it 30 days then they perish after a month.
Kafka’s major aspect is its log. Log here refers to the data structure that is append-only data order insertion which is time-ordered. In Kafka, you can use any type of data.
Typically, a database writes event modifications to a log and also extracts column values from them. For Kafka, messages write to a topic that is responsible for log maintenance. From these topics, subscribers can access and extract their relevant data representations.
For instance, a shopping cart’s log activity might include: add product shirt, add product bag, remove product shirt, and checkout. For the log, this activity is presented to the downstream systems. When that log is read by a shopping cart service, it can reference to the objects of the shopping cart that indicate the constituents of the shopping cart: product bag, and ready for checkout.
Since Apache Kafka is known to store messages for longer period of time, applications can be re-winded to previous log positions for reprocessing. For instance, consider a scenario in which you wish to use a new analytic algorithm or application so it can be tested for the previous events.
What Apache Kafka Does Not Do?
Apache Kafka offers blazing speed as it displays the log data structure like a first-class resident. It is far different from other conventional message brokers.
It is important to note that Kafka does not support individual IDs for messages. These messages are referenced according to their log offsets. It also refrains from monitoring consumers in terms of topic or their message consumption—consumers themselves can do all this.
Due to its unique design from other conventional messaging brokers, it can offer the following optimizations.
- It offers a decrease in the load. This is done by its refusal to maintain indexes that have the message records. Moreover, it does not offer random access; consumers define offsets where beginning from the offset, messages are delivered by Kafka in the correct order.
- There are no delete options. Kafka maintains log parts for a specific time period.
- It can use kernel-level input/output for effective stream messages to consumers, without depending on message buffering.
- It can take advantage of the OS for the write operations to disk along with file page caches.
Kafka and Microservices
Due to Kafka’s robust performance for big data ingestion, it has a series of use cases for microservices. Microservices often depend on event sourcing, CQRS, and other domain-driven concepts for scalability; their backing store can be provided by Kafka.
Often, event sourcing applications create a large amount of events—their implementation with conventional databases is tricky. By using Kafka’s feature log compaction, you can preserve your events for as long as possible. In log compaction, the log is not discarded after a defined time period; instead, Kafka saves all the events with a key set. As a result, the application gains loose coupling since it can discard or lose logs; at any point time, it uses the preserved events for the restoration of the domain state.
When to Use Kafka?
Apache Kafka’s use depends on your use case. While it solves many modern-day issues for web enterprises, similar to the conventional message brokers, it cannot perform well in all scenarios. If your intention is to design a reliable group of data applications and services, then Apache Kafka can function as your source of truth, gathering and storing all the system events.
Memory-Centric Architectures
Businesses of both small and large scale develop mobile apps, web applications, and IoT projects to power their IT infrastructures. At times, for getting high scalability and speed to support mission-critical systems, they make use of IMC: in-memory computing. As a result, it comes off as no surprise that in-memory computing platforms are trending. Moreover, IMC technology is evolved with the memory-centric architecture, providing a greater degree of ROI and flexibility against different data sets.
Background
Back in the second half of the 20th century, restrictions regarding disk-based platforms soon came into notice. It was found out that with transactional databases, data analysis and processing was prone to affect DB performance. Consequently, there was a need for disparate analytical DBs.
In this decade, businesses quickened businesses processes to kick off a wide range of initiatives related to digital transformation, resolve requirements pertaining to real-time regulations, and deploy omnichannel marketing strategies. However, real-time data analysis and action is not possible because of the ETL processes. Therefore, in-memory computing solutions which use HTAP (hybrid transactional/analytical processing) are used for real-time data analysis and processing against the given data.
In the past, RAM was costly and servers were considerably slow. The capability of caching and quickly processing data which resided in RAM for removing latency was quite restricted. Strategies of distributed computing like when in-memory data grids were deployed against commodity servers, allowed scaling between the available CPU and RAM—still, RAM remained quite expensive.
However, with the passage of time, the RAM costs have decreased. Additionally, APIs and 64-bit processors have enabled the in-memory data grids to assist integration with data layers and existing applications, offering high availability, scalability, and in-memory speeds.
At one hand, in-memory DBs came into production so they can become a replacement for the previous disk-based DBs. Despite the progressive nature of these steps, they unintentionally added fragmentation and complexity in the in-memory computing market.
Recently, in-memory DBs, in-memory data grids, machine learning, streaming analytics, ANSI-99 SQL, and ACID transactions—all of them have integrated with the emergence of IMC solutions into a single, reliable platform. These platforms offer greater convenience for use and deployment over those point solutions that seemed to provide as\ single product capability. As a result, these in-memory computing platforms were influential in significantly cutting down the operation and implementation expenses. Furthermore, there has been a dramatic shift in scaling out and speeding up the previous applications by designing these modern applications with the help of memory-centric architectures in a wide range of industries like healthcare, retail, SaaS, software, internet of things etc.
How In-Memory Computing Solved Real-World Issues
The biggest Russian bank—Sberbank—struggled with digital transformation in the past. What the bank wanted was to ensure support for mobile and online banking, work around with 1.5 petabytes of data for real-time data storage, and assist its 135 million customer base by facilitating a large number of transactions for each passing second. At the same time, the bank desired support for ACID transactions to monitor and track transactions and singled out high availability as one of the requirements. By using in-memory computing, the bank designed a modern web-scale infrastructure, consisting of 2,000 nodes. Experts reckon that in-memory computing has made sure that their infrastructure can compete with the best supercomputers in the world.
Similarly, Workday holds a reputation as one of the most famous enterprise cloud solutions in the HR and finance market. The brand serves close to 2,000 customers—a significant portion of whom belong to the Fortune 500 and Fortune 50. Around 10,000 employees run the company. In order to offer SaaS-based solutions, Workday utilizes IMC platforms for processing more than 185 million transactions daily.
Memory-Centric Architectures
Among the restrictions of in-memory computing solutions, a crucial one dictates that all the available data has to somehow “fit” in the memory. However, doing this is more expensive as opposed to storing the majority of the data in the hard disk; therefore, usually, businesses opt against maintaining all of their data in the disk. On the other hand, memory-centric architectures completely eliminated this issue. What they do is that they offer the means to utilize other storage and memory mediums like 3D XPoint, Flash memory, SSDs, and other storage technologies. The idea behind memory-centric architectures is simply “memory-first”. This means that the recent and important data is located in memory and disk simultaneously so the required in-memory speed can be attained. However, what separates this architecture from the rest is that the RAM amount can be exceeded by the data set. As a result, it is possible that the complete dataset resides on the disk while it offers robust performance by processing on the underlying disk store or processing against data in memory.
Keep in mind that this is different from disk-based data caching in memory. Companies leverage the capability for surpassing the memory amount; it helps them to optimize data in such a way that the entire data can reside on the disk, where the more important and valued data is located in-memory. Similarly, the less critical data is stored on the disk. Thus, memory-centric architectures have allowed companies to improve performance and reduce infrastructure expenses.
Essentially, a memory-centric architecture removes the requirement for all the waiting that is done so the RAM gets the reloaded data after a reboot scenario. Often, these delays consume a great deal of time based on the network speed and database size, violating SLAs in the process. When the system is able to perform computations on data directly from the disk while the system is still warming up and reloading the memory, you can ensure quick recovery. At first, you might find the performance identical to the disk-based systems; however, it is going to quickly improve in speed when the data is reloaded into memory, making the processing of all the operations compute with in-memory speeds.
Introduction to Rest with Examples – Part 2
In the previous post, we talked about what is REST APIs and discussed a few examples, we particularly, used CURL for our requests. So far, we have established that a request is composed of four parts: endpoint, method, header, and data. We have already explained endpoint and method, now let’s go over the header, data, and some more relevant information on the subject.
Headers
Headers offer information to the server and the client. They are used for a wide range of use cases, such as offering a peek into the body content or for authentication. Typically, HTTP headers follow the property-value pair format; a colon separates them. For instance, the following example consists of a header which informs the server about expecting JSON-based content.
“Content-Type: application/json”. Missing the opening”
By using cURL (we talked about it in the last post), you can use the –header option for sending the HTTP headers. For instance, if you want to send the above-mentioned header, then for the Github API, you can write the following.
curl -H “Content-Type: application/json” https://api.github.com
In order to check all of your sent headers, you can use the –verbose or the –v option at the end of the request. Consider the following command as an example.
Keep in mind that in your result, “*” indicates cURL’s additional information, “<” indicates the response headers and “>” indicates the request headers.
The Data (Body)
Let’s come to the final component of a request, also known as the message or the body. It entails information that is to be sent to any server. To use cURL for sending data, you can use the –data or the –d options like the following format.
For multiple fields, you can write the following .i.e. add two –d options.
It is also possible to break requests into several lines for better readability. When you learn how to spin (start) servers, you can easily create your API and test it with any data. If you are not interested in spinning up a server, you can use Requestbin.com and hit the “create endpoint”. In response, you can get a request which can be used for testing requests. In order to test requests, you have to generate your own request bin. Keep in mind that these request bins have a lifespan of 48 hours. Now you can transfer data to your request bin by using the following.
curl -X POST https://requestb.in/1ix963n1 \
-d name=adam \
-d age=28
cURL’s data transfer is similar to a web page’s form fields. For JSON data, you can alter your “Content-Type” and change it to “application/json”, like this.
curl -X POST https://requestb.in/1ix963n1 \
-H “Content-Type: application/json” \
-d ‘{
“adam”:”value”
“age”:”28”
}’
And with this, your request’s anatomy is finished.
Authentication
While using POST requests with your Github API, a message displays “Requires authentication”. What does this mean exactly?
Developers ensure that there are certain authorization measures so specific actions are only performed by the right parties; this negates the possibilities of impersonation by any malicious third party. PUT, PATCH, DELETE, and POST requests change the database, forcing the developers to design some sort of authentication mechanism. What about the GET request? It also needs authentication but only in some cases.
In the world of web, authentication is performed in two ways. Firstly, there is the generic user/password authentication—known as the basic authentication. Secondly, authentication is done by a secret token. The second method consists of something known as oAuth—it uses Google, Facebook, and other social media platforms for user authentication. For using the user/password authentication, you have to use the “-u” option like the following.
You can test this authentication yourself. Afterward, the previous “requires authentication” response is changed to “Problems parsing JSON”. The reason behind this is that so far, you have not sent any data. Since it is a POST request, data transfer is a must.
HTTP Error Messages and Status Codes
The above-mentioned messages like “Problems parsing JSON” or “Requires authentication” fall into the category of HTTP error messages. These emerge whenever a request has an issue. With HTTP status codes, you can learn your response status instantly. The range of these codes starts from 100+ and end to 500+.
- The success of your request is signified by 200+.
- The redirection of the request to any URL is signified by the 300+.
- If the client causes an error, then the code is 400+.
- If the server causes an error, then the code is 500+.
In order to debug a response’s status, you can use the head or verbose options. For instance, if you add “-I” in a POST request and do not mention the username/password details, then it can cause a 401 status code. When your request is flawed—either due to incorrect or missing data, a 400 status code appears.
Versions of APIs
Time and again, developers upgrade their APIs, it is a life-long process. When too many modifications are required, the developers should consider creating a new version. When this occurs, it is possible that your application gets an error; due to the fact that you wrote code with respect to the previous version API while the brand-new API is pointed out by your requests.
In order to perform a request for a certain version of the API, there are two methods. Depending on your API’s structure, you can choose any of them.
- Use endpoint.
- Use the request header.
For instance, Twitter follows the first strategy. For instance, a website can follow it in this way:
https://api.abc.com/1.1/account/settings.json
On the other hand, Github takes advantage of the second method. For instance, consider the following where the API version is 4 as mentioned in the request header.
curl https://api.abc.com -H Accept:application/abc.v4+json
An Introduction to REST with Examples – Part 1
REST stands for Representational State Transfer. If you have just transitioned from a desktop application development environment to the world of web development, then you have probably heard about REST APIs.
To understand REST API, let’s take an example. Suppose you search “Avengers: Endgame” in the search bar of YouTube. As a result, you can check a seemingly endless list of videos on the result pages; this is exactly how a REST API is supposed to work—providing you results for anything that you want to find.
Broadly speaking, an API is just a number of fixed functionalities that help programs in communicating with each other. The API is created by the developer on the server—to which the client communicates. REST is a software architectural style that determines the working of the API. Developers use it to design their APIs. Among REST rules, one is to ensure that the user gets any specific data—also referred to as a resource—if a specific URL is linked. In this context, URL is known as a request and the data that is received by the user is known as a response.
What Makes Up a Request?
Usually, there are four components make up a request.
- The endpoint
- The method
- The headers
- The data
The Endpoint
The URL that is requested by a user is known as an endpoint. It comprises of the following structure.
root-endpoint/?
Here, the “root-endpoint” signifies an API’s starting point from which a user requests data. For instance, Twitter’s root-endpoint API is https://apitwitter.com.
The path indicates the requested resource. Just think of it as a simple automated menu: you get what you click on. Paths can be accessed the same way as a website’s sections. For instance, if there is a website https://techtalkwithbhatt.com on which you want to check all the tagged posts on Java, then you can go to https://techtalkwithbhatt.com/tag/java. Here, as you can guess, https://techtalkwithbhatt.com is the root-endpoint while the path is the /tag/java.
In order to check the available paths, you must go through the documentation of that specific API. For instance, suppose you have to check repositories of a user via the Github’s API; you can simply go to this link and learn the path.
/users/:username/repos
In place of the colons in the above path, you have to alter and add any username of your choice. For instance, if there is a user named sarveshbhatt, then you should write the following:
https://api.github.com/users/sarveshbhatt/repos
Lastly, we also have the query parameters in the endpoint. Strictly speaking, they do not come under the REST architecture but they are extensively used in the APIs. Query parameters offer the functionality to change your request by adding key-value pairs. These pairs start off with a question mark and all parameter pairs are separated by adding a “&” character. The format is listed below.
?queryone=valueone&querytwo=valuetwo
Using Curl
These requests can be sent with different languages. For example, Python developers use Python Requests; JavaScript developers use the jQuery’s Ajax method and the Fetch API.
However, for this post, we are going to use a user-friendly utility that is already installed in your computer, cURL. Since API documentation typically resembles cURL, therefore if you can get the hang of it then you can understand any API documentation—allowing you to create requests easily.
But first, let’s check whether or not cURL is installed on your PC. Depending on your OS, open the Terminal and enter the following.
curl –version.
In response, you can get a result which looks similar to the following screen.
Those who do not have CURL can see a “command not found” error. To install curl, you can check this link. In order to work with cURL, you can enter “curl” with any endpoint. For instance, to check the Github root endpoint, you can use the following line.
The resulting response can seem like the following.
Similarly, you can check the repositories of a user by adding a path to the endpoint—we discussed how to do this above. Just add “curl” and write the following command.
curl https://api.github.com/users/Sarveshbhatt/repos
However, keep in mind that while using query parameters in the command line, you have to add a backslash (“\”) prior to the question mark (“q”) and “=” characters. The reason behind this is that both the “=” and “?” are recognized by the command line as the special characters. Therefore you have to add “\” so they are interpreted by the command line as a part of the command.
JSON
JSON stands for JavaScript Object Notation; a popular format used to send and request data with a REST API. If you send a request to Github, then it can send you back a response that bears the JSON format. As the name suggests, a JSON object is essentially an object in JavaScript. JSON’s property-value pairs are encompassed with a double quotation mark.
The Method
Going back to the request, now let’s come to the second component: the method. The method is simply a request type which is sent to the server. The method types are: GET, POST, PUT, PATCH and DELETE.
The function of these methods is to give meaning to the request. CRUD (Create, Read, Update, and Delete) operations are performed by these methods.
- GET
The GET request is used when a server resource is needed. When it is used, the server searches for your requested data and sends it to you. It is the default request method.
- POST
The POST request generates a new resource on the server. When you use this request, the server generates a new record in the DB and responds to you about its success.
- PUT and PATCH
These requests provide an update on a server’s resource. When these requests are used, a database entry is updated by the server and you get a message about its success.
- DELETE
The DELETE request effectively eliminates a server resource. When you use this request, it performs a deletion in the database entry and informs you about its success.
Automation Anti-Patterns That You Must Avoid
Regardless of your testing experience or the effectiveness of your automation infrastructure, you should use a robust test design to initiate automation. The lack of strong test design can force testing teams to face a wide range of issues which generate incomplete, inefficient, and difficult-to-maintain tests. In order to make sure the cost efficiency, quality, and delivery are not affected, it is necessary to familiarize with those indicators which represent the performance of tests. To begin with, consider the following automation anti-patterns to improve testing.
Longer Length of Sequence
Often, tests are created with small steps for long sequences, thus their management and maintenance is hard. For instance, while an application which is currently being tested undergoes changes, it is considerably complex to work around these modifications with other tests.
Instead of using a bottom-up approach first, generate a high-level design. In accordance with the respective method, such a design can consist of features like scopes and definitions of test products which are included in the main objective and test objective for different tests. For instance, a product could consist of test cases which test the calculation process of home loan mortgage premiums.
Business-Tests
When testers focus too much on interaction tests, it is possible that they may design weak tests which do not factor major business-level concerns like the interaction of application responses for unusual circumstances.
Testers should emphasize on the use of businesses tests which represent rules, objects, and processes of business alongside the interaction tests. For instance, with a business test, a user can log in, type a few orders, and view the financial information with the help of high-level activities which mask the details of the interaction.
In interaction tests, a user can type name/password combination and assess whether or not the button of submission (submit) is enabled or not—such a test can occur in any business environment type.
Blurred Lines
While it is important to run interaction and business tests, however, keep in mind that they should be run separately. For instance, the rules and lifecycles of business objects along with their processes and calculations must not be combined with interaction test details like confirming the presence of submit button or determining whether or not a login message is displayed after the login process. Such a scenario can make maintenance hard due to the mix. For instance, if the welcome message is generated in a new application version, all the associated tests have to be checked and maintained.
A modular and high-quality test can help testers with the correct vision about how these blurred lines should be avoided to make sure the maintainability and manageability is strong. Test modules are known to carry a well-defined scope with test modules. They prevent checks which are not suitable for their scope and mask comprehensive steps for the UI interactions.
Life Cycle Tests
Most of the global applications work with business objects like products, invoices, orders, and customers. The application lifecycles of such objects are updated, retrieve, create, and delete—known as CRUD. The major issue is that the tests for these lifecycles are hard to find, incomplete, and scattered. Hence, there can be real vulnerabilities in the tests’ scope especially in the case of business objects, particularly when the business objects are ever-changing. For a car rental, while one can try several vans and cars, however, the coverage is lesser for both the buses and motorcycles.
It is easy to design life cycle tests. You can initiate by choosing business objects along with their operations. Such a process can also consist of variations like updating or canceling an order. It is important to remember that life cycle tests are similar to business tests, instead of the interaction tests.
Poorly Developed Tests
Factors like pressure, time, and others can culminate in the creation of shallow test cases which are not good enough to properly test the application. As a result, quality suffers, like missed situations which are not responded. Doing this can also affect the test maintenance, making it more expensive.
More importantly, it is necessary to sync both the testing and test automation for the entire Scrum sprint. These tests and their automation require a greater degree of cooperation which is tougher in case they are still running after the completion of the sprint.
When improved automation architecture and test design are not good enough in keeping up with the velocity, then you can think about the outsourcing of a number of tasks so the testers are able to match the speed.
You have to think in terms of a professional tester—as an individual who has a knack of breaking things. For instance, if different testing methods like error guessing decision tables are used then they can assist to pinpoint those situations which require immediate assistance with the test cases. On a similar note, equivalence partitioning and state-transition diagrams can help to think about different design to test the cases.
Scope
Lack of scope in tests is quite a common problem like when an entry dialog has to be tested by a test or with a group of financial transactions. It is easy to find the tests and also update in case of changes in the application while duplication is also possible.
Duplicate Checks
Testers usually use to assess steps by using an unexpected output for all the steps; this is also encouraged by different management tools. What this indicates is that separate tests can be used for the same check. For instance, the previous test determined whether or not the welcome screen is generated after the login.
Begin with a test design on which the testers focused heavily. You have to ensure that all the modules in the test have well-differentiated and clear scopes. Hence, during the development of such tests, you have to avoid checking after each step. Add checks according to the scope.
AWS S3 Tips for Performance
Amazon S3 is used by many companies for storage purposes. Due to its use as object storage, it offers flexibility with a slew of data types including small objects to massive datasets. Thus, S3 has carved out its niche as a great service which can store a broad scope of data types via a resilient and available environment. As your objects of S3 have to be accessed and read by other AWS services, applications, and end users, do you believe that they are optimized to offer the best possible performance? Here is how you can optimize your S3. Follow these tips to improve your performance with Amazon S3.
Perform TCP Window Scaling
TCP window scaling facilitates developers to improve the network throughput performance via the modification of the header in the TCP packet that uses a window scale. This helps in sending data with a single segment—more than the traditional 64 KB. It is important to note that such practice is exclusive to S3; it functions along with the protocol level. As a result, by using this protocol you can execute window scaling for your client while establishing a connection with a server.
When a connection is established between a destination and source by the TCP, then the next thing is a 3-way handshake that starts up from the source. This means that with the S3 view, it is possible that the client might be required to upload an object to the S3. However, prior to this, you must create a connection with the S3 servers.
A TCP packet will be sent by the client along with a defined window scale of TCP in the header—such a request is also referred to as SYN request—the first part of the 3-way handshake. When the S3 gets this request, it uses an SYN/ACK message to send a response to the client while using the same window scale factor—this forms the second part of the 3-way handshake—maintaining the relevant window scale factor. The third and final part consists of an ACK message which is sent to the S3 server—it serves as the response’s acknowledgment. A connection is generated after this 3-way handshake ends where the S3 and client can finally exchange data.
In order to send more data, you can widen the window size via a scale factor. This helps in sending voluminous amount s of data via a single segment whereas your speed is also quickened.
Use Selective Acknowledgment
At times while using TCP, it is not uncommon for packets to get lost. To figure out which of the packets went missing is hard within a TCP window. Consequently, at times it is possible to resend all of these packets. However, then the receiver may have received some of the packets so this is an ineffective strategy.
Instead, you can use TCP SACK (selective acknowledgment) for improving performance where the sender receives notifications about which were the failed packets for a window. As a result, the sender can then easily only resend the failed packets.
However, it is necessary that the source client or the sender initiates the SACK when a connection is being established amidst the handshake’s SYN phase. Such an option is also called as SACK-permitted. For using and implementing SACK, you can visit this link.
Setting Up S3 Request Rates
Alongside TCP SACK and TCP Scaling, S3 is quite nicely optimized to address a high request throughput. A year back, in 2018, AWS introduced a new change for these request rates. Before the announcement, it was recommended that the prefixes can be randomized within the bucket to help with performance optimization—there is no more need for it. Now exponential growth of request rate performance can be achieved when more than one prefixes within the bucket are used.
Now, developers are getting a 3,500 PUT/POST/DELETE request for each second while they are also achieving 5,500 GET requests. A single prefix is a reason behind such limitations. However, keep in mind that there is no limit for prefixes which are to be used in an S3 bucket. What this means is that if you use 2 prefixes then you can get 110,000 GET requests and 70,000 PUT/POST/DELETE for each passing second for the same bucket.
There is no hierarchical-based structure in the folders of S3; it follows a flat structure for storage. This means that all you need is a bucket while all the objects are saved in a flat space of the bucket. You can generate folders and store objects in it—without depending on a hierarchical system. The prefixes of the object are responsible to make them unique. For instance, in case you have the following objects in a bucket:
- Design/Meeting.ppt
- Objective/Plan.pdf
- jpg
Here, the ‘Design’ folder serves as a prefix for identifying the object—such a pathname is also referred to as the object key. The ‘Objective’ folder is also an object’s prefix while the ‘Will.jpg’ is without any prefix.
Amazon Cloud Front
One more strategy for optimization is to integrate Amazon CloudFront with Amazon S3. This is a wise strategy when the request of the S3 data is a GET request. CloudFront is a content delivery network which increases the pace of the distribution of dynamic and static content across a global network comprising of edge locations.
Typically, after a user sends a request from S3 in the form of GET request then the S3 service is used to route it while the relevant servers return the content. In case you use CloudFront with S3 then it can also perform caching for those objects which are requested commonly. Hence, the user’s GET request is directed to the nearest edge location that offers low latency for returning the cached object and providing the best performance. It also decreases the AWS costs when the number of GET requests in the buckets is shortened.
Best Code Review Practices
How do you run code reviews? Code reviews are vital and enhance the quality of code. They are responsible for the stability and reliability of the code. Moreover, they build and foster relationship among team members. Following are some of the tips for code reviews.
1. Have a Basic Understanding of What to Search in the Code
To begin with, you should have a basic idea about what you are looking for. Ideally, you should look for major aspects like the following.
- What structure has the programmer followed?
- How is the logic building so far?
- What style has been used?
- How is the code performing?
- How are the test results?
- How readable is the code?
- Does it look maintainable?
- Is it ticking up all the boxes for the functionality?
You can also perform static analysis or other automated checks to evaluate the logic and structure of your code. However, some things are best reviewed from a pure manual check like functionality and design.
Moreover, you also have to consider the following questions for the code.
- Are you able to understand how the code works and what does it do?
- Is the code following the client requirement?
- Are all modules and functions running as expected?
2. Take 60-90 Minutes for a Review
You should avoid spending too much time for reviewing a codebase in a single sitting. This is because after a 60-minute time interval, a code reviewer naturally begins sensing tiredness and does not have the same physical and mental strength to pick out defects from the code. Such state is supported by proofs from other studies as well. It is a common fact that whenever human beings commit themselves to an activity which needs special attention, their performance begins to experience a dip after 60 minutes. By following this time period, a reviewer can at best review around 300 to 600 lines of code.
3. Assess 400 Lines of Code at Max
According to a code review by Cisco, for best results, developers should conduct a code review which extends from 200 to 400 LOC (lines of code) at a time. After this time period, the capability to identify bugs begins to wither way. Considering an average review stretches up to 1.5 hours, you can get a yield in between 70-90%. This means that if there were 10 faults in the code, then you maybe successful to find at least 9 of them.
4. Make Sure Authors Annotate Source Code
It is possible that authors of the code can remove almost all of the flaws from the code before a review is required. This can be done if it is mandatory for the developers to re-check their code, thus making reviews end faster while the code quality remains unaffected as well.
Before the review, authors can use annotation in their code. With annotations the reviewer can them go through all the modifications, see what to look first and assess the methods and reasons for all the changes in the code. Thus, these notes are not merely code comments but they serve as a guide to the reviewers.
Since authors have to re-assess and explain their modifications while annotating the code, therefore it can help to show different flaws in the code prior to the beginning of the review. As a result, the review achieves a greater level of efficiency.
5. Setup Quantifiable Metrics
In the beginning, you have to come up with the goals for the code review and brainstorm how to assess the effectiveness of the code. After certain goals are defined, it can help to reflect better whether or not the peer review is providing the required results.
You can use external metrics like “cut upon the defects from development by 50%” or “decrease support calls by 15%”. Therefore, you can get a better picture of how well your code is performing from an external outlook. Moreover, a quantifiable measurement is wiser rather than having an unclear objective to “resolve more bugs”.
It must be noted that the results of the external metrics are not realized too early. For instance, there will be no changes to the support calls till the release of the new version and users can use the software. Therefore, you should also judge internal process metrics for getting the number of defects, study the points which cause issues, and get an idea about how much time is being spent by your developer on reviewing their code. Some of the internal code review metrics are the following.
- Inspection rate: It is measured in kLOC (thousands of lines of code) per work hour and represents the time required for reviewing the code.
- Defect rate: It is measured in number of defects discovered for each hour. It represents the process to discover defects
- Defect density: It is measured in number of defects for each kLOC. It represents the number of defects which are discovered in a code.
6. Create Checklists
Checklists are vital for the reviews due to the fact that help the reviewer to keep tasks in mind. They are a good option to evaluate those components which might be forgotten by you. They are not only effective for reviewers but they can also aid the developers.
One of the tough defects to highlight is omission as it is obviously tough to identify a piece of code which was never added. A checklist is one of the best ways to address this issue. With checklist both the reviewers and author can verify that the errors have been resolved, the arguments in the function are tested to run with invalid values, and the required unit tests are generated.
Another good idea is to use a personal checklist. Each developer often repeated the same errors in their code. If the creation of personal checklist is enforced then it can help the reviewers.
Tips to Manage Garbage Collection in Java
With each evolution, garbage collectors get new improvements and advancements. However, garbage collections in Java face a common issue: unpredictable and redundant object allocations. Consider the following tips to improve your garbage collection in Java.
1. Estimate Capacity of the Collections
All types of Java collections including extended and custom implementations take advantage of underlying object-based or primitive arrays. Due to the size immutability of arrays after allocation, when items are added to a collection, a new array can force the old array to get dropped.
Many implementations of collections attempt optimization of the re-allocation process and try to limit it an amortized restriction, whether or not the expected collection size is not given. To get the best results, give the collection its expected size during creation.
For instance, consider the following code.
public static List reverse(List < ? extends T > list) {
List answer = new ArrayList();
for (int x = list.size() – 1; x > = 0; x–) {
answer.add(list.get(x));
}
return answer;
}
In this method, a new array is allocated after which another list’s items fill it via a reverse order. In this code, optimization can be applied to the specific line which performs addition of items in the new list. Whenever an item is added, the list has to ensure that there are enough available slots in the underlying array so it can store the incoming item. If there is a free slot, then it can easily store it. Or else, it has to perform allocation of a new underlying array, move the content of the old array to the newer one, and then perform the addition of the new item. As a result, multiple arrays are allocated which are ultimately collected by the garbage collection.
In order to avoid these redundant and inefficient allocations, you can “inform” the array about the number of items it can store while creating it.
public static List reverse(List < ? extends T > list) {
List answer = new ArrayList(list.size());
for (int x = list.size() – 1; x > = 0; x–) {
answer.add(list.get(x));
}
return answer;
}
As a result, the first allocation from the constructor of the ArrayList is big enough to store the items of the list.size(), thus there is no need for reallocation of memory in the middle of an iteration.
2. Compute Streams with a Direct Approach
While data is being processed like it is downloading from the network or read from the file, then it not uncommon to view the following instances.
byte[] fData = readFileToByteArray(new File(“abc.txt”));
As a result, the byte array a JSON object, XML document, or Protocol Buffer message can then be used for parsing. While working with bigger files or files having uncertain size, there are possibilities of exposure to OutOfMemoryErrors if buffer allocation of the complete file is not performed by the Java Virtual Machine.
In case, we assume that the data size can be managed, the above-mentioned pattern can lead on to generate considerable overhead for garbage collection because a huge blob is allocated on the heap for holding the data of the file.
There are better solutions to tackle this. One of them is the use of the relevant InputStream and use the parser directly before converting it into a byte array. Usually, all crucial libraries are known to have API exposure for direct streams parsing. For instance, consider the following code.
FileInputStream fstream =new FileInputStream(fileName);
MyProtoBufMessage message = MyProtoBufMessage.parseFrom(fstream);
3. Make Use of Immutable Objects
Immutability brings a lot of benefits to the table. Perhaps, its biggest one is on the garbage collection. An object in which you cannot alter the fields after the construction of the project is known as an immutable object. For instance,
public class twoObjects {
private final Object a;
private final Object b;
public twoObjects(Object a, Object b) {
this.a = a;
this.b = b;
}
public Object getA() {
return a;
}
public Object getB() {
return b;
}
}
The instantiation of the above-mentioned class provides an immutable object in which all the fields are set as ‘final’ and cannot be altered.
Immutability means that objects which are referenced via an immutable container are generated prior to the container’s construction. In terms of garbage collection, the container and the reference have the same age.
Therefore, while working with cycles of garbage collections for young generations, the garbage collection can ignore the older generations’ immutable objects as they are unable to reference.
When there are lesser objects for scanning, it requires less memory and also saves up on garbage collection cycles, resulting in improved throughput.
4. Leverage Specialized Primitive Collections
The standard collection library of Java is generic and convenient. It assists in using semi-static type binding in collections. This can be advantageous if, for instance, you have to use a string set or work with a map for strings lists.
The actual issue emerges when developers have to store a list contains “ints” or a map containing type double values. As it is not possible to use primitives with generic types, another option is the use of boxed type.
Such an approach consumes a lot of space because an Integer is an object having a 12-byte header and a 4-byte int field. In total each Integer item amounts to 16 bytes—four times the size of the similar primitive int data type. There is another issue that the object instances of these integers have to be assessed for garbage collection.
To resolve this issue, you can use the Trove collection library. It provides a few generics over specialized primitive collections that are memory efficient. For instance, rather than using the Map<Integer, Double>, you can use the TIntDoubleMap.
TIntDoubleMap mp = new TIntDoubleHashMap();
mp.put(6, 8.0);
mp.put(-2, 8.555);
The underlying implementation from Trove works with primitive arrays, hence no boxing occurs during the manipulation of collections and objects are not hold in the primitives’ place.