How to Increase the Availability of an Application with RabbitMQ?

If you have used RabbitMQ for managing the messaging part of your applications, then you are certainly aware of the configuration issues which are recurrent with it. Software engineering experts who have used RabbitMQ for an extended period of time know that it can face certain configuration errors.

Therefore, it is necessary to configure it correctly. In this way, you can achieve the best possible performance along with getting the cluster which is the most stable of them all. This article focuses on a number of tips and techniques which can improve the availability and uptime of your application.

Limit Queue Size

The arrangement of queues matters in the RabbitMQ environment. If you wish to attain maximum performance for your application then ensure that these queues remain shorter in size. When a queue is considerably large in size, then the accompanied processing overhead slows down the performance of the application. In order to achieve optimal performance for your application, queues should remain close to 0.

Turn On Lazy Queues

Since RabbitMQ 3.6, there is an option in the messaging platform to use lazy queues. In lazy queues, messages are automatically redirected to the secondary memory of your disk and are then stored there. When the need arises for a message, then the main memory calls out the corresponding message from the disk. Since all the messages are stored on the disk, therefore it helps to manage and optimize the usage of RAM. On the other hand, it also promises additional time for the throughput.

The use of lazy queues ensures you can produce improved stable clusters. These cluster features are known to infuse a certain degree of predictability in the performance. As a result, the server’s availability gets a much-needed boost. If your application involves too much processing of batch jobs and deals with a high number of messages, then you might be concerned that the speed of publisher may not always be kept up by the consumers. In this case, you must give a thought to turn off the lazy queues.

Manage Cluster Setup

The node collection in RabbitMQ is also known as a cluster. In order to improve availability, make sure that the clients are facilitated with data replication in the event of any possible or expected failure. This means that even if a cluster is disconnected, clients could access it to fulfill their requirements.

In Cloud AMQP, you can create 3 clusters. You have to locate all nodes for various availability zones (AWS) while queues should be replicated (HA) amongst the availability zones.

If a node goes down, then there should be a mechanism which can transfer processing to the remaining cluster nodes for keeping up the availability of the application. To do this, you can use a load balancer with the instances of the RabbitMQ. The load balancer can help with the message publishers for transparent brokers distribution.

Keep in mind that in Cloud AMPQ, 60 seconds is the max failover time. The DNS TTL runs every 30 seconds while the endpoint health is processed after every 30 seconds.

By default, message queues are positioned on a single node. However, they can be reached and seen by all the other nodes. For replication of queues in the node cluster, you can visit this link to achieve high availability. If you need more information on the configuration of nodes in your cluster, then you can visit this link.

Using Durable Queues and Persistent Messages

If it is absolutely critical that your application cannot lose a message, then you can configure your queues as “durable”. In this way, your sent messages are now delivered with a “persistent” mode. If you wish to prevent the loss of messages with the broker, then you have to consider broker crashes, broker hardware failure, and broker restart issues. In order to make sure that the broker definitions and messages are not hampered by restarts, you can store them on the disk. Queues, exchanges, and messages which are not declared as persistent and durable are missed if the broker restarts.

Using the Federation for the Cloud

According to some experts, clustering between regions or cloud is not recommended. This means you do not have to think about the distribution of nodes across data centers or regions. In case, an entire cloud region is disconnected, the impact can force the CloudAMQP to falter as well, though it not necessary that it will be affected. The nodes in the cluster are distributed among availability zones in the same region.

If you wish to safeguard the setup when a complete region goes down, then you can configure two clusters for separate regions and apply “federation” between them. Federation is a plugin in RabbitMQ. The use of the federation plugin allows your application to leverage more than one broker which are spread across separate systems.

Turn Off HiPE

If you plan to use HiPE, then avoid it. When HiPE is used, the server throughput is noticeably increased but it also increases the startup time. If HiPE is turned on, then the compilation of RabbitMQ is triggered at the startup. This also means that the startup time can increase to as much as 3 minutes. As a consequence, you might face issues during a server restart.

Disable Detailed Mode

If you have set the RabbitMQ Management statistics rate mode as “detailed, disable it immediately. When this mode is enabled, it can affect the performance and availability of the application. As such, it should not be utilized in production.

Limit Priority Queues

All the priority levels work on the Erlang VM with an internal queue. This queue consumes the considerable resource. For general use, you may not need more than 5 priority levels.

March 17, 2019 by sarveshbhatt

How to Enhance the Performance of Your RabbitMQ Application?

The performance in RabbitMQ is dependent on multiple variables. Go through the following tips to enhance the performance of your RabbitMQ application.

Use Shorter Queues

Ensure that all your queues are as short as possible to achieve maximum performance. When queues are long, then consequently their processing overhead is higher to the shorter ones. Therefore, experts recommend that all queues should be kept close to 0 to achieve best possible performance.

Setting up Max-length

If your application is hit up by a huge load of messages, then you can think about setting up a “max-length” for your queue. If you employ this technique, then your queue can be shortened through the discarding of messages which exist at the queue’s head. In this way, the queue can never exceed the setting of max-length.

Implement Lazy Queues

By default, lazy queues are enabled in the CloudAMQP. If you are unfamiliar with lazy queues, then these are those queues which are stored automatically on the disk. When a need for any message arises, then messages are transferred to the memory. When you work with lazy queues, the incoming messages are directed to the disk. In this way, the RAM usage becomes efficient. On the other hand, the time for throughput also balloons up.

Utilize Transient Messages

If you desire the fastest possible throughput, use transient messages. Keep in mind that persistent messages are directed to the disk as soon as the queue receives them.

Think About Multiple Queues

In RabbitMQ, queues are single-threaded. Up to 50k messages can be handled by a single queue. If you have a multi-core system, then you will gain improved throughput when you have multiple consumers and queues. To gain maximum throughput, the number of cores and queues on the underlying nodes should be equal.

In RabbitMQ, the management interfaces stores all the queue information which in turn has a negative impact on the server and slows it down. The RAM and CPU usage can also be negatively affected when the number of queues reaches a high limit (we are talking about a 1,000+ range). The management interface gathers information about each queue and then processes them. Each of this queue is reliant on the consumption of a select few resources. When the number of queues is in thousands then sharing of disk and CPU resources can create issues.

Spread Each Queue over Separate Cores

The performance of queues is restricted by a single core. Hence, in order to gain improved performance, a better strategy would be to spread up your queues over separate cores. On a similar note, if you are working with a cluster in RabbitMQ, then use separates node for your queues to achieve optimal performance. Note that when the queues are declared in RabbitMQ, they are attached to their respective node.

In the scenario where you configure and generate a cluster for your brokers in RabbitMQ, messages which are directed over a designated queue are routed to the node where the queue is located. It is possible to spread queues among the nodes, however, it has a catch. The drawback is that you would have to note down the location of your queue for future use. On average, it is recommended to use 2 plugins. Using 2 plugins is helpful when you have a more than a single node or when you have a cluster with a single node and multiple cores

Leveraging the Consistent Hash Exchange Plugin

If you are struggling with load-balancing your messages with queues, then a specific plugin can help you in this regard, known as the consistent hash exchange plugin. This plugin allows the use of exchange for facilitating load-balancing messages with queues. The exchange receives messages which are equally and regularly spread among multiple queues. In this process, the message’s routing key is utilized.

What the plugin does is that generates the routing key’s hash and distributes the messages to those queues which possess that exchange’s binding. As you can predict, doing such through manual means can create certain pitfalls as you will have to provide a great deal of information about the queue numbers along with their respective bindings for the publisher.

If you must gain the max usage of your cluster’s cores, then the introduction of the consistent hash exchange plugin can serve you well. Bear in mind that consumption is necessary from all of the queues. To read, about the plugin in detail, you can use this link.

Use Plugin for Sharding

Sharding is a process in which large databases are partitioned into smaller divisions to increase the speed and efficiency of the application. In RabbitMQ, you can use a sharding plugin to automate this partitioning. All you have to do is specify an exchange as sharded. Afterward, all the required queues are generated by the plugin for all the cluster nodes while sharding will be applied on the messages to distribute among the nodes.

With this plugin, one queue will be presented with a single consumer, but it is possible that multiple background queues run behind that queue. The plugin allows users to have access to their own centralized spot which can be used for sending messages, attaching queues with nodes of the cluster, and load-balancing the messages over multiple nodes.

Turn Off Manual Acks

The performance of the application can be influenced by the acknowledgment and publish confirms. Therefore, in order to target the best possible throughput, disable the manual acks.

Stay Away from Multiple Nodes

A single node can provide you the best possible throughput if you compare it with a high availability cluster setup. Mirroring is not applied on queues and messages for other nodes.

Turn Off Non-Used Plugins

While using RabbitMQ, different situations require the use of various plugins. While you find some of them as extremely valuable for your use, you have to keep in mind that their processing overhead can affect your performance. Hence, for production, they are not always recommended.

What Are the Components of a Sharded Cluster in MongoDB?

A sharded cluster in MongoDB is composed of 3 elements: shards, config servers, and mongos. These are the following.

Shards

Data from a sharded cluster is encompassed in a subset which is stored in a shard. All shards grouped together to maintain the data of the entire cluster. For deployment, you have to configure a shard as a replica set in order to achieve high availability and redundancy. Shards are used for maintenance and administrative operations.

Primary Shard

All the databases have their own primary shard. Primary shards reside in a sharded cluster’s database. They store all of those collections which are not sharded. Bear in mind, that there is no link between the primary shard and the primary replica member of the replica set, hence do not be confused by the similarity in their names.

A primary shard is chosen by the mongos during the generation of a new database. To choose a shard, mongos picks the one which contains minimum data.

It is possible to change the primary shard via a command, “movePrimary”. However, do not change your primary member so casually. The change of a primary shard is a time-consuming procedure. During the primary shard migration, you cannot use collections of that shard’s database. Likewise, cluster operations can be disrupted, and the extent of this disturbance is reliant on the data which is currently migrating.

To check a general view of the cluster in terms of sharding, you can use the sh.status method via the mongo shell. The results generated by this method also specify the primary shard of the database. Other useful information includes information about how the chunks are distributed among the shards.

Config Servers

The sharded cluster’s metadata is stored in the config servers. This metadata can include the organization of the components in the cluster along with the states of these components and data. Information about all the shards and their chunks are maintained in the config servers.

This type of data is then used for caching by the mongos instances after which it is used for routing associated with read and writes operations with respect to the appropriate shards. When new metadata is updated, then the cache is also updated by the mongos.

It must be noted that the configuration of “authentication” in MongoDB like internal authentication and RBAC (role-based access control) is also stored in these config servers. Additionally, MongoDB utilizes them for the management of distributed locks.

While it is possible for a single config server to be used with all the sharded clusters, such practice is not recommended. If you have multiple sharded clusters, then use a separate config server for all of them.

Config Servers and Read/Write Operations

Write Operations

If you are a MongoDB user, then you must be familiar with the admin and config databases in MongoDB. Both of these databases are maintained in the config servers. Collections associated with the authorization, authentication, and system collections are stored by the “admin” database. On the other hand, the metadata of the sharded cluster is stored in the “config” database.

When the metadata is modified like when a chunk is split or a chunk is migrated, then MongoDB directs write operations on the config DB. During these write operations, MongoDB uses “majority” for the write concern.

However, as a developer, you should refrain from writing to the config DB by yourself in the midst of maintenance or standard operations.

Read Operations

The admin database is used for read operations by the MongoDB. These reads are associated with authorization, authentication, and internal operations.

If the metadata is modified like when a chunk is being migrated or mongos is initiated, then read operations are processed to the config database by the MongoDB. MongoDB uses the “majority” read concern during these read operations. Moreover, it is not only the MongoB which reads from config servers. Shards also require them for read operations associated with the metadata of the chunks.

Mongos

The mongos instances in MongoDB are responsible for shard related write operations while they are also utilized for the routing of queries. For the sharded cluster, mongos is the only available interface which offers the application perspective. Keep this in mind that there is no direct communication between the shards and applications.

Mongos monitors the contents of a shard via metadata caching with the help of the config servers. The metadata provided by the config servers is then used by the MongoDB for routing operations between clients and applications with the instances of the mongod. Mongos instances lack a persistent state. This is helpful because it assists in the lowest possible consumption of the available resources.

Usually, mongos instances are run on a system which also houses the applications servers. However, if your situation demands, then you can also use them on any other dedicated resources like shards.

Routing and Results

While routing a query for a cluster, a mongos instance assesses all the shards through a list and identifies which shard needs the query. It then forms a cursor for all of these specific shards.

Afterward, the data in these shards is merged by the mongos instance which is then displayed in the result document. There are some modifiers for queries like sorting which maybe required to be executed on a shard. Subsequently, the retrieval of the results is carried out by the mongos. To manage the query modifiers, mongos does the following.

In the case of un-sorted query results, mongos applies a “round-robin” strategy for the generation of results from the shards.
In the scenario in which the result size is restricted because of the limit() method, then shards receive this information from the mongos. Afterward, they re-implement limit on the result and then send it to the client.

If the skip() method is used in the query, then like the previous case, it is not possible for mongos to forward the information. Instead, it searches the shards and fetches the unskipped results after which the specified skip limit is processed during the arrangement of the entire result.

What Are Replica Set Members in MongoDB?

Among the several database concepts out there, one of the most important ones is replication. Replication involves data copying between multiple systems so each user has the same type of information. This allows for data availability which in turn improves the performance of the application. Likewise, it can also help in backups, for instance when one of the systems is damaged by a cyberattack. In MongoDB, replication is achieved with the help of grouped mongod processes known as a replica set.

What Is a Replica Set?

A replica set is an amalgamation of multiple mongod processes which are combined together. They are the key to offering redundancy in MongoDB along with ensuring strong availability of data. Members in the replica set are classified into three categories.

Primary member.
Secondary member.
Arbiter.

Primary Member

There can only be one primary member in a single replica set. It is a sort of “leader” among all the other members. The major difference between a primary member and other members is that a primary member gets write operations. This means that when MongoDB has to process writes, then it forwards them to the primary member. These writes are then recorded in the operation log (oplog) of the primary member. Afterward, the secondary members look up in the oplog and use that recorded data to implement changes in their stored data.

In the case of read operations, all the members of the replica set can process them but the read operation is first directed to the primary member due to default settings in the MongoDB.

All members in a replica set acknowledge their availability through a “heartbeat”. If the heartbeat is missing for the specified time limit (usually defined in seconds), then it means that the member is disconnected. In replica sets, there are some cases when the primary member gets unavailable due to multiple issues, for instance, the data center in which the primary member resides is disconnected due to the power outage.

Since replication cannot operate without a primary member, an “election” is held to appoint a primary member from the secondary members.

Secondary Member

A secondary member is dependent on the primary member to maintain its data. For replication, it uses the oplog of the primary member asynchronously. Unlike primary members, a replica set can have multiple secondary members—the more the merrier. However, as explained before, write operations are only conducted by the primary member and thus, a secondary member is not allowed to process them.

If a secondary member is allowed to vote, then it is eligible to trigger an election and take part as a candidate to become the new primary member.

A secondary member can also be customized to achieve the following.

Made ineligible to become the primary member. This is done so it can be utilized in a secondary data center where it is valuable as a standby option.
Block the contact of applications from the secondary member such that they cannot “read” through it. This can ensure that they work with special applications which necessitate detachment from the standard traffic.
You can modify a secondary member to act as a snapshot for historical purposes. This is a backup strategy which can aid as a contingency plan. For example, when a user deletes a database by mistake, then the snapshot can be utilized.

Arbiter

An arbiter is distinct due to the fact that it does not store its own data copy. Likewise, it is ineligible for the primary member position. So, why exactly are they used? Arbiters provide efficiency in the elections; they can vote once. They are added so they can vote and break the possibility of scenarios where it is a “tie” between members.

For example, there are 5 members in a replica set. If the primary member gets unavailable, then an election is held where two of the secondary members bag two votes each, so it is not possible for either of them to become the primary member. Therefore, to break such standoffs, arbiters are added so their vote can complete the election and select a primary member.

It is also possible to add a secondary member in the replica set to improve the election but a new secondary member is heavy as it stores and maintains data. Since arbiters are not constrained by such overheads; therefore they are the most cost-effective solution.

It is recommended, that arbiters should never reside in sites which are responsible to host both the primary and secondary members.

If the configuration option, authorization, is enabled in the arbiter, then it can also swap credentials with members of the replica set where authentication is implemented using the “keyfiles”. In such scenarios, MongoDB applies encryption on the entire authentication procedure.

The use of cryptographic techniques ensures that this authentication is secure from any exploitation. However, since arbiters are unable to store or maintain data, this means that the internal table—which has information of users/roles in authentication—can not be owned by the arbiters. Therefore, the localhost exception is used with the arbiter for authentication purposes.

Do note that in the recent MongoDB versions (from MongoDB 3.6 onwards), if you upgrade the replica set from an older version, then its priority value is changed from 1 to 0.

What Is the Ideal Replica Set?

There is a limit on the number of members in the replica set. At most, a replica set can contain 50 members. However, the number of voting members must not cross the 7-member limit. To configure voting and allow a member to vote, the members[n]. votes setting of a replica set must be 1.

Ideally, a replica set should have at least three members which can bear data: 1 primary member and 2 secondary members. It is also possible to use a three-member replica set where with one primary, secondary, and arbiter but at the expense of redundancy.

How to Work with Data Modeling in MongoDB with an example

Were you assigned to work on the MEAN Stack? Did your organization choose MongoDB for data storage? As a beginner in MongoDB, it is important to familiarize yourself with data modeling in MongoDB.

One of the major considerations for data modeling in MongoDB is to assess the DB engine’s performance, balance the requirements of the application, and think about the retrieval patterns. As a beginner in MongoDB, think about how your application works with queries and updates and processes data.

MongoDB Schema

The schema of MongoDB’s NoSQL is considerably different from the relational and SQL databases. In the latter, you have to design and specify the schema of a table before you can begin with the insert operations to populate the table. There is no such thing in MongoDB.

The collections used by MongoDB operate on a different strategy which means that it is not necessary that two documents (similar to rows in SQL databases) can have the same schema. As a result, it is not mandatory for a document in a collection to adhere to the same data types and fields.

If you must create fields in the document or you have to modify the current fields, then you have to update a new structure for the document.

How to Improve Data Modeling

During data modeling in MongoDB, you must analyze your data and queries and consider the following tips.

Capped Collections

Think about how your application is going to take advantage of the database. For instance, if your application is going to use too many insert operations, then you can benefit from the use of capped collections.

Manage Growing Documents

Write operations increase data exponentially like when an array is updated with new fields, the data increases quickly. It can be a good practice to keep a track of the growth of your documents for data modeling.

Assessing Atomicity

On the document level, the operations in MongoDB are strictly atomic. A single write operation can only change the contents of a single document. There are some write operations that can modify multiple documents, but behind the scenes, they only process one document at a time. Therefore, you have to ensure that you are able to factor in accurate atomic dependency according to your requirements.

The use of embedded data models is often recommended to utilize atomic operations. If you use references, then your application is forced in working with different read/write operations.

When to Use Sharding?

Sharding is an excellent option for horizontal scaling. It can be advantageous when you have to deploy datasets in a large amount of data and where there is a major need for read and write operations. Sharding helps in categorizing database collections. This helps in the efficient utilization for the documents of the collection.

To manage and distribute data in MongoDB, you are required to create a shared key. Shard key has to be carefully selected or it may have an adverse impact on the application’s performance. Likewise, the right shard key can be used to prevent a query’s isolation. Other advantages include a notable uplift in the write capacity. It is imperative that you take your time to select a field for the position of shard key.

When to Use Indexes?

The first option to improve the query performance in MongoDB is an index. To consider the use of an index, go through all your queries. Look for those fields which are repeated the most often. Make a list of these queries and use them for your indexes. As a result, you can begin to notice a considerable improvement in the performance of your queries. Bear in mind, that indexes consume space both in the RAM and hard disk. Hence, you have to keep these factors in mind while creating indexes or your strategy can backfire instead.

Limit the Number of Collections

Sometimes, it is necessary to use different collections based on application requirements. However, at times, you may have used two collections when it may have been possible to do your work through a single one. Since each collection comes with its own overhead, hence you cannot overuse them.

Optimize Small Document in Collections

In case, you have single or multiple collections where you can notice a large number of small documents, then you can achieve a greater degree of performance by using the embedded data model. To increase a small document into a larger one via roll-up, you can try to see if it is possible to design a logical relationship between them during grouping.

Each document in MongoDB comes up with its own overhead. The overhead for a single document may not be much but multiple documents can make matters worse for your application. To optimize your collections, you can use these techniques.

If you have worked with MongoDB then you would have noticed that it automatically creates an “_id” field for each document and adds a unique 12-byte “ObjectId”. MongoDB indexes the “_id” field. However, this indexing may waste storage. To optimize your collection, modify “_id” field’s value when you create a new document. What this does is that it helps to save a value which could have been assigned to another document. While there is no explicit restriction to use a value for the “_id” field but do not forget the fact that it is used as a primary key in MongoDB. Hence, your value must be unique.
MongoDB documents store the name of all fields which does not bode well for small documents. In small documents, a large portion of their size can be dictated due to the number of their fields. Therefore, what you can do is that make sure that you use small and relevant names for fields. For instance, if there is the following field.

father_name: “Martin”

Then you can change it by writing the following.

f_name : “Martin”

At first glance, you may have only reduced 5 letters but for the application, you have decreased a significant number of bytes which are used to represent the field. When you will do the same for all queries then it can make a noticeable difference.

Let’s see how this works with an example

The major choice which you have to think around while working for data modeling in MongoDB is what does your DOCUMENT structure entail? You have to decide what the relationships which represent your data are like whether you are dealing with a one-to-one relationship or a one-to-many relationship. In order to do this, you have two strategies.

Embedded Data Models

MongoDB allows you to “embed” similar data in the same document. The alternative term for embedded data models is de-normalized models. Consider the following format for embedded data models. In this example, the document begins with two standard fields “_id” and “name” to represent the details of a user. However, for contact information and residence, one field is not enough to store data as there are more than one attributes for data. Therefore, you can use the embedded data model to add a document within a document. For “contact_information” curly brackets are used to change the field to a document and add “email_address” and “office_phone”. A similar technique is applied to the residence of the user.

{

_id: “abc”,

contact_information: {

email_address: “bruce@ittechbook.com”,

office_phone: ” (123) 456–7890″,

residence: {

country: “USA”,

city: “Bothell”,

}

The strategy to store similar types of data in the same document is useful for a reason; they limit the number of queries to execute standard operations. However, the questions which might be going around in your mind is, when is the correct time to use an embedded data model? There are two scenarios which can be marked as appropriate for the use of embedded data models.

Firstly, whenever you find a one-to-many relationship in an entity, then it would be wise to infuse embedding. Secondly, whenever you identify an entity with “contains” relationship, then it is also a good time to use this data model. In embedded documents, you can access information by using the dot notation.

One-to-Many Relationship with Embedding

One-to-many is a common type of database relationship where a collection A can match multiple documents in collection B but the latter can only match for one document in A.

Consider the following example to see when de-normalized data models hold an advantage over normalized data models. To select one, you have to consider the growth of documents, access frequency, and similar factors.

In this example, reference is using three examples.

{

_id: “barney”,

}

{

user_id: “Barney”,

street: “824 XYZ”,

city: “Baltimore”,

state: “MD”,

zip: “xxxxx”

}

{

user_id: “Barney”,

street: “454 XYZ”,

city: “Boston”,

state: “MA”,

zip: “xxxx”

}

If you can see closely, then it is obvious that the application has to continuously retrieve data for address along with other field names which are not required. As a result, several queries are wasted.

On the other hand, embedding your address field can ensure that the efficiency of your application is boosted and your application will only require a single query.

{

_id: “barney”,

addresses: [

{

user_id: “barney”,

street: “763 UIO Street”,

city: “Baltimore”,

state: “MD”,

zip: “xxxxx”

{

user_id: “barney”,

street: “102 JK Street”,

city: “Boston”,

state: “MA”,

zip: “xxxxx”

}

]

}

Normalized Data models (References)

In normalized data models, you can utilize references to represent relationship between multiple documents. To understand references, consider the following diagram.

References or normalized data models are used in the cases of one-to-many relationship models and many-to-many relationship models. In some instances of embedded documents model, we might have to repeat some data which could be avoided by using references. For example, see the following example.

mangos

In this example, the “_id” field in the user document references to two other documents that are required to use the same field.

References are suitable for datasets which are based on hierarchy. They can be used to describe multiple many-to-many relationships.

While references provide a greater level of flexibility in comparison to embedding, they also require the applications to issue the suitable follow-up queries for their resolution. Put simply, references can increase the processing between the client-side application and the server.

One-to-Many Relationship with References

There are some cases in which a one-to-many relationship is better off with references rather than embedding. In these scenarios, embedding can cause needless repetition.

{

title: “MongoDB Guide”,

author: [ “Mike Jones”, “Robert Johnson” ],

published_date: ISODate(“2019-1-1”),

pages: 1,200,

language: “English”,

publisher: {

founded: 2009,

location: “San Francisco”

}

{

title: “Java for Beginners”,

author: “Randall James”,

published_date: ISODate(“2019-01-02”),

pages: 800,

language: “English”,

publisher: {

founded: 2005,

location: “San Francisco”

}

As you can realize whenever a query requires the information of a publisher, that information is repeated continuously. By leveraging references, you can improve your performance by storing the information of the publisher in a different collection. In cases, where a single book has limited publishers and the likelihood of their growth is low, references can make a good impact. For instance,

{

founded: 2009,

location: “San Francisco”

books: [101, 102, …]

}

{

_id: 101

title: “MongoDB Guide”,

author: [ “Mike Jones”, “Robert Johnson” ],

published_date: ISODate(“2019-1-1”),

pages: 1,200,

language: “English }

{

_id: 102

title: “Java for Beginners”,

author: “Randall James”,

published_date: ISODate(“2019-01-02”),

pages: 800,

language: “English”,

}

Factors to Consider in IoT Implementation

Have you resolved to use IoT to power your organization?

The entire process to develop an IoT ecosystem is quite a big challenge. IoT implementation is not comparable to other IT deployments that are largely software-based because it includes multiple components like devices, gateways, and platforms. If you plan to adopt IoT, then you have the following factors to consider in IoT implementation.

Security

The year 2016 turned out to be an unforgettable year for the telecommunication industry. At that time, the telecom infrastructure was badly hit by a DDoS attack. As a consequence, many users faced difficulties while establishing a connection with the internet. Earlier, it was speculated that there was a cyber warfare element in the attack. Nation-backed attacks are nothing new. Over the past few years, the battleground has changed from the land to the digital realm as cybercriminal groups find support from different countries.

However, in this case, the culprit was someone else. It was known as Mirai, a malware. Soon, it was revealed that the malware belonged to the “botnet” category. A botnet is an attack which compromises multiple systems at once and uses them as digital zombies to carry out malicious actions. Mirai was able to bypass several IoT devices. These devices included residential gateways, digital cameras, and even baby monitors! All of these devices were invaded through a brute-force strategy.

Unfortunately, this is just the tip of the iceberg. The botnet is not the only threat looming over IoT. There has been other malware like ransomware which makes matters worse in the Medical IoT. Similarly, last year’s cyber attacks in Bristol Airport and Atlanta Police paint a worrisome picture which illustrates IoT devices as highly insecure against modern cyber attacks.

If any part of the IoT ecosystem of business is hacked, then it offers the perpetrators remote access to trigger actions. Therefore, it is necessary that cybersecurity strategy of any organization dealing with IoT focuses on the complete system security from sensors and actuators to the IoT platforms for minimizing loopholes.

Authentication

Authentication is one of the integral key points for IoT implementation. It is important to ensure that a system’s security is not completely reliant on authentication mechanisms which fall into the category of one-time authentication. An enterprise IoT infrastructure demands that connectivity for devices and endpoints is carefully assessed.

There should be an environment which can be “trusted.” Such an environment entails proper identification of all users, applications, and devices so they can be authenticated easily while eliminating unknown devices from the network. There have to be appropriate roles and access defined for all the linked devices. This ensures that the network can only permit authorized activities. Similarly, the incoming and outgoing data in the ecosystem can only be accessed by the user having the required clearance and authority.

Reference Datasets

The data production from IoT devices can only be useful if it is used in the proper context. This context can be utilized from third-party data which stores information of aggregated values, look-up tables, and historical trends. For instance, if IoT is used in home automation, then it can adjust the temperature of the home. The decision for using an air conditioner to increase cooling in the room or for using a heater to increase the inflow of warm air in the room depends upon the real-time data extraction from the weather data sources. Likewise, in the case of a connected car, the car has to send its location coordinateness to the closest service center. Therefore, it is necessary to ensure that adequate reference points are available for IoT devices.

Standards

During IoT implementation, one has to factor in all the activities related to managing, processing, and saving data in the sensors. This aggregation enriches the value of data by enhancing the frequency, scope, and scale. However, aggregation requires the correct use of different standards.

There are two standards which are associated with aggregation.

Technology Standards – They include data aggregation standards (Loading (ETL), Transformation), communication protocols (HTTP), and network protocols (Wi-Fi).
Regulatory Standards – They are specified and overseen by federal authorities like HIPAA and FIPP.

The use of standards springs several questions, for example, which standard will be used to manage unstructured data? The traditional relational database store structured data and are queried with SQL. On the other hand, modern databases like MongoDB use NoSQL (Not SQL) to store unstructured data.

Data Sensitivity

When organizations began providing services and products on the digital realm, they collected user information for processing. However, no one exactly knew what happened behind the scene. Was the information only being used to provide better user experience or was it exploited for hidden purposes? Studies revealed that a large chunk of organizations sold their private customer data to third parties.

In the past few years, a strong wave has emerged to improve data transparency for clients. For instance, in May 2018, the European Union implemented GDPR (General Data Protection Regulation). GDPR is a comprehensive list of data privacy regulations, which is created to make organizations transparent about their data processing. The objective behind GDPR is to protect the privacy of EU residents. This law is applied on any business which engages with EU residents, irrespective of the business’ geographical location.

These emerging trends are important for IoT since the sensors and devices are responsible to store and process large datasets. Hence, it is vital that any of this data does not breach privacy laws. There are four main tips for data sensitivity.

Understand the exact nature of data which is going to be stored by the IoT equipment.
Know the security measures used to encrypt or secure data.
Identify roles in the businesses which access the data.
Learn data processing of each component of the IoT ecosystem.

What Is the Internet of Things Ecosystem Composed of?

The internet of things is quite popular nowadays. Many people are familiar with technology and its role in improving the lives of human beings. However, people still do not know much about what makes up the internet of things. Read this post to get the complete picture of the internet of things ecosystem.

Hardware in Device

The “things” in IoT refer to devices. They act as the intermediary between the digital and real world. The chief principle of an IoT device is to gather data. This data is collected with the help of a sensor. If you don’t have enough data on your device, you can utilize a basic sensor. However, industrial applications require an extensive list of sensors. Similarly, there is an actuator which is used to physically trigger an action.

To understand an IoT device, you have to go through multiple factors like size, reliability, lifespan, and most importantly, cost. For instance, a small device like a smartwatch is good to go with a System on a Chip (SoC). Similarly, more complex and bigger solutions require the use of programmable circuits like Arduino or Raspberry Pi. However, if you aim is to install IoT at a manufacturing plant, then you are looking at gigantic solutions like PXI.

Software in Device

While a smart device uses sensors to interact with the real world, it requires an OS that is similar to a robot. The addition of software and hardware elevates a device to a “smart device.” The software establishes communication with other devices (it is called the internet of things for a reason) and other components of the ecosystem like the Cloud. The software enables you to perform real-time business intelligence on the data collected by your hardware (sensors).

The right development of the software in your device is extremely critical. The better the code, the more features you can create from your IoT ecosystem. Bear in mind that adding the hardware part is tricky and costly in the IoT. Therefore, instead of focusing on your hardware, you should work hard on the software of your devices. Afterward, you can fit it into any piece of hardware. The software is classified into two sections.

Edge OS

It deals with your operating system. For instance, the type of I/O functionality you will require in the future. Similarly, you have to create a number of OS-level settings so your application layer runs easily.

Edge Applications

It is the application layer of the software. The application layer is the real deal to customize the processing. For instance, if you have installed an IoT device in a manufacturing plant, then you can use the software to look for a rapid increase or decrease in temperature. When the device senses a huge difference, it can instantly notify the authorities and enable the plant’s system to react in time.

Communications

In the internet of things ecosystem, communication refers to the networking of your device. How is your device going to establish a connection with the outer digital world? Likewise, it also includes the protocols which have to be used. For example, your business may operate on LAN. To ensure that your devices only “talk” with other connected devices, you have to create a tailored network design for your sensors.

Today, smart buildings use BACnet protocol with their systems. If you plan to use your device for home automation purposes, then you can make sure that it runs on the BACnet protocol. Even if your objective is to use it for a different purpose, it may prove helpful in the future to establish a connection between your device and others.

Likewise, you have to plan the connection of your sensors with the Cloud. In some cases, you may want to keep the data private in certain sensors.

Gateway

The bidirectional exchange of data between an IoT network and protocol is carried out by a gateway. The job of a gateway is similar to a translator; it glues the entire ecosystem as it can take data from a sensor and forward it to any other component of the ecosystem.

Gateways can also be used to perform specific operations. For instance, an IoT provider can use a gateway to trigger an action on the sensor’s accumulated information. When gateways complete their set of routines, they transfer information to other parts of the ecosystem.

Gateways are especially useful to add security in a solution. Encryption techniques can be used with gateways to hide data. Therefore, it can serve as a vital shield to stop a cyber attack, especially the ones targeted at IoT like botnets.

Cloud Platform

The Cloud Platform is perhaps the most important of all the components. Cloud enthusiasts will particularly find it familiar to their “software-as-a-service” model. Your platform is linked to the following segments.

Amassing Data

Remember how we talked earlier how sensors collect data? Sensors stream that data to the Cloud. While creating your IoT solution, you will be well aware of your data requirements like the total amount of data you will process in a day, week, month, or year! Here, data management is necessary to address scalability concerns.

Analytics

Analytics include processing the data, identify patterns, forecasting, and using ML algorithms. The use of analytics helps smart devices in creating meaningful information out of a cluster of disorganized data.

APIs

APIs can be introduced on the device or at the Cloud level. With APIs, you can link different stakeholders in your IoT ecosystems, such as your clients and partners, to allow for seamless communication.

Cloud Applications

This is the easiest path for non-IT folks. It is the end-user of the ecosystem. This is where the client engages in an interaction with the IoT ecosystem. The application in your wearable smartwatch is an example.

As long as your smart device consists of a display, your client is bound to have an application to interact. This assists in getting access to smart devices from any place and at any time.

How Does Random Forest Algorithm Work?

One of the most popular machine learning algorithms is the random forest algorithm. In real life nature, a forest is measured according to the number of its trees. A bigger number points towards a healthy forest. The random forest algorithm functions on the same principle of nature. As its name suggests, it creates a forest with trees. However, these trees we are talking about are the decision trees which we have covered in one of our earlier posts. These decision trees allow a random forest to make accurate forecasting decisions.

The forest is referred to as random because of the randomness of its components. Each of the trees in this forest receives training through a procedure known as the bagging method. This method enhances the final result of the algorithm.

The classifier class in a random forest is convenient. While the trees in the forest grow, the algorithm applies to achieve a greater degree of randomness.

In other algorithms, the best feature is determined during the split of a node. In this one, when the algorithm is in the working stage, the best random feature out of a collection is searched. This is done in order to enhance the model’s diversity.

Bear in mind that out of all the features, only a select few are assessed by the random forest during a node’s split. For further randomness in its trees, another technique known as the random threshold is used where they are equipped with all the individual features.

To illustrate this point, take the example of a man named Bob who wants to dine at a new restaurant. Bob asks his friend James for a few suggestions. James asks Bob questions related to his likes and dislikes in food, budget, area, and other relevant questions. In the end, James uses Bob’s answers to suggest a suitable restaurant. This whole process mirrors a decision tree.

Moving forward, Bob is not happy with the recommendation of James and wants to explore other dining options. Thus, he begins asking other friends for their recommendations. He goes to 5 more people who act similar to James; they ask him relevant questions to provide a recommendation. In the end, Bob goes through all the answers and picks the most common answer. Here, each of Bob’s friend acts as a decision tree and their combined answers generate a random forest.

Determining the relative importance of the feature of a prediction is easy in random forest, especially if it is compared with others. If you are looking for a tool which can aid you to calculate such values, then do consider the scikit-learn—a machine learning library in Python.

In the post-training period, a score is assigned to all of the features so the results can be scaled. This makes sure that a zero value is placed for each of the importance sum.

The assessment of feature importance is crucial in order to drop a feature. Usually, a feature is dropped when it struggles to add anything of value to the prediction. This is done because too many features pose the issue of over-fitting.

Hyper-parameters

One of the factors that make random forest unique is the precise output which does not require tuning of hyper-parameters. Like a decision tree, a random forest carries its own hyper-parameters.

In random forest, hyper-parameters are used for increasing the speed of the model. Following are the scikit-learn’s hyper-parameters.

n_jobs

It provides the engine with details about the limit of processor for computational usage. If it has a “1” value, then this indicates that only a single processor can be run. On the contrary a “-1” value indicates that there is no restriction.

N_estimators

It is the overall number of trees to be generated in the time period before the determination of max voting and averages for predictions. A larger number of figures increases reliability but it also affects the performance speed.

random_state

It is used to convert the model’s output to create a replicable result. If similar piece of training data, a definite value for random_state, and hyper-parameters are inserted in the model, then the output would also be identical.

min_sample_leaf

It examines the lowest limit of a leaf for the split of internet nodes.

Max_features

It takes the figure of maximum digit of features that are required to be used in each tree.

Now that you have understood how a random forest algorithm works, you should begin implementing through coding. For more information about machine learning, check other blogs on our website.

What Is Classification and Regression in Machine Learning?

Machine learning is divided into types: supervised and unsupervised machine learning. In supervised machine learning, inputs and outputs are offered. To aid judgment in future, it offers several algorithms of fixed quantities. Supervised machine learning algorithms have come up with applications like chatbots, facial expression system, etc.

Both classification and regression fall into the category of supervised learning. So what are they and why is it necessary to understand them?

Classification

If your dataset requires you to work with discrete values, then you should use classification. When the solution to a problem demands a definite or predetermined range of output, then you most probably have to deal with classification. The following scenarios are one of the few examples where classification is used.

To determine consumer demographics.
To predict the likelihood of a loan.
To check who wins or lose a coin toss.

When a problem can have only two answers (yes or no), then such a classification falls under the category binary classification.

On the other hand, multi-label classification processes several variables. This type of classification comes handy in the above-mentioned consumer segmentation, grouping images, and text and audio analysis. For instance, a sports blog can have posted about multiple sports like basketball, baseball, tennis, football, and others, at the same time.

There is also the multi-class classification in which a target defines a sample. For example, it is possible for a fruit to be apple or banana but it cannot become both at the same time.

Classification computes only those values which are “observed”. It relies on the total of its input to compute forecasting which offers more than a single result. The algorithm which maps a provided input into a specific category is referred to as the classifier. The feature is a measurable variable.

Before the creation of a classification model, firstly you have to pick a classifier and have it initialized. Subsequently, you have to provide some training to that classifier. In the end, you can check the output for the observed x values to predict the label y.

Regression

Regression works opposite to classification; it is used for the prediction of results where continuous values are at play. In regression, the variables are flexible and can be modified, unlike classification hence; there is no need to restrict to a fixed set of labels.

Linear regression is one of the leading algorithms. Sometimes, linear regression is underestimated as some perceive its working to be too easy. However, in actuality, linear regression can be used in multiple cases, as it is quite simple in comparison to others. You can use logistic regression to estimate the prices of property, assess the churn rate of customers, and even manage the collection of money from that person.

For more details, keep following this blog series. If you have any questions, then you contact us to clear up your confusion.

January 30, 2019 by sarveshbhatt

What Is Virtualization? What Are the Types of Virtualization

Recently I was working on a POC which triggered a very intriguing topic -Virtualization and how it is changing IT culture.

Virtualization is written on the blue sky. As the name suggests, virtualization involves producing a “virtual” environment. This environment is implemented on a server to execute the required programs without any complexity in interrupting other platforms and applications of the server. For instance, in a traditional architecture you can run a single application which runs on an operating system configured on a server. Through virtualization, a new layer is added to the top of the server which can host multiple OS having their own applications.

To understand virtualization, you must also know what a VM or virtual machine is. A VM is a heavily isolated container which encompasses an application and OS. All the VMs are independent. VMs are stored on a single device to run multiple applications and OS at once through a single host.

The software which is used to link the host machine and the virtual machine is commonly referred to as a hypervisor. It is the hypervisor which serves as an intermediary between the virtual environment and the hardware equipment of a server and is responsible to properly allocate resources like memory and number of cores.

Types of Virtualization

Virtualization can be applied to multiple components in the IT infrastructure of an organization. Following are the fundamental types of virtualization.

Server Virtualization

Server virtualization is also known as hardware virtualization. The idea behind such virtualization is to break down a server “logically” into multiple segments of hardware which effectively work as virtual servers. All of these smaller servers are powered to host their own virtual machines.

However, the complete set of servers would be considered as a single one whenever a process comes up with a request related to hardware. Such a virtualization technique is useful in boosting the processing capabilities of the organization without spending on a new server. Likewise, it can also improve the uptime and enhance the resource utilization of the hardware.

Storage Virtualization

In storage virtualization, all of the storage devices are lumped or consolidated together, so they are “visible” and function as a stand-alone device. This is done in order to achieve homogenization to get increased speed and capacity, load balancing, enhanced optimization, and decreased downtime—factors which often plague organizations when they work with several different storage devices.

You can observe the use of storage virtualization whenever you are attempting to create more than a single partition from out of your hard drive. When several storage devices are combined in a single device, then the virtualization is more commonly known as block virtualization.

Memory Virtualization

This type of virtualization collects and combines the physical memory from all servers and accumulates it into a virtualized pool. This is beneficial because it allows expanding the contiguous working memory.

Operating systems often make use of such techniques. For instance, Microsoft’s operating system facilitates some part of the storage to act as the RAM’s extension. This is an OS level memory virtualization where the OS offers the memory pool. For the application level, applications offer the memory pool.

Software Virtualization

In software virtualization, several virtual environments are generated which are then run on the host machine. Such virtualization allows the guest OS to function. For instance, if a host machine is natively running on Windows 10, then software virtualization can help to run Android 8.0 on the machine where it makes use of the exact hardware resources which are granted to the host machine.

Software virtualization is subdivided into OS virtualization, application virtualization, and service virtualization. In application virtualization, you can host many applications in the virtualized space. In service virtualization, you can host an application services and processes in the virtualized space.

Network Virtualization

Such a virtualization solution requires dividing several mini-networks from a single physical network. These mini-networks are configured to communicate with each other; it is possible to completely cut off any communication between them.

This virtualization is effective because it supports the limitation of file transfer in various networks. Moreover, it is a secureoption because instead of the whole network, only a single mini-network is breached when the company is hit by a cyberattack like man-in-the-middle attack.

Similarly, it enables a greater level of monitoring and facilitates networking professionals to identify the usage of data. As a result, a greater degree of reliability is attained as a problem in one network cannot necessarily disrupt another network.

Desktop Virtualization

This virtualization is routinely used by organizations to improve productivity. In desktop virtualization, a remote server stores users’ desktop environment, thereby authorizing them to work from anywhere in the world. This way, employees can work on an urgent-basis, whether they are on a vacation or when they are on a sick leave. Hence, organizations gain a huge advantage and are able to maximize their revenues.

Data Virtualization

When organizations use various data sources like MongoDB, SQL Server, and other DB products, then they find integration issues. To address, such issues they make use of data virtualization for aggregating data efficiently.

Data virtualization refers to an approach in which data is gathered from multiple data sources in order to design a single, virtual, and logical perspective or view. Such a view maybe used by a dashboard for an analytics tool or a software application. Data virtualization provides a layer of abstraction on storage data for location, storage structure, access language, and application programming interface.

Service Virtualization

When organizations supports parallel development lots of time service provider becomes bottleneck for consumers. Even existing service can become bottleneck for consumer due to availability,connectivity. Service Virtualization refers to the approach to emulate the behavior of the original component by creating similar response and request structure. This helps organization in faster and independent development and testing.

Service Virtualization is different from Mock service.Mock service is more of static request response implement vs full implementation in service virtualization. Also,Mock service are stateless and context anware whereas service virtualization can be stateful and state aware.

Final Thoughts

With the emergence and advancements of cloud computing, virtualization has changed the game for many organizations. Companies use it as one of the major sections of their disaster recovery and backup strategy. Whenever, a disaster like a cyberattack, power outages, or any similar issues arises, virtualization simplifies and eases up and ensures to quicken the recovery period. Similarly, it is beneficial to effectively allocate budget for hardware, and therefore saves from unnecessaryinvestments in servers. This effect is particularly positively felt in SMEs (small and medium enterprises).

Today, by limiting the wastage of computing equipment and minimizing costs, virtualization has truly become a key IT component of any successful organization.