The Best Practices for Hybrid Cloud Development


Hybrid cloud has changed the way applications run. It has affected the expectations of the consumers which in turn has a strong impact on the business revenue streams. In order to adapt quickly, consider the following best practices for cloud development.

Selection of IDE

There is an extensive list of IDEs (Integrated Development Environment) that can be used in hybrid cloud development. While creating hybrid cloud applications, you should target IDEs with the following features.

  • An IDE must be compatible with several programming languages to power the requirements of enterprise development.
  • Since the dev community uses IDEs, it is best to use only those IDEs which are open-source and come with strong community support. Similarly, there should be only a few IDEs to deal with so patching, upgrading, and maintenance tasks are convenient.
  • Since, we are talking about the cloud; hence it is almost necessary that IDEs can connect with cloud solutions like Oracle, Salesforce, Azure, and AWS.
  • IDEs must have seamless integration for popular source code repositories like Gitlab, Bitbucket, Subversion, etc.
  • The chosen IDE must help developers with the import of common libraries, automatic code generation, generation of template projects, provision of shortcuts keys for generation of code constructs, auto-complete, assistance for compilation errors, and code recommendations.
  • If API management tools like Akana, Mule, and Apigee are compatible with the IDE and developers can use them for publishing and pulling content directly, then it can save a lot of time.

There are some of the features that are must-have if you wish to breeze through your hybrid cloud development for enterprise level. AWS Cloud9 and Eclipse are some of the IDEs which check of all the above-mentioned boxes.

Automation

Integrated governance and security, an efficient sequence of procedures, planned configuration, and cloud-agnostic tools are necessary to discover powerful tool chains. Developers should try to limit the number of steps for the toolchain so they speed up with quicker builds while maintaining quality assurance for the written code.

Design Patterns

The addition of design patterns ensures that applications conform to industry standards. For hybrid cloud development, you have to check which design patterns fit your use case. If you select the wrong design pattern, then it can create confusion and can potentially disrupt the robustness of the application.

Applications replicas are spun up and tore down in hybrid cloud environments. Therefore, it is necessary to conduct a thorough detail analysis to identify the creation and destruction of objects. In some use cases, you might require certain handling for the objects which is caused by rapid scaling.

Agile

Agile and Lean practices allow development teams to offer higher value for the stakeholders. VersionOne, Jira, and similar tools can help you to not only monitor, analyze, and prioritize your work but also help the provision of transparency for the stakeholders. Tools like XP can have a positive impact on the code for hybrid cloud development. Similarly, with the use of pair review and pair programming, you can ensure that your code is written and verified effectively.

Server Selection

Developers can pick from many types of servers to work for hybrid cloud applications. Majority of the IDEs are powered by their own default servers. Cloud experts recommend using the nearest version and server type for the code which is expected to be deployed. Try to choose a cloud-agnostic and light-weight server for the job. The introduction of light-weight servers can save a great amount of time for the starting and deploying the code.

Service Discovery

Regardless of the fact that your IT architecture is based on a cloud or on-premises architecture, services are always bound to grow. There comes a point where it absolutely necessary to target strong discovery process. Depending upon the tools and use case, you can go for server-side discovery, gateway moderation, and self- registration. To do this, you can use multiple tools like Mule Service Registry, Istio, and API Connect. Before choosing these tools, assess their impact on the performance of your hybrid cloud application.

Enhancement

As the cloud development gained traction, it became easy for instances to rebuild and teardown regularly. This helps in frequent patching and enhancements for the OS, 3rd party libraries, middleware, and the existing code base. Developers should patch and upgrade whenever it is possible. This can help to not only limit defects and vulnerabilities but also improve the cloud application’s upgrade cycle. Good practice for your provisioned servers is to use a regular upgrade cycle where it can serve as a portfolio activity for your IT infrastructure.

Versioning

For IT artifacts, versioning is one of the most important aspects. Perhaps, the most popular among them is the Semantic Versioning. However, depending upon your use case, you can use others as well. With tools like Git and Bitbucket, you cannot only apply versioning for your artifacts but also improve their functionality to merge, branch, revert accordingly. Therefore, versioning is implemented for all of your artifacts.

Third-Party Libraries

When the complexity of your hybrid cloud application reaches higher levels, you can take advantage of 3rd party libraries. They can help you with vulnerabilities, dependencies, and vendor lock-in. For best results, use these libraries sparingly. Additionally, pick those which have an open source and strong community support to back them. Choosing those with a proven track record can help you with better results.

Conversions

XML and JSON are the most common mediums for the exchange of information between heterogeneous systems. You have to think about a standard solution which can convert or translate these messages as objects. If you are working with C#, then you can use XML to C# and JSON to C# while Java programmers can use JAXB, Jackson, and Java API for JSON. This can assist developers in the creation of incoming messages. Moreover, RAML and Swagger are getting traction, therefore their tools for Java and .NET can prove helpful for developers.

Advertisement

How to Increase the Availability of an Application with RabbitMQ?


If you have used RabbitMQ for managing the messaging part of your applications, then you are certainly aware of the configuration issues which are recurrent with it. Software engineering experts who have used RabbitMQ for an extended period of time know that it can face certain configuration errors.

Therefore, it is necessary to configure it correctly. In this way, you can achieve the best possible performance along with getting the cluster which is the most stable of them all. This article focuses on a number of tips and techniques which can improve the availability and uptime of your application.

Limit Queue Size

The arrangement of queues matters in the RabbitMQ environment. If you wish to attain maximum performance for your application then ensure that these queues remain shorter in size. When a queue is considerably large in size, then the accompanied processing overhead slows down the performance of the application. In order to achieve optimal performance for your application, queues should remain close to 0.

Turn On Lazy Queues

Since RabbitMQ 3.6, there is an option in the messaging platform to use lazy queues. In lazy queues, messages are automatically redirected to the secondary memory of your disk and are then stored there. When the need arises for a message, then the main memory calls out the corresponding message from the disk. Since all the messages are stored on the disk, therefore it helps to manage and optimize the usage of RAM. On the other hand, it also promises additional time for the throughput.

The use of lazy queues ensures you can produce improved stable clusters. These cluster features are known to infuse a certain degree of predictability in the performance. As a result, the server’s availability gets a much-needed boost. If your application involves too much processing of batch jobs and deals with a high number of messages, then you might be concerned that the speed of publisher may not always be kept up by the consumers. In this case, you must give a thought to turn off the lazy queues.

Manage Cluster Setup

The node collection in RabbitMQ is also known as a cluster. In order to improve availability, make sure that the clients are facilitated with data replication in the event of any possible or expected failure. This means that even if a cluster is disconnected, clients could access it to fulfill their requirements.

In Cloud AMQP, you can create 3 clusters. You have to locate all nodes for various availability zones (AWS) while queues should be replicated (HA) amongst the availability zones.

If a node goes down, then there should be a mechanism which can transfer processing to the remaining cluster nodes for keeping up the availability of the application. To do this, you can use a load balancer with the instances of the RabbitMQ. The load balancer can help with the message publishers for transparent brokers distribution.

Keep in mind that in Cloud AMPQ, 60 seconds is the max failover time. The DNS TTL runs every 30 seconds while the endpoint health is processed after every 30 seconds.

By default, message queues are positioned on a single node. However, they can be reached and seen by all the other nodes. For replication of queues in the node cluster, you can visit this link to achieve high availability. If you need more information on the configuration of nodes in your cluster, then you can visit this link.

Using Durable Queues and Persistent Messages

If it is absolutely critical that your application cannot lose a message, then you can configure your queues as “durable”. In this way, your sent messages are now delivered with a “persistent” mode. If you wish to prevent the loss of messages with the broker, then you have to consider broker crashes, broker hardware failure, and broker restart issues. In order to make sure that the broker definitions and messages are not hampered by restarts, you can store them on the disk. Queues, exchanges, and messages which are not declared as persistent and durable are missed if the broker restarts.

Using the Federation for the Cloud

According to some experts, clustering between regions or cloud is not recommended. This means you do not have to think about the distribution of nodes across data centers or regions. In case, an entire cloud region is disconnected, the impact can force the CloudAMQP to falter as well, though it not necessary that it will be affected. The nodes in the cluster are distributed among availability zones in the same region.

If you wish to safeguard the setup when a complete region goes down, then you can configure two clusters for separate regions and apply “federation” between them. Federation is a plugin in RabbitMQ. The use of the federation plugin allows your application to leverage more than one broker which are spread across separate systems.

Turn Off HiPE

If you plan to use HiPE, then avoid it. When HiPE is used, the server throughput is noticeably increased but it also increases the startup time. If HiPE is turned on, then the compilation of RabbitMQ is triggered at the startup. This also means that the startup time can increase to as much as 3 minutes. As a consequence, you might face issues during a server restart.

Disable Detailed Mode

If you have set the RabbitMQ Management statistics rate mode as “detailed, disable it immediately. When this mode is enabled, it can affect the performance and availability of the application. As such, it should not be utilized in production.

Limit Priority Queues

All the priority levels work on the Erlang VM with an internal queue. This queue consumes the considerable resource. For general use, you may not need more than 5 priority levels.

How to Enhance the Performance of Your RabbitMQ Application?


The performance in RabbitMQ is dependent on multiple variables. Go through the following tips to enhance the performance of your RabbitMQ application.

Use Shorter Queues

Ensure that all your queues are as short as possible to achieve maximum performance. When queues are long, then consequently their processing overhead is higher to the shorter ones. Therefore, experts recommend that all queues should be kept close to 0 to achieve best possible performance.

Setting up Max-length

If your application is hit up by a huge load of messages, then you can think about setting up a “max-length” for your queue. If you employ this technique, then your queue can be shortened through the discarding of messages which exist at the queue’s head. In this way, the queue can never exceed the setting of max-length.

Implement Lazy Queues

By default, lazy queues are enabled in the CloudAMQP. If you are unfamiliar with lazy queues, then these are those queues which are stored automatically on the disk. When a need for any message arises, then messages are transferred to the memory. When you work with lazy queues, the incoming messages are directed to the disk. In this way, the RAM usage becomes efficient. On the other hand, the time for throughput also balloons up.

Utilize Transient Messages

If you desire the fastest possible throughput, use transient messages. Keep in mind that persistent messages are directed to the disk as soon as the queue receives them.

Think About Multiple Queues

In RabbitMQ, queues are single-threaded. Up to 50k messages can be handled by a single queue. If you have a multi-core system, then you will gain improved throughput when you have multiple consumers and queues. To gain maximum throughput, the number of cores and queues on the underlying nodes should be equal.

In RabbitMQ, the management interfaces stores all the queue information which in turn has a negative impact on the server and slows it down. The RAM and CPU usage can also be negatively affected when the number of queues reaches a high limit (we are talking about a 1,000+ range). The management interface gathers information about each queue and then processes them. Each of this queue is reliant on the consumption of a select few resources. When the number of queues is in thousands then sharing of disk and CPU resources can create issues.

Spread Each Queue over Separate Cores

The performance of queues is restricted by a single core. Hence, in order to gain improved performance, a better strategy would be to spread up your queues over separate cores. On a similar note, if you are working with a cluster in RabbitMQ, then use separates node for your queues to achieve optimal performance. Note that when the queues are declared in RabbitMQ, they are attached to their respective node.

In the scenario where you configure and generate a cluster for your brokers in RabbitMQ, messages which are directed over a designated queue are routed to the node where the queue is located. It is possible to spread queues among the nodes, however, it has a catch. The drawback is that you would have to note down the location of your queue for future use. On average, it is recommended to use 2 plugins. Using 2 plugins is helpful when you have a more than a single node or when you have a cluster with a single node and multiple cores

Leveraging the Consistent Hash Exchange Plugin

If you are struggling with load-balancing your messages with queues, then a specific plugin can help you in this regard, known as the consistent hash exchange plugin. This plugin allows the use of exchange for facilitating load-balancing messages with queues. The exchange receives messages which are equally and regularly spread among multiple queues. In this process, the message’s routing key is utilized.

What the plugin does is that generates the routing key’s hash and distributes the messages to those queues which possess that exchange’s binding. As you can predict, doing such through manual means can create certain pitfalls as you will have to provide a great deal of information about the queue numbers along with their respective bindings for the publisher.

If you must gain the max usage of your cluster’s cores, then the introduction of the consistent hash exchange plugin can serve you well. Bear in mind that consumption is necessary from all of the queues. To read, about the plugin in detail, you can use this link.

Use Plugin for Sharding

Sharding is a process in which large databases are partitioned into smaller divisions to increase the speed and efficiency of the application. In RabbitMQ, you can use a sharding plugin to automate this partitioning. All you have to do is specify an exchange as sharded. Afterward, all the required queues are generated by the plugin for all the cluster nodes while sharding will be applied on the messages to distribute among the nodes.

With this plugin, one queue will be presented with a single consumer, but it is possible that multiple background queues run behind that queue.  The plugin allows users to have access to their own centralized spot which can be used for sending messages, attaching queues with nodes of the cluster, and load-balancing the messages over multiple nodes.

Turn Off Manual Acks

The performance of the application can be influenced by the acknowledgment and publish confirms. Therefore, in order to target the best possible throughput, disable the manual acks.

Stay Away from Multiple Nodes

A single node can provide you the best possible throughput if you compare it with a high availability cluster setup. Mirroring is not applied on queues and messages for other nodes.

Turn Off Non-Used Plugins

While using RabbitMQ, different situations require the use of various plugins. While you find some of them as extremely valuable for your use, you have to keep in mind that their processing overhead can affect your performance. Hence, for production, they are not always recommended.

What Are the Components of a Sharded Cluster in MongoDB?


A sharded cluster in MongoDB is composed of 3 elements: shards, config servers, and mongos. These are the following.

Shards

Data from a sharded cluster is encompassed in a subset which is stored in a shard. All shards grouped together to maintain the data of the entire cluster. For deployment, you have to configure a shard as a replica set in order to achieve high availability and redundancy. Shards are used for maintenance and administrative operations.

Primary Shard

All the databases have their own primary shard. Primary shards reside in a sharded cluster’s database. They store all of those collections which are not sharded. Bear in mind, that there is no link between the primary shard and the primary replica member of the replica set, hence do not be confused by the similarity in their names.

A primary shard is chosen by the mongos during the generation of a new database. To choose a shard, mongos picks the one which contains minimum data.

It is possible to change the primary shard via a command, “movePrimary”. However, do not change your primary member so casually. The change of a primary shard is a time-consuming procedure. During the primary shard migration, you cannot use collections of that shard’s database. Likewise, cluster operations can be disrupted, and the extent of this disturbance is reliant on the data which is currently migrating.

To check a general view of the cluster in terms of sharding, you can use the sh.status method via the mongo shell. The results generated by this method also specify the primary shard of the database. Other useful information includes information about how the chunks are distributed among the shards.

Config Servers

The sharded cluster’s metadata is stored in the config servers. This metadata can include the organization of the components in the cluster along with the states of these components and data. Information about all the shards and their chunks are maintained in the config servers.

This type of data is then used for caching by the mongos instances after which it is used for routing associated with read and writes operations with respect to the appropriate shards. When new metadata is updated, then the cache is also updated by the mongos.

It must be noted that the configuration of “authentication” in MongoDB like internal authentication and RBAC (role-based access control) is also stored in these config servers. Additionally, MongoDB utilizes them for the management of distributed locks.

While it is possible for a single config server to be used with all the sharded clusters, such practice is not recommended. If you have multiple sharded clusters, then use a separate config server for all of them.

Config Servers and Read/Write Operations

Write Operations

If you are a MongoDB user, then you must be familiar with the admin and config databases in MongoDB. Both of these databases are maintained in the config servers. Collections associated with the authorization, authentication, and system collections are stored by the “admin” database. On the other hand, the metadata of the sharded cluster is stored in the “config” database.

When the metadata is modified like when a chunk is split or a chunk is migrated, then MongoDB directs write operations on the config DB. During these write operations, MongoDB uses “majority” for the write concern.

However, as a developer, you should refrain from writing to the config DB by yourself in the midst of maintenance or standard operations.

Read Operations

The admin database is used for read operations by the MongoDB. These reads are associated with authorization, authentication, and internal operations.

If the metadata is modified like when a chunk is being migrated or mongos is initiated, then read operations are processed to the config database by the MongoDB. MongoDB uses the “majority” read concern during these read operations. Moreover, it is not only the MongoB which reads from config servers. Shards also require them for read operations associated with the metadata of the chunks.

Mongos

The mongos instances in MongoDB are responsible for shard related write operations while they are also utilized for the routing of queries. For the sharded cluster, mongos is the only available interface which offers the application perspective. Keep this in mind that there is no direct communication between the shards and applications.

Mongos monitors the contents of a shard via metadata caching with the help of the config servers. The metadata provided by the config servers is then used by the MongoDB for routing operations between clients and applications with the instances of the mongod. Mongos instances lack a persistent state. This is helpful because it assists in the lowest possible consumption of the available resources.

Usually, mongos instances are run on a system which also houses the applications servers. However, if your situation demands, then you can also use them on any other dedicated resources like shards.

Routing and Results

While routing a query for a cluster, a mongos instance assesses all the shards through a list and identifies which shard needs the query. It then forms a cursor for all of these specific shards.

Afterward, the data in these shards is merged by the mongos instance which is then displayed in the result document. There are some modifiers for queries like sorting which maybe required to be executed on a shard. Subsequently, the retrieval of the results is carried out by the mongos. To manage the query modifiers, mongos does the following.

  • In the case of un-sorted query results, mongos applies a “round-robin” strategy for the generation of results from the shards.
  • In the scenario in which the result size is restricted because of the limit() method, then shards receive this information from the mongos. Afterward, they re-implement limit on the result and then send it to the client.

If the skip() method is used in the query, then like the previous case, it is not possible for mongos to forward the information. Instead, it searches the shards and fetches the unskipped results after which the specified skip limit is processed during the arrangement of the entire result.

What Are Replica Set Members in MongoDB?


Among the several database concepts out there, one of the most important ones is replication. Replication involves data copying between multiple systems so each user has the same type of information. This allows for data availability which in turn improves the performance of the application. Likewise, it can also help in backups, for instance when one of the systems is damaged by a cyberattack. In MongoDB, replication is achieved with the help of grouped mongod processes known as a replica set.

What Is a Replica Set?

A replica set is an amalgamation of multiple mongod processes which are combined together. They are the key to offering redundancy in MongoDB along with ensuring strong availability of data. Members in the replica set are classified into three categories.

  1. Primary member.
  2. Secondary member.
  3. Arbiter.

Primary Member

There can only be one primary member in a single replica set. It is a sort of “leader” among all the other members. The major difference between a primary member and other members is that a primary member gets write operations. This means that when MongoDB has to process writes, then it forwards them to the primary member. These writes are then recorded in the operation log (oplog) of the primary member. Afterward, the secondary members look up in the oplog and use that recorded data to implement changes in their stored data.

In the case of read operations, all the members of the replica set can process them but the read operation is first directed to the primary member due to default settings in the MongoDB.

All members in a replica set acknowledge their availability through a “heartbeat”. If the heartbeat is missing for the specified time limit (usually defined in seconds), then it means that the member is disconnected. In replica sets, there are some cases when the primary member gets unavailable due to multiple issues, for instance, the data center in which the primary member resides is disconnected due to the power outage.

Since replication cannot operate without a primary member, an “election” is held to appoint a primary member from the secondary members.

Secondary Member

A secondary member is dependent on the primary member to maintain its data. For replication, it uses the oplog of the primary member asynchronously. Unlike primary members, a replica set can have multiple secondary members—the more the merrier. However, as explained before, write operations are only conducted by the primary member and thus, a secondary member is not allowed to process them.

If a secondary member is allowed to vote, then it is eligible to trigger an election and take part as a candidate to become the new primary member.

A secondary member can also be customized to achieve the following.

  • Made ineligible to become the primary member. This is done so it can be utilized in a secondary data center where it is valuable as a standby option.
  • Block the contact of applications from the secondary member such that they cannot “read” through it. This can ensure that they work with special applications which necessitate detachment from the standard traffic.
  • You can modify a secondary member to act as a snapshot for historical purposes. This is a backup strategy which can aid as a contingency plan. For example, when a user deletes a database by mistake, then the snapshot can be utilized.

Arbiter

An arbiter is distinct due to the fact that it does not store its own data copy. Likewise, it is ineligible for the primary member position. So, why exactly are they used? Arbiters provide efficiency in the elections; they can vote once. They are added so they can vote and break the possibility of scenarios where it is a “tie” between members.

For example, there are 5 members in a replica set. If the primary member gets unavailable, then an election is held where two of the secondary members bag two votes each, so it is not possible for either of them to become the primary member. Therefore, to break such standoffs, arbiters are added so their vote can complete the election and select a primary member.

It is also possible to add a secondary member in the replica set to improve the election but a new secondary member is heavy as it stores and maintains data. Since arbiters are not constrained by such overheads; therefore they are the most cost-effective solution.

It is recommended, that arbiters should never reside in sites which are responsible to host both the primary and secondary members.

If the configuration option, authorization, is enabled in the arbiter, then it can also swap credentials with members of the replica set where authentication is implemented using the “keyfiles”. In such scenarios, MongoDB applies encryption on the entire authentication procedure.

The use of cryptographic techniques ensures that this authentication is secure from any exploitation. However, since arbiters are unable to store or maintain data, this means that the internal table—which has information of users/roles in authentication—can not be owned by the arbiters. Therefore, the localhost exception is used with the arbiter for authentication purposes.

Do note that in the recent MongoDB versions (from MongoDB 3.6 onwards), if you upgrade the replica set from an older version, then its priority value is changed from 1 to 0.

What Is the Ideal Replica Set?

There is a limit on the number of members in the replica set. At most, a replica set can contain 50 members. However, the number of voting members must not cross the 7-member limit. To configure voting and allow a member to vote, the members[n]. votes setting of a replica set must be 1.

Ideally, a replica set should have at least three members which can bear data: 1 primary member and 2 secondary members. It is also possible to use a three-member replica set where with one primary, secondary, and arbiter but at the expense of redundancy.

How to Work with Data Modeling in MongoDB with an example


Were you assigned to work on the MEAN Stack? Did your organization choose MongoDB for data storage? As a beginner in MongoDB, it is important to familiarize yourself with data modeling in MongoDB.

One of the major considerations for data modeling in MongoDB is to assess the DB engine’s performance, balance the requirements of the application, and think about the retrieval patterns. As a beginner in MongoDB, think about how your application works with queries and updates and processes data.

MongoDB Schema

The schema of MongoDB’s NoSQL is considerably different from the relational and SQL databases. In the latter, you have to design and specify the schema of a table before you can begin with the insert operations to populate the table. There is no such thing in MongoDB.

The collections used by MongoDB operate on a different strategy which means that it is not necessary that two documents (similar to rows in SQL databases) can have the same schema. As a result, it is not mandatory for a document in a collection to adhere to the same data types and fields.

If you must create fields in the document or you have to modify the current fields, then you have to update a new structure for the document.

How to Improve Data Modeling

During data modeling in MongoDB, you must analyze your data and queries and consider the following tips.

Capped Collections

Think about how your application is going to take advantage of the database. For instance, if your application is going to use too many insert operations, then you can benefit from the use of capped collections.

Manage Growing Documents

Write operations increase data exponentially like when an array is updated with new fields, the data increases quickly. It can be a good practice to keep a track of the growth of your documents for data modeling.

Assessing Atomicity

On the document level, the operations in MongoDB are strictly atomic. A single write operation can only change the contents of a single document. There are some write operations that can modify multiple documents, but behind the scenes, they only process one document at a time. Therefore, you have to ensure that you are able to factor in accurate atomic dependency according to your requirements.

The use of embedded data models is often recommended to utilize atomic operations. If you use references, then your application is forced in working with different read/write operations.

When to Use Sharding?

Sharding is an excellent option for horizontal scaling. It can be advantageous when you have to deploy datasets in a large amount of data and where there is a major need for read and write operations. Sharding helps in categorizing database collections. This helps in the efficient utilization for the documents of the collection.

To manage and distribute data in MongoDB, you are required to create a shared key. Shard key has to be carefully selected or it may have an adverse impact on the application’s performance. Likewise, the right shard key can be used to prevent a query’s isolation. Other advantages include a notable uplift in the write capacity. It is imperative that you take your time to select a field for the position of shard key.

When to Use Indexes?

The first option to improve the query performance in MongoDB is an index. To consider the use of an index, go through all your queries. Look for those fields which are repeated the most often. Make a list of these queries and use them for your indexes. As a result, you can begin to notice a considerable improvement in the performance of your queries. Bear in mind, that indexes consume space both in the RAM and hard disk. Hence, you have to keep these factors in mind while creating indexes or your strategy can backfire instead.

Limit the Number of Collections

Sometimes, it is necessary to use different collections based on application requirements. However, at times, you may have used two collections when it may have been possible to do your work through a single one. Since each collection comes with its own overhead, hence you cannot overuse them.

Optimize Small Document in Collections

In case, you have single or multiple collections where you can notice a large number of small documents, then you can achieve a greater degree of performance by using the embedded data model. To increase a small document into a larger one via roll-up, you can try to see if it is possible to design a logical relationship between them during grouping.

Each document in MongoDB comes up with its own overhead. The overhead for a single document may not be much but multiple documents can make matters worse for your application. To optimize your collections, you can use these techniques.

  • If you have worked with MongoDB then you would have noticed that it automatically creates an “_id” field for each document and adds a unique 12-byte “ObjectId”. MongoDB indexes the “_id” field. However, this indexing may waste storage. To optimize your collection, modify “_id” field’s value when you create a new document. What this does is that it helps to save a value which could have been assigned to another document. While there is no explicit restriction to use a value for the “_id” field but do not forget the fact that it is used as a primary key in MongoDB. Hence, your value must be unique.
  • MongoDB documents store the name of all fields which does not bode well for small documents. In small documents, a large portion of their size can be dictated due to the number of their fields. Therefore, what you can do is that make sure that you use small and relevant names for fields. For instance, if there is the following field.

father_name: “Martin”

Then you can change it by writing the following.

f_name : “Martin”

At first glance, you may have only reduced 5 letters but for the application, you have decreased a significant number of bytes which are used to represent the field. When you will do the same for all queries then it can make a noticeable difference.

Let’s see how this works with an example 

The major choice which you have to think around while working for data modeling in MongoDB is what does your DOCUMENT structure entail? You have to decide what the relationships which represent your data are like whether you are dealing with a one-to-one relationship or a one-to-many relationship. In order to do this, you have two strategies.

Embedded Data Models

MongoDB allows you to “embed” similar data in the same document. The alternative term for embedded data models is de-normalized models. Consider the following format for embedded data models. In this example, the document begins with two standard fields “_id” and “name” to represent the details of a user. However, for contact information and residence, one field is not enough to store data as there are more than one attributes for data. Therefore, you can use the embedded data model to add a document within a document. For “contact_information” curly brackets are used to change the field to a document and add “email_address” and “office_phone”. A similar technique is applied to the residence of the user.

{

_id: “abc”,

name: “Bruce Wayne”,

contact_information: {

email_address: “bruce@ittechbook.com”,

office_phone: ” (123) 456–7890″,

},

residence: {

country: “USA”,

city: “Bothell”,

}

 

}

The strategy to store similar types of data in the same document is useful for a reason; they limit the number of queries to execute standard operations. However, the questions which might be going around in your mind is, when is the correct time to use an embedded data model? There are two scenarios which can be marked as appropriate for the use of embedded data models.

Firstly, whenever you find a one-to-many relationship in an entity, then it would be wise to infuse embedding. Secondly, whenever you identify an entity with “contains” relationship, then it is also a good time to use this data model. In embedded documents, you can access information by using the dot notation.

One-to-Many Relationship with Embedding

One-to-many is a common type of database relationship where a collection A can match multiple documents in collection B but the latter can only match for one document in A.

Consider the following example to see when de-normalized data models hold an advantage over normalized data models. To select one, you have to consider the growth of documents, access frequency, and similar factors.

In this example, reference is using three examples.

{

_id: “barney”,

name: “Barney Peters”

}

{

user_id: “Barney”,

street: “824 XYZ”,

city: “Baltimore”,

state: “MD”,

zip: “xxxxx”

}

{

user_id: “Barney”,

street: “454 XYZ”,

city: “Boston”,

state: “MA”,

zip: “xxxx”

}

If you can see closely, then it is obvious that the application has to continuously retrieve data for address along with other field names which are not required. As a result, several queries are wasted.

On the other hand, embedding your address field can ensure that the efficiency of your application is boosted and your application will only require a single query.

{

_id: “barney”,

name: “Barney Peters”,

addresses: [

{

user_id: “barney”,

street: “763 UIO Street”,

city: “Baltimore”,

state: “MD”,

zip: “xxxxx”

},

{

user_id: “barney”,

street: “102 JK Street”,

city: “Boston”,

state: “MA”,

zip: “xxxxx”

}

]

}

Normalized Data models (References)

In normalized data models, you can utilize references to represent relationship between multiple documents. To understand references, consider the following diagram.

References or normalized data models are used in the cases of one-to-many relationship models and many-to-many relationship models. In some instances of embedded documents model, we might have to repeat some data which could be avoided by using references. For example, see the following example.

mangos

In this example, the “_id” field in the user document references to two other documents that are required to use the same field.

References are suitable for datasets which are based on hierarchy. They can be used to describe multiple many-to-many relationships.

While references provide a greater level of flexibility in comparison to embedding, they also require the applications to issue the suitable follow-up queries for their resolution. Put simply, references can increase the processing between the client-side application and the server.

One-to-Many Relationship with References

There are some cases in which a one-to-many relationship is better off with references rather than embedding. In these scenarios, embedding can cause needless repetition.

{

title: “MongoDB Guide”,

author: [ “Mike Jones”, “Robert Johnson” ],

published_date: ISODate(“2019-1-1”),

pages: 1,200,

language: “English”,

publisher: {

name: “ASD Publishers”,

founded: 2009,

location: “San Francisco”

}

}

 

{

title: “Java for Beginners”,

author: “Randall James”,

published_date: ISODate(“2019-01-02”),

pages: 800,

language: “English”,

publisher: {

name: “BNM Publishers”,

founded: 2005,

location: “San Francisco”

}

}

As you can realize whenever a query requires the information of a publisher, that information is repeated continuously. By leveraging references, you can improve your performance by storing the information of the publisher in a different collection. In cases, where a single book has limited publishers and the likelihood of their growth is low, references can make a good impact. For instance,

{

name: “ASD Publishers”,

founded: 2009,

location: “San Francisco”

books: [101, 102, …]

}

 

{

_id: 101

title: “MongoDB Guide”,

author: [ “Mike Jones”, “Robert Johnson” ],

published_date: ISODate(“2019-1-1”),

pages: 1,200,

language: “English }

{

_id: 102

title: “Java for Beginners”,

author: “Randall James”,

published_date: ISODate(“2019-01-02”),

pages: 800,

language: “English”,

 

}