This
blog outlines the basic patterns and principles for using MongoDB as a
persistent storage engine for log data from servers and other machine
data.Servers generate a large number of events (i.e. logging,) that
contain useful information about their operation including errors,
warnings, and users behavior. By default, most servers, store these data
in plain text log files on their local file systems.While plain-text
logs are accessible and human-readable, they are difficult to use,
reference, and analyze without holistic systems for aggregating and
storing these data.
1. Schema Design
The schema for storing log data in MongoDB depends on the format of the event data that you’re storing.The preferred approach is to extract the relevant information from the log data into individual fields in a MongoDB document.When you extract data from the log into fields, pay attention to the data types you use to render the log data into MongoDB. Using proper types for your data also increases query flexibility: if you store date as a timestamp you can make date range queries, whereas it’s very difficult to compare two strings that represent dates. The same issue holds for numeric fields; storing numbers as strings requires more space and is difficult to query.When extracting data from logs and designing a schema, also consider what information you can omit from your log tracking system. In most cases there’s no need to track all data from an event log, and you can omit other fields.
2.System Architecture
Insertion speed is the primary performance concern for an event logging system. At the same time, the system must be able to support flexible queries so that you can return data from the system efficiently.
MongoDB has a configurable write concern. This capability allows you to balance the importance
of guaranteeing that all writes are fully recorded in the database with the speed of the insert.
For example, if you issue writes to MongoDB and do not require that the database issue any response, the writeoperations will return very fast (i.e. asynchronously,) but you cannot be certain that all writes succeeded.
The following command will insert the event object into the events collection.
>>> db.events.insert(event, w=0)
By setting w=0, you do not require that MongoDB acknowledges receipt of the insert. Although very fast, this is risky
because the application cannot detect network and server failures. See write-concern for more information.
Conversely,if you require that MongoDB acknowledge every write operation, the database will not return as quickly but you can be certain that every item will be present in the database.
In this case use pass w=1 argument as follows:
>>> db.events.insert(event, w=1)
Finally, if you have extremely low tolerance for event data loss, you can require that MongoDB replicate the data to multiple secondary replica set members before returning:
>>> db.events.insert(event, w=majority)
Sharding
Eventually your system’s events will exceed the capacity of a single event logging database instance. In these situations you will want to use a sharded cluster, which takes advantage of MongoDB’s sharding functionality.
In a sharded environment the limitations on the maximum insertion rate are:
• the number of shards in the cluster.
• the shard key you chose.
Because MongoDB distributed data in using “ranges” (i.e. chunks) of keys, the choice of shard key can control how MongoDB distributes data and the resulting systems’ capacity for writes and queries.
Shard key choices:
1. Schema Design
The schema for storing log data in MongoDB depends on the format of the event data that you’re storing.The preferred approach is to extract the relevant information from the log data into individual fields in a MongoDB document.When you extract data from the log into fields, pay attention to the data types you use to render the log data into MongoDB. Using proper types for your data also increases query flexibility: if you store date as a timestamp you can make date range queries, whereas it’s very difficult to compare two strings that represent dates. The same issue holds for numeric fields; storing numbers as strings requires more space and is difficult to query.When extracting data from logs and designing a schema, also consider what information you can omit from your log tracking system. In most cases there’s no need to track all data from an event log, and you can omit other fields.
2.System Architecture
Insertion speed is the primary performance concern for an event logging system. At the same time, the system must be able to support flexible queries so that you can return data from the system efficiently.
MongoDB has a configurable write concern. This capability allows you to balance the importance
of guaranteeing that all writes are fully recorded in the database with the speed of the insert.
For example, if you issue writes to MongoDB and do not require that the database issue any response, the writeoperations will return very fast (i.e. asynchronously,) but you cannot be certain that all writes succeeded.
The following command will insert the event object into the events collection.
>>> db.events.insert(event, w=0)
By setting w=0, you do not require that MongoDB acknowledges receipt of the insert. Although very fast, this is risky
because the application cannot detect network and server failures. See write-concern for more information.
Conversely,if you require that MongoDB acknowledge every write operation, the database will not return as quickly but you can be certain that every item will be present in the database.
In this case use pass w=1 argument as follows:
>>> db.events.insert(event, w=1)
Finally, if you have extremely low tolerance for event data loss, you can require that MongoDB replicate the data to multiple secondary replica set members before returning:
>>> db.events.insert(event, w=majority)
Sharding
Eventually your system’s events will exceed the capacity of a single event logging database instance. In these situations you will want to use a sharded cluster, which takes advantage of MongoDB’s sharding functionality.
In a sharded environment the limitations on the maximum insertion rate are:
• the number of shards in the cluster.
• the shard key you chose.
Because MongoDB distributed data in using “ranges” (i.e. chunks) of keys, the choice of shard key can control how MongoDB distributes data and the resulting systems’ capacity for writes and queries.
Shard key choices:
- Shard by Time
- Shard by a Semi-Random Key
- Shard by an Evenly-Distributed Key in the Data Set
- Shard by Combine a Natural and Synthetic Key