March 02, 2013

Storing Log Data using MongoDB

This blog outlines the basic patterns and principles for using MongoDB as a persistent storage engine for log data from servers and other machine data.Servers generate a large number of events (i.e. logging,) that contain useful information about their operation including errors, warnings, and users behavior. By default, most servers, store these data in plain text log files on their local file systems.While plain-text logs are accessible and human-readable, they are difficult to use, reference, and analyze without holistic systems for aggregating and storing these data.

1. Schema Design
The schema for storing log data in MongoDB depends on the format of the event data that you’re storing.The preferred approach is to extract the relevant information from the log data into individual fields in a MongoDB document.When you extract data from the log into fields, pay attention to the data types you use to render the log data into MongoDB. Using proper types for your data also increases query flexibility: if you store date as a timestamp you can make date range queries, whereas it’s very difficult to compare two strings that represent dates. The same issue holds for numeric fields; storing numbers as strings requires more space and is difficult to query.When extracting data from logs and designing a schema, also consider what information you can omit from your log tracking system. In most cases there’s no need to track all data from an event log, and you can omit other fields.

2.System Architecture
Insertion speed is the primary performance concern for an event logging system. At the same time, the system must be able to support flexible queries so that you can return data from the system efficiently.
MongoDB has a configurable write concern. This capability allows you to balance the importance
of guaranteeing that all writes are fully recorded in the database with the speed of the insert.
For example, if you issue writes to MongoDB and do not require that the database issue any response, the writeoperations will return very fast (i.e. asynchronously,) but you cannot be certain that all writes succeeded.
The following command will insert the event object into the events collection.
>>> db.events.insert(event, w=0)
By setting w=0, you do not require that MongoDB acknowledges receipt of the insert. Although very fast, this is risky
because the application cannot detect network and server failures. See write-concern for more information.

Conversely,if you require that MongoDB acknowledge every write operation, the database will not return as quickly but you can be certain that every item will be present in the database.
In this case use pass w=1 argument as follows:
>>> db.events.insert(event, w=1)

Finally, if you have extremely low tolerance for event data loss, you can require that MongoDB replicate the data to multiple secondary replica set members before returning:
>>> db.events.insert(event, w=majority)

Sharding
Eventually your system’s events will exceed the capacity of a single event logging database instance. In these situations you will want to use a sharded cluster, which takes advantage of MongoDB’s sharding functionality.
In a sharded environment the limitations on the maximum insertion rate are:
• the number of shards in the cluster.
• the shard key you chose.
Because MongoDB distributed data in using “ranges” (i.e. chunks) of keys, the choice of shard key can control how MongoDB distributes data and the resulting systems’ capacity for writes and queries.
Shard key choices:
  • Shard by Time
  • Shard by a Semi-Random Key
  • Shard by an Evenly-Distributed Key in the Data Set
  • Shard by Combine a Natural and Synthetic Key

Choosing a Mobile BI Solution


Mobile BI Solution are helping remote employees/users  manage supply chains more efficiently or keeping traveling executives informed of the latest financial developments, today’s mobile ad hoc reporting solutions provide the dynamic capabilities organizations need to stay competitive and drive innovation in the field.
While working in the field used to mean relying on static data, today’s mobile BI solutions offer the ability to generate interactive reports with in-depth analytic functionality.


  • Solutions which provide unified user experience across all devices are most suitable for Mobile BI solutions. 
  • Rather than relying on static data, users should be able to use real-time updates to inform their decisions.
  • Mobile BI solution should facilitates sharing reports, both over wireless networks and in person.
  • Users may need to access mobile BI solutions from remote locations where internet connectivity is low or absent  or on a plane. While a lack of connectivity prohibits real-time updates, a good mobile BI offering should have some form of reliable offline access to recent and saved reports so that employees can tap into data-driven insights.


Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...