Data Engineering with Avishkar: 03/01/2014

March 31, 2014

Embedded Business Intelligence

Embedded Business Intelligence

Embedded BI is the integration of reports, dashboards, and analytic views inside an application. The information is typically displayed and managed by a BI platform and is placed directly within the application user interface to improve the context and usability of the data. Use of an embedded BI platform delivers state-of-the-art reporting and analytics without the time and expense of having to build it.

BI is embedded into operational world and help to make better decisions in real time which is relevant, timely and actionable.

Real Time BI - Embedded BI acts on real time data, not time delayed data stored in a separate warehouse or OLAP cube. A key factor in this is the source of the data - it comes from the application (or uses the same source as the application), not a data warehouse or data mart.
Seamless Integration – Users do not want to switch applications between undertaking operational and reporting activities. Integrated security and look and feel assist create a seamless integration between the host application and reporting.
End User Centric – Embedded BI is much more end-user focused than traditional BI. With embedded BI you cannot assume that your users has knowledge of both the BI application and the data set being analysed. Embedded BI needs to be significantly easier to use without training.

Benefits of Embedded BI:

Eckerson intimated that because users of standalone BI solutions are required to exit operational applications in order to access relevant reports, then subsequently re-enter the operational application to take appropriate action based on the intelligence garnered from the BI tool, their productivity is reduced.

Eckerson said reduced productivity was a result of two key factors:

· Having to exit, enter and re-enter different applications breaks user “train of thought”; and

· Having to view analytical information via a separate BI application means the data is not viewed in its optimal context.

. These benefits included:

· Higher perception of BI ease-of-use

· Higher perception of information relevancy

· Higher perception of reporting and analytics accessibility

· Boost BI user adoption: Embedding BI functionality into an existing software application enables users to access and interact with those analytical features within a framework that they are already accustomed to, thereby increasing ease-of-use and lowering resistance to adoption. Not only does replicating the look and feel of the core application reduce barriers to adoption, embedding BI into an existing operations-specific application also ensures the relevancy of the analytics produced for the user base.

· Boost BI effectiveness: Embedded BI can directly link reporting and analytics capabilities to operational processes to help improve the immediacy and relevancy with which users attain data-based insights, hence assisting to directly link insight to action.

· Support pervasive BI: Embedded BI enables more pervasive use of reporting and analytics – and facilitates and underpins the development of an organizational culture based on fact-based decision-making – because BI insights are delivered via the applications and processes that users already utilize on a regular basis to perform their job. Therefore, embedding analytical capabilities into existing applications and processes is an effective way to deliver BI to a wide range of business departments without having to purchase a standalone BI platform to meet the requirements of each user group.

· Build a bridge between information and action: By combining analytical and operational functions, embedded BI empowers users with the context they need to understand the relationships between operational processes and business data, enabling them to react faster to emergent internal and external business threats or opportunities.

· Boost organizational effectiveness and efficiency by facilitating process automation: Embedded BI, directly linked to operational applications, can trigger automated actions and / or alerts that improve or address function-specific business processes (based on pre-determined benchmarks) in drastically reduced timeframes.

· Enhance the salability and value of your core applications: And, if you’re a software vendor, adding an analytics module to your core application(s) can significantly increased the salability and value or your product(s).

March 16, 2014

QlikView

Qlik

Qlik's QlikView product has become a market leader with its capabilities in data discovery, a segment of the BI platform market that it pioneered. QlikView is a self-contained BI platform, based on an in-memory associative search engine and a growing set of information access and query connectors, with a set of tightly integrated BI capabilities.

Strengths

· Qlik has embarked on one of the boldest strategies of any vendor to address enterprises' unmet need for a BI platform standard that can fulfill both business users' requirements for ease of use and IT's requirements for enterprise features relating to reusability, data governance and control, scalability, and so on. In the second half of 2014, Qlik plans to release a completely rearchitected product, QlikView.Next, featuring a redesigned interactive visualization user experience called Natural Analytics, to make it easier for users to discover and share new insights. Natural Analytics builds on the company's associative search capability and incorporates enhanced comparisons, collaboration, workflow, sharing and data dialogs, as well as enhanced insights from unique visualization techniques that Qlik acquired from NComVA in June 2013. QlikView.Next will also provide completely rearchitected enterprise server and administration capabilities, including reusable semantic intelligence and modeling that draws on its acquisition of Expressor Software, open APIs for extensibility, expanded data connectivity, and enhanced scalability and security features. By providing both business-user-oriented and IT-friendly capabilities, QlikView.Next has the potential to make Qlik a differentiated and viable enterprise-standard alternative to the incumbent BI players.

· Customers choose QlikView for the intuitive interactive experience it offers; this is most often deployed in dashboards, where it enables business users to freely explore and find connections, patterns and outliers in data without having to model those relationships in advance. In particular, QlikView's associative search enables users easily to see which query results are related, to compare them, and more importantly to identify which data elements are not related, without having to write complex SQL. Users can also filter data using search capabilities. The percentage of QlikView customers that choose the platform because of its ease of use for end users is in the top two of all the vendors surveyed; an above-average percentage also select QlikView because of its ease of use for developers. QlikView's ease of use is coupled with an above-average score for the complexity of the types of analysis that users can conduct with the platform, and an above-average score for the breadth of functionality used. As a result, Qlik received one of the highest scores for market understanding of any vendor in the Magic Quadrant survey. In common with those of other stand-alone data discovery vendors, Qlik's customers also report achieving above-average business benefits. This powerful combination of advantages has been a key driver of data discovery success for vendors in general, and for Qlik in particular.

· Qlik's customers also have a positive view of QlikView's composite functional capabilities, which, weighted for use, were rated above the survey average, including above-average individual scores for dashboards, interactive visualization, search-based data discovery (rated No. 1), geospatial intelligence, business user data mashup, collaboration (a score near the top), big data support (also near the top) and mobile BI. As a result of a high degree of satisfaction with its mobile functionality, Qlik has among the highest percentage of users deploying, piloting or planning to deploy mobile capabilities in the next 12 months.

· Qlik's above-average scores for ease of use for developers, particularly when compared with traditional IT-centric enterprise vendors, has resulted in better-than-average implementation costs, IT developer costs and overall three-year BI platform ownership costs per user. The perception that QlikView offers a relatively low cost of ownership, when compared with other vendors' products, is also evident from the high percentage of customers that choose QlikView because of its implementation cost and associated effort, as well as its TCO.

· Qlik has been successfully expanding its reach and awareness beyond its traditional stronghold of Europe (it was founded in Sweden) to North America, as well as to the growing regions of Asia/Pacific and Latin America. The partner channel is more important to Qlik than to any other BI platform vendor except Microsoft, particularly in comparison to its stand-alone data discovery competitors. The partner channel will be particularly important to Qlik's growth after the introduction of QlikView.Next, given the expectation that partners will use the platform's planned improved openness to build new QlikView.Next-based solutions.

Cautions

· The enterprise-readiness of the current release of QlikView remains a work in process. Despite QlikView being deployed in multiple departments and around the world, only half the QlikView customers we surveyed identified QlikView as their enterprise standard. This is far below the figures of most other incumbent BI vendors, whose customers report standardization rates of over 70%. QlikView received below-average customer survey scores for enterprise features such as metadata management, BI infrastructure and embeddable analytics. Additionally, customers and implementers continued to express concerns about QlikView's facilities for managing security and administering large numbers of named users. Although user deployment sizes and average data sizes continue to increase, they are around the survey average.

· Customers most often select QlikView for its ease of use for end users, particularly in terms of its interactive dashboards and when compared with the offerings of the incumbent IT-centric vendors. However, in terms of visual-based interactive exploration and analysis capabilities, user experience, and the time it takes for business users to gain proficiency in authoring, the current QlikView 11.x release is considered more limited than offerings from other stand-alone data discovery vendors. With QlikView.Next, Qlik is placing major emphasis on filling this gap.

· Qlik plans for QlikView.Next to deliver the combination of business user and IT capabilities that is currently lacking in the market. However, QlikView.Next will be delivered more than a year later than expected, which creates opportunities for its competitors to narrow any gaps. Moreover, no major rearchitecting is without risks to both customers and vendor, especially when the latter is also facing a more intense competitive landscape, as is the case with Qlik. It is not unusual for initial "point versions" of major releases to take time to reach complete stability. In addition, adopting this major new release will require some degree of migration, which could delay some deployments that might otherwise have occurred in 1H14. During the extended period before QlikView.Next's arrival, its competitors are not standing still. Incumbent vendors, stand-alone data discovery players and new market entrants continue aggressively to build and enhance their data discovery features, to innovate and make progress (some quickly) toward narrowing Qlik's "land and expand" potential and, more importantly, toward addressing the big "white space" opportunity (to delight business users while still offering IT control) that Qlik plans to address with QlikView.Next.

· Qlik's customer experience results remain mixed. QlikView earned positive scores for product quality, which led to an overall above-average customer experience score. However, support scores for QlikView were again just below the survey average. Similarly, sales experience continued to be rated below the survey average. We believe these results are partly influenced by Qlik's rapid growth, since both support and sales proficiency are strongly correlated with employees' length of service; high growth means a larger percentage of relatively new sales and support people. Moreover, Qlik's sales and support organizations are in transition from selling to and supporting departments to selling to and supporting strategic enterprise deployments. A successful transformation on both fronts is critical if Qlik is to fulfill its enterprise aspirations for QlikView.Next.

Magic Quadrant for Business Intelligence and Analytics

For this Magic Quadrant, Gartner defines BI and analytics as a software platform that delivers 17 capabilities across three categories: information delivery, analysis and integration.

Information Delivery

Reporting: Provides the ability to create highly formatted, print-ready and interactive reports, with or without parameters.

Dashboards: A style of reporting that graphically depicts performances measures. Includes the ability to publish multi-object, linked reports and parameters with intuitive and interactive displays; dashboards often employ visualization components such as gauges, sliders, checkboxes and maps, and are often used to show the actual value of the measure compared to a goal or target value. Dashboards can represent operational or strategic information.

Ad hoc report/query: Enables users to ask their own questions of the data, without relying on IT to create a report. In particular, the tools must have a reusable semantic layer to enable users to navigate available data sources, predefined metrics, hierarchies and so on.

Microsoft Office integration: Sometimes, Microsoft Office (particularly Excel) acts as the reporting or analytics client. In these cases, it is vital that the tool provides integration with Microsoft Office, including support for native document and presentation formats, formulas, charts, data "refreshes" and pivot tables. Advanced integration includes cell locking and write-back.

Mobile BI: Enables organizations to develop and deliver content to mobile devices in a publishing and/or interactive mode, and takes advantage of mobile devices' native capabilities, such as touchscreen, camera, location awareness and natural-language query.

Analysis

Interactive visualization: Enables the exploration of data via the manipulation of chart images, with the color, brightness, size, shape and motion of visual objects representing aspects of the dataset being analyzed. This includes an array of visualization options that go beyond those of pie, bar and line charts, including heat and tree maps, geographic maps, scatter plots and other special-purpose visuals. These tools enable users to analyze the data by interacting directly with a visual representation of it.

Search-based data discovery: Applies a search index to structured and unstructured data sources and maps them into a classification structure of dimensions and measures that users can easily navigate and explore using a search interface. This is not the ability to search for reports and metadata objects. This would be a basic feature of a BI platform.

Geospatial and location intelligence: Specialized analytics and visualizations that provide a geographic, spatial and time context. Enables the ability to depict physical features and geographically referenced data and relationships by combining geographic and location-related data from a variety of data sources, including aerial maps, GISs and consumer demographics, with enterprise and other data. Basic relationships are displayed by overlaying data on interactive maps. More advanced capabilities support specialized geospatial algorithms (for example, for distance and route calculations), as well as layering of geospatial data on to custom base maps, markers, heat maps and temporal maps, supporting clustering, geofencing and 3D visualizations.

Embedded advanced analytics: Enables users to leverage a statistical functions library embedded in a BI server. Included are the abilities to consume common analytics methods such as Predictive Model Markup Language (PMML) and R-based models in the metadata layer and/or in a report object or analysis to create advanced analytic visualizations (of correlations or clusters in a dataset, for example). Also included are forecasting algorithms and the ability to conduct "what if?" analysis.

Online analytical processing (OLAP): Enables users to analyze data with fast query and calculation performance, enabling a style of analysis known as "slicing and dicing." Users are able to navigate multidimensional drill paths. They also have the ability to write-back values to a database for planning and "what if?" modeling. This capability could span a variety of data architectures (such as relational, multidimensional or hybrid) and storage architectures (such as disk-based or in-memory).

Integration

BI infrastructure and administration: Enables all tools in the platform to use the same security, metadata, administration, object model and query engine, and scheduling and distribution engine. All tools should share the same look and feel. The platform should support multitenancy.

Metadata management: Tools for enabling users to leverage the same systems-of-record semantic model and metadata. They should provide a robust and centralized way for administrators to search, capture, store, reuse and publish metadata objects, such as dimensions, hierarchies, measures, performance metrics/key performance indicators (KPIs), and report layout objects, parameters and so on. Administrators should have the ability to promote a business-user-defined data mashup and metadata to the systems-of-record metadata.

Business user data mashup and modeling: Code-free, "drag and drop," user-driven data combination of different sources and the creation of analytic models, such as user-defined measures, sets, groups and hierarchies. Advanced capabilities include semantic autodiscovery, intelligent joins, intelligent profiling, hierarchy generation, data lineage and data blending on varied data sources, including multistructured data.

Development tools: The platform should provide a set of programmatic and visual tools and a development workbench for building reports, dashboards, queries and analysis. It should enable scalable and personalized distribution, scheduling and alerts of BI and analytics content via email, to a portal and to mobile devices.

Embeddable analytics: Tools including a software developer's kit with APIs for creating and modifying analytic content, visualizations and applications, embedding them into a business process, and/or an application or portal. These capabilities can reside outside the application, reusing the analytic infrastructure, but must be easily and seamlessly accessible from inside the application, without forcing users to switch between systems. The capabilities for integrating BI and analytics with the application architecture will enable users to choose where in the business process the analytics should be embedded.

Collaboration: Enables users to share and discuss information, analysis, analytic content and decisions via discussion threads, chat and annotations.

Support for big data sources: The ability to support and query hybrid, columnar and array-based data sources, such as MapReduce and other NoSQL databases (graph databases, for example). Support could include direct Hadoop Distributed File System (HDFS) query or access to MapReduce through Hive.

Apache Cassandra

Introduction

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

Apache Cassandra is a highly scalable and high-performance distributed database
management system that can serve as both an operational datastore (the “system of record”) for
online/transactional applications, and as a read-intensive database for business intelligence
systems. Cassandra is able to manage the distribution of data across multiple data centers and
offers incremental scalability with no single points of failure.
Cassandra is a logical choice for enterprises that need high degrees of uptime, reliability, and
very fast performance.
Cassandra was originally incubated at Facebook and is based upon Google’s BigTable and
Amazon’s Dynamo software. The end result is an extremely scalable and fault-tolerant data
infrastructure that solves small to big data problems, handles write intensive user traffic, delivers
sub-millisecond caching layer reads, and supports demanding workloads involving petabytes of
data.

Cassandra Architecture

Cassandra is a peer-to-peer distributed data management system where every
node is essentially the same with respect to how it functions in the cluster. In Cassandra, there is
no concept of a “master node” or anything similar, with the benefit being derived that no single

point of failure exists for any key process or function.

The scale-out aspect of Cassandra allows node additions to occur with no disruption to
application uptime. Cassandra automatically partitions data across nodes once one or more
nodes have been added to a cluster and “seeds” the new nodes from existing machines in the
cluster.Data redundancy to protect against hardware failure and other data loss scenarios is also built
into and managed transparently by Cassandra.

An administrator, architect, or developer only has to specify a replication and data-partitioning
strategy. From there, Cassandra takes care of everything.
All nodes in the cluster communicate with each other through the gossip protocol. If a node goes
down, the cluster detects the failure and automatically routes user requests away from the failed
machine. Once the failed node is operational again, it rejoins the cluster, and its data is brought
back up to date via the other nodes.

Why Cassandra

MySQL drives too many random I/Os
File-based solutions require far too many locks

The new face of data

Scale out, not up
Online load balancing, cluster growth
Flexible schema
Key-oriented queries
CAP-aware

CQL Language

CQL provides a very similar syntax to that used in all RDBMSs, making it very easy for
developers and administrators coming from the relational world to begin working with Cassandra.

DDL, DML, and SELECT functionality all can be found in CQL.

cqlsh> CREATE TABLE monkeySpecies (

    species text PRIMARY KEY,
    common_name text,
    population varint,
    average_size int
) WITH comment='Important biological records'
   AND read_repair_chance = 1.0;

CREATE TABLE timeline (
    userid uuid,
    posted_month int,
    posted_time uuid,
    body text,
    posted_by text,
    PRIMARY KEY (userid, posted_month, posted_time)
) WITH compaction = { 'class' : 'LeveledCompactionStrategy' };

cqlsh> INSERT INTO users (KEY, password) VALUES ('jsmith', 'ch@ngem3a');

cqlsh> SELECT * FROM users
... WHERE gender='f' AND
... state='TX' AND
... birth_year='1968';

Batched Operations
Cassandra supports tunable consistency on a per-operation basis, meaning developers can
choose how strong or loose they want data consistency to be for a particular request. If a
developer wants to apply a certain consistency level for a number of different requests, he or she
can encase them in a BEGIN and APPLY BATCH statement.

BEGIN BATCH USING CONSISTENCY QUORUM
INSERT INTO users (KEY, password) VALUES (‘user1’, ‘mypass’)
UPDATE users SET password = ‘newpass’ WHERE KEY = ‘user1’
INSERT INTO users (KEY, password) VALUES (‘user2’, ‘user2pass’)
DELETE name FROM users WHERE key = ‘user5’
APPLY BATCH

Batched operations allow a developer to retry (if necessary) a group of changes in an idempotent
fashion.

Cassandra highlights

High availability
Incremental scalability
Eventually consistent
Tunable tradeoffs between consistency and latency
Minimal administration
No SPF (Single Point of Failure)

Applications suitable to use Cassandra

Dispersed applications that need to serve numerous geographies with the same fast response times

Web online applications or

other systems needing around-the-clock transactional input capabilities.

Applications needing extreme degrees of uptime and no single point of failure

Applications that need easy data elasticity, so capacity can be added to service peak workloads for various periods of time and then shrink back

when user traffic reduction allows – all done in an online fashion

Write-intensive applications that must take in large volumes of data continuously e.g.credit card systems, music download purchases, device/sensor data, web clickstream,data, archiving systems, event logging.

Management of large data volumes (terabytes-petabytes) that must be kept online for query access and business intelligence processing.

Systems that need to store and directly deal with a combination of structured,unstructured, and semi-structured data, with a requirement for a flexible schema/data
storage paradigm that allows for easy and online structure modifications

Data Engineering with Avishkar