Skip navigation
All Places > Products and Services > Blog
1 2 3 4 Previous Next

Products and Services

59 posts

We’re pleased to announce the release of MapR 6.0.1 and MEP 5.0.

 

With the 6.0.1 release, MapR includes streaming enhancements that allow for real-time apps to be brought to market faster, more accurate analytics of IoT data, and a wider array of real-time use cases.  In MEP 5.0, major enhancements to Apache Drill 1.13 help bring performance boost for memory-intensive analytical queries.

 

Key Features:

MapR-ES API updates, including support for event-time timestamps, time indexing of events, event headers, and event interceptors.  Allows for more accurate analytics of data generated by IoT devices and sensors.

 

Spark 2.2 with Structured Streaming, allowing for more powerful stream processing capabilities, including windowing and aggregations, using the event time added in MapR-ES.

 

Drill 1.13

  • Drill queries against MapR-ES, allowing for interactive data exploration & ad-hoc SQL queries on data in streams.
  • Spill to disk for tables.
  • Filter pushdown for Parquet.
  • CPU limits on multiple Drill clusters in YARN.

 

Audit Logs Sent to Streams, letting any “listener” collect, monitor, and act on MapR audit events.

 

MapR-DB ReST Gateway, allowing for developers to use their preferred language to access MapR-DB JSON.

 

Native MapR-DB exploration via Data Science Refinery notebook.

 

Release Notes:

MapR Expansion Pack 5.0.0 Release Notes 

Drill Release Notes 

 

Documentation:

MEP 5.0.0 Reference Information 

 

Related Resources:

Blog:  Extending Your Stream of Record with MapR 6.0.1 and MEP 5.0

MapR Expansion Pack (MEP)

Date: March 6th, 2018

 

We’re pleased to announce the general release of MapR Data Fabric for Kubernetes. MapR Data Fabric for Kubernetes provides persistent storage for containers and enables the deployment of stateful containerized applications. It provides easy and full data access from within and across clouds and on-premises deployments.

 

 

Key Features

MapR integrates with the Kubernetes storage plugin, wherein MapR volumes are mounted for access by containers.

Static Provisioning: Mount already created MapR volumes for easy access by Kubernetes.

Dynamic Provisioning: Maximize resource usage by creating volumes on-demand as and when needed by applications.

Storage Classes: Enforce SLAs using storage classes to define volume characteristics.

FlexVolume Driver with POSIX Client: Harness the performance benefits of POSIX client when mounting volumes.

Secure Data for Containers: Leverage MapR tickets to establish a secure end-to-end connection between containers and the data residing on the MapR Converged Data Platform.

Key Benefits

  • Persist and scale data with MapR as containers grow in production.
  • Easily synchronize and update applications with unified access and viewing of data, using the MapR global namespace.
  • Reduce development and deployment costs by deploying multiple tenants, isolating and sharing resources.
  • Deploy portable, smart applications using machine learning and MapR Data Fabric for Kubernetes.

 

Download

http://archive.mapr.com/tools/KubernetesDataFabric/

 

 

Installer:

https://maprdocs.mapr.com/home/PersistentStorage/kdf_installation.html

 

Documentation:

https://maprdocs.mapr.com/home/PersistentStorage/kdf_overview.html

 

Related Resources:

https://mapr.com/solutions/data-fabric/kubernetes/ 

https://maprdocs.mapr.com/home/MapROverview/MapR-XD.html

https://maprdocs.mapr.com/home/AdvancedInstallation/UsingtheMapRPACC.html

MapR-XD

pacc

Date: Feb 8th, 2018

 

We’re pleased to announce the general release of the MapR Expansion Pack (MEP) version 4.1.

MapR Expansion Packs are an expanded version of the MapR Ecosystem Pack (MEP) which is a way to deliver ecosystem upgrades decoupled from core platform upgrades. This expansion means that we can also deliver some core functionality in faster way using the framework we put together for Ecosystem projects to allow you to upgrade independently of your core platform.

 

MEP 4.1 features new releases for Apache Drill and MapR Data Science Refinery, and Python and Java bindings for MapR-DB OJAI connector for Apache Spark.

Key Features

Apache Drill 1.12

  • Index-based query plans for queries without filters, including queries with GROUP BY, JOIN, and DISTINCT projections.
  • Ability to submit queries from the REST API when impersonation is enabled and authentication is disabled.
  • Support for NaN (Not-a-Number) and Infinity (both POSITIVE and NEGATIVE) as numeric values.
  • System options improvements, including a new internal system options table.

Release notes: https://maprdocs.mapr.com/home/EcosystemRN/drill-1.12.0-release-notes.html#drill-1.12.0-release-noteshttps://maprdocs.mapr.com/home/EcosystemRN/drill-1.12.0-release-notes.html#drill-1.12.0-release-notes

Also check out our latest Drill blog post:  Apache Drill 1.12 on MapR 6.0: Release Highlights 

 

Apache Spark

 

  • Support for Java and Python APIs for MapR-DB OJAI connector

Release notes: https://maprdocs.mapr.com/home/EcosystemRN/SparkRN-2.1.0-1801.html#concept_ebx_m53_mbb

 

MapR Data Science Refinery 1.1

  • Support for the Spark interpreter, configured to launch Spark jobs in YARN client mode
  • Enhancements in installing custom Python environments for the Livy and Spark interpreters
  • Improvements in launching multiple Zeppelin containers on the same host

Release notes: https://maprdocs.mapr.com/home/DataScienceRefinery/whats_new_DSR_1.1.html

 

All Ecosystem Components

 

MapR support for Sentry is limited to Impala users.

 

 

Download

MEP 4.1.0:

http://package.mapr.com/releases/MEP/MEP-4.1.0/

 

UI Installer:

Index of /releases/installer

 

Release Notes

https://maprdocs.mapr.com/home/EcosystemRN/MEP4.1.0.html

 

Documentation

 

Related Resources

Have a Question?

Ask in the comments below.

In Case You Missed My Release Pitch

We just released Apache Drill 1.12 on MapR 6.0 as part of MEP 4.1 (MapR Expansion Pack). Continuing with the Drill 1.11 theme that I outlined in my previous post here in late November, we have made improvements in the most recent release.

 

Here are the highlights:

  • Exploratory queries (those not requiring any filters) on operational data on JSON tables in MapR-DB can leverage secondary indexes to speed up.
  • Exploratory queries on Parquet files in the MapR file system (MapR-XD) have improved by at least 2x. 
  • Several contributions from the open source community, including UDFs for facilitating network analysis and usability improvements.
  • 140+ bug fixes that improve quality overall. 

 

Data Exploration on Operational Data on JSON Tables in MapR-DB and Historical Data on Parquet in MapR-XD

One of the key features of MapR-DB and MapR-XD is that it allows data scientists to reuse the same data for advanced analytics, such as machine learning, AI, or predictive analytics, without the need to export the data. Critical to designing new algorithms is prototyping where the focus is, to explore the data while running experiments. In MapR 6.0, we launched a new product called the MapR Data Science Refinery, an easy-to-deploy and scalable data science toolkit with native access to all platform assets and superior out-of-the-box security. To enable data exploration with Drill while prototyping algorithms, data scientists can use the same notebook in Apache Zeppelin to do in-place ad hoc SQL queries (as shown in Figure 1) and visualize the results. 

 

From a technical standpoint, we enhanced the performance of exploratory queries on JSON tables in MapR-DB by:

  • Enhancing the query planner to:
    • Leverage secondary indexes on queries lacking filters (i.e., an explicit WHERE clause). 
    • Use the sortedness of the data in the index to avoid costly sorting operations.
  • For subsets of the results request (i.e., using LIMIT clause), performance is improved through index pushdown and reducing the amount of data scanned. 

 

With this feature, Drill on JSON tables on MapR-DB can leverage secondary indexes to improve performance for exploratory queries (that require no filters) and highly selective queries (that have filters) that require sorting, aggregation and joins.  

 

Figure 1: Sample of modified TPCH exploratory SQL queries on JSON tables in MapR-DB that would benefit from the performance feature. 

 

Parquet, the columnar file format, is considered as a standard amongst our customers for historical analytics on the MapR Platform. To improve Drill's performance on Parquet, we conducted an investigation last year into the scanner itself that revealed the following:

  • An opportunity to directly move the data from the file into direct memory via the heap. Since the data is moved in 4KB chunks, Drill could leverage CPU L1 cache and avoid touching the heap, which hurts performance. 
  • Vector processing with Java intrinsics (a native function implementation) could give a performance boost. Our tests showed that this improvement was in the order of 2x.
  • Implicit column optimization: Implicit columns are columns that carry metadata about the rows in a batch, processed from the Parquet file by Drill. For example, these could be file paths and names for each row. Testing revealed to us that there was as much as 20% overhead in carrying this metadata that was identical for lots of rows. We reduced these duplicates and represented them by one value. 
  • Implicit data type optimizations include pattern detection in the Parquet metadata such that even if it claims the variable length for fixed-length data, Drill would override and treat the column as fixed-length to leverage JVM optimizations. 

Our tests show below that we are able to get about 2-4x improvement in scan performance of Parquet files. The performance gain will be most pronounced for SELECT * queries that need to scan the entire table. 

 

Test Results of Running Exploratory Queries on Operational Data in MapR-DB JSON Tables

This was the test setup that we put together to see how the combination of above performance optimizations would

benefit the sample queries:

  • Cluster setup:
    • 10 data nodes, each node had 12 SSD disks each of which was 0.5TB, each node has 20 cores and 256GB RAM
    • 10 Drill nodes (drill bits) collocated with the data nodes
  • MapR Converged Data Platform configuration: 
    • MapR File System (MapR-XD): 4 instances per node, 1 CLDB
    • 4 storage pools per node
    • 2RPC connections between MapR file system nodes
    • MapR-DB: 4GB tablet size for primary and index tables
  • Drill configuration: 
    • planner.memory.max_query_memory_per_node = 4GB
    • planner.width.max_per_node = 14, default value which is 70% (of 20 cores)
  • Data set:
    • TPCH with SF1000 (1TB)

 

The results are shown in Figure 2.  All the queries show significant improvement in performance except for two

queries that require retrieving the maximum values. Such a query would require scanning the entire index table to

ensure that the maximum value was identified.

Figure 2: Performance test of a sample of modified TPCH exploratory SQL queries on JSON tables in

MapR-DB.

 

Test Results of Running Exploratory Queries on Historical Data on Parquet Files in MapR-XD

This was the test setup that we put together to see how the combination of Parquet scanner optimizations would impact the performance of a SELECT * type of query:

  • Cluster setup
    • 10 data nodes, 23TB HDD per node, each node has 32 cores and 256GB RAM
    • 10 Drill processing nodes (drill bits) collocated with the data nodes
  • MapR Converged Data Platform configuration:
    • MapR File System (MapR-XD): 1 instance per node, 1 CLDB
  • Drill configuration
    • planner.memory.max_query_memory_per_node = 8GB, 4 times the default of 2GB to avoid any spill to disk scenarios
    • planner.width.max_per_node=1, parallelization was reduced to measure the impact more accurately of a single scan thread.
  • Data sets
    • TPCH SF100 parquet snappy compressed

 

The results of the testing is shown in Figure 3. We measured the performance gain of an individual scan fragment across all fragments for multiple runs of a query and observed the 2X factor improvement. However, as predicted, the overall query performance gain (30% in Figure 3) will be dependent on other factors such as filter complexity, aggregations, joins or sorting operations. 

Figure 3: Performance test of an exploratory query on Parquet in MapR-XD.

 

Wild Card Text Search Performance on Parquet Files

Unknown to many customers, Drill, much like standard SQL, has the ability to search text (see Figure 4) within a document as part of a filter. A "regular expression," specified as a grammar in the filter, can help detect text patterns. We introduced several improvements to this search in Drill 1.11 but tested it in this release. Prior to Drill 1.11, Drill used the Java Regular Expressions library for pattern matching. The library required that for each record, the data was copied from direct memory (area in memory that Drill controls) into the heap (garbage collector, not in Drill's control), and then the regular expression was evaluated, which hurt performance.

 

To improve upon this feature, we did the following:

  • First, we introduced character-based inspection for commonly seen patterns in direct memory itself.
  • Second, we introduced optimizations to the ‘contains’ for-loop and minimized comparisons.

 

We carried out tests in the same cluster as the tests for exploratory queries in MapR-DB JSON tables described above. The queries were on TPCH dataset with a scale factor of 1000 and ParquetThe The results are shown in Figure 4. We see an increase in performance for regular expressions that had 1 wild card per word. As more % wild cards (i.e., any text) were present in the query, a full text scan had to be done, which hurt the performance. This is part of our roadmap for improvement for the next phase of this project.

 

Figure 4: Performance test of wild text card search queries on Parquet in MapR-XD.

 

Community Contribution Highlights

I am happy to report that the Apache Drill community has ramped up its activity in the last several months. In September of last year, we organized a Drill Developer Day that attracted users and developers around the Bay Area. I thought it would be worthwhile to highlight some of the contributions, as these are available in the current release as well. Note that we have not subjected these features to our internal testing and hence do not support it. But that should not deter you from trying them out and suggesting improvements through the dev and user mailing lists

 

Here are the highlights:

  • New plugins: Kafka storage pluginOpenTSDB plugin.
  • Graceful shutdown of Drillbits: Shutdown Drillbits to be reused for something else without disrupting the service.
  • A collection of networking functions that facilitate network analysis using Drill (DRILL-5834).
  • Geometry functions, ST_AsGeoJSON and ST_AsJSON, that return GeoJSON and JSON representations.
  • Filter pushdown for Parquet with multiple row groups to improve performance
  • IF NOT EXISTS support for CREATE TABLE and CREATE VIEWS: If you had a table with the same name, it would error out.
  • Syntax highlighting and checks during storage plugin configuration: Users can get feedback about the storage plugin information they submit. Prior to Drill 1.12, all storage plugins were initialized for every query. This had two disadvantages: a performance hit and a query failure if any of the storage plugins failed to initialize due to incorrect information. In this release, we have introduced a feature that allows initialization of only the essential plugins required to run a query. Both the above features should improve the user experience overall.

 

It's Your Turn to Try

No release is ever complete without you giving it a try and seeing for yourself if it delivers the value you are looking for. With that, I invite you to give the new Apache Drill 1.12 on MapR 6.0 a try, and let me know your feedback. 

 

The MapR Converge Community pages for Apache Drill also have a lot of good material. Check them out here

 

Happy Drill-ing!

In Case You Missed My Release Pitch

Apache Drill 1.11 on MapR 6.0 was released with new enterprise capabilities for a faster, secure, and robust interactive BI analytics experience. Drill now leverages the new secondary index technology on semi-structured operational data stored in MapR-DB, the database for the world’s most data-intensive applications, to speed up analytics, deliver insights, and drive better decisions. New security features ensure protection of sensitive data as they are accessed, processed, and delivered to end users. Several enhancements were added to improve better handling of analytic workloads running on the system, including spooling data intensive queries to disk and their management through queues.

 

Some Introduction is a Good Thing

The current trend in big data analytics marketing dictates that every product release, such as this one, warrants that the product manager (yours truly!) make a big splash in a blog post, promoting the greatest features that the team has shipped. Performance benchmark numbers must be published in carefully orchestrated setups to make the query engine appear like the North Star. Self-service? Fast ETL? Sub-second response times faster than the speed of thought? No problems. They got it all.

 

Big data marketing myths are far from enterprise reality. I am not going to do any of that. Instead, I want to talk about the Apache Drill release on MapR in the context of the problems we see our enterprise customers facing on their big data journey to bringing enterprise-wide analytics access. Democratization of access and self-service analytics continue to be the cornerstones of the current BI wave. From that perspective, there are three big challenges we see:

  1. Analytics market shift continues: Traditional warehousing solutions continue to be heavily optimized for smaller scales ranging in several 100s of TB but still strain the IT budget to scale. Many of these vendors seeing the inevitable market shift are still harvesting the market by sticking to their pricing. This, among other factors, is leading to increased displacement of these tools from use cases where they are overkill.  For example, scheduled reports are a classic use case, where a low-cost big data solution can be designed to deliver the same customer relevant SLAs. Over the next several years, I expect to see query engines like Apache Drill mature to displace more enterprise critical use cases from these traditional systems with the only differences being that they will do it at massive scale and low cost.
  2. Enterprise-grade needs: Customer expectations on enterprise features that traditional solutions have provided are non-negotiable. These needs include reliability, performance, and security at scale, both at the query and data layers. I mention the data layer as well, as there is no point building a MPP query engine without a tight integration to an industry-grade data platform.
  3. Analytics on data as-it-happens: Analysts do not wish to wait for a slow and expensive ETL process to kick in to make the operational data available for analytics in the data mart or the warehouse. 

None of these problems are easy to solve, especially when you are designing an industry-grade solution at massive scale. 

 

MapR Analytics Approach

One of our core design beliefs is that a query engine can never be designed in isolation but must be tightly integrated with the underlying data layer. It should come as no surprise that we continue to invest time and effort to strengthen our core data platform that we have built from the ground up. Keeping reliability and performance at scale as our end goal, we have built an industry-grade distributed file system (MapR-XD), a NoSQL database for data intensive operational applications (MapR-DB), and real-time streams for IoT scale (MapR-ES).  To simplify its operation, management, and deployment, the above technologies are unified within the software layer, yielding convergence (i.e.,  data can be streamed directly to the database, which would be stored as files, all within a single machine). Such design precludes the need for maintaining these as separate systems and managing their complexity.

 

For several years now, MapR has contributed to Apache Drill, an open source MPP query engine project. What is truly unique about Apache Drill is its ability to discover schema of the data on the fly. This capability becomes important as the amount of unstructured data stored continues to increase exponentially in the enterprise. In my opinion, most of the MPP query engines currently have a short-term focus on solving the structured data problem, reminiscent of the relational world. On D-Day, they will have to contend with querying all this unstructured data in-place without complete knowledge of their underlying schema. And in real-time. From the MapR perspective, we believe that Apache Drill prepares the world better for this impending data explosion. 

 

So What Challenges Have We Solved?

To address the above challenges we see in the enterprise, the Apache Drill 1.11 integration on MapR has the following capabilities:

1. Operational analytics access: Apache Drill had an existing integration with MapR-DB (a row-based NoSQL database), but queries containing filter conditions on columns that were not part of the primary key were rightfully slow. This issue was due to the lack of secondary indexes which could speed them up by indexing these columns. In this release, we leverage native secondary indexes, introduced in MapR-DB, to speed up the highly selective queries by a factor of at least 10X. 

Conceptual diagram, showing how Apache Drill integration on MapR enables operational and historical data analytics

 

The performance benefits can be substantial at scale at a fraction of the cost of the traditional warehousing system! Analysts now do not have to wait for the operational data collected from transaction systems to be ETL-ed into historical data stored in columnar Parquet format. Besides, columnar formats are ill-suited for queries that are highly selective. For example, we ran a simple test rig that confirmed these results as shown below.

Results from a simple test verify that secondary indexes speed up Drill queries on MapR-DB JSON, especially when the selectivity is high, corresponding to a low selectivity (%) value.

 

We also carried out a concurrency test with different number of concurrent users (each user is a user stream) sending simple queries into the same system on a TPC-DS dataset. The queries were fired in batches with the next query being executed when a query slot became available. Based on type of query, filter selectivity and capacity available we observe that highly selective queries increase system throughput. 

Simple concurrency test results showing query throughput and response times as type of query, selectivity and number of users are varied. 

 

As stated earlier, our vision at MapR is to leverage Apache Drill as a unified SQL access layer across files, tables and streams. We have made significant strides with enabling this capability on files and with this release, expanding it to tables. Our near-term focus is to introduce streaming analytics on the platform. For the curious, we are beginning to experiment with streaming SQL analytics in the open source community by building an experimental Kafka plugin. Even though KSQL from Confluent has a developer preview out, we are not convinced that their semantics and architecture considerations have been well thought out, which requires deep discussion at our end.

 

2. Enterprise-grade query engine capabilities:  Security and resource management are the primary concerns

  1. Security: One of the enterprise security requirements for analytics we see is the protection of sensitive data as they are accessed, processed, and delivered to analysts. We delivered the first set of these features in Apache Drill 1.10 and have completed the rest in this release. In brief, security features include multiple authentication mechanisms (PAM, SSL, Kerberos, MapR Security) and associated state-of-the-art encryption, all implemented through the MapR SASL. ZooKeeper contains key information used by Drill for discovery and coordination of cluster nodes. In this release, this ZooKeeper information is viewable but protected through ACLs (Access Control Lists). In a future release, we are looking to secure the entire cluster through full authentication and encryption support on all paths leading in and out of ZooKeeper. 

Authentication options on various communication paths within the Drill architecture

 

b. Improved resource management: Apache Drill's initial architecture and design, much like Google's Dremel, was based on optimistic execution semantics that assumed availability of ample memory and CPU core resources to execute queries. Such an assumption was justified at Google scale, where such clusters could possibly run into thousands of nodes or if the queries were operating on smaller workloads. If such a query failed, then it was perfectly okay to re-run the query. In the enterprise, resources are limited, and workloads can have wide variation. Hence, the expectation is that the query engine be pessimistic, especially in scenarios when there is resource contention. In this release, Apache Drill has introduced more pessimistic execution semantics.  

 

Apache Drill supports spooling to disk for two memory intensive operators that inevitably appear in most BI/SQL queries: aggregation and sorting. Thus, queries containing such operators will slow down and not fail. We have tested the functionality on the MapR-XD and found that queries that would normally fail under limited memory conditions complete. As expected, these queries do undergo a deterioration in performance. In the next several releases, there are plans in the open source to add this functionality to other operators, such as join, so that queries as a whole can spool to disk if the need arises. 

 

We are also supporting the ZooKeeper queue feature to better manage the concurrency of the system with a setting that decides the number of queries that can be run concurrently in the Drill cluster at any point in time, while all other queries wait. This Drill cluster concurrency should not be confused with user concurrency, which is typically defined as the number of concurrent queries being sent by all users through some front-end client (i.e., Tableau, Web UI, or REST API). This concurrency number includes queries that are being processed and are waiting to be processed by the Drill cluster. The waiting time can cause a slower response time to all users, which itself is dependent on the cluster response time. 

 

The feature was first released in open source in November of last year, but we have tested and added new features to this capability. The key idea behind using queues is that you can tune the system to allow light workload queries (typically interactive) to run more frequently than heavy workload queries (typically reporting or batch data intensive query) that can burden the system, thereby leading to a lower system response time or even failures.

For every 5 queries in the small queue that are getting executed, only 1 query in the large queue is allowed to execute. All other queries must wait in FIFO order and are routed to the appropriate queue when a slot opens. 

 

If you are smart enough to manage the spill to disk options, then the heavy workload queries will slow down but not fail. This slowdown, due to queuing and spilling, discourages those users who may have overloaded the system. Incidentally, the control on the concurrency also sets an upper bound on the number of CPU cores that could be used overall, as each query uses the same planner.width.max_per_query parameter. If these resources are sized appropriately, the response time of the system to interactive workloads for a given user concurrency can be improved as the waiting time for such queries can be reduced. 

Sample parameters of Drill queues, showing that 5 light workload queries ("Small") run for every heavy workload ("Large") query. 

 

As shown above, Cost Threshold is a measure of the workload size computed by Drill at query time. Since the parameter is an estimate, it needs to be tuned. To enable tuning during POCs and testing, we provide this estimate in the query profile itself, so that you can trace how queries with well-defined workloads were queued based on that parameter. For the curious, query profiles are JSON files that can be queried by Drill itself. From our point of view, this manual queuing feature represents an important step towards our future goal of making them automatic and dynamic by putting in more intelligent control through feedback mechanisms. 

To enable threshold parameter tuning, query profiles contain a new column for total cost and its associated queue.

 

And One Last Thing

If you are an aspiring data scientist, learning SQL could be one thing worth doing now. We recently announced the launch of a new product called the MapR Data Science Refinery, a scalable data science offering that comes pre-packaged with a notebook, Apache Zeppelin. You can use the notebook as a SQL interface to retrieve data and visualize it.

Drill queries alongside a simple pie chart representation.

 

Now, It's Your Turn to Try

No release is ever complete without you giving it a try and seeing for yourself if it delivers the value you are looking for. With that, I invite you to give the new Apache Drill 1.11 on MapR 6.0 a try, and let me know your feedback. 

 

The MapR Converge Community pages for Apache Drill also got a face-lift. Check it out here

 

Happy Drill-ing!

 Today, we are proud to announce the next-generation MapR Control System (MCS). Building on the Spyglass Initiative, this new administrative product couples a unified and intuitive management interface with scalable and secure monitoring for on-premises, cloud, and edge clusters.  With its integration of files, tables, and streams, MapR has always been the simplest and most cost-efficient platform to manage. Today it gives administrators further control and visibility to manage the converged infrastructure.

 

Organizations are not only looking to drive a lot more business value from data investments, but also benefit from the wave of infrastructure agility. They are encountering several challenges:

  • Data silos and isolated tools that bring additional complexity.
  • An ever-growing ecosystem that brings new innovations but at a higher operational cost.
  • Lack of knowledge and visibility in tuning to get the best ROI.
  • Harder to setup critical enterprise-grade features for high-availability, disaster recovery, and data protection.

 

MCS has been redesigned to take an already self-managed platform and make cluster operations easy, autonomous, and intelligent. In its first release, MCS makes it easy for customers to manage their cluster infrastructure and MapR platform using a simple, intuitive, and actionable interface. With a unified management solution, built-in real-time insights, and fast time-to-action, MCS is vital component in the DataOps movement.

 

In conversations with our users, we learned that monitoring was a big piece of day-to-day big data operations. MCS in MapR 6.0 expands on the Spyglass initiative by co-relating events with metrics and logs and providing actionable recommendations. It also simplifies common tasks such as managing tenant quotas, monitoring nodes and services, and configuring data elements.


The top features of MapR 6.0 MCS include:

  •  A quick glance cluster dashboard
  • Resource utilization by node and by service
  • Capacity planning using storage utilization trends and per tenant usage
  • Easy to set up replication, snapshots, and mirrors
  • The ability to manage cluster events with related metrics and expert recommendations
  • Direct access to default metrics and pre-filtered logs
  • The power to manage MapR-ES and configure replicas
  • Access to MapR DB tables, indexes, and change logs
  • Intuitive mechanisms to set up volume, table, and stream ACEs for access control

 

We released MapR Monitoring with 5.2 to build a customizable and extensible monitoring framework. With 6.0 we are making it even more scalable with the use of MapR-ES in the core architecture. We are also driving more granular visibility through, for example, volume performance metrics such as latency, IOPs, and throughput. With additional flexible deployment options such as Installer Stanzas and cloud offerings, we are on a mission to empower our administrators.

 

Learn more about the new MCS with MapR 6.0 at mapr.com.

Announcing: MEP 3.0.2, 2.0.3, 1.1.4 

Date: 11/21/17

 

We are pleased to announce the following MEP maintenance releases:

  • MEP 3.0.2: a maintenance release for our MEP 3.0 branch
  • MEP 2.0.3: a maintenance release for our MEP 2.0 branch
  • MEP 1.1.4: a maintenance release for our MEP 1.1 branch

 

MEP 3.0.2

MEP 3.0.2 is a maintenance release on the MEP 3.0 train that includes fixes to Apache Spark, Hive, Sqoop, Impala, and Hue, among others. 

 

The list of fixes can be found here:

Spark 2.1-1710

Hive 2.1-1710

Hue 3.12-1710

Sqoop 1.4.6-1710

Impala 2.7-1710

 

MEP 2.0.3

MEP 2.0.3 is  a maintenance release on the MEP 2.0 train that includes fixes to Apache HIve, Oozie, and Sqoop.

 

The list of fixes can be found here:

Hive 1.2.1-1710

Oozie 4.2.0-1710

Sqoop 1.4.6-1710

 

MEP 1.1.4

MEP 1.1.4 is a maintenance release on the MEP 1.1 train the includes fixes to Apache Oozie.

 

Oozie fixes can be found here:

Oozie 4.2.0-1710

 

Download

MEP 3.0.2:

Index of /releases/MEP/MEP-3.0.2 

 

MEP 2.0.3:

Index of /releases/MEP/MEP-2.0.3 

 

MEP 1.1.4:

Index of /releases/MEP/MEP-1.1.4 

 

ECO-1707:

http://package.mapr.com/releases/ecosystem-4.x/

http://package.mapr.com/releases/ecosystem-5.x/

 

UI Installer:

Index of /releases/installer 

 

Documentation

Announcing: MapR Expansion Pack (MEP) 4.0 Released

Date: Nov. 21st, 2017

 

We’re pleased to announce the general release of the MapR Expansion Pack (MEP) version 4.0.

 

MapR Expansion Packs are an expanded version of the MapR Ecosystem Pack (MEP) which is a way to deliver ecosystem upgrades decoupled from core platform upgrades. This expansion means that we can also deliver some core functionality in faster way using the framework we put together for Ecosystem projects to allow you to upgrade independently of your core platform.

 

MEP 4.0 is the first MEP for the MapR 6.0 release train and there's a lot of new content to support the new 6.0 initiatives including security enhancements for all ecosystem and feature-complete language bindings for MapR-DB OJAI.

 

Key Features

 

Security Enhancements

MEP 4.0 is designed to have security enabled with a single click. This means that most wire-level encryption and authentication for network-based connections are automatically enabled for new clusters, delivering an innovative approach to simplified security.
MapR installer has a simple Enable Security check box which will ensure that the platform and required ecosystem components are configured properly. If you're using manual install, 'configure.sh –secure' has been enhanced to enable security for platform and ecosystem at once.

Read More

 

MapR Container for Developers

The MapR Container for Developers is a Docker image containing a single node MapR Deployment that includes MapR-FS, MapR-DB, MapR Event Streams and Apache Drill.
This docker image was built for developers that want to create new applications and services or simply learn more about MapR.

Read More

 

MapR-DB OJAI Connector for Apache Spark: Support for DataFrames and DataSets 

The MapR-DB OJAI Connector for Apache Spark is a tool that makes it easier to build real-time or batch pipelines between your data and MapR-DB and leverage Spark within the pipeline. 

MapR previously released support for RDDs and is expanding this support now to enable in-place ML/AI & real-time analytics via native Spark Connectivity using all key Spark constructs: RDDs, DataFrames, & Datasets

Read More

 

Hive support for MapR-DB JSON tables

Enables ETL & Batch processing using native Hive integration and is deployed via a new Hive storage handler on MapR-DB JSON tables. This provides the ability to use complete Hive functionality on MapR-DB JSON tables.

Read More

 

Myriad Refresh: Myriad 0.2

MapR has updated the MapR-packaged version of Apache Myriad to 0.2, bringing in some bug fixes. We've added support for Myriad security, meaning all web/API endpoints are authenticated and Myriad is supported on a secure cluster. We've also added an option for Myriad to accept when Mesos offers GPUs.
Read More

 

OpenStack Manila GA

MapR is officially releasing the MapR plugin for OpenStack Manila, which allows OpenStack-based clouds to provide file services to users backed by MapR storage. This plugin, combined with MapR Cloud-scale Multi-tenancy allows for sharing of a MapR platform between multiple organizations.

Read More

 

Apache Drill 1.11

MapR-Drill 1.11 released with new enterprise capabilities for faster, secure and robust interactive BI analytics experience. Drill now leverages the new secondary index technology on semi-structured operational data stored in MapR-DB, the database for the world’s most data intensive applications, to speed up analytics, deliver insights and drive better decisions. Newer security features ensure protection of sensitive data as they are accessed, processed and delivered to end users. Several enhancements added to improve better handling of analytic workloads running on the system including spooling data intensive queries to disk and their management through queues.

Read More

 

MapR Monitoring Updates

We have some new updates to our MapR Monitoring Stack  including the use of MapR Event Streams for security and scale.

  • Metrics are now written to MapR Streams before storing them in OpenTSDB for enhanced security & scale.
  • Additional performance metrics for MapR Volumes (throughput, latency, IOPs) are available with updated "Volume Dashboard" in Grafana
  • Updates to log stack: FluentD 0.14.2, Elasticsearch 5.4.1, Kibana 5.4.1

 

All Ecosystem Components (*denotes re-release)

The following is a list of components included in the MEP 4.0 release, supported for MapR 6.0.X.

 

 

MEP 4.0 Contents
Release NotesDocumentation
Apache Drill 1.11

Release Notes

Documentation
Apache Hive 2.1.1*Release NotesDocumentation
Apache Flume 1.7Release NotesDocumentation
AsyncHBase 1.7Release NotesDocumentation
Apache Myriad 0.2.0Release NotesDocumentation
Apache Oozie 4.3.0Release NotesDocumentation

Apache Pig 0.16

Release NotesDocumentation
Apache Sentry 1.7Release NotesDocumentation
Apache Spark 2.1.0*Release NotesDocumentation
Apache Sqoop 1.4.6*Release NotesDocumentation
Apache Sqoop2 1.99.7*Release NotesDocumentation
HttpFS 1.0*Release NotesDocumentation
Hue 3.12*Release NotesDocumentation
Impala 2.7Release NotesDocumentation

 

Download
MEP 4.0.0:

Index of /releases/MEP/MEP-4.0 


UI Installer:

Index of /releases/installer 

 

Documentation

 

Related Resources

Have a Question?

Ask in the comments below.

Date: Nov. 21, 2017

 

Data Science is a hot topic in boardrooms right now. Everybody wants to adopt AI/ML, hire the best and brightest data scientists, and enable them to experiment and build intelligent applications. New deep learning libraries have made it possible to analyze new types of data and even gain new insights from historical data. Massive amounts of data are being generated from the boom in IoT computing, which means there’s even more demand for ML aggregation at the edge. Everybody wants in.

 

But what we’re seeing is that our customers are struggling with existing solutions not scaling sufficiently to allow them to derive business value from ML. Most solutions currently available require the use of entirely new clusters with limited access to data and high IT overhead. Models are built on the small samples of data that can be accommodated and then deployed into production. Many offer closed platforms that cannot be extended to include popular emerging tools and libraries.

 

At MapR, the approach that we take is highly governed by what we’re hearing from our customers. And what we’re hearing is that they want a complete, open, secure, and converged solution. They want the ability to collaborate, visualize, and build while still keeping things easy to deploy and manage. And they don’t want another cluster.

 

That is why we’re launching the MapR Data Science Refinery. MapR will provide a scalable data science offering with native platform access, superior out-of-the-box security, and access to global event streaming and a multi-model NoSQL database.

 

 

We’ve seen that our customers need agile, easy-to-deploy solutions that can scale to fit the needs of all types of data science teams. Within our platform, we’re offering support for popular open source tooling in a small footprint, containerized, and preconfigured offering that can be distributed to many data science teams across multitenant environments.

 

The MapR Data Science Refinery plans to initially ship with a data science notebook, Apache Zeppelin, offering:

 

 

  • Access to All Platform Assets - The MapR FUSE-based POSIX Client allows app servers, web servers, and other client nodes and apps to read and write data directly and securely to a MapR cluster, like a Linux filesystem. In addition, connectors are provided for interacting with both MapR-DB and MapR-ES via Apache Spark connectors.
  • Superior Security - The MapR Platform is secure by default, and Apache Zeppelin on MapR leverages and integrates with this security layer using the built-in capabilities provided by the MapR Persistent Application Container (PACC).
  • Extensibility - Apache Zeppelin is paired with the Helium framework to offer pluggable visualization capabilities.

 

  • Simplified Deployment - A preconfigured Docker container provides the ability to leverage MapR as a persistent data store. 

 

This product is supported by and extended by our Data Science Quick Start Solutions (QSS), which are data science-led product-and-services offerings that enable the training of complex deep learning algorithms (i.e., Deep Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks) at scale. Learn more here.

 

ML is an active area of research and market innovation, and there are game-changing ML companies investing to improve data science productivity and build domain-specific machine learning solutions. As a data platform company, we want to be open and give our customers flexibility to use these solutions on the petabytes of business data they are relying on MapR to store and manage. So, we have extended this offering with selected Refinery partnerships as a holistic approach to enabling the MapR platform for all types of data science teams.

 

You can find out more about this offering and our partnerships in the The MapR Data Science Refinery Converge Community Area.

Today, we are delighted to announce the next level of advancement of the MapR Converged Data Platform with the latest release of MapR-DB 6.0–the modern database for global data-intensive applications.

 

At MapR, our goal has been to build a complete data platform with a built-in, modern, scalable database to create a broad variety of operational, analytic, and real-time applications spread across on-premises, edge, and multi-cloud environments with no complex trade-offs and compromises. MapR-DB allows these broad variety of applications by bringing critical database capabilities into one system as below.

 

 

 

MapR has systematically built MapR-DB to be a converged and complete database over the past 3 years, and the latest MapR-DB 6.0 release delivers on this broader vision. 

 

MapR-DB 6.0 is a significant milestone. With this release, we are introducing several new capabilities and performance improvements to expand the usage of the database in organizations.

 

Here is the summary of the key features in this release.

Powerful and Efficient Data Access with Native Secondary Indexes

Prior to 6.0, MapR-DB was optimized for access only based on rowkey. The new, built-in, secondary indexes expand on this feature by supporting flexible and efficient queries on any columns in the DB tables at scale. This capability enables application developers to build rich and new types of applications that support complex user interaction patterns, and business users can perform optimized/high performance SQL queries, using the familiar BI/Analytics tools.

The key features of the secondary indexing functionality include:

  • Native secondary indexes for MapR-DB JSON tables–no external indexing system, such as Elasticsearch or Solr, necessary
  • Scalable and enterprise-grade indexing with auto-propagation, auto-scale, and auto-management
  • Extreme index scalability and performance with SSD optimizations
  • Rich indexing functionality–unlimited indexes, composite indexes with large # of columns, comprehensive data types support, hashed indexes, covering/non-covering query support, security, and more
  • Highly functional and seamless queries across primary and secondary index tables
  • Optimized index-based access for application development and BI/Analytics

 

Rich and Expanded Application Development with MapR-DB OJAI 2.0 APIs

OJAI (Open JSON Application Interface) is the API to develop applications with MapR-DB document data model. In 6.0, we are expanding on the API for more functionality and performance.

The new capabilities include:

  • New and intuitive OJAI query interface
  • JSON grammar and fluent API semantics
  • Rich expressive language support, including conditional filtering, sorting, and pagination support
  • Efficient queries with seamless index-based access
  • Smart query execution to support operational and operational analytic applications on any data scale and with any query complexity

 

Optimized Drill/DB Integration for In-Place SQL Data Exploration and Operational BI

Apache Drill provides flexible SQL analytics on the data in MapR-DB JSON tables. Drill is a distributed SQL query engine and serves as a unified interactive access layer for the MapR Platform, bringing together data from MapR-FS and MapR-DB.

The new capabilities of the MapR-DB and Drill integration optimize the SQL data access on MapR-DB, speeding up ad-hoc queries. The new capabilities include:

  • Ability for a variety of Drill SQL queries to seamlessly leverage MapR-DB secondary indexes, significantly speeding up query performance and avoiding large scans
  • Statistics, selectivity, and cost-based index selection
  • Index support for Filter/Sort/Offset/Limit operators
  • Comprehensive index functionality support, including single, composite, covering/non-covering indexes, and index intersection

 

In-Place Advanced Analytics/ML on MapR-DB JSON with Native Spark Connectivity

MapR-DB 6.0 deeply integrates with Apache Spark and MapR-DB JSON tables. Customers can use these capabilities to perform real-time data processing as well as build and serve machine learning models on MapR-DB tables directly without creating analytic silos.

The new capabilities of this integration include:

  • Batch and real-time data processing support with native Spark connectivity
  • Supports for all key Spark constructs–RDDs, data frames/data sets
  • Optimized Spark performance with projection and filter pushdown

 

In-Place ETL/Data Processing on MapR-DB JSON with Native Hive Support

MapR-DB 6.0 deeply integrates with Apache Hive and MapR-DB JSON tables. Customers can use these capabilities to perform ETL/batch processing of the data in MapR-DB tables directly.

The new capabilities of this integration include:

  • New Hive storage handler for MapR-DB JSON tables
  • Support for extensive Hive SQL functionality and data types on MapR-DB tables

 

Real-Time Data Integration and Micro-Services with MapR-DB Change Data Capture API

Built on the foundations of global table replication and MapR Event Streaming, the MapR-DB Change Data Capture API provides a powerful and easy-to-use interface to support real-time integration of changes arriving at a MapR-DB table to arbitrary, external systems. Users can now build applications to consume and process the MapR-DB table data changes published as ‘change log’ streams in real time in a highly scalable way. The change data propagation is granular for selected columns/fields and supports ordered at least-once delivery.

This capability enables use cases such as:

  • Track changes happening to the MapR-DB (Inserts, Updates, Deletes) and perform real-time processing on the data
  • Synchronize data in MapR-DB with a downstream search index (such as Elasticsearch, Solr), materialized views, or in-memory caches

 

All the new functionality expands on the data access capabilities on MapR-DB and helps leverage in a variety of use cases, such as customer 360, personalization, real-time analytics, IoT, and building scalable and high performance enterprise apps. The general availability of the MapR-DB 6.0 is in Q4’2017.

 

For more information on MapR-DB, refer to the following:

Learn more about MapR-DB

MapR-DB performance benchmarks

It’s 2017.  Every company in business today is either running part of their operations in the cloud or making a plan to do so.  Cloud is simply too attractive to ignore, mostly because of the agility that comes from creating infrastructure instantly at the click of a button, but also because of the rock bottom cost of storing data in cloud object storage.  However, tapping into those advantages is anything but straightforward.  Each of the major cloud providers present a moving target of prices and portfolios, so the offering that sounds best today may be behind next week.  Meanwhile, other vendors in the industry are rushing to present themselves as somewhere in between “cloud-first” to “cloud-agnostic” in an effort to sound relevant.

 

At MapR, we believe it’s important to focus on customer value when describing our approach to cloud, which comes from several years of helping customers set up public cloud and hybrid cloud data environments.  Today, we’re excited to share those key enablers, as well as some new capabilities, with the MapR Orbit Cloud Suite.  The MapR Orbit Cloud Suite is a set of features and capabilities for the MapR Converged Data Platform that helps companies go native-cloud as well as cross-cloud.  Let’s explore what each of these terms means, how MapR is delivering them, and why they matter.

 

Native-cloud refers to the ability to leverage the best aspects of the cloud, cost and agility, through deep technical integrations between the MapR Converged Data Platform and cloud services.  To this end, we’re announcing two new capabilities.

  • MapR Object Tiering data, stored in the MapR Platform, can be automatically and seamlessly offloaded to lower-cost cloud object storage, based on predefined policies.  Best of all, metadata for offloaded data is retained in the MapR Platform as part of the global namespace, so the tiering is transparent to any applications.  This game-changing capability helps customers achieve almost bare-metal performance for hot data, while offering cloud economics for cold data.  For more information about MapR Object Tiering, please see the data sheet.
  • MapR Cloud-Native Installation and Management (available today!) helps companies leverage the agility benefits of the cloud.  By integrating MapR with the VM provisioning APIs of the major cloud providers, users can provision and manage MapR and cloud infrastructure together in a single click.  This integration not only simplifies initial provisioning of clusters, but also allows clusters to be resized on-demand, based on changing data volume.  To get started today, please visit our community spaces for AWS and Azure.

 

Cross-cloud refers to a company’s ability to exist across multiple environments, whether different public clouds, a combination of public and private cloud, or from cloud to edge.  This capability is critical in a world where cloud providers’ prices, portfolios, and strategies change frequently.  Several companies in the industry position themselves as “cloud agnostic” as an answer to this issue, but being able to run in multiple clouds doesn’t help companies that need to run across locations.  Running across multiple clouds, including coordinating data flows and applications, requires much more, and MapR has been uniquely delivering this capability through:

  • MapR Mirroring and real-time data replication for MapR-DB and MapR-Streams, both with built-in reliability and bandwidth optimization.
  • MapR Edge, a scaled-down form factor that distributes processing to the edge.
  • MapR Edge-to-Cloud File Migrate, a new capability for directly integrating edge locations with cloud object storage in real time.  

You can find out more about these capabilities, as well as some additional capabilities for cloud builders, at the MapR Orbit Cloud Suite page.

Announcing: MEP 3.0.1, 2.0.2, 1.1.3, and ECO-1707

Release Date: 8/2/2017

 

We are pleased to announce the following maintenance releases:

  • MEP 3.0.1: a maintenance release for our MEP 3.0 branch
  • MEP 2.0.2: a maintenance release for our MEP 2.0 branch
  • MEP 1.1.3: a maintenance release for our MEP 1.1 branch
  • ECO-1707: a maintenance release for our pre-MEP Eco trains

 

MEP 3.0.1

MEP 3.0.1 is a maintenance release on the MEP 3.0 train that includes some additional/completed feature sets for the MapR Installer, Apache Spark, and the MapR Connector for Teradata.

 

In addition, operating system support for SUSE 12sp2 has been added as part of this MEP.

 

MapR Installer 1.6

Our newest installer release adds support for the following:

  • Adding nodes within different service groups without affecting any other node
  • Running MapR Installer in Docker Container

Read more here.

 

Support for Spark on Mesos

The 2.X branch of Spark changed how Spark Mesos support was handled and moved it into a separate build profile. As part of this release, we’re adding support for this profile.

Read more here.

 

MapR Connector for Teradata

In this release, we complete the feature set of the Teradata Connector for Hadoop, initially released along with MEP 2.0. Added options include:

  • Teradata Fastload Sqoop support
  • Input methods: split.by.amp, split.by.value, split.by.partition, split.by.hash
  • Output methods: batch.insert, internal.fastload
  • Other new options: --batch-size, --access-lock, --query-band, --error-table, --error-database, --fastload-socket-hostname, --num-partitions-for-staging-table, --skip-xviews, --date-format, --time-format, --timestamp-format, --num-partitions-for-staging-table, --keep-staging-table, --staging-table, --staging-database , --staging-force

Read more here,

 

All Components (**denotes re-release)

The following is a list of components included in the MEP 3.0.1 release, supported for MapR 5.2:

 

MEP 3.0.1 Contents

Release Notes

Documentation

Apache Drill 1.10**

Release Notes

Documentation

Apache Hive 2.1.1**

Release Notes

Documentation

Apache Flume 1.7

Release Notes

Documentation

Apache HBase 1.1.8

Release Notes

Documentation

AsyncHBase 1.7

Release Notes

Documentation

Apache Mahout 0.12.0

Release Notes

Documentation

Apache Myriad 0.1.0

Release Notes

Documentation

Apache Oozie 4.3.0**

Release Notes

Documentation

Apache Pig 0.16

Release Notes

Documentation

Apache Sentry 1.7

Release Notes

Documentation

Apache Spark 2.1.0**

Release Notes

Documentation

Apache Sqoop 1.4.6**

Release Notes

Documentation

Apache Sqoop2 1.99.7

Release Notes

Documentation

Apache Storm 0.10.0

Release Notes

Documentation

HttpFS 1.0

Release Notes

Documentation

Hue 3.12**

Release Notes

Documentation

Impala 2.7**

Release Notes

Documentation

 

More detail available here.

 

Release Notes and Versions for Other Releases

MEP 2.0.2: MapR Ecosystem Pack 2.0.2 Release Notes 

MEP 1.1.3: MapR Ecosystem Pack 1.1.3 Release Notes 

 

ECO-1707 Re-Releases

This maintenance release for pre-MEP customers contains re-releases of the project versions listed below:

 

ECO-1707 Project

Release Notes

Documentation

Hive 1.2.1

Release Notes

Documentation

Oozie 4.2

Release Notes

Documentation

Spark 1.6.1

Release Notes

Documentation

Sqoop 1.4.6

Release Notes

Documentation

 

Download

MEP 3.0.1:

Index of /releases/MEP/MEP-3.0.1 

 

MEP 2.0.2:

Index of /releases/MEP/MEP-2.0.2 

 

MEP 1.1.3:

Index of /releases/MEP/MEP-1.1.3 

 

ECO-1707:

http://package.mapr.com/releases/ecosystem-4.x/

http://package.mapr.com/releases/ecosystem-5.x/

 

UI Installer:

Index of /releases/installer 

 

Documentation

 

Have a Question?

Ask in Answers or comment below.

Have you noticed anything different in the Products and Services space? We hope you have! Last month we unveiled new product pages related to Open Source projects that are of interest to members in the MapR Community, such as:

 

- Apache Zeppelin

- Apache Apex

- ElasticSearch

- StreamSets

 

These spaces exist to provide an area to discuss and share expertise on products outside of the formally supported ecosystem. These pages are a result of Community interest, and we hope to drive more engagement from Community members who are committers and contributors on these projects.

 

What is the advantage?

Since we announced our new Product and Services pages earlier this year, we noticed a dramatic change in the community behavior: members now go directly to the product in order to search for content. The Products and Services pages have become the center stage through which all content in the community is filtered and leveraged more easily. 

 

Why did we choose these tools?

The initial tools selected represent the most popular projects in our Community, gauged by interest and engagement. We've attempted to evaluate which tools are being used or searched for most by our Community members and customers and what content existed already. We're doing our best to partner with the companies behind these projects and that will also allow us to expand this offering. 

 

How to contribute?

Products and Services leverages the content in other spaces. We encourage you to share your knowledge as a blog post, sample code, video, or other in The Exchange, making sure you use the appropriate tag for each product, so it can be automatically filtered and surfaced. 

 

What is next?

Tell us what tools you are interested in by voting or commenting on What other product pages would you like to see in the community? 

I recently joined MapR and was thrown right into a new product launch. I have to admit it’s an exciting way to start a new gig, especially given what we’re launching. Below is a quick look at MapR-XD and what it means for you.

 

Storage and data management are in the midst of an exciting generational replatforming to leap forward into the digital age. With the exponential growth of data volumes and rigid infrastructures, moving data and integrating analytics with operational processes is becoming increasingly difficult. The resulting data silos make it even more difficult to derive meaning and intelligence from the valuable data. This leads to high costs of processing and storing of data, and these costs only continue to grow with the data volumes. As a result, we conclude that today’s storage and data management technologies were not designed to take advantage of distributed computing environments, cloud infrastructures, containers and virtualization, and IoT. Therefore, a need arises for a new kind of data platform. Intelligent applications that automate real-time operational decisions on the basis of deep analytical insights are necessary.

 

This is where MapR-XD comes in.

 

MapR-XD Cloud-Scale Data Store is the industry’s only exabyte scale data store for building intelligent applications with the MapR Converged Data Platform. The unique MapR Platform enables the creation of data fabric with a global view of data and metadata, supporting a wide diversity of data types for both analytics and operations. MapR-XD is a high-scale, reliable, globally distributed data store, delivering an organization’s data fabric for managing files, objects, and containers. MapR-XD supports the most stringent speed, scale, and reliability requirements within and across multiple edge, on-premises, and cloud environments.

 

 

Let’s look at some of the key features of MapR-XD:

 

  • Global NameSpace. A method that offers simple data access management by providing a consolidated view into files that are in different physical locations.
  • Data Replication. MapR provides data protection using data replication, and the method of replication ensures there is no data loss or access loss during failures.
  • Topologies. Topology is a unique MapR feature that ensures reliability and an efficient data placement across the cluster.
  • Auto Tiering. A strong feature that maintains the lifecycle of data and allows for efficient usage of cluster capacity and data management.
  • ACEs. MapR offers another powerful and unique model of authorization in the form of Access Control Expressions (ACEs).
  • Self-Healing. MapR provides self healing from multiple simultaneous failures, where it reconstructs the data from copies, allowing cluster availability at all times.
  • Open Interfaces. By supporting specific protocol interfaces, a wide range of applications that dictate diverse characteristics and metrics can all be hosted on a single, flexible platform.
  • Instant Snapshots. MapR Snapshots are read-only volume-level snapshots and are beneficial for rollback from errors, hot backups, and managing real-time analysis.

 

Some examples on how customers are benefitting from MapR-XD include:

 

  • For financial services, speed in identifying potential fraudulent activity is critical for keeping clients safe from cyber criminals. MapR-XD enables companies to unify, manage, and act on data rapidly, ultimately resulting in critical advantages.
  • For data warehouse use cases, MapR is being used to drive consistent speed at scale, hosting multi-tenant applications, while maintaining the different tiers of data.
  • MapR-XD is also extensively used by enterprise organizations building a cloud platform, because of its scale, reliability, and ability to host different applications across different user groups.

 

These are just a few examples of how MapR-XD is a revolutionary data platform. If you are looking for globally distributed scale and reliability, functionality with any data type, and integrated analytics to operationalize data, talk to us about how MapR-XD can help.

 

 

RELATED

MapR-XD

WEBINAR: Cloud-scale data fabric - On-Demand after July 11, 2017 - Remote attendees welcomed 

MapR-XD | MapR Website

MapR-XD | Press Release

We are pleased to announce the MapR Distributed Deep Learning QSSa data science-led product and services offering that enables the training of complex deep learning algorithms (i.e. deep neural networks, convolutional neural networks, recurrent neural networks) at scale.  Within a few weeks, this new Quick Start Solution will provide an environment for continuous learning, enable experimentation with deep learning libraries, and deliver a production framework for quickly operationalizing deep learning applications.

 

The new offering features access to distributed deep learning libraries (TensorFlow, Caffe, MXNet, etc.), a framework that intelligently switches storage and workflow between CPUs and GPUs, and the stability, scale, and performance of the MapR Converged Data Platform to form the basis for advanced, data-driven applications such as the following:

 

  1. Convolutional Neural Networks for images
    Retail: in-store activity analysis of video to measure traffic
    Satellite images: labeling terrain, classifying objects
    Automotive: recognition of roadways and obstacles
    Healthcare: diagnostic opportunities from x-rays, scans, etc.
    Insurance: estimating claim severity based on photographs
  2. Recurrent Neural Networks for sequenced data
    Customer satisfaction: transcription of voice data to text for NLP analysis
    Social media: real-time translation of social and product forum posts
    Photo captioning: search archives of images for new insights
    Finance: Predicting behavior based via time series analysis (also enhanced recommendation systems
  3. Deep Neural Networks for Improved Traditional Algorithms
    Finance: Enhanced Fraud Detection through identification of more complex patterns
    Manufacturing: Enhanced identification of defects based on deeper anomaly detection

 

KEY SOLUTION CAPABILITIES

The Deep Learning Quick Start Solution is a major step towards transforming your business using deep learning. By the end of the engagement the customer can expect the following:

  • A MapR Converged Data Platform cluster installed and configured for efficient experimentation with deep learning libraries (such as TensorFlow on Kubernetes) and access to both CPUs and GPUs.
    An in-depth collaboration between business stakeholders and a deep learning scientist to identify the tools and methods that will provide the optimal results to the business problem.
  • A complete model-building initiative including experimentation with neural network parameters (number of nodes, learning rates, modeling layers, etc.) to achieve maximum performance gains.
    Training on implementation of model, interpreting reason codes, and applying model metrics to business goals. Stakeholders are trained on the process as a whole to ensure a clear path forward.
  • A fully-functional deep learning platform that will continue to fuel cutting-edge research and provide scalable access to the newest, most powerful algorithms as they become available.

 

Reference Architecture for Distributed Deep Learning on MapR

                  

Using this approach, MapR data scientists use the MapR Converged Data Platform and distributed machine learning algorithms to provide an enterprise-grade analytics capability that will continue to be refined and modified to respond to new data sets, new algorithms, and new intelligent applications.

 

KEY BUSINESS BENEFITS INCLUDE:

  • A scalable Deep Learning Platform that will continue to enable cutting-edge research opportunities long after the QSS has delivered initial results.
  • An enterprise class data platform with virtually limitless scale that supports a rich choice of open source and commercial processing engines and analytical tools
  • Extensive collaboration with key stakeholders to build a high quality, customized image classification model on which to build intelligent applications
  • Clear demonstration of value to business and technical stakeholders
  • Continuous training and knowledge transfer during the engagement about tools, techniques and use case roadmap

 

 

LEARN MORE

Visit the MapR Distributed Deep Learning QSS page.

Read Distributed Deep Learning on the MapR Converged Data Platform 

TensorFlow on MapR Tutorial

machine learning