In an effort to streamline costs, increase efficiency, and basically getting IT to focus on delivering real business value, a tremendous mind-shift is taking place. The battleground is the "build-your-own" approach versus the pre-built, pre-tested, pre-integrated built-for-purpose appliances using commodity hardware.
Big Data infrastructure is the next wave of advanced information infrastructures focused on to deliver competitive advantage and insight into pattern based behaviours.
Virtually every hardware vendor in the market has an offering. From Greenplum/EMC's (Pivotal) PivotalHD Hadoop Distribution with VMware/Isilon underpinning infrastructure, HP’s Haven and AppSystem ecosystem, IBM PureData, Teradata's Portfolio for Hadoop and indeed Oracle’s Big Data Appliance / Exalytics.
Many share the Cloudera (with Yahoo Hadoop founder Doug Cutting) distribution or directly to the roots in Apache Hadoop. Others are implementing with MapR or Hortonworks (the spinoff of Yahoo’s Hadoop team) distributions that are highly performant.
Clearly credibility is important to customers, either directly as in using Apache Hadoop or inherited through distributions such as Cloudera. Cluster management tools are critical differentiators when examining operations at scale - consequently this typically incurs licensing.
Significant players such as Facebook and Yahoo are developing Hadoop further and feeding back into the core Apache Hadoop. This allows anyone using this distribution to take advantage.
Over the coming blogs I will take a quick peek at these Big Data Infrastructures, their approaches, key differentiators and integration with the larger information management architecture. The focus will be density, speed and smallest form factors for infrastructure.
Questioning "Accepted Wisdom"
Whilst it is nice to know that we can have clusters of thousands of nodes performing some mind-boggling Big Data processing, it is almost MORE interesting to know how this can be performed on much smaller infrastructures, at the same/better speed and simpler operational management.
With that in mind, we should be asking ourselves if Big Data can be processed in other ways to take advantage of some very important technology drivers:
- Multi-core/Multi-GHz/Hyper-threaded processors
- Multi-terabyte (TB) memory chips
- Non-volatile RAM (PCIe SSD mainly today, but possible Memristor or even carbon nanotube based NanoRAM (NRAM)) as replacement for spinnng disk storage and volatile RAM
- Terabit Interconnects for system components or to external resources
- In-Memory/In-Database Big Data capabilities to span information silos vs. recreating RDBMS systems again.
With systems today such as the recently released Oracle SPARCT T5-8 alread having 128 cores and 1024 threads at 3.6GHz in a 8RU form factor - the compute side of the Big Data equation seems to be shaping up nicely - 16 cores/RU or 128 threads/RU.
Still too small as Hadoop clusters sport much greater processing power at the cost of requiring more server nodes of course.
With Infiniband running at 40Gbps (QDR) and faster, component interconnects are also shaping up nicely. Many vendors are now using Infiniband to really get that performance up compared to Ethernet or even fibre channel for storage elements. Indeed some are literally skipping SANs/NASs and just moving to server based storage.
Many database vendors are actively using adapters to send Big Data jobs to the infrastructure and pull results back into the database. It will be a matter of time before Big Data is sucked into the RDBMS itself just as Java and Analytics has been.
However the memory technology vector is the one that is absolutely critical. The promise of non-volatile memory with performance outstripping the fastest RAM chips out there, the very shape of infrastructure for Big Data is radically different!
Not only is small beautiful - but it is essential to continuous consolidation paradigms allowing much greater densification of data and processing thereof!
Why is this important for the CIO, CTO & CFO?
Big Data technologies are certainly revolutionizing the way decisions are being informed and providing lighting fast insight into very dynamic landscapes.
Executive management should be aware that these technologies rely on scale-out paradigms. This would effectively reverse gains made through virtualization and workload optimized systems to reduce the datacenter estate footprint!
CIO/CFOs should be looking at investing in technology infrastructure minimizing the IT footprint, and yet still delivering revenue generating capabilities derived through innovative IT. In some cases this will be off-premise based clouds; in others competitive advantage will be derived from on-premise resources that are tightly integrated into the central information systems of the organization.
Action-Ideas such as:
- Funding the change for smaller/denser Big Data Infrastructure will ensure server (physical or virtual) sprawl is avoided
- Continuous consolidation paradigm funding by structuring business cases for faster ROI
- Funding efficiency of operations. If Big Data can be done within a larger server with a single OS image and in-memory vs. 1000 OS images, this will be the option to make operations simpler and efficient. There maybe a cost premium at the server and integration layers.
- Advanced integration of the Big Data infrastructure and information management architectures to allow seamless capability across information silos (structured/unstructured)
- Cloud capabilities for scaling preferably within the box/system using what Gartner calls fabric-based infrastructures should be the norm rather than building your own environment.
Continuous workload consolidation needs to take center-front stage again - think thousands of Big Data workloads per server not just thousands of servers to run a single workload! Think In-Memory for workloads rather than in simple spinning disk terms. Think commodity hardware/OS is not the only competitive advantage - only low hanging fruit!
We'll take a closer look at Big Data infrastructure in the coming blogs with a view to how Cloud pertains and still ensure deep efficiency at an information management infrastructure perspective using relevant KPIs.