|By Bill McColl||
|October 15, 2010 04:45 PM EDT||
Big data is creating a massive disruption for the IT industry. Faced with exponentially growing data volumes in every area of business and the web, companies around the world are looking beyond their current databases and data warehouses for new ways to handle this data deluge.
Taking a lead from Google, a number of organizations have been exploring the potential of MapReduce, and its open source clone Hadoop, for big data processing. The MapReduce/Hadoop approach is based around the idea that what's needed is not database processing with SQL queries, but rather dataflow computing with simple parallel programming primitives such as map and reduce.
As Google and others have shown, this kind of basic dataflow programming model can be implemented as a coarse-grain set of parallel tasks that can be run across hundreds or thousands of machines, to carry out large-scale batch processing on massive data sets.
Google themselves have been using MapReduce for batch processing for over six years, and others, such as Facebook, eBay and Yahoo have been using Hadoop for the same kind of batch processing for several years now. So today, parallel dataflow is firmly established as an alternative to databases and data warehouses for offline batch processing of big data. But now the game is changing again...
In recent months, Google has realized that the web is now entering a new era, the realtime era, and that batch processing systems such as MapReduce and Hadoop cannot deliver performance anywhere near the speed required for new realtime services such as Google Instant. Google noted that
- "MapReduce isn't suited to calculations that need to occur in near real-time"
- "You can't do anything with it that takes a relatively short amount of time, so we got rid of it"
Other industry leaders, such as Jeff Jonas, Chief Scientist for Analytics at IBM, have made similar remarks in recent weeks. In his recent video "Big Thoughts on Big Data", Jonas notes that with only batch processing tools to handle it, organizations grappling with a relentless avalanche of realtime data will get dumber over time rather than getting smarter.
- "The idea of waiting for a batch job to run doesn't cut it. Instead, how can an organization make sense of what it knows, as a transaction is happening, so that it can do something about it right then"
- "I'm not a big fan of batch processes... I've never seen a batch system grow up an become a realtime streaming system, but you can take a realtime streaming system and make it eat batches all day long"
- "I like Hadoop but it's meant for batch activities. That's not the kind of back-end you would use for realtime sense-making systems"
So coarse-grain dataflow architectures such as Hadoop are good for batch, but bad for realtime.
To power realtime big data apps we need a completely new type of fine-grain dataflow architecture. An architecture that can, for example, continuously analyze a stream of events at a rate of say one million events per second per server, and deliver results with a maximum latency of five seconds between data in and analytics out. At Cloudscale we set out to crack this major technical problem, and to build the world's first "realtime data warehouse". The linearly scalable Cloudscale parallel dataflow architecture not only delivers game-changing realtime performance on commodity hardware, but also, as Jeff Jonas notes above "can eat batches all day long" like a traditional MapReduce or Hadoop architecture. There isn't really an established name yet for such a system. I guess we could call it a "Redoop" architecture (Realtime Dataflow on Ordinary Processors, or Realtime Hadoop).
- WebRTC Summit at Cloud Expo Agenda Announced
- Google’s Enterprise Problem
- Building Video Calling with PubNub and WebRTC
- DataStax Announces New Startup Programme Offering Free Software, As Well As Free Training Courses For Cassandra Users And New Developer Tool
- Get Ready to Think Out (C)loud With Cloud Sherpas’ Upcoming Webinar Series
- Evaluation Report on Virtual Backup Software
- New PubNub App Template for WebRTC
- Series: Exchange 2013 and Lync 2013 Integration with AsteriskNOW PBX Pt. 1
- Strategic Enough to Matter, Code Halos and Mobile Apps
- GAMA : Quatre acteurs clefs, quatre stratégies différentes !
- Box and NSI Partnership Brings the Cloud to Businesses in the Middle East
- 7 Christmas Gifts For Your Business
- WebRTC Summit at Cloud Expo Agenda Announced
- OneLogin Raises $13M to Power Expansion
- Cloud Security Alliance Releases Cloud Controls Matrix, Version 3.0
- Survey Finds Large Enterprises Adopting WebRTC
- WebRTC Summit | WebRTC: Test then Disrupt
- WebRTC Summit Speaker Submissions Open
- WSO2 Expands Identity Management Capabilities Across Cloud, Mobile and Web Applications With the Launch of WSO2 Identity Server 4.5
- Twilio and LiveOps to Deliver WebRTC Deployments
- BMC Software to Exhibit at Cloud Expo Silicon Valley
- Oracle Demonstrates WebRTC Solution with CounterPath's Bria
- OpenStack for the Enterprise – Showcasing the OpenStack Ecosystem
- XIRSYS Launches WebRTC Hosting Service
- Where Are RIA Technologies Headed in 2008?
- The Top 250 Players in the Cloud Computing Ecosystem
- Dolphin Announces Open API With Over 50 Add-ons Including Dropbox and Wikipedia
- Personal Branding Checklist
- AJAXWorld 2006 West Power Panel with Google's Adam Bosworth
- Why Microsoft Loves Google's Android
- Google's OpenSocial: A Technical Overview and Critique
- Cloud Expo New York Call for Papers Now Open
- Wal-Mart To Sell $399 Ubuntu Linux-based Laptop with Google Operating System
- i-Technology Blog: Google Trends on Java, McNealy, AJAX, and SOA Give Pause For Thought
- i-Technology Blog: Is There Life Beyond Google?
- Android: Who Hates Google Over the Phone?