|By Jnan Dash||
|July 11, 2011 04:25 PM EDT||
The phrase “Big Data” is thrown around a lot these days. What exactly is referred to by this phrase? When I was part of IBM’s DB2 development team, the largest size limit of a DB2 Table was 64 Gigabytes (GB) and I thought who on earth can use this size of a database. Thirty years later, that number looks so small. Now you can buy a 1 Terabyte external drive for less than $100.
Let us start with a level set on the unit of storage. In multiples of 1000, we go from Byte – Kilobyte (KB) – Megabyte (MB) – Gigabyte (GB) – Terabyte (TB) – Petabyte (PB) – Exabyte (EB) – Zettabyte (ZB) – Yottabyte (YB). The last one YB is 10 to the power of 24. A typed page is 2KB. The entire book collection at the US Library of Congress is 15TB. The amount of data processed in one hour at Google is 1PB. The total amount of information in existence is around 1.27ZB. Now you get some context to these numbers.
When we say Big Data, we enter the petabyte space (1000 Terabytes). There is talk of “personal petabyte” to store all your audio, video, and pictures. The cost has come down from $2M in 2002 to $2K in 2012 – real Moore’s law in disk storage technology here. This is not the stuff for current commercial database products such as DB2 or Oracle or SQLServer. Such RDBMS’s handle maximum of 10 to 100 Terabyte sizes. Anything bigger would cause serious performance nightmares. These large databases are mostly in the decision support and data warehousing applications. Walmart is known to have its main retail transaction data warehouse at 100 plus terabytes in a Teradata DBMS system.
Most of the growth in data is in “files”, not in DBMS. Now we see huge volumes of data in social networking sites like Facebook. At the beginning of 2010, Facebook was handling more than 4TB per day (compressed). Now that it has gone to 750M users, that number is at least 50% more. The new Zuck’s (Zuckerberg) law is , “Shared contents double every 24 months”. The question is how to deal with such volumes.
Google pioneered the algorithm called MapReduce to process massive amounts of data via parallel processing through hundreds of thousands of commodity servers. A simple Google query you type, probably touches 700 to 1000 servers to yield that half-second response time. MapReduce was made an open source under the Apache umbrella and was released as Hadoop (by Doug Cutting, former Xerox Parc, Apple, now at Cloudera). Hadoop has a file store called HDFS besides the MapReduce computational process. Hadoop therefore is a “flexible and available architecture for large scale computation and data processing on a network of commodity servers”. What is Redhat to Linux is Cloudera (new VC funded company) to Hadoop.
While Hadoop is becoming a defacto standard for big data, it’s pedigree is batch. For near-real-time analytics, better answers are needed. Yahoo, for example, has a real time analytics project called S4. Several other innovations are happening in this area of realtime or near realtime analytics. Visualization is another hot area for big data.
Big Data offers many opportunities for innovation in next few years.
- WebRTC Summit at Cloud Expo Agenda Announced
- Google’s Enterprise Problem
- Building Video Calling with PubNub and WebRTC
- DataStax Announces New Startup Programme Offering Free Software, As Well As Free Training Courses For Cassandra Users And New Developer Tool
- Get Ready to Think Out (C)loud With Cloud Sherpas’ Upcoming Webinar Series
- Evaluation Report on Virtual Backup Software
- Series: Exchange 2013 and Lync 2013 Integration with AsteriskNOW PBX Pt. 1
- New PubNub App Template for WebRTC
- Strategic Enough to Matter, Code Halos and Mobile Apps
- GAMA : Quatre acteurs clefs, quatre stratégies différentes !
- Box and NSI Partnership Brings the Cloud to Businesses in the Middle East
- 7 Christmas Gifts For Your Business
- WebRTC Summit at Cloud Expo Agenda Announced
- OneLogin Raises $13M to Power Expansion
- Cloud Security Alliance Releases Cloud Controls Matrix, Version 3.0
- Survey Finds Large Enterprises Adopting WebRTC
- WebRTC Summit | WebRTC: Test then Disrupt
- WebRTC Summit Speaker Submissions Open
- WSO2 Expands Identity Management Capabilities Across Cloud, Mobile and Web Applications With the Launch of WSO2 Identity Server 4.5
- BMC Software to Exhibit at Cloud Expo Silicon Valley
- Twilio and LiveOps to Deliver WebRTC Deployments
- Oracle Demonstrates WebRTC Solution with CounterPath's Bria
- OpenStack for the Enterprise – Showcasing the OpenStack Ecosystem
- XIRSYS Launches WebRTC Hosting Service
- Where Are RIA Technologies Headed in 2008?
- The Top 250 Players in the Cloud Computing Ecosystem
- Dolphin Announces Open API With Over 50 Add-ons Including Dropbox and Wikipedia
- Personal Branding Checklist
- AJAXWorld 2006 West Power Panel with Google's Adam Bosworth
- Why Microsoft Loves Google's Android
- Google's OpenSocial: A Technical Overview and Critique
- Cloud Expo New York Call for Papers Now Open
- Wal-Mart To Sell $399 Ubuntu Linux-based Laptop with Google Operating System
- i-Technology Blog: Google Trends on Java, McNealy, AJAX, and SOA Give Pause For Thought
- i-Technology Blog: Is There Life Beyond Google?
- Android: Who Hates Google Over the Phone?