Click here to close now.

Welcome!

Open Web Authors: Liz McMillan, Lori MacVittie, Gilad Parann-Nissany, Carmen Gonzalez, Mark R. Hinkle

Related Topics: Cloud Expo, Microservices Journal, Virtualization, Web 2.0, Open Web, Security

Cloud Expo: Article

Lessons Learned from Real-World Big Data Implementations

The value of Big Data is in the insights that the data can provide

In the past few weeks I visited several Cloud and Big Data conferences that provided me with a lot of insight. Some people only consider the technology side of Big Data technologies like Hadoop or Cassandra. The real driver however is a different one. Business analysts have discovered Big Data technologies as a way to leverage tons of existing data and ask questions about customer behavior and all sorts relationships to drive business strategy. By doing that they are pushing their IT departments to run ever bigger Hadoop environments and ever faster real-time systems.

What's interesting from a technical side is that ad-hoc analytics on existing data is allowed to take some time. However ad-hoc implies people waiting for an answer, meaning we are talking about minutes and not hours. Another interesting insight is that Hadoop environments are never static or standalone. Most companies take in new data on a continuous basis via technologies like flume. This means Hadoop MapReduce jobs need to be able to keep up with the data flow, either by adding more hardware or by optimizing them.

There are multiple drivers to Big Data (actually there are a lot) but the two most important ones are these: Analytics and Technical Need for Speed. Let's look at some of those and the resulting takeaways.

The Value Is in the Insight Not the Volume
The value of Big Data is in the insights that the data can provide, not the sheer volume of it. The reason that more and more companies are keeping all of their log and transaction data is that they want to gain those insights. The sheer size of the data is rather an obstacle to this goal and has been for a long time. With Big Data technologies this value can be harnessed.

Don't Forget That Data Analysts Are People Too
Ad-hoc analytics doesn't have to be instant, but must not take hours either. It was interesting to see that time to result on ad-hoc analytics is considered important. This is because people are doing those queries, and people don't like to wait for hours. But even more important is that business analytics is often an iterative process. Ask a question, check the answer, refine or change the question. Hours long MapReduce jobs are prohibitive to this process.

New Data Is Coming in All the Time
Big Data environments are constantly fed new data. This is not really big news, but I was still surprised by the constant reiteration of this fact. The constant data growth means that ad-hoc queries get either slower over time or need to work on samples. To remedy this, companies are writing, scrubbing and categorizing MapReduce jobs. These jobs basically strip out all the unimportant stuff and put cleansed, streamline easy-to-access data into new files. Instead of executing analytics against raw files, the analyst works on a cleansed data set. The implications are that scrubbing jobs need to be maintained all the time (as data input is changing over time) and they need to be able to keep up with the velocity of the input. MapReduce is not allowed to run for hours, but needs to be quick and iterative.

Big Data Is Not Cheap
While it sounds obvious, it's something that's not talked about by the vendors unless specifically asked. Hadoop requires a lot of hardware and a lot of expertise. Especially the expertise is hard to come by as of yet. While hardware might be cheap (you don't need expensive boxes for Hadoop) the bigger the environment the higher the operational costs. That operational cost is the reason some Hadoop vendors exist on services alone and also why customers are demanding better monitoring and management solutions.

Data Must Be Accessible at Low Latencies to Provide Value
One very interesting fact is that most early adopters that use Hadoop for analytics use it for ad-hoc analytics and not as a traditional warehouse. They use MapReduce to do the heavy lifting that is usually reserved for ETL jobs and put the resulting dimensions in existing data warehouses or into a NoSQL solution like HBase, Cassandra or MongoDB. These solutions provide low latency access semantics and are then integrated in the transactional application world, e.g. to provide recommendations to the end users.

This does not absolve them from optimizing their Hadoop environment where they can, but it gives them the much needed real time access that Hadoop so far does not provide. This also makes for additional complexity that needs to be maintained and monitored.

NoSQL Solutions Need Management and Monitoring as Well
NoSQL solutions are most often used to provide low latency databases with failover and horizontal scaling characteristics. As expected, practitioners quickly run into new issues like distribution and wrong access patterns. Most NoSQL solutions lack sophisticated monitoring or performance analysis tools and require experts instead. Fortunately several companies are working on providing those tools and some APM vendors work hard to support NoSQL databases similar to normal databases. This is emphasized by another interesting finding: With a fast and scalable data storage, the application itself quickly becomes the response time and scaling bottleneck.

Applications Using NoSQL Technologies Are More Complex
Most NoSQL solutions surrender more complex logic like joins in order to achieve horizontally scalable data distribution. That logic is moved to the application - arguably this is where it should be anyway. NoSQL solutions require data to be stored in a query access optimized way - de-normalization is the key. The flip side of storing data multiple times and the need to keep it in sync on updates, is that the storage logic again becomes more complex. More application logic usually means less performance.

My conclusion as a performance engineer is relatively clear: Big Data requires Performance Management and Monitoring Tools to fulfill its promise in a cost effective and timely manner. Here are some suggestions on what you should think about when you start a Big Data project.

  1. Large Hadoop environments are hard to manage and operate. Without automation in terms of deployment, operations, monitoring and root cause analysis they quickly become unmanageable. Make sure to have a monitoring solution in place that informs you pro-actively of any infrastructure or software issues that would affect your operation. It needs to give you an easy way to pinpoint the root cause.
  2. The easiest way to identify new performance issues is to detect and analyze change. Adopt a life cycle and 24/7 production APM approach. It will enable you to notice changes in data and compute distribution over time. In addition a life cycle approach will allow you to immediately pin point any negative changes introduced by a new software release.
  3. Don't just throw more and more hardware at the problem. While you can use cheaper hardware for Hadoop, it's still cost. But more than that you have to consider the operational drag. Every node you add will make traditional log based analysis more complicated. Instead ensure that you have an APM solution in place that lets you understand and optimize MapReduce jobs at their core and reduce both the time and resources it takes to run them.
  4. Your Hadoop cluster is no island, but will always be connected in some form or the other to a real time or at least transactional system. Make sure that you have a monitoring solution in place that can support both.

NoSQL applications tend to have more complex logic. The very performance and scalability of the store depends on correct data access and data distribution. An good monitoring solution allows you to monitor and optimize that additional complexity with ease; it also enables you to understand how your application access the data and how that access is distributed across your NoSQL cluster in your production system. The best way to ensure a scalable and fast NoSQL store is to ensure optimal distribution and access patterns.

Conclusion
Big Data is still very much an emerging technology and its promises are huge. But in order to deliver on those promises it must be cost and time effective to those that harness its value - The Business and not just technology experts.

More Stories By Michael Kopp

Michael Kopp has over 12 years of experience as an architect and developer in the Enterprise Java space. Before coming to CompuwareAPM dynaTrace he was the Chief Architect at GoldenSource, a major player in the EDM space. In 2009 he joined dynaTrace as a technology strategist in the center of excellence. He specializes application performance management in large scale production environments with special focus on virtualized and cloud environments. His current focus is how to effectively leverage BigData Solutions and how these technologies impact and change the application landscape.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
GENBAND has announced that SageNet is leveraging the Nuvia platform to deliver Unified Communications as a Service (UCaaS) to its large base of retail and enterprise customers. Nuvia’s cloud-based solution provides SageNet’s customers with a full suite of business communications and collaboration tools. Two large national SageNet retail customers have recently signed up to deploy the Nuvia platform and the company will continue to sell the service to new and existing customers. Nuvia’s capabilities include HD voice, video, multimedia messaging, mobility, conferencing, Web collaboration, deskt...
Sonus Networks introduced the Sonus WebRTC Services Solution, a virtualized Web Real-Time Communications (WebRTC) offer, purpose-built for the Cloud. The WebRTC Services Solution provides signaling from WebRTC-to-WebRTC applications and interworking from WebRTC-to-Session Initiation Protocol (SIP), delivering advanced real-time communications capabilities on mobile applications and on websites, which are accessible via a browser.
SYS-CON Events announced today that Site24x7, the cloud infrastructure monitoring service, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Site24x7 is a cloud infrastructure monitoring service that helps monitor the uptime and performance of websites, online applications, servers, mobile websites and custom APIs. The monitoring is done from 50+ locations across the world and from various wireless carriers, thus providing a global perspective of the end-user experience. Site24x7 supports monitoring H...
SYS-CON Events announced today that Intelligent Systems Services will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Established in 1994, Intelligent Systems Services Inc. is located near Washington, DC, with representatives and partners nationwide. ISS’s well-established track record is based on the continuous pursuit of excellence in designing, implementing and supporting nationwide clients’ mission-critical systems. ISS has completed many successful projects in Healthcare, Commercial, Manufacturing, ...
“With easy-to-use SDKs for Atmel’s platforms, IoT developers can now reap the benefits of realtime communication, and bypass the security pitfalls and configuration complexities that put IoT deployments at risk,” said Todd Greene, founder & CEO of PubNub. PubNub will team with Atmel at CES 2015 to launch full SDK support for Atmel’s MCU, MPU, and Wireless SoC platforms. Atmel developers now have access to PubNub’s secure Publish/Subscribe messaging with guaranteed ¼ second latencies across PubNub’s 14 global points-of-presence. PubNub delivers secure communication through firewalls, proxy ser...
SYS-CON Events announced today that B2Cloud, a provider of enterprise resource planning software, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. B2cloud develops the software you need. They have the ideal tools to help you work with your clients. B2Cloud’s main solutions include AGIS – ERP, CLOHC, AGIS – Invoice, and IZUM
SYS-CON Events announced today that Tufin, the market-leading provider of Security Policy Orchestration Solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. As the market leader of Security Policy Orchestration, Tufin automates and accelerates network configuration changes while maintaining security and compliance. Tufin's award-winning Orchestration Suite™ gives IT organizations the power and agility to enforce security policy across complex, multi-vendor enterprise networks. With more than 1...
VoxImplant has announced full WebRTC support in the newest versions of its Android SDK and iOS SDK. The updated SDKs, which enable audio and video calls on mobile devices, are now compatible with the WebRTC standard to allow any mobile app to communicate with WebRTC-enabled browsers, including Google Chrome, Mozilla Firefox, Opera, and, when available, Microsoft Spartan. The WebRTC-updated SDKs represent VoxImplant's continued leadership in simplifying the development of real-time communications (RTC) services for app developers. VoxImplant (built by Zingaya, the real-time communication servi...
SYS-CON Events announced today that Cloudian, Inc., the leading provider of hybrid cloud storage solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Cloudian, Inc., is a Foster City, California - based software company specializing in cloud storage software. The main product is Cloudian, an Amazon S3-compliant cloud object storage platform, the bedrock of cloud computing systems, that enables cloud service providers and enterprises to build reliable, affordable and scalable cloud storage solu...
SYS-CON Events announced today that Gridstore™, the leader in hyper-converged infrastructure purpose-built to optimize Microsoft workloads, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Gridstore™ is the leader in hyper-converged infrastructure purpose-built for Microsoft workloads and designed to accelerate applications in virtualized environments. Gridstore’s hyper-converged infrastructure is the industry’s first all flash version of HyperConverged Appliances that include both compute and storag...
SYS-CON Events announced today that IDenticard will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. IDenticard™ is the security division of Brady Corp (NYSE: BRC), a $1.5 billion manufacturer of identification products. We have small-company values with the strength and stability of a major corporation. IDenticard offers local sales, support and service to our customers across the United States and Canada. Our partner network encompasses some 300 of the world's leading systems integrators and security s...
SYS-CON Events announced today the IoT Bootcamp – Jumpstart Your IoT Strategy, being held June 9–10, 2015, in conjunction with 16th Cloud Expo and Internet of @ThingsExpo at the Javits Center in New York City. This is your chance to jumpstart your IoT strategy. Combined with real-world scenarios and use cases, the IoT Bootcamp is not just based on presentations but includes hands-on demos and walkthroughs. We will introduce you to a variety of Do-It-Yourself IoT platforms including Arduino, Raspberry Pi, BeagleBone, Spark and Intel Edison. You will also get an overview of cloud technologies s...
“In the past year we've seen a lot of stabilization of WebRTC. You can now use it in production with a far greater degree of certainty. A lot of the real developments in the past year have been in things like the data channel, which will enable a whole new type of application," explained Peter Dunkley, Technical Director at Acision, in this SYS-CON.tv interview at @ThingsExpo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
The best mobile applications are augmented by dedicated servers, the Internet and Cloud services. Mobile developers should focus on one thing: writing the next socially disruptive viral app. Thanks to the cloud, they can focus on the overall solution, not the underlying plumbing. From iOS to Android and Windows, developers can leverage cloud services to create a common cross-platform backend to persist user settings, app data, broadcast notifications, run jobs, etc. This session provides a high level technical overview of many cloud services available to mobile app developers, includi...
SYS-CON Events announced today that Ciqada will exhibit at SYS-CON's @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Ciqada™ makes it easy to connect your products to the Internet. By integrating key components - hardware, servers, dashboards, and mobile apps - into an easy-to-use, configurable system, your products can quickly and securely join the internet of things. With remote monitoring, control, and alert messaging capability, you will meet your customers' needs of tomorrow - today! Ciqada. Let your products take flight. For more inform...
Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 16th Cloud Expo at the Javits Center in New York June 9-11 will find fresh new content in a new track called PaaS | Containers & Microservices Containers are not being considered for the first time by the cloud community, but a current era of re-consideration has pushed them to the top of the cloud agenda. With the launch of Docker's initial release in March of 2013, interest was revved up several notches. Then late last...
Health care systems across the globe are under enormous strain, as facilities reach capacity and costs continue to rise. M2M and the Internet of Things have the potential to transform the industry through connected health solutions that can make care more efficient while reducing costs. In fact, Vodafone's annual M2M Barometer Report forecasts M2M applications rising to 57 percent in health care and life sciences by 2016. Lively is one of Vodafone's health care partners, whose solutions enable older adults to live independent lives while staying connected to loved ones. M2M will continue to gr...
SYS-CON Media announced today that @WebRTCSummit Blog, the largest WebRTC resource in the world, has been launched. @WebRTCSummit Blog offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. @WebRTCSummit Blog can be bookmarked ▸ Here @WebRTCSummit conference site can be bookmarked ▸ Here
Dave will share his insights on how Internet of Things for Enterprises are transforming and making more productive and efficient operations and maintenance (O&M) procedures in the cleantech industry and beyond. Speaker Bio: Dave Landa is chief operating officer of Cybozu Corp (kintone US). Based in the San Francisco Bay Area, Dave has been on the forefront of the Cloud revolution driving strategic business development on the executive teams of multiple leading Software as a Services (SaaS) application providers dating back to 2004. Cybozu's kintone.com is a leading global BYOA (Build Your O...
While not quite mainstream yet, WebRTC is starting to gain ground with Carriers, Enterprises and Independent Software Vendors (ISV’s) alike. WebRTC makes it easy for developers to add audio and video communications into their applications by using Web browsers as their platform. But like any market, every customer engagement has unique requirements, as well as constraints. And of course, one size does not fit all. In her session at WebRTC Summit, Dr. Natasha Tamaskar, Vice President, Head of Cloud and Mobile Strategy at GENBAND, will explore what is needed to take a real time communications ...