Welcome!

Release Management Authors: Pat Romanski, Elizabeth White, David H Deans, Liz McMillan, Jnan Dash

Related Topics: @CloudExpo, Java IoT, Microservices Expo, Containers Expo Blog, Release Management , Apache

@CloudExpo: Blog Post

Cloudera Impala – Closing the Near Real Time Gap Working with Big Data

Building data structures and loading data

By

On October 24, 2012 Cloudera announced the release of Cloudera Impala and the commercial support subscription service of Cloudera Enterprise Real Time Query (RTQ). During the Hadoop World/STRATA Conference in NYC, I was invited over to see a demonstration. Impala is a SQL based Real Time Query/Ad Hoc query engine built on top of HDFS or Hbase. As I watched the demonstration unfold, I wondered if one of the remaining technology gaps in the NOSQL arsenal had been closed.  What gap you ask? Near Real Time Analytics on a NOSQL stack. Working with customers across the Cyber Security customer space, not only do they face the familiar BIGDATA horsemen of the apocalypse: Volume, Velocity and Variety but one more large challenge crept in: Time (V3T).  The Near Real Time Analysis/Near Real Time Analytic capability that Cloudera Impala provides is essential in many high value use cases associated with Cyber Security: comparing current activity with observed historical norms, correlation of many disparate data sources/enrichment and automated threat detection algorithms.

When the demonstration concluded, the Cloudera representatives and I discussed the potential of performing an informal independent evaluation of Cloudera Impala against some of the common Real Time/Near Real Time use cases in Cyber Security. I agreed to step up and perform an independent evaluation as well as developing a demonstration platform for FedCyber 2012 (almost three weeks hence for inquiring minds).  So let us set the field: a new BETA technology, NO prior exposure to the technology or documentation, a vendor making promises, addressing a large technology gap and three weeks to implement, seemed straight forward; no pressure.

The day after I returned from the STRATA Conference, I returned to my office and provisioned four Virtual Machines in order to build the Impala demonstration. As a committer/contributor for SherpaSurfing an open source Cyber Security solution, I have an abundance of data sets, enrichment sources, Hive data structures and services.  Given the amount of time and the audience for FedCyber 2012, I decided to focus on some Intrusion Detection and Netflow related use cases for the demonstration. The data sets for the demonstration included base data sets:  20 million Netflow events, 8 million Intrusion Detection System events and enrichment: Geographic, Blacklist, Whitelist and Protocol related information. Each of the selected uses cases for this demonstration is critical to the Perform Near-Real Time Network Analysis domain in Cyber Security. The name for the demonstration system was decided to be the Impala Mission Demonstration Platform (IMDP).  The IMDP was implemented based on vendor recommendations with no tuning or optimization.

The IMDP effort provided me with my first opportunity to work with Cloudera Manager. Although this post is focused on Cloudera Impala I would be remiss not to mention Cloudera Manager. I have worked with Hadoop since 1.0 and built more than a few clusters over the years. I used the installation and configuration guides provided with Cloudera Impala and followed the recommendations. One of the first recommendations was use of the Cloudera Manager. Using the Cloudera Manager (CDH 4.1), I was able to roll out a four node cluster in two hours.  I was able to discover the hosts, manage services and provision them in accordance with the IMDP deployment plan. The deployment plan consisted of:

  • node 1 – hbase, hdfs, impala,  mapreduce
  • node2 – hbase, hdfs, impala,  mapreduce
  • node3 – hbase(region server, master), hdfs(namenode), impala(impalad, statestore),  mapreduce(job tracker, tasktracker) , hue, oozie and zookeeper
  • node4 – Application Tier, Cloudera Manager

The Cloudera Manager saved at least two days of effort in deploying the cluster, the tight integration with the support portal, comprehensive help and one place to work with all properties of the entire cluster and view space consumption metrics; verdict on Cloudera Manager: Cloudera masterful, bold stroke, thumbs up.

Now that the cluster build-out completed; I shifted attention to deploying and configuring the Cloudera Impala service.  Using Cloudera Manager, I deployed Impala on three nodes: three instances of Impalad and one impala state store, in a matter of minutes. I completed the deployment and configuration of the Hive MetaStore. Keeping in mind this is a BETA; the documentation was complete, but fragmented on deployment and configuration (HIVE MetaStore portion); verdict on impala deployment and configuration: solid for a BETA (needs an example hive-site.xml, configuration guide needs better flow).

At this point all configuration and deployment was completed, attention turned to building data structures and loading data. I took the Data Definition Language (DDL) scripts or data structures for ten data sources and enrichment; ported them over to Hive and tested them in less than four hours. It is worthy of mention that the data sources for this demonstration are large flat tables: netflow and intrusion detection system. Cloudera Impala uses HIVE as an Extract Transform Load (ETL) engine, using Hive I defined all of the data structures in source files which were sourced using hive shell: created a database (Sherpa). Hive was then used to load data into the tables that were just created. Creating data structures in Hive was simple as usual and loading data sets was quick (20 million netflow events in 57 seconds). Logging into impala-shell, issued a refresh of the MetaStore and I was working with data. I performed verification of the data load, all data loaded and no issues were revealed. One area of potential improvement would be more comprehensive messages on load failure. Defining the data structures and loading data using Hive was nothing new; verdict:  really good; easy to use, easy to load, but need to improve failed load messages.

Finally, we moved on to the most interesting stage which is using Cloudera Impala in a series of Real Time Query (RTQ) scenarios that are common across the Cyber Security customer space. The real world scenarios selected come from the perform netflow analysis set of use case(s). In each of these scenarios, the exact same queries were executed on the same cluster using Hive and then Impala against the same data structures (database and tables).  In the Hive approach, we traverse the batch processing stack and with Impala we traverse the Real Time Query (RTQ) stack performing a series of analytics. In the first use case, I ran a five tuple (sip, sport, dip, dport, protocol) summary covering bytes per packet, summing bytes and packets for a 20 million event set resulted in: identical result sets, Hive 82 seconds – Impala 6 seconds.   In the second use case, I performed a summary of destination ports where the source port is 80 which resulted in: identical result sets, Hive 57 seconds, Impala 5 seconds. In the third use case, I performed correlation between netflow and intrusion detection systems, correlating netflow with intrusion detection events for several hours which resulted in: identical result sets, Hive 40 seconds, Impala sub-second.  Finally, for FedCyber 2012, I developed a java based situational awareness dashboard which connected to Cloudera Impala via ODBC and executed analytics performing: correlation of blacklists, Intrusion Detection, Netflow, statistical cubes for ten hours with a refresh of every five seconds without failure or issue.  The ODBC implementation easily provided the ability to export data to desktop tools (using ODBC) and common BI tools as advertised. Developing and Using Cloudera Impala verdict: This is as advertised; easy to use, easy to implement on, very fast, very flexible and more than capable of running real time analytics. The Impala shell is limited but much of the demonstration work was done using result sets so it was not an impediment.

In summation, I have worked for over a decade across the vast BIGDATA technology space covering Legacy Relational Database, Data Warehouse, and NOSQL; Cloudera Impala proved more than capable of running near real time analytics and providing mission relevance to customers with a Near Real Time (NRT) requirement.  Based on my initial review Cloudera Impala appears to be a bold step in closing the gap of near real time analytics on a NOSQL stack. I did encounter some minor problems, but the few problems and limitations that were encountered in this demonstration were documented and published in the known issues document so they will not be shared; none were show stoppers.

The notes, details and all of the lessons learned, data structures and the configuration guide from the demonstration are being published out on Github under SherpaSurfing in the coming days. These documents cover everything in detail and will enable developers to replicate the demonstration platform and get a jump start on Cloudera Impala.  Finally, I would like to thank two contributors: Hanh Le, Robert Webb and Six3 Systems for helping me pull this off.

Read the original blog entry...

More Stories By Bob Gourley

Bob Gourley writes on enterprise IT. He is a founder of Crucial Point and publisher of CTOvision.com

@ThingsExpo Stories
Cloud-enabled transformation has evolved from cost saving measure to business innovation strategy -- one that combines the cloud with cognitive capabilities to drive market disruption. Learn how you can achieve the insight and agility you need to gain a competitive advantage. Industry-acclaimed CTO and cloud expert, Shankar Kalyana presents. Only the most exceptional IBMers are appointed with the rare distinction of IBM Fellow, the highest technical honor in the company. Shankar has also receive...
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, introduced two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a multip...
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
Data is the fuel that drives the machine learning algorithmic engines and ultimately provides the business value. In his session at Cloud Expo, Ed Featherston, a director and senior enterprise architect at Collaborative Consulting, discussed the key considerations around quality, volume, timeliness, and pedigree that must be dealt with in order to properly fuel that engine.
IoT is rapidly becoming mainstream as more and more investments are made into the platforms and technology. As this movement continues to expand and gain momentum it creates a massive wall of noise that can be difficult to sift through. Unfortunately, this inevitably makes IoT less approachable for people to get started with and can hamper efforts to integrate this key technology into your own portfolio. There are so many connected products already in place today with many hundreds more on the h...
When shopping for a new data processing platform for IoT solutions, many development teams want to be able to test-drive options before making a choice. Yet when evaluating an IoT solution, it’s simply not feasible to do so at scale with physical devices. Building a sensor simulator is the next best choice; however, generating a realistic simulation at very high TPS with ease of configurability is a formidable challenge. When dealing with multiple application or transport protocols, you would be...
Detecting internal user threats in the Big Data eco-system is challenging and cumbersome. Many organizations monitor internal usage of the Big Data eco-system using a set of alerts. This is not a scalable process given the increase in the number of alerts with the accelerating growth in data volume and user base. Organizations are increasingly leveraging machine learning to monitor only those data elements that are sensitive and critical, autonomously establish monitoring policies, and to detect...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settl...
In his session at @ThingsExpo, Dr. Robert Cohen, an economist and senior fellow at the Economic Strategy Institute, presented the findings of a series of six detailed case studies of how large corporations are implementing IoT. The session explored how IoT has improved their economic performance, had major impacts on business models and resulted in impressive ROIs. The companies covered span manufacturing and services firms. He also explored servicification, how manufacturing firms shift from se...
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of bus...
The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, compared the Jevons Paradox to modern-day enterprise IT, examin...
IoT solutions exploit operational data generated by Internet-connected smart “things” for the purpose of gaining operational insight and producing “better outcomes” (for example, create new business models, eliminate unscheduled maintenance, etc.). The explosive proliferation of IoT solutions will result in an exponential growth in the volume of IoT data, precipitating significant Information Governance issues: who owns the IoT data, what are the rights/duties of IoT solutions adopters towards t...
Amazon started as an online bookseller 20 years ago. Since then, it has evolved into a technology juggernaut that has disrupted multiple markets and industries and touches many aspects of our lives. It is a relentless technology and business model innovator driving disruption throughout numerous ecosystems. Amazon’s AWS revenues alone are approaching $16B a year making it one of the largest IT companies in the world. With dominant offerings in Cloud, IoT, eCommerce, Big Data, AI, Digital Assista...
Organizations planning enterprise data center consolidation and modernization projects are faced with a challenging, costly reality. Requirements to deploy modern, cloud-native applications simultaneously with traditional client/server applications are almost impossible to achieve with hardware-centric enterprise infrastructure. Compute and network infrastructure are fast moving down a software-defined path, but storage has been a laggard. Until now.
Digital Transformation is much more than a buzzword. The radical shift to digital mechanisms for almost every process is evident across all industries and verticals. This is often especially true in financial services, where the legacy environment is many times unable to keep up with the rapidly shifting demands of the consumer. The constant pressure to provide complete, omnichannel delivery of customer-facing solutions to meet both regulatory and customer demands is putting enormous pressure on...
In his general session at 19th Cloud Expo, Manish Dixit, VP of Product and Engineering at Dice, discussed how Dice leverages data insights and tools to help both tech professionals and recruiters better understand how skills relate to each other and which skills are in high demand using interactive visualizations and salary indicator tools to maximize earning potential. Manish Dixit is VP of Product and Engineering at Dice. As the leader of the Product, Engineering and Data Sciences team at D...
DXWorldEXPO LLC announced today that All in Mobile, a mobile app development company from Poland, will exhibit at the 22nd International CloudEXPO | DXWorldEXPO. All In Mobile is a mobile app development company from Poland. Since 2014, they maintain passion for developing mobile applications for enterprises and startups worldwide.
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
IoT is at the core or many Digital Transformation initiatives with the goal of re-inventing a company's business model. We all agree that collecting relevant IoT data will result in massive amounts of data needing to be stored. However, with the rapid development of IoT devices and ongoing business model transformation, we are not able to predict the volume and growth of IoT data. And with the lack of IoT history, traditional methods of IT and infrastructure planning based on the past do not app...