A data lake was initially described as a storage system that held a very large amount of data in its original format until required. Originally the term data lake was synonymous with Apache Hadoop. Apache Hadoop enabled organizations to both store and compute data on commodity hardware. Apache Hadoop was a place where you had storage in HDFS and compute in MapReduce and other managed services. Today some define the data lake as a place for storage such as Amazon Web Services and S3 buckets. I like to define a data lake as an affordable locus where organizations can land massive amounts of data in order to disrupt change for future extraction, analysis, and knowledge management. Streaming technologies also have a vital role in populating the data lake.
Landing data is not enough! Until you actually do something with that data, it is virtually worthless. In this article, we focus on five classes of use cases where a data lake is transformative to an organization and how it can be exploited. These cases are in oil and gas, big government – smart city initiatives, life sciences, cybersecurity, and marketing and customer data platforms.
Oil and Gas
The oil and gas industry is one of the earliest adopters of data lakes in the cloud, IoT, and digital transformation. According to the World Economic Forum, this could unlock $1.6 trillion of value for the entire industry by 2025. The list of use cases include optimized directional drilling, minimized unplanned downtime, lowered lease operating expenses, and improved safety and adherence to regulatory matters. It is estimated the average oil and gas company generates at minimum 1.5 terabytes of IoT data per day. Collecting and analyzing data in real-time, and using data for historical modeling are vital to exploration companies. Using data science and GPS, geologists are able to steer drill bits horizontally instead of vertically, increasing production 20+ times.
Not only can this industry benefit from production gains, but it also can gain from predicting failure and providing proactive maintenance. Approximately 75% of all downtime incidents of more than six hours are caused by drilling mechanical part failure. Being able to predict and proactively repair wells to keep them functioning is vital since revenue is lost when a well is down, especially since the average offshore rig can cost up to $500,000 per day to operate. Given the sheer volume of data and the economics of the use cases, it is no surprise oil and gas is an early adopter of data lakes and analytics technologies.
Big Government – Smart Cities Initiatives
Populations across the world are growing rapidly and outpacing city resources. Data lakes and analytics that assist cities in building more livable, workable, and sustainable habitats have never been more important. Consortiums including nonprofits, governments, universities, and private companies are working together to build the cities of tomorrow. From connected vehicles to intelligent power grids, the places we live will be nothing like where we dwell today.
Ever see those black hoses on the road and wonder what they are? They are pressure sensors that track vehicle patterns, speed, and other traffic data. They will one day be used to manage traffic signals, prevent congestion, divert traffic, and drive variable tolling programs. These smart tubes can differentiate between 19 different types of vehicles. They are also used by law enforcement to help manage speed and assist in pedestrian safety.
According to IDC, investment in smart city tech is expected to reach $135 billion by 2025. Smart cities will be like a living thing that can load balance traffic, steer law enforcement, enhance education systems, and optimize power grids, waterways, tolls, and much more. The size of this data is massive. As an example, one connected vehicle will generate 25GB of data per hour to the cloud. Now multiply that across the number of potentially connected vehicles and the numbers are staggering just for that one endpoint type alone. Check out this video on connected vehicles to understand how this works.
Estimating the size of the human genome is a lot of fun. There are tons of great examples, but a good general estimate is around 715 megabytes, which doesn’t take into consideration sequencing the genome. To do that, we create a series of short ‘reads.’ This allows scientists to study pieces of the human genome and increases the size to about 175-225 gigabytes. With the current world population of 7.6 billion people, the human genome for all of us is a significant amount of data. And this is just one measurement of the human body, a very complex machine. With heart rate, blood pressure, enzymes, white blood cell counts, temperature, and countless other measures that change over time, life sciences make up one of the biggest data sources for the data lake. It is my very bold prediction that through data analytics and IoT, human life expectancy could be extended by 20 years!
Let’s review sepsis. Sepsis is a life-threatening condition that arises when the body’s immune system responds to an infection. Signs of sepsis include fever, confusion, increased breathing, and rapid heart rate. The risk of mortality from sepsis ranges from 30-80% depending upon severity. The best way to treat sepsis is to find it early and administer antibiotics as soon as possible. Hospitals use alerts generated from electronic medical records to bring attention to the disease and initiate treatment as early as possible. Researchers at Carnegie Mellon University Heinz College are applying machine learning algorithms to data in the data lake to more accurately predict sepsis. Around 270,000 people in the US die each year from sepsis, and it is a big driver of hospital costs, estimated at $27 billion annually.
Every organization – private and public – most certainly has computing resources. Endpoints such as laptops, servers, mobile phones, cloud computing devices, and virtual systems are all vulnerable to constantly evolving attacks and threats. Threats like ‘cryptoware,’ a type of ransomware that infects and locks your devices and attempts to extort money, are very real as crime-as-a-service expands globally. Adding to this complexity is the potential billions of devices that will be added as a result of IoT. These devices are often insecure by design and therefore offer a virtual smorgasbord of hacking opportunity. Cybersecurity vulnerabilities create business and reputation risks as data breaches can sever customer trust and wreak havoc on the top and bottom line. Over the past several years we have seen organizations make the switch from a passively secure organization to one of pervasive and proactive security. This is a very big data problem as threats come from internal and external sources of an organization.
The General Data Protection Regulation (GDPR) is the first comprehensive replacement of the now 20-year-old European data protection legislation. It is intended to standardize expectations and protect personally identifiable information on employees, clients, and applicable data subjects. The concepts behind GDPR are simple: any organization that holds data about EU citizens regardless of its location is within scope. Organizations must adhere to notifications of data breaches, including notifying data protection authorities without delay of any personal data breaches that present risk to data subjects. What does this mean? Organizations must communicate high-risk breaches to affected data subjects within 72 hours of becoming aware of such risk. They must also communicate high-risk breaches to affected data subjects with zero “undue” delay. This means cybersecurity data collection and analysis has to become proactive and always on. This, of course, is an enormous data challenge prime for the data lake, data collection, streaming analytics, event notification, and artificial intelligence.
Marketing and Customer Data Platforms
A customer data platform (CDP) is a marketing-based data management platform. It creates a unified customer database that pulls data from multiple behavioral and transaction sources such as customer profile data, web and mobile behaviors, brick and mortar systems, loyalty systems, and service center data. A consistent identifier links all the data together and supports marketing segmentation and exploration for personalized marketing efforts. Most organizations don’t realize that much of this data is stored in the data lake. Business intelligence tools like Arcadia Enterprise allow marketing teams to leverage the data in the data lake in order to blend sources of data together and perform segmentation analytics that can then be shared with cloud-based marketing automation providers such as Adobe, Marketo, Salesforce, HubSpot, and SharpSpring. A good CDP solution closes the loop by measuring campaign effectiveness.
The five classes of use cases described above all have one thing in common: massive amounts of valuable data waiting to be exploited for business or human benefits. Arcadia Enterprise allows your users to access and analyze the entire data lake with a data-native visual platform. It has a web-based architecture for big data visualization and was built for the business user to perform self-service analytics. Arcadia Data is the first visualization engine for Confluent’s KSQL platform and has brought real-time and event analytics within the reach of anyone. Arcadia Data builds tools that enable anyone to find value in the data lake.