Databricks

Software : Cloud Computing : Data & Analytics

Website | Blog | Video

San Francisco, California, United States

VC-H

With origins in academia and the open source community, Databricks was founded in 2013 by the original creators of Apache Spark™, Delta Lake and MLflow. As the world’s first and only lakehouse platform in the cloud, Databricks combines the best of data warehouses and data lakes to offer an open and unified platform for data and AI.

Assembly Line

Solution Accelerator: LLMs for Manufacturing

📅 Date:

✍️ Authors: Will Block, Ramdas Murali, Nicole Lu, Bala Amavasai

🔖 Topics: Generative AI, Large Language Model

🏢 Organizations: Databricks


In this solution accelerator, we focus on item (3) above, which is the use case on augmenting field service engineers with a knowledge base in the form of an interactive context-aware Q/A session. The challenge that manufacturers face is how to build and incorporate data from proprietary documents into LLMs. Training LLMs from scratch is a very costly exercise, costing hundreds of thousands if not millions of dollars.

Instead, enterprises can tap into pre-trained foundational LLM models (like MPT-7B and MPT-30B from MosaicML) and augment and fine-tune these models with their proprietary data. This brings down the costs to tens, if not hundreds of dollars, effectively a 10000x cost saving.

Read more at Databricks Blog

Chevron Phillips Chemical tackles generative AI with Databricks

Databricks Raises Series I Investment at $43B Valuation

📅 Date:

🔖 Topics: Funding Event

🏢 Organizations: Databricks, T Rowe Price, NVIDIA


Databricks, the Data and AI company, today announced its Series I funding, raising over $500 million. This funding values the company at $43 billion and establishes the price per share at $73.50. The series is led by funds and accounts advised by T. Rowe Price Associates, Inc., which is joined by other existing investors, including Andreessen Horowitz, Baillie Gifford, ClearBridge Investments, funds and accounts managed by Counterpoint Global (Morgan Stanley), Fidelity Management & Research Company, Franklin Templeton, GIC, Octahedron Capital and Tiger Global along with new investors Capital One Ventures, Ontario Teachers’ Pension Plan and NVIDIA.

The Databricks Lakehouse unifies data, analytics and AI on a single platform so that customers can govern, manage and derive insights from enterprise data and build their own generative AI solutions faster. “Enterprise data is a goldmine for generative AI,” said Jensen Huang, founder and CEO of NVIDIA. “Databricks is doing incredible work with NVIDIA technology to accelerate data processing and generative AI models.”

Read more at PR Newswire

Bringing Scalable AI to the Edge with Databricks and Azure DevOps

📅 Date:

✍️ Authors: Andres Urrutia, Howard Wu, Nicole Lu, Bala Amavasai

🔖 Topics: Cloud-to-Edge Deployment, Machine Learning, Cloud Computing, Edge computing

🏢 Organizations: Databricks, Microsoft


The ML-optimized runtime in Databricks contains popular ML frameworks such as PyTorch, TensorFlow, and scikit-learn. In this solution accelerator, we will build a basic Random Forest ML model in Databricks that will later be deployed to edge devices to execute inferences directly on the manufacturing shop floor. The focus will essentially be the deployment of ML Model built on Databricks to edge devices.

Read more at Databricks Blog

📊 Accelerating Innovation at JetBlue Using Databricks

📅 Date:

✍️ Authors: Sai Ravuru, Yared Gudeta

🔖 Topics: Data Architecture

🏭 Vertical: Aerospace

🏢 Organizations: JetBlue, Databricks, Microsoft


The role of data and in particular analytics, AI and ML is key for airlines to provide a seamless experience for customers while maintaining efficient operations for optimum business goals. For a single flight, for example, from New York to London, hundreds of decisions have to be made based on factors encompassing customers, flight crews, aircraft sensors, live weather and live air traffic control (ATC) data. A large disruption such as a brutal winter storm can impact thousands of flights across the U.S. Therefore it is vital for airlines to depend on real-time data and AI & ML to make proactive real time decisions.

JetBlue has sped AI and ML deployments across a wide range of use cases spanning four lines of business, each with its own AI and ML team. The following are the fundamental functions of the business lines:

  • Commercial Data Science (CDS) - Revenue growth
  • Operations Data Science (ODS) - Cost reduction
  • AI & ML engineering – Go-to-market product deployment optimization
  • Business Intelligence – Reporting enterprise scaling and support

Each business line supports multiple strategic products that are prioritized regularly by JetBlue leadership to establish KPIs that lead to effective strategic outcomes.

Read more at Databricks Blog

A Data Architecture to assist Geologists in Real-Time Operations

📅 Date:

✍️ Author: Nicola Lamonaca

🔖 Topics: Data Architecture

🏭 Vertical: Petroleum and Coal

🏢 Organizations: Eni, Databricks


Data plays a crucial role in making exploration and drilling operations for Eni a success all over the world. Our geologists use real-time well data collected by sensors installed on drilling pipes to keep track and to build predictive models of key properties during the drilling process.

Data is delivered by a custom dispatcher component designed to connect to a WITSML Server on all oil rigs and send time-indexed and / or depth-indexed data to any supported applications. In our case, data is delivered to Azure ADLS Gen2 in the format of WITSML files, each accompanied by a JSON file for additional custom metadata.

The visualizations generated from this data platform are used both on the oil rigs and in HQ, with operators exploring the curves enriched by the ML models as soon as they’re generated on a web application made in-house, which shows in real time how the drilling is progressing. Additionally, it is possible to explore historic data via the same application.

Read more at Medium

Databricks Announces Lakehouse for Manufacturing, Empowering the World's Leading Manufacturers to Realize the Full Value of Their Data

📅 Date:

🔖 Topics: Cloud Computing

🏢 Organizations: Databricks, DuPont, Honeywell, Rolls-Royce, Shell, Tata Steel


Databricks, the lakehouse company, today announced the Databricks Lakehouse for Manufacturing, the first open, enterprise-scale lakehouse platform tailored to manufacturers that unifies data and AI and delivers record-breaking performance for any analytics use case. The sheer volume of tools, systems and architectures required to run a modern manufacturing environment makes secure data sharing and collaboration a challenge at scale, with over 70 percent of data projects stalling at the proof of concept (PoC) stage. Available today, Databricks’ Lakehouse for Manufacturing breaks down these silos and is uniquely designed for manufacturers to access all of their data and make decisions in real-time. Databricks’ Lakehouse for Manufacturing has been adopted by industry-leading organizations like DuPont, Honeywell, Rolls-Royce, Shell and Tata Steel.

The Lakehouse for Manufacturing includes access to packaged use case accelerators that are designed to jumpstart the analytics process and offer a blueprint to help organizations tackle critical, high-value industry challenges.

Read more at PR Newswire

A Deeper Look Into How SAP Datasphere Enables a Business Data Fabric

📅 Date:

✍️ Author: Juergen Mueller

🔖 Topics: Partnership, Data Architecture

🏢 Organizations: SAP, Databricks, Collibra, Confluent, DataRobot


SAP announced the SAP Datasphere solution, the next generation of its data management portfolio, which gives customers easy access to business-ready data across the data landscape. SAP also introduced strategic partnerships with industry-leading data and AI companies – Collibra NV, Confluent Inc., Databricks Inc. and DataRobot Inc. – to enrich SAP Datasphere and allow organizations to create a unified data architecture that securely combines SAP software data and non-SAP data.

SAP Datasphere, and its open data ecosystem, is the technology foundation that enables a business data fabric. This is a data management architecture that simplifies the delivery of an integrated, semantically rich data layer over underlying data landscapes to provide seamless and scalable access to data without duplication. It’s not a rip-and-replace model, but is intended to connect, rather than solely move, data using data and metadata. A business data fabric equips any organization to deliver meaningful data to every data consumer — with business context and logic intact. As organizations require accurate data that is quickly available and described with business-friendly terms, this approach enables data professionals to permeate the clarity that business semantics provide throughout every use case.

Read more at SAP News

Rolls-Royce Civil Aerospace keeps its Engines Running on Databricks Lakehouse

How Corning Built End-to-end ML on Databricks Lakehouse Platform

📅 Date:

✍️ Author: Denis Kamotsky

🔖 Topics: MLOps, Quality Assurance, Data Architecture, Cloud-to-Edge Deployment

🏢 Organizations: Corning, Databricks, AWS


Specifically for quality inspection, we take high-resolution images to look for irregularities in the cells, which can be predictive of leaks and defective parts. The challenge, however, is the prevalence of false positives due to the debris in the manufacturing environment showing up in pictures.

To address this, we manually brush and blow the filters before imaging. We discovered that by notifying operators of which specific parts to clean, we could significantly reduce the total time required for the process, and machine learning came in handy. We used ML to predict whether a filter is clean or dirty based on low-resolution images taken while the operator is setting up the filter inside the imaging device. Based on the prediction, the operator would get the signal to clean the part or not, thus reducing false positives on the final high-res images, helping us move faster through the production process and providing high-quality filters.

Read more at Databricks Blog

Maersk embraces edge computing to revolutionize supply chain

📅 Date:

✍️ Author: Paula Rooney

🔖 Topics: IIoT, 5G

🏢 Organizations: Maersk, Microsoft, Databricks


Gavin Laybourne, global CIO of Maersk’s APM Terminals business, is embracing cutting-edge technologies to accelerate and fortify the global supply chain, working with technology giants to implement edge computing, private 5G networks, and thousands of IoT devices at its terminals to elevate the efficiency, quality, and visibility of the container ships Maersk uses to transport cargo across the oceans.

“Two to three years ago, we put everything on the cloud, but what we’re doing now is different,” Laybourne says. “The cloud, for me, is not the North Star. We must have the edge. We need real-time instruction sets for machines [container handling equipment at container terminals in ports] and then we’ll use cloud technologies where the data is not time-sensitive.”

Laybourne’s IT team is working with Microsoft to move cloud data to the edge, where containers are removed from ships by automated cranes and transferred to predefined locations in the port. To date, Laybourne and his team have migrated about 40% of APM Terminals’ cloud data to the edge, with a target to hit 80% by the end of 2023 at all operated terminals. Maersk has also been working with AI pioneer Databricks to develop algorithms to make its IoT devices and automated processes smarter. The company’s data scientists have built machine learning models in-house to improve safety and identify cargo. Data scientists will some day up the ante with advanced models to make all processes autonomous.

Read more at CIO

Solution Accelerator: Multi-factory Overall Equipment Effectiveness (OEE) and KPI Monitoring

📅 Date:

✍️ Authors: Jeffery Annor, Tarik Boukherissa, Bala Amavasai

🔖 Topics: Manufacturing Analytics

🏢 Organizations: Databricks


The Databricks Lakehouse provides an end-to-end data engineering, serving, ETL, and machine learning platform that enables organizations to accelerate their analytics workloads by automating the complexity of building and maintaining analytics pipelines through open architecture and formats. This facilitates the connection to high-velocity Industrial IoT data using standard protocols like MQTT, Kafka, Event Hubs, or Kinesis to external datasets, like ERP systems, allowing manufacturers to converge their IT/OT data infrastructure for advanced analytics.

Using a Delta Live Tables pipeline, we leverage the medallion architecture to ingest data from multiple sensors in a semi-structured format (JSON) into our bronze layer where data is replicated in its natural format. The silver layer transformations include parsing of key fields from sensor data that are needed to be extracted/structured for subsequent analysis, and the ingestion of preprocessed workforce data from ERP systems needed to complete the analysis. Finally, the gold layer aggregates sensor data using structured streaming stateful aggregations, calculates OT metrics e.g. OEE, TA (technical availability), and finally combines the aggregated metrics with workforce data based on shifts allowing for IT-OT convergence.

Read more at Databricks Blog

Part Level Demand Forecasting at Scale

📅 Date:

✍️ Authors: Max Kohler, Pawarit Laosunthara, Bryan Smith, Bala Amavasai

🔖 Topics: Demand Planning, Production Planning, Forecasting

🏢 Organizations: Databricks


The challenges of demand forecasting include ensuring the right granularity, timeliness, and fidelity of forecasts. Due to limitations in computing capability and the lack of know-how, forecasting is often performed at an aggregated level, reducing fidelity.

In this blog, we demonstrate how our Solution Accelerator for Part Level Demand Forecasting helps your organization to forecast at the part level, rather than at the aggregate level using the Databricks Lakehouse Platform. Part-level demand forecasting is especially important in discrete manufacturing where manufacturers are at the mercy of their supply chain. This is due to the fact that constituent parts of a discrete manufactured product (e.g. cars) are dependent on components provided by third-party original equipment manufacturers (OEMs). The goal is to map the forecasted demand values for each SKU to quantities of the raw materials (the input of the production line) that are needed to produce the associated finished product (the output of the production line).

Read more at Databricks Blog

How to pull data into Databricks from AVEVA Data Hub

Using MLflow to deploy Graph Neural Networks for Monitoring Supply Chain Risk

📅 Date:

🔖 Topics: Graph Neural Network, MLOps

🏢 Organizations: Databricks


We live in an ever interconnected world, and nowhere is this more evident than in modern supply chains. Due to the global macroeconomic environment and globalisation, modern supply chains have become intricately linked and weaved together. Companies worldwide rely on one another to keep their production lines flowing and to act ethically (e.g., complying with laws such as the Modern Slavery Act). From a modelling perspective, the procurement relationships between firms in this global network form an intricate, dynamic, and complex network spanning the globe.

Lastly, it was mentioned earlier that GNNs are a framework for defining deep learning algorithms over graph structured data. For this blog, we will utilise a specific architecture of GNNs called GraphSAGE. This algorithm does not require all nodes to be present during training, is able to generalise to new nodes efficiently, and can scale to billions of nodes. Earlier methods in the literature were transductive, meaning that the algorithms learned embeddings for nodes. This was useful for static graphs, but the algorithms had to be re-run after graph updates such as new nodes. Unlike those methods, GraphSAGE is an inductive framework which learns how to aggregate information from neighborhood nodes; i.e., it learns functions for generating embeddings, rather than learning embeddings directly. Therefore GraphSAGE ensures that we can seamlessly integrate new supply chain relationships retrieved from upstream processes without triggering costly retraining routines.

Read more at Ajmal Aziz on Medium

Optimizing Order Picking to Increase Omnichannel Profitability with Databricks

📅 Date:

✍️ Authors: Peyman Mohajerian, Bryan Smith

🔖 Topics: BOPIS, Operations Research

🏢 Organizations: Databricks


The core challenge most retailers are facing today is not how to deliver goods to customers in a timely manner, but how to do so while retaining profitability. It is estimated that margins are reduced 3 to 8 percentage-points on each order placed online for rapid fulfillment. The cost of sending a worker to store shelves to pick the items for each order is the primary culprit, and with the cost of labor only rising (and customers expressing little interest in paying a premium for what are increasingly seen as baseline services), retailers are feeling squeezed.

But by parallelizing the work, the days or even weeks often spent evaluating an approach can be reduced to hours or even minutes. The key is to identify discrete, independent units of work within the larger evaluation set and then to leverage technology to distribute these across a large, computational infrastructure. In the picking optimization explored above, each order represents such a unit of work as the sequencing of the items in one order has no impact on the sequencing of any others. At the extreme end of things, we might execute optimizations on all 3.3-millions simultaneously to perform our work incredibly quickly.

Read more at Databricks Blog

Virtualitics’ integration with Databricks sorts out what’s under the surface of your data lake

📅 Date:

🏢 Organizations: Virtualitics, Databricks


Databricks users can benefit from Virtualitics’ multi-user interface because it can enable hundreds more people across the business to get value from complex datasets, instead of a small team of expert data scientists. Analysts and citizen data scientists can do self-serve data exploration by querying large datasets with the ease of typing in question and AI-guided exploration instead of writing lines of code. Business decision makers get their hands on AI-generated insights that can help them take smart, predictive actions.

Read more at Virtualitics Blog

How to Build Scalable Data and AI Industrial IoT Solutions in Manufacturing

📅 Date:

✍️ Authors: Bala Amavasai, Vamsi Krishna Bhupasamudram, Ashwin Voorakkara

🔖 Topics: IIoT, manufacturing analytics

🏢 Organizations: Databricks, Tredence


Unlike traditional data architectures, which are IT-based, in manufacturing there is an intersection between hardware and software that requires an OT (operational technology) architecture. OT has to contend with processes and physical machinery. Each component and aspect of this architecture is designed to address a specific need or challenge, when dealing with industrial operations.

The Databricks Lakehouse Platform is ideally suited to manage large amounts of streaming data. Built on the foundation of Delta Lake, you can work with the large quantities of data streams delivered in small chunks from these multiple sensors and devices, providing ACID compliances and eliminating job failures compared to traditional warehouse architectures. The Lakehouse platform is designed to scale with large data volumes. Manufacturing produces multiple data types consisting of semi-structured (JSON, XML, MQTT, etc.) or unstructured (video, audio, PDF, etc.), which the platform pattern fully supports. By merging all these data types onto one platform, only one version of the truth exists, leading to more accurate outcomes.

Read more at Databricks Blog