
Sign up to save your podcasts
Or


AI is driving a remarkable transformation throughout the industry, delivering unprecedented productivity gains and enabling rapid insights from vast amounts of data.
In this two-episode season premiere, Tirthankar Lahiri, SVP of Mission-Critical Data and AI Engines, discusses how Oracle AI Vector and embedded machine learning search are harnessing the power of AI to unlock value from enterprise data, and allow developers to build sophisticated RAG and Agentic frameworks that leverage the full power of the converged database architecture of Oracle Database — including its class-leading scalability, fault-tolerance, and enterprise-grade security. Furthermore, Oracle database provides several mechanisms to make data "AI-ready" by enabling declarative data intent for AI. In this session, we will describe these techniques, and more, to explain how to truly build an AI for data solution in this rapidly changing AI landscape! ------------------------------------ Episode Transcript:
00:00:00:00 - 00:00:34:07 Unknown Welcome to the Oracle Academy Tech Chat. This podcast provides educators and students in-depth discussions with thought leaders around computer science, cloud technologies, and software design to help students on their journey to becoming industry ready technology leaders of the future. Let's get started. Welcome to Oracle Academy Tech Chat, where we discuss how Oracle Academy prepares the next generation's workforce.
00:00:34:09 - 00:01:03:23 Unknown I'm your host, Tara Pierce. This is the first of two episodes on AI for data when data meets intelligence. Our guest speaker is to thank Carly Harris, senior vice president for mission critical data and AI engines at Oracle. Here's responsible for the data engine for Oracle database, including areas like AI, vector search, indexing and data compression. He also manages the Oracle Times ten in memory and the Oracle NoSQL database product teams to thank her.
00:01:03:23 - 00:01:33:13 Unknown Has 30 years of experience in the database industry and has worked on a variety of areas such as performance, scalability, manageability, caching, in-memory architectures and developer focused functionality. He has 71 issued and several pending patents. A bachelor's in computer Science from the Indian Institute of Technology and a master's in electrical engineering from Stanford University. In the first episode to thank our talks about how data makes AI intelligent and how enterprises are using AI to get greater value from their data.
00:01:33:15 - 00:01:59:19 Unknown Over to you to thank her. Hi. Hey, guys. Thank you very much for joining. It's a great pleasure to be presenting AI for data. This is an exciting time in technology. AI is ubiquitous. AI changes everything. And I actually makes data intelligent. Let's talk about that today. So you know Oracle is working on AI. As many of you know, at many levels in the enterprise stack.
00:01:59:21 - 00:02:31:22 Unknown We have AI initiatives for applications, AI initiatives for services. I for data. And we're building a lot of AI infrastructure, as you seen from the news. Now I'm going to focus on AI for data. That's the focus of my presentation today. How we bring AI, the power of AI and unleash it on enterprise data. So Oracle's goal is to make AI for data extremely simple for basically everything.
00:02:32:00 - 00:02:54:08 Unknown So no matter what kind of end user you are, whether you're an expert, an AI, or a developer, or a DBA random list, every single persona should be able to leverage AI for data. We want to make it possible for all applications to leverage AI for data and benefit all workloads with the AI for data. So this is the goal that we have for AI for data.
00:02:54:08 - 00:03:25:05 Unknown Now, there's again basically two classic kinds of AI in the classical sense. So let's quickly talk about one before I get to what's new. So the traditional AI, was basically called algorithmic AI. Algorithmic here is based on machine learning models, typically non neural net designed to do predictions classifications, forecasting etc. and for data science people, you know that there's many different machine learning algorithms.
00:03:25:07 - 00:03:44:06 Unknown And these are all now available in Oracle database. So if you want you can use one of these models. This is the ever evolving list. You can use one of these models to load to first of all to train, you know, a sorry, you could use one of these algorithms. Excuse me. I keep that in the trunk.
00:03:44:08 - 00:04:05:22 Unknown These are algorithms. You can use one of these to train models and then to run inferencing using these models. So you imagine you can take, you know, linear linear regression. The algorithm used that to train a model and then applied that to data in real time to basically do predictions. So that's what in database machine learning lets you do.
00:04:06:00 - 00:04:30:18 Unknown And we've had this, capability for a while now. So what is new is something called I vector search, which is the primary focus of a presentation today. And this is newer, you know, and if this is beyond classical machine learning. So basically yeah vector search the new technology that enables searching for data by semantics rather than values.
00:04:30:20 - 00:04:54:11 Unknown The why why is this important? Because if you look at what databases traditionally do, for those of you who've been in the database field or have studied databases, databases essentially do what we call value based searches, where given a value, they can search by that value, like for instance, finding the revenue by each product. That's a very typical search you run inside of a database.
00:04:54:13 - 00:05:22:10 Unknown And they've excelled at this through various, you know, techniques like query optimization, SQL document processing, etc.. However, there is an ever increasing volume of unstructured data which you really can't search by value, but they have to be searched by semantics or meaning, like, you know, photos or images or description. Long complex textual descriptions. There's no real value that you can search those with.
00:05:22:10 - 00:05:52:08 Unknown Effectively, you need to search them essentially by their semantic content, not by the value content. For instance, finding products that match a particular photo or match a description that's not really something a database could do very well in the past. And this is a very important, an ever growing use case, because, you know, businesses need to do this today on a routine basis, forgetting about AI just in general to keep the business running in a healthy fashion.
00:05:52:10 - 00:06:25:14 Unknown There's a lot of examples of use cases where a business needs a search its data by, sort of the semantics. For instance, if you know, you have parts going into the sub line for manufacturing, the photo, the part should, quickly tell you whether that part might be defective, when customers log in to e-commerce sites, then when you browse products, so you try to check out a certain product, there is a desire from the e-commerce site to see what else they could then recommend to you in real time.
00:06:25:16 - 00:06:43:19 Unknown These are all examples. Another one is, of course, biometrics. You know, I'm coming in to the airport. I need to, you know, I go through facial recognition. They want to make sure that I'm the person I said I am when I, you know, when I submitted my visa application. So all of these cases require semantic search, not value based search.
00:06:43:21 - 00:07:11:12 Unknown And, vector searches. Exactly. That enable searching data based semantics. That's precisely what it does. And it does that using a construct primitive known as a vector, which is very simple actually. You know, if you think about this, the beauty of this is the basic concept is very easy, very simple. A vector is simply a long string of numbers that capture the semantics of much more complex data.
00:07:11:12 - 00:07:36:14 Unknown And they're produced by something I call black magic deep learning, machine learning models that take this, you know, unstructured set of data on the left, apply these complex algorithms and machine learning algorithms to that data and then outcomes a vector. It's actually incredible that this this actually works, that you can take something as sophisticated as a Picasso painting and convert that into a string of numbers.
00:07:36:14 - 00:07:59:16 Unknown That represents that painting. That's basically what a vector does. It's a string of numbers encoding the semantics. And once you do that, well, how do you then measure for similarity? The way you do that is by measuring the mathematical distance between the vectors. Now for those of you who've of course all of you are familiar with the vector concept, I'm sure from mathematics and physics.
00:07:59:18 - 00:08:22:09 Unknown Basically, vectors are points in multidimensional space, and there's many different ways to measure distance between them. You know, a simple example, a simple distance function is what we call Euclidean squared. We just take the square of the difference, the sum of the differences of each coordinate. That's a that's one distance for a function. However, there's many formula for distance.
00:08:22:11 - 00:08:45:07 Unknown And each machine learning model and each data scientist prefers a different one. Let's talk about how vectors get used in the real world. Now, if you think about, a very simple business example, I know that most of you not not, you know, business people, but most of you use products and sometimes, you know, products go wrong and you have this file, you know, ask for help from customer support.
00:08:45:08 - 00:09:10:11 Unknown Support incidents are very complex, documents, very complex entities. They involve a lot of different attributes. And typically looking for similarity for support incidents. Is this tricky? So a very simple example here is you take a simple incident and you encode the different properties of the incident as a vector. That's really what vector embedding models do.
00:09:10:11 - 00:09:36:17 Unknown They take the different aspects of data and they convert each aspect to a numerical value. Now and these are called dimensions on the left these numbers now in practice of course in the real world nothing is as simple as this. This is a very simple example, highly idealized, but it shows you you can take a fairly complex structure, like a super incident document, which has lots of fields and lots of textual content, and convert that into a vector.
00:09:36:23 - 00:10:01:12 Unknown Okay, that's let's just take that as our baseline example for what I'm about to show you next. When you have collapse incidents into two dimensional vectors, they might look like this. So if you have incidents for laptops running slowly, maybe an incident for a desktop that's crashing, it might look like this on into the space.
00:10:01:14 - 00:10:26:19 Unknown And basically the similarity property of vectors is that things that are more similar have a smaller distance between them. So in this example, for instance, support incidents for laptops are more similar than those for desktops. And that's why you can use vector distance as a measure of similarity of these really complex components. So you just seen how this was done.
00:10:26:19 - 00:10:51:09 Unknown Now taken this complex entity known in the support incident broken that down into vectors and then measured distances between the. Now why are we doing this in the Oracle database. There's a lot of vector database out there. Like builders pinecone etc. and lots of dedicated specialists do this really well. And that's all they do really, is measure vector similarity between two given vectors.
00:10:51:11 - 00:11:14:06 Unknown But we think the big requirement we have is in business applications, sometimes you need to combine semantic search and value based search. And that requires both the search to run together. So you could imagine your business database, take the incident data, move that to a vector database and run the search there. That's one way of doing it.
00:11:14:08 - 00:11:38:21 Unknown However, you have to send other data as well because sometimes you want to filter the customer support similarity. Search with customer information. Maybe I only want customers for a certain range region, or I only want to look for incidents for a certain product. Those are things that you're better off with other kinds of data added to the search.
00:11:38:23 - 00:12:02:07 Unknown So this way, if you see what's what happens is you have to send a lot of extra data because you don't know what might be asked, what kind of question does my support incident question involve? Customer information, product information, region information. All of that has to be sent to the vector database. And this causes some issues. It makes the data still adds a bunch of complexity.
00:12:02:09 - 00:12:39:04 Unknown It also compromises security because now your vector database becomes the weakest link in your security architecture. As you know, security works in the following way. You're as secure as your weakest link. You're as secure your house, as secure as your weakest window. So the minute you add more products to your ecosystem, you end up compromising security. And of course, or databases like Oracle have a lot more capabilities, like, you know, first of all, much more sophisticated query, much better fault tolerance, much better security than dedicated vector specialists.
00:12:39:06 - 00:13:04:03 Unknown So what we said was instead of customers using a vector database to search data by semantics, let's put that functionality into the Oracle database to begin with. That way, every search runs on current data. You don't need to guess what data might be needed because it's all there in the database. There is no data movement required, no need to manage multiple products, and no compromises.
00:13:04:03 - 00:13:32:23 Unknown The security and fault tolerance by having your vectors inside your you know, production enterprise database. And now let's get into some, you know, under-the-hood stuff. Now what's once I've done this? Well, what kind of queries can I run? How do I care about similarity. But turns out SQL is really, really powerful. I would encourage everyone here who has not taken the class involving SQL to brush up on SQL.
00:13:33:01 - 00:13:56:15 Unknown You know, because SQL is actually a galactic intergalactic language standard for declarative simple queries, as this example shows. If I want to find support incidents that are similar to my support by current incident, this is the query I vectorize my current incident using the You know, the search vector. Imagine that's the vector. So I'm just going to show you make sure the pointer works here.
00:13:56:17 - 00:14:15:10 Unknown Can you guys see the pointer here. There we go. Yeah. So the search vector is a vectorization of my my own incident let's say. So vector is that I create that. And then I find the vectors and I rank them by distance from the search fit. And I say I only want the top ten. So that's how SQL works.
00:14:15:10 - 00:14:42:12 Unknown It's very simple, extremely user friendly and very easy to express sophisticated searches with. So this gets me the top ten incidents closest to me. Now let's get more interesting. I'm only I only want to look at incidents that are similar that are for a laptop. So then I would just take that same SQL statement and extend that with a simple join and filter by the product type laptop.
00:14:42:14 - 00:15:07:23 Unknown Okay, so that makes it clear to the database engine that don't, you know, not written incidents that don't correspond to laptops. And these are again ranked by the same vector distance. And again I only want the first ten rows. And you can see that you can keep expanding on this. Maybe I only care about vector incidents for laptops reported by customers in Las Vegas, because I happen to be in Las Vegas.
00:15:08:01 - 00:15:25:17 Unknown Las Vegas is kind of a hot air in the summer, for those of you who've been there, and maybe there are some unique issues for Las Vegas customers. So I can add a filter again on Las Vegas that customers to the list of joint tables, and then run the vector search on that joint result. This is the beauty of SQL.
00:15:25:17 - 00:15:53:04 Unknown SQL is decorative and SQL is extensible and composable. And you can create really sophisticated queries with very simple building blocks. This is if you look at what's on the right, it's very powerful because this combines vector data with production at the relational data in a few lines of SQL, providing you with a single solution where everything is consistent.
00:15:53:04 - 00:16:16:16 Unknown There's nothing stale here. Every customer, every product, every support incident is current. And I think developers can learn to use this within minutes. There's there's nothing new to learn for anybody who has even basic SQL knowledge. So this is the power of putting vectors into the production enterprise database and running basically this type of converge SQL on it.
00:16:16:18 - 00:16:38:12 Unknown Okay, let's now get into a little deeper dive under the hood into AI vector search itself. This is, designed more from a user end user standpoint, but I'll highlight some areas that might be of academic interest that, might do what exploration or learning more about. And there's tons of literature out there in the field and lots of stuff that's evolving daily.
00:16:38:14 - 00:16:58:02 Unknown But hopefully this will give you a high level idea of what the space includes. Okay, so I vector search from a user point of view. An Oracle database basically contains consists of four steps. You first have to take the data that you want to search and encode the data into vectors. Right. That's that's of course the step one.
00:16:58:02 - 00:17:28:16 Unknown And that's done using something we call vector and the call embedding models or vector embedding models interchangeably. Embedding is a natural language processing term from the 80s. I think it's become a standard synonym for vectors. So vectors, vector embeddings, same thing. You first encode the data, then the the data you're searching for. Like if you have a question you want to ask, you encode that search data, the question using the vector with that same model.
00:17:28:16 - 00:17:53:01 Unknown So whatever model was used to embed the original data, like if I have a list of images of, you know, I'm going to search, I use a certain vector model to encode them into vectors. What I want to search for a certain image after is encode that image using that same model. Then of course, I find the k nearest vectors ordered by distance to the question, and then return the data corresponding to the vector.
00:17:53:01 - 00:18:22:17 Unknown That's the way vector search works. So let's talk about vector generator. I want to skip some of the details and the vector generation. So typically for a database application there are three ways customers can do this. One is they can use pre-created vectors. There's a lot of data that is already vectorized. And if you already have a tier of data with the image or text and its vector, you can load that directly into the database.
00:18:22:19 - 00:18:47:08 Unknown No problem. The second thing is often you want to use, third party embedding service like OpenAI. They provide Rest endpoints to generate vector embeddings from your data, and you can do that from the database using a political function. And of course, the third approach is you can load the model into the database and do the vector generation inside the model.
00:18:47:10 - 00:19:16:03 Unknown Okay. So that's the way vector embeddings are produced. So I know very simply the benefit of this, approach is that you can make the database the API hub for your operation. So even if the vectors are being produced outside the database, like using OpenAI or come here or Google's embedding models, you can do that call out from the database so that you end user doesn't need to do two different things.
00:19:16:05 - 00:19:41:08 Unknown They can supply the credentials to the database and have the database essentially generate the embeddings directly from source data. So this this simple function takes, like a support incident description and combined creates the and creates the vector for it using the supplied embedding credential. This is an open AI credential, for instance, as an example. So very simple.
00:19:41:10 - 00:20:10:02 Unknown It keeps everything simple, consistent and SQL oriented. Now if you want to run everything inside the database, you can load models into the database. And there is a technique known as Onnx open Neural Network exchange. That's the standard sort of runtime that supports vector embedding generation. So step one is you load that model into the database, you whatever model you want.
00:20:10:02 - 00:20:35:20 Unknown This is a very common model for of Transformers. It's called it's a nice long name. All Alex. L6 v2, you can load the model into the database using the spill SQL function. And then once the model is in the database, you can use that model in the database to convert that incident descriptions, into a into a vector.
00:20:35:22 - 00:20:55:15 Unknown So this vector embedding function runs and said the database and produced vectors. Very simple okay. So now we know how to get vectors from data. Hopefully. What do we not do with them. Well the first thing to do is to store them because vectors have to go somewhere persistent. They for long term search by the database. So let's see how do we store the vectors.
00:20:55:17 - 00:21:19:03 Unknown So vectors are a new data type in Oracle. We basically can declare columns with the vector type. We can optionally give it more details like how big, how big is the vector and what are the types of each number in the vector. These are all things that we can support inside the database. Specific specify how many dimensions.
00:21:19:05 - 00:21:45:12 Unknown This is really a property of the vector embedding model. So the more in some models have smaller dimension and some larger. But depending on the model that you intend to use, you should use, vectors of that size. However, you can also just avoid specifying the dimension count and dimension type altogether because, you know, models change rapidly.
00:21:45:14 - 00:22:05:20 Unknown And if you don't want to change your table definitions, you could leave the model, the dimension cut unspecified. And this format allows you to store vectors of any size and type inside the column so that if the model changes, your schema doesn't change, which is very useful. And also this lets you support multiple models in the same column.
00:22:05:20 - 00:22:30:23 Unknown For instance, I might have a model for Japanese Brazilians versus English estimates, and I have a column of data in the table that tells me what type of resume is being stored. And I can use different models for those those rows. So that's a very powerful capability allowing vectors to be mixed and matched. Okay. As I said earlier, the main operation is vector distance, but I have two different vector values.
00:22:31:01 - 00:22:59:08 Unknown The, only thing that really makes sense to doing them is to see how similar they are using distance. And again, there's many vector distance formula. There's Euclidean cosine similarity. They're all embedding model specific. So each model is this defined or is designed rather to use a certain distance function to measure similarity. And you know, so vectors for words like tiger and lion will be closer to each other than vectors for tiger and apple as an example.
00:22:59:08 - 00:23:24:11 Unknown So this gives you the similarity property with which you can search for data by semantics. Pick. Now the real rocket science behind vector search is called vector indexes. Let's talk about that. This is the this is the part that's really interesting from an academic standpoint. Vector indexes allow you to make similarity searches happen hundreds of times faster.
00:23:24:13 - 00:23:45:09 Unknown That's why you need the indexes. So let's talk about them. So you could of course search the entire column and every value exhaustively in a vector column. And you'd get good results. But that's done slowly. But we index the vectors so that you get much faster access, and you find your top k nearest neighbors much more quickly.
00:23:45:11 - 00:24:17:12 Unknown So let's talk about the neighbor graph vector index. We have two types of indexes. Let's talk about the graph index first here. This is basically a form a type of index where the index is stored as a graph where edges between the vectors represent vector similarity. This is an in-memory index designed for high accuracy and speed. It's not meant for very large data, but today memories are getting pretty big so it can store a reasonable amount of data.
00:24:17:17 - 00:24:44:13 Unknown And one example of a graph vector index that's very popular is called each NSW hierarchical navigable small boat split Everton cluster. But it has sort of become the B-tree of indexes. It looks a bit like this, a just under the hood. So if you the way the ancient structure works, it has it's a multi layer graph and each layer is a subset of the layer below it.
00:24:44:15 - 00:25:10:00 Unknown So let's erase all the vectors and layers one and two and three have a subset of those data. The first has only one the top layer. Sorry. So what happens is you you begin your search from the top left. You, you go down a level each time to find the nearest neighbor, the next layer. And once you get to the last there, you will turn that vector and its neighbors.
00:25:10:02 - 00:25:31:17 Unknown This is sort of the way the high level description, how the vector indexed application works for graph based vector indexes. It makes the vector search very, very fast because this basically is a log n traversal compared to being an order n traversal. So it's very fast and especially because these are basically this pointers of memory. The navigation happens of memory speed.
00:25:31:19 - 00:25:56:14 Unknown So this is the probably the fastest vector index up there. The other approach when if you have very large data that does not fit a memory, the way to search that is another vector index that's partitioned based here. What we do is we take the vectors and we divide them into different groups or clusters or partitions based on vector similarity.
00:25:56:16 - 00:26:25:10 Unknown And one example is something called the inverted file index or IVF flat, which basically groups vectors into these clusters based on vector similarity and this essentially scales up to unlimited sizes of data. So each of these is faster, but it only works if your data fits in memory. IVF can handle basically terabytes and petabytes of data. And I'll give you a, a quick illustration of how IVF works.
00:26:25:10 - 00:26:48:22 Unknown It's very simple. So imagine I have a two dimensional set of vectors. The first step is to take these vectors and classify them into groups. And this is done using an algorithm called K-means clustering. It's a very familiar algorithm for those of you who've done data science, the machine learning K-means basically identifies k clusters given your two dimensional divide.
00:26:48:22 - 00:27:09:22 Unknown Given a data set, when you get a query vector that says find me the nearest neighbors, the first thing we do is instead of look at all the vectors, we first look at the nearest clusters to that vector. So we look at all the cluster distances by measuring the distance between the query vector and the middle of each clusters.
00:27:09:22 - 00:27:32:02 Unknown Let me go back a step. So these are clusters different clusters of vectors. Each of them have a has a centroid or a center of gravity. We defined that as basically the average of those vectors that lets us that measure the distance between the query vector to the center of gravity or centroid. And what we do is we find the nearest centroids.
00:27:32:04 - 00:27:58:01 Unknown In this case here two clusters that look like the the closest to the search vector. Once I've done that, then I can do a search within those clusters for the nearest vector. So what I did first was I reduced the search space by looking at only the nearest clusters, and then searching only within those clusters. And this is essentially how, you know, neighbor partition vector indexes work.
00:27:58:03 - 00:28:23:00 Unknown Very simple. Again, this works for unlimited, say, the data, the standard algorithms or heuristics for how many classes you want. Typically what happens is if you have n vectors altogether the number of clusters is square root of n. So that gets us a reduction in the search space by a quadratic amount. There's a simple DDL just to show you how this looks from an end user standpoint.
00:28:23:02 - 00:28:47:13 Unknown You basically define the vector index and you specify how you want the index to be organized and using what distance function. Now, what's new? What's in addition, Oracle is you can also specify how accurate you want the vector index to be. By default you can specify a default target accuracy. The more accurate the index, the slower the index construction is, but the better the result is.
00:28:47:15 - 00:29:13:01 Unknown So, it's easy for developers to specify because you could specify low level index parameters that only Data Science Scientist would understand, but an accuracy number is easy. Saying, hey, I want this index to be 95%. I can because this is meant for like facial recognition that border control. But if it's product recommendations, I might be okay with the lower accuracy in order to make the index construction faster.
00:29:13:01 - 00:29:40:13 Unknown So you can define the target accuracy based on your target use case. Thank you to thank. Ha. That's all for this episode. Please join us for episode two to hear the conclusion of ten class presentation. Until then, thank you for listening. That wraps up this episode. Thanks for listening and stay tuned for the next Oracle Academy Tech Chat podcast.
By Oracle Corporation3
22 ratings
AI is driving a remarkable transformation throughout the industry, delivering unprecedented productivity gains and enabling rapid insights from vast amounts of data.
In this two-episode season premiere, Tirthankar Lahiri, SVP of Mission-Critical Data and AI Engines, discusses how Oracle AI Vector and embedded machine learning search are harnessing the power of AI to unlock value from enterprise data, and allow developers to build sophisticated RAG and Agentic frameworks that leverage the full power of the converged database architecture of Oracle Database — including its class-leading scalability, fault-tolerance, and enterprise-grade security. Furthermore, Oracle database provides several mechanisms to make data "AI-ready" by enabling declarative data intent for AI. In this session, we will describe these techniques, and more, to explain how to truly build an AI for data solution in this rapidly changing AI landscape! ------------------------------------ Episode Transcript:
00:00:00:00 - 00:00:34:07 Unknown Welcome to the Oracle Academy Tech Chat. This podcast provides educators and students in-depth discussions with thought leaders around computer science, cloud technologies, and software design to help students on their journey to becoming industry ready technology leaders of the future. Let's get started. Welcome to Oracle Academy Tech Chat, where we discuss how Oracle Academy prepares the next generation's workforce.
00:00:34:09 - 00:01:03:23 Unknown I'm your host, Tara Pierce. This is the first of two episodes on AI for data when data meets intelligence. Our guest speaker is to thank Carly Harris, senior vice president for mission critical data and AI engines at Oracle. Here's responsible for the data engine for Oracle database, including areas like AI, vector search, indexing and data compression. He also manages the Oracle Times ten in memory and the Oracle NoSQL database product teams to thank her.
00:01:03:23 - 00:01:33:13 Unknown Has 30 years of experience in the database industry and has worked on a variety of areas such as performance, scalability, manageability, caching, in-memory architectures and developer focused functionality. He has 71 issued and several pending patents. A bachelor's in computer Science from the Indian Institute of Technology and a master's in electrical engineering from Stanford University. In the first episode to thank our talks about how data makes AI intelligent and how enterprises are using AI to get greater value from their data.
00:01:33:15 - 00:01:59:19 Unknown Over to you to thank her. Hi. Hey, guys. Thank you very much for joining. It's a great pleasure to be presenting AI for data. This is an exciting time in technology. AI is ubiquitous. AI changes everything. And I actually makes data intelligent. Let's talk about that today. So you know Oracle is working on AI. As many of you know, at many levels in the enterprise stack.
00:01:59:21 - 00:02:31:22 Unknown We have AI initiatives for applications, AI initiatives for services. I for data. And we're building a lot of AI infrastructure, as you seen from the news. Now I'm going to focus on AI for data. That's the focus of my presentation today. How we bring AI, the power of AI and unleash it on enterprise data. So Oracle's goal is to make AI for data extremely simple for basically everything.
00:02:32:00 - 00:02:54:08 Unknown So no matter what kind of end user you are, whether you're an expert, an AI, or a developer, or a DBA random list, every single persona should be able to leverage AI for data. We want to make it possible for all applications to leverage AI for data and benefit all workloads with the AI for data. So this is the goal that we have for AI for data.
00:02:54:08 - 00:03:25:05 Unknown Now, there's again basically two classic kinds of AI in the classical sense. So let's quickly talk about one before I get to what's new. So the traditional AI, was basically called algorithmic AI. Algorithmic here is based on machine learning models, typically non neural net designed to do predictions classifications, forecasting etc. and for data science people, you know that there's many different machine learning algorithms.
00:03:25:07 - 00:03:44:06 Unknown And these are all now available in Oracle database. So if you want you can use one of these models. This is the ever evolving list. You can use one of these models to load to first of all to train, you know, a sorry, you could use one of these algorithms. Excuse me. I keep that in the trunk.
00:03:44:08 - 00:04:05:22 Unknown These are algorithms. You can use one of these to train models and then to run inferencing using these models. So you imagine you can take, you know, linear linear regression. The algorithm used that to train a model and then applied that to data in real time to basically do predictions. So that's what in database machine learning lets you do.
00:04:06:00 - 00:04:30:18 Unknown And we've had this, capability for a while now. So what is new is something called I vector search, which is the primary focus of a presentation today. And this is newer, you know, and if this is beyond classical machine learning. So basically yeah vector search the new technology that enables searching for data by semantics rather than values.
00:04:30:20 - 00:04:54:11 Unknown The why why is this important? Because if you look at what databases traditionally do, for those of you who've been in the database field or have studied databases, databases essentially do what we call value based searches, where given a value, they can search by that value, like for instance, finding the revenue by each product. That's a very typical search you run inside of a database.
00:04:54:13 - 00:05:22:10 Unknown And they've excelled at this through various, you know, techniques like query optimization, SQL document processing, etc.. However, there is an ever increasing volume of unstructured data which you really can't search by value, but they have to be searched by semantics or meaning, like, you know, photos or images or description. Long complex textual descriptions. There's no real value that you can search those with.
00:05:22:10 - 00:05:52:08 Unknown Effectively, you need to search them essentially by their semantic content, not by the value content. For instance, finding products that match a particular photo or match a description that's not really something a database could do very well in the past. And this is a very important, an ever growing use case, because, you know, businesses need to do this today on a routine basis, forgetting about AI just in general to keep the business running in a healthy fashion.
00:05:52:10 - 00:06:25:14 Unknown There's a lot of examples of use cases where a business needs a search its data by, sort of the semantics. For instance, if you know, you have parts going into the sub line for manufacturing, the photo, the part should, quickly tell you whether that part might be defective, when customers log in to e-commerce sites, then when you browse products, so you try to check out a certain product, there is a desire from the e-commerce site to see what else they could then recommend to you in real time.
00:06:25:16 - 00:06:43:19 Unknown These are all examples. Another one is, of course, biometrics. You know, I'm coming in to the airport. I need to, you know, I go through facial recognition. They want to make sure that I'm the person I said I am when I, you know, when I submitted my visa application. So all of these cases require semantic search, not value based search.
00:06:43:21 - 00:07:11:12 Unknown And, vector searches. Exactly. That enable searching data based semantics. That's precisely what it does. And it does that using a construct primitive known as a vector, which is very simple actually. You know, if you think about this, the beauty of this is the basic concept is very easy, very simple. A vector is simply a long string of numbers that capture the semantics of much more complex data.
00:07:11:12 - 00:07:36:14 Unknown And they're produced by something I call black magic deep learning, machine learning models that take this, you know, unstructured set of data on the left, apply these complex algorithms and machine learning algorithms to that data and then outcomes a vector. It's actually incredible that this this actually works, that you can take something as sophisticated as a Picasso painting and convert that into a string of numbers.
00:07:36:14 - 00:07:59:16 Unknown That represents that painting. That's basically what a vector does. It's a string of numbers encoding the semantics. And once you do that, well, how do you then measure for similarity? The way you do that is by measuring the mathematical distance between the vectors. Now for those of you who've of course all of you are familiar with the vector concept, I'm sure from mathematics and physics.
00:07:59:18 - 00:08:22:09 Unknown Basically, vectors are points in multidimensional space, and there's many different ways to measure distance between them. You know, a simple example, a simple distance function is what we call Euclidean squared. We just take the square of the difference, the sum of the differences of each coordinate. That's a that's one distance for a function. However, there's many formula for distance.
00:08:22:11 - 00:08:45:07 Unknown And each machine learning model and each data scientist prefers a different one. Let's talk about how vectors get used in the real world. Now, if you think about, a very simple business example, I know that most of you not not, you know, business people, but most of you use products and sometimes, you know, products go wrong and you have this file, you know, ask for help from customer support.
00:08:45:08 - 00:09:10:11 Unknown Support incidents are very complex, documents, very complex entities. They involve a lot of different attributes. And typically looking for similarity for support incidents. Is this tricky? So a very simple example here is you take a simple incident and you encode the different properties of the incident as a vector. That's really what vector embedding models do.
00:09:10:11 - 00:09:36:17 Unknown They take the different aspects of data and they convert each aspect to a numerical value. Now and these are called dimensions on the left these numbers now in practice of course in the real world nothing is as simple as this. This is a very simple example, highly idealized, but it shows you you can take a fairly complex structure, like a super incident document, which has lots of fields and lots of textual content, and convert that into a vector.
00:09:36:23 - 00:10:01:12 Unknown Okay, that's let's just take that as our baseline example for what I'm about to show you next. When you have collapse incidents into two dimensional vectors, they might look like this. So if you have incidents for laptops running slowly, maybe an incident for a desktop that's crashing, it might look like this on into the space.
00:10:01:14 - 00:10:26:19 Unknown And basically the similarity property of vectors is that things that are more similar have a smaller distance between them. So in this example, for instance, support incidents for laptops are more similar than those for desktops. And that's why you can use vector distance as a measure of similarity of these really complex components. So you just seen how this was done.
00:10:26:19 - 00:10:51:09 Unknown Now taken this complex entity known in the support incident broken that down into vectors and then measured distances between the. Now why are we doing this in the Oracle database. There's a lot of vector database out there. Like builders pinecone etc. and lots of dedicated specialists do this really well. And that's all they do really, is measure vector similarity between two given vectors.
00:10:51:11 - 00:11:14:06 Unknown But we think the big requirement we have is in business applications, sometimes you need to combine semantic search and value based search. And that requires both the search to run together. So you could imagine your business database, take the incident data, move that to a vector database and run the search there. That's one way of doing it.
00:11:14:08 - 00:11:38:21 Unknown However, you have to send other data as well because sometimes you want to filter the customer support similarity. Search with customer information. Maybe I only want customers for a certain range region, or I only want to look for incidents for a certain product. Those are things that you're better off with other kinds of data added to the search.
00:11:38:23 - 00:12:02:07 Unknown So this way, if you see what's what happens is you have to send a lot of extra data because you don't know what might be asked, what kind of question does my support incident question involve? Customer information, product information, region information. All of that has to be sent to the vector database. And this causes some issues. It makes the data still adds a bunch of complexity.
00:12:02:09 - 00:12:39:04 Unknown It also compromises security because now your vector database becomes the weakest link in your security architecture. As you know, security works in the following way. You're as secure as your weakest link. You're as secure your house, as secure as your weakest window. So the minute you add more products to your ecosystem, you end up compromising security. And of course, or databases like Oracle have a lot more capabilities, like, you know, first of all, much more sophisticated query, much better fault tolerance, much better security than dedicated vector specialists.
00:12:39:06 - 00:13:04:03 Unknown So what we said was instead of customers using a vector database to search data by semantics, let's put that functionality into the Oracle database to begin with. That way, every search runs on current data. You don't need to guess what data might be needed because it's all there in the database. There is no data movement required, no need to manage multiple products, and no compromises.
00:13:04:03 - 00:13:32:23 Unknown The security and fault tolerance by having your vectors inside your you know, production enterprise database. And now let's get into some, you know, under-the-hood stuff. Now what's once I've done this? Well, what kind of queries can I run? How do I care about similarity. But turns out SQL is really, really powerful. I would encourage everyone here who has not taken the class involving SQL to brush up on SQL.
00:13:33:01 - 00:13:56:15 Unknown You know, because SQL is actually a galactic intergalactic language standard for declarative simple queries, as this example shows. If I want to find support incidents that are similar to my support by current incident, this is the query I vectorize my current incident using the You know, the search vector. Imagine that's the vector. So I'm just going to show you make sure the pointer works here.
00:13:56:17 - 00:14:15:10 Unknown Can you guys see the pointer here. There we go. Yeah. So the search vector is a vectorization of my my own incident let's say. So vector is that I create that. And then I find the vectors and I rank them by distance from the search fit. And I say I only want the top ten. So that's how SQL works.
00:14:15:10 - 00:14:42:12 Unknown It's very simple, extremely user friendly and very easy to express sophisticated searches with. So this gets me the top ten incidents closest to me. Now let's get more interesting. I'm only I only want to look at incidents that are similar that are for a laptop. So then I would just take that same SQL statement and extend that with a simple join and filter by the product type laptop.
00:14:42:14 - 00:15:07:23 Unknown Okay, so that makes it clear to the database engine that don't, you know, not written incidents that don't correspond to laptops. And these are again ranked by the same vector distance. And again I only want the first ten rows. And you can see that you can keep expanding on this. Maybe I only care about vector incidents for laptops reported by customers in Las Vegas, because I happen to be in Las Vegas.
00:15:08:01 - 00:15:25:17 Unknown Las Vegas is kind of a hot air in the summer, for those of you who've been there, and maybe there are some unique issues for Las Vegas customers. So I can add a filter again on Las Vegas that customers to the list of joint tables, and then run the vector search on that joint result. This is the beauty of SQL.
00:15:25:17 - 00:15:53:04 Unknown SQL is decorative and SQL is extensible and composable. And you can create really sophisticated queries with very simple building blocks. This is if you look at what's on the right, it's very powerful because this combines vector data with production at the relational data in a few lines of SQL, providing you with a single solution where everything is consistent.
00:15:53:04 - 00:16:16:16 Unknown There's nothing stale here. Every customer, every product, every support incident is current. And I think developers can learn to use this within minutes. There's there's nothing new to learn for anybody who has even basic SQL knowledge. So this is the power of putting vectors into the production enterprise database and running basically this type of converge SQL on it.
00:16:16:18 - 00:16:38:12 Unknown Okay, let's now get into a little deeper dive under the hood into AI vector search itself. This is, designed more from a user end user standpoint, but I'll highlight some areas that might be of academic interest that, might do what exploration or learning more about. And there's tons of literature out there in the field and lots of stuff that's evolving daily.
00:16:38:14 - 00:16:58:02 Unknown But hopefully this will give you a high level idea of what the space includes. Okay, so I vector search from a user point of view. An Oracle database basically contains consists of four steps. You first have to take the data that you want to search and encode the data into vectors. Right. That's that's of course the step one.
00:16:58:02 - 00:17:28:16 Unknown And that's done using something we call vector and the call embedding models or vector embedding models interchangeably. Embedding is a natural language processing term from the 80s. I think it's become a standard synonym for vectors. So vectors, vector embeddings, same thing. You first encode the data, then the the data you're searching for. Like if you have a question you want to ask, you encode that search data, the question using the vector with that same model.
00:17:28:16 - 00:17:53:01 Unknown So whatever model was used to embed the original data, like if I have a list of images of, you know, I'm going to search, I use a certain vector model to encode them into vectors. What I want to search for a certain image after is encode that image using that same model. Then of course, I find the k nearest vectors ordered by distance to the question, and then return the data corresponding to the vector.
00:17:53:01 - 00:18:22:17 Unknown That's the way vector search works. So let's talk about vector generator. I want to skip some of the details and the vector generation. So typically for a database application there are three ways customers can do this. One is they can use pre-created vectors. There's a lot of data that is already vectorized. And if you already have a tier of data with the image or text and its vector, you can load that directly into the database.
00:18:22:19 - 00:18:47:08 Unknown No problem. The second thing is often you want to use, third party embedding service like OpenAI. They provide Rest endpoints to generate vector embeddings from your data, and you can do that from the database using a political function. And of course, the third approach is you can load the model into the database and do the vector generation inside the model.
00:18:47:10 - 00:19:16:03 Unknown Okay. So that's the way vector embeddings are produced. So I know very simply the benefit of this, approach is that you can make the database the API hub for your operation. So even if the vectors are being produced outside the database, like using OpenAI or come here or Google's embedding models, you can do that call out from the database so that you end user doesn't need to do two different things.
00:19:16:05 - 00:19:41:08 Unknown They can supply the credentials to the database and have the database essentially generate the embeddings directly from source data. So this this simple function takes, like a support incident description and combined creates the and creates the vector for it using the supplied embedding credential. This is an open AI credential, for instance, as an example. So very simple.
00:19:41:10 - 00:20:10:02 Unknown It keeps everything simple, consistent and SQL oriented. Now if you want to run everything inside the database, you can load models into the database. And there is a technique known as Onnx open Neural Network exchange. That's the standard sort of runtime that supports vector embedding generation. So step one is you load that model into the database, you whatever model you want.
00:20:10:02 - 00:20:35:20 Unknown This is a very common model for of Transformers. It's called it's a nice long name. All Alex. L6 v2, you can load the model into the database using the spill SQL function. And then once the model is in the database, you can use that model in the database to convert that incident descriptions, into a into a vector.
00:20:35:22 - 00:20:55:15 Unknown So this vector embedding function runs and said the database and produced vectors. Very simple okay. So now we know how to get vectors from data. Hopefully. What do we not do with them. Well the first thing to do is to store them because vectors have to go somewhere persistent. They for long term search by the database. So let's see how do we store the vectors.
00:20:55:17 - 00:21:19:03 Unknown So vectors are a new data type in Oracle. We basically can declare columns with the vector type. We can optionally give it more details like how big, how big is the vector and what are the types of each number in the vector. These are all things that we can support inside the database. Specific specify how many dimensions.
00:21:19:05 - 00:21:45:12 Unknown This is really a property of the vector embedding model. So the more in some models have smaller dimension and some larger. But depending on the model that you intend to use, you should use, vectors of that size. However, you can also just avoid specifying the dimension count and dimension type altogether because, you know, models change rapidly.
00:21:45:14 - 00:22:05:20 Unknown And if you don't want to change your table definitions, you could leave the model, the dimension cut unspecified. And this format allows you to store vectors of any size and type inside the column so that if the model changes, your schema doesn't change, which is very useful. And also this lets you support multiple models in the same column.
00:22:05:20 - 00:22:30:23 Unknown For instance, I might have a model for Japanese Brazilians versus English estimates, and I have a column of data in the table that tells me what type of resume is being stored. And I can use different models for those those rows. So that's a very powerful capability allowing vectors to be mixed and matched. Okay. As I said earlier, the main operation is vector distance, but I have two different vector values.
00:22:31:01 - 00:22:59:08 Unknown The, only thing that really makes sense to doing them is to see how similar they are using distance. And again, there's many vector distance formula. There's Euclidean cosine similarity. They're all embedding model specific. So each model is this defined or is designed rather to use a certain distance function to measure similarity. And you know, so vectors for words like tiger and lion will be closer to each other than vectors for tiger and apple as an example.
00:22:59:08 - 00:23:24:11 Unknown So this gives you the similarity property with which you can search for data by semantics. Pick. Now the real rocket science behind vector search is called vector indexes. Let's talk about that. This is the this is the part that's really interesting from an academic standpoint. Vector indexes allow you to make similarity searches happen hundreds of times faster.
00:23:24:13 - 00:23:45:09 Unknown That's why you need the indexes. So let's talk about them. So you could of course search the entire column and every value exhaustively in a vector column. And you'd get good results. But that's done slowly. But we index the vectors so that you get much faster access, and you find your top k nearest neighbors much more quickly.
00:23:45:11 - 00:24:17:12 Unknown So let's talk about the neighbor graph vector index. We have two types of indexes. Let's talk about the graph index first here. This is basically a form a type of index where the index is stored as a graph where edges between the vectors represent vector similarity. This is an in-memory index designed for high accuracy and speed. It's not meant for very large data, but today memories are getting pretty big so it can store a reasonable amount of data.
00:24:17:17 - 00:24:44:13 Unknown And one example of a graph vector index that's very popular is called each NSW hierarchical navigable small boat split Everton cluster. But it has sort of become the B-tree of indexes. It looks a bit like this, a just under the hood. So if you the way the ancient structure works, it has it's a multi layer graph and each layer is a subset of the layer below it.
00:24:44:15 - 00:25:10:00 Unknown So let's erase all the vectors and layers one and two and three have a subset of those data. The first has only one the top layer. Sorry. So what happens is you you begin your search from the top left. You, you go down a level each time to find the nearest neighbor, the next layer. And once you get to the last there, you will turn that vector and its neighbors.
00:25:10:02 - 00:25:31:17 Unknown This is sort of the way the high level description, how the vector indexed application works for graph based vector indexes. It makes the vector search very, very fast because this basically is a log n traversal compared to being an order n traversal. So it's very fast and especially because these are basically this pointers of memory. The navigation happens of memory speed.
00:25:31:19 - 00:25:56:14 Unknown So this is the probably the fastest vector index up there. The other approach when if you have very large data that does not fit a memory, the way to search that is another vector index that's partitioned based here. What we do is we take the vectors and we divide them into different groups or clusters or partitions based on vector similarity.
00:25:56:16 - 00:26:25:10 Unknown And one example is something called the inverted file index or IVF flat, which basically groups vectors into these clusters based on vector similarity and this essentially scales up to unlimited sizes of data. So each of these is faster, but it only works if your data fits in memory. IVF can handle basically terabytes and petabytes of data. And I'll give you a, a quick illustration of how IVF works.
00:26:25:10 - 00:26:48:22 Unknown It's very simple. So imagine I have a two dimensional set of vectors. The first step is to take these vectors and classify them into groups. And this is done using an algorithm called K-means clustering. It's a very familiar algorithm for those of you who've done data science, the machine learning K-means basically identifies k clusters given your two dimensional divide.
00:26:48:22 - 00:27:09:22 Unknown Given a data set, when you get a query vector that says find me the nearest neighbors, the first thing we do is instead of look at all the vectors, we first look at the nearest clusters to that vector. So we look at all the cluster distances by measuring the distance between the query vector and the middle of each clusters.
00:27:09:22 - 00:27:32:02 Unknown Let me go back a step. So these are clusters different clusters of vectors. Each of them have a has a centroid or a center of gravity. We defined that as basically the average of those vectors that lets us that measure the distance between the query vector to the center of gravity or centroid. And what we do is we find the nearest centroids.
00:27:32:04 - 00:27:58:01 Unknown In this case here two clusters that look like the the closest to the search vector. Once I've done that, then I can do a search within those clusters for the nearest vector. So what I did first was I reduced the search space by looking at only the nearest clusters, and then searching only within those clusters. And this is essentially how, you know, neighbor partition vector indexes work.
00:27:58:03 - 00:28:23:00 Unknown Very simple. Again, this works for unlimited, say, the data, the standard algorithms or heuristics for how many classes you want. Typically what happens is if you have n vectors altogether the number of clusters is square root of n. So that gets us a reduction in the search space by a quadratic amount. There's a simple DDL just to show you how this looks from an end user standpoint.
00:28:23:02 - 00:28:47:13 Unknown You basically define the vector index and you specify how you want the index to be organized and using what distance function. Now, what's new? What's in addition, Oracle is you can also specify how accurate you want the vector index to be. By default you can specify a default target accuracy. The more accurate the index, the slower the index construction is, but the better the result is.
00:28:47:15 - 00:29:13:01 Unknown So, it's easy for developers to specify because you could specify low level index parameters that only Data Science Scientist would understand, but an accuracy number is easy. Saying, hey, I want this index to be 95%. I can because this is meant for like facial recognition that border control. But if it's product recommendations, I might be okay with the lower accuracy in order to make the index construction faster.
00:29:13:01 - 00:29:40:13 Unknown So you can define the target accuracy based on your target use case. Thank you to thank. Ha. That's all for this episode. Please join us for episode two to hear the conclusion of ten class presentation. Until then, thank you for listening. That wraps up this episode. Thanks for listening and stay tuned for the next Oracle Academy Tech Chat podcast.