How Vector Data Is Changing The Way We Recommend Everything with Edo Liberty, CEO, Pinecone

DUE TO SOME HEADACHES IN THE PAST, PLEASE NOTE LEGAL CONDITIONS:

WHAT YOU’RE WELCOME TO DO: You are welcome to share the below transcript (up to 500 words but not more) in media articles (e.g., The New York Times, LA Times, The Guardian), on your personal website, in a non-commercial article or blog post (e.g., Medium), and/or on a personal social media account for non-commercial purposes, provided that you include attribution to “The HumAIn Podcast” and link back to the humainpodcast.com URL. For the sake of clarity, media outlets with advertising models are permitted to use excerpts from the transcript per the above.

WHAT IS NOT ALLOWED: No one is authorized to copy any portion of the podcast content or use David Yakobovitch’s name, image or likeness for any commercial purpose or use, including without limitation inclusion in any books, e-books, book summaries or synopses, or on a commercial website or social media site (e.g., Facebook, Twitter, Instagram, etc.) that offers or promotes your or another’s products or services. For the sake of clarity, media outlets are permitted to use photos of David Yakobovitch from the media room on humainpodcast.com or (obviously) license photos of David Yakobovitch from Getty Images, etc.

You are listening to the HumAIn Podcast. HumAIn is your first look at the startups and industry titans that are leading and disrupting artificial intelligence, data science, future of work, and developer education. I am your host, David Yakobovitch, and you are listening to HumAIn. If you like this episode, remember to subscribe and leave a review. Now onto our show.

David Yakobovitch

Welcome back listeners to HumAIn your first look at technology that is expanding how humans augment with machines, developer tools, and the world of data science. Today on the show, I have Edo Liberty who’s the founder and CEO of Pinecone. Edo comes from a tremendous background of working in technology at big tech and startups alike and is releasing his newest version of the Pinecone technology. Edo thanks so much for joining us on the show.

Edo Liberty

Thank you.

David Yakobovitch

Well, just to start us off, can you share with the listeners a little bit about yourself and what you’ve built-in technology throughout your career that led you to be a founder for Pinecone?

Edo Liberty

Sure. So I spent most of my career being an academic, I did my Ph.D. and postdoc and computer science and applied math, working on high dimensional geometry and functional analysis, which at the time, I didn’t know it, but we’re kind of the mathematical foundations for what, for machine learning, really in deep learning. In some sense, all deep learning models are high-dimensional functions. And so if you learn functional analysis, you end up being really well equipped to deal with machine learning theory, which was a very happy happenstance for me because that ended up taking off like a rocket-like everybody else knows.

During my postdoc, I started building my first startup, which we later sold to Vizio. That was a real-time video search solution that relied a lot on vector search and visual similarity search, which we’ll talk about a lot in this podcast. And then I joined Yahoo to be a research director and lead efforts in research there on anything that has to do with big data from ads to spam detection, to you name it, recommendation engines and search.

And then I moved to AWS and I helped build an organization called Amazon AI, that built a lot of different applications and platforms for machine learning, including sage maker and all the AI solutions that you see on your AWS console that came out of that organization. That’s it. And two and a half years ago, I started Pinecone with the mission of creating the next generation data platform for vector, it is a vector database, but to enable search and similarity and recommendation at scale for everyone.

David Yakobovitch

It’s so exciting that you spent a good portion of your career building at preeminent big tech companies. I remember, at the time that you were at Yahoo, they were leaving some of the largest research projects, we think today of researchers at all the top ML data science NLP AI conferences that we see Microsoft, AWS, GCP, Azure, and others, and often we forget that Yahoo was leaving that they were a core part of the industry, have a tremendous amount of engineers all over the world working on that.

And it’s amazing to see that you were there building this big data wave, which of course evolved into the new data science, ML-driven wave that we’ve seen in the last few years, with all that growth, with the cloud, and all these products. So I’m sure you’ve seen a lot of interesting things over the years that informed your decision to become a founder again.

Edo Liberty

100%. So you’re right, people don’t remember that kind of going back in time. Yahoo Labs was an organization that was started by Prabhakar Raghavan, who literally wrote the book on randomized algorithms and is a Stanford professor who became one of the top 10 people at Google. I don’t know if he’s in the top three or top 10, but he built together with Ron Brachman and others, really one of the best research organizations in the world back then.

And I remember that to get in you basically had to qualify to be a professor, if you couldn’t get into a top tier university. You couldn’t become a scientist at Yahoo back then. And a lot of the way that people think about machine learning wasn’t there. And so you’re right. If you dig research papers on machine learning data science going back 5, 6, 7 years ago, half of them would have Yahoo offers on them. So yes, there were a lot of really exciting things happening there.

David Yakobovitch

And then after that, as you mentioned, either you spent a few years at AWS and got to see a lot of those core products, both from a research perspective and then a practical AI perspective got to get built out on the cloud. I remember for myself, having worked with enterprise customers at my previous companies, we were building and helping consulting on data science and ML projects.

And I remember when AWS was just coming out with sage maker, and we’re like, that’s experimental. And then suddenly today, in 2021, Sage maker is great and there’s really great products there. So I’d love to hear a little bit about your experience, being on both the research and practical side, when you were at AWS as well.

Edo Liberty

It’s funny, like how being a scientist and building applications and building platforms are so different. It’s kind of like, for me, it’s just by analogy, kind of a scientist if you’re looking at some achieved, like, technical achievement is being a top of a mountain, and a scientist is trying to like hike there, they’re trying to be the first person to the summit, right? When you build an application, you kind of have to build a road that you have to be able to drive them with a car.

And when you’re building a platform on AWS, or at Pinecone, you have to build a city there. You have to really like, completely, like, cover it. And so, for me, the experience of building platforms that AWS was transformational because the way you think about problems is completely different. It’s not about proving that something is possible, it is building the mechanisms that make it possible always for in any circumstance. And that’s very different.

David Yakobovitch

It’s interesting that, doing all the research that you discovered at Yahoo, and AWS informed building Pinecone and what it is today, and what it’s evolving into. And, having looked through your mission statement and seeing Pinecone as a fully managed vector database, it’s so smart and clever, because back to a lot of these competitions we see today, for students who are solving in ML, or, at these conferences, and they’re constantly tuning, I want to get that extra 1% improvement using a certain algorithm. But when you really think about it, it all comes back to the data and how you structure, unstructured that data, having a little bit more, having a better organized search that can make all the difference.

Edo Liberty

There’s absolutely nothing wrong with tuning models and getting your accuracies, 4% higher or something, that’s great. I mean, that needs to be done. And there are great tools for it. But like you said, in the end, a lot of it is about the data. In my experience, it’s mostly about the data.

And when I need to be able to know if something is spam or not, if I’m like a spam classification server, I can run an email or an image that I got through a classifier and try to get an answer. But hey, what, I think it would be a lot better for me, what I could do is just ask a database, give me the, 20 other emails or other images that have the most like it, in where they spam, what were they. And so, that doesn’t look like a model that is a retrieval engine, that’s a search engine. And that’s really like a vector database.

That’s what Pinecone is. So you can have a lot of applications that require and work with these vector embeddings. And features need this, this mechanism. And that cuts across not only similarity search for something like spam classification, but also the recommendation and semantic search over text, and so on. And the interesting thing is that a lot of these models that people train, what they do is actually convert data, complex data, whether it be text or images, or anything else, really, to these high dimensional vectors.

And once you have them in that form, usually called embeddings, you can then feed them into Pinecone, then you can start really acting on them and interacting with them in real-time, which makes a whole set of machine learning applications so much more powerful and so much more capable.

David Yakobovitch

When you think of the layperson, right they see an image spam or ham or spam, not spam, right or dog or cat and you think it’s magic but there is underlying mathematics that’s occurring to perform these calculations. And today we’re moving beyond recommendations. As you’ve described in your research, we’re moving beyond the leaps in recommendation work from companies like AWS, Yahoo, Google, Spotify, Facebook, and Pinterest. Now, this change is moving to the use of vectors and vector similarity search. Can you unpack for the audience why you think we’re at this inflection point now today.

Edo Liberty

That the inflection point is not in the change of technology, but rather in the change of depth and investments that companies are making into building those solutions. So let me explain that the companies that you named already use vector search and already use vector similarity already built in-house solutions to be able to do the kind of recommendation that you experienced as a user and they do a very good job at it. What they have, under the hood, is a technology that most companies don’t have. That’s what Pinecone is.

And so up until very recently, those companies had two options, either they had to invest, like and go all in and have a dedicated engineering team to just build the infrastructure to be able to do advanced recommendation, or they would go and buy some off the shelf, retail recommendation, product, right? That does everything for them. So they really don’t have to deal with any of this.

And today, I see this very strong motion by a lot of technology-savvy companies, whether they be retailers or, or text-based search solutions, or analytics, and so on, to really want to build these things in-house, to say, hey, we hired a handful of scientists, we know how to train models already, we have great tools for it, can we do more than just buy some off the shelf thing?

And the answer is almost always yes. And it doesn’t even take a whole lot of energy to do that. And with something like Pinecone as the backend, it doesn’t even take a lot of energy to go to production. And so they find that those like black box solutions if they want to create a Pinterest like experience, they don’t have to build Pinterest from scratch. They can train a few models and use Pinecone and go to production in a few weeks and not in a few years.

David Yakobovitch

And building any of those networks and any of those systems. There’s a lot of data. We’ve seen the rise of data augmentation, data labeling, data generation synthetic data design, there’s data everywhere. And a lot of it’s not standardized. It sounds to me the vectors are opening up this new way to retrieve that data. This is known as a similarity. Can you talk to us more about whether some of the similarities you’ve seen are the use cases with Pinecone?

Edo Liberty

I’ll use an example. So, for example, a very common technique to do what’s called question answering is to do the following. If I’m going to get a question on my website, say, or by a user of my app, and I want to give them the most appropriate answer, conceptually easy thing to do is to say, I’ll just use my, like FAQ, I’ll look for the most similar questions I’ve gotten. And if I have a very similar answer question I’ll give this question. The other questions answer that ends up being pretty efficient. But how do you search for other questions that mean the same thing?

Do they contain the same words? Maybe, maybe not, and then you go into this, like, very complicated discussion of like, how do I parse the language and how do I measure similarity and so on. And today, with machine learning, you don’t really have to do any of that you have pre-trained NLP models, that convert a string, like take a sentence in English, to an embedding to a high dimensional vector, such that the similarity or either the distance or the angle between them is analogous to the similarity between them in terms of like conceptual, similar semantic similarity between them. And so I can convert my texts into a high dimensional vector. And I can search my database not for similar sentences in English, but rather to vectors, which are highly correlated with the vector that was generated for the query.

And that ends up being a) lot easier, b) a lot more accurate because now you don’t do text search, you don’t try to fit on this order that you don’t have to create like super elaborate elastic search queries, you just use machine learning for it. So that is, maybe an example of how you translate from the real world to a similarity and similarity in that space is measured either by angle or by distance, or by the dot product, which is the sum of multiplications of the different coordinates. But in the end, those are just mathematical constructs that are simulating what you really want to happen in real life. That’s the beauty of machine learning.

David Yakobovitch

I remember those days in the math classes in high school in college, right? We’re looking at cosine similarity, geometry, and trigonometry, it’s all vectors.

Edo Liberty

That is what machine learning does. That’s why that’s the beauty of it. You can train models that map sentences that mean roughly the same thing to be highly correlated with each other in the vector presentation. And if you train that model, well, and you’re successful, then now you have a vector presentation, that’s very useful because now two things two vectors that are aligned correspond to two sentences mean roughly the same thing. And so now you can search with that.

David Yakobovitch

And so doing the search, here, it’s helping with recommendations, it’s helping with lookalike audiences, it’s helping with getting deeper for businesses to create results that previously would be very challenging to build, as you describe, though, companies would have to hire full teams. I know today at a SingleStore, I’d spoken to one of our leaders who previously worked at an enterprise analytics company, and we were talking about how they were building a use case for search around the specific finance industry.

And at one point, they had a team of 30 engineers, all they were doing was I kid you not the labeling, the design, the understanding, the modeling, the algorithms, just to get one use case for one industry right there, there was so much work, but it sounds like there are better solutions or newer solutions that are being repurposed, such as vectors similarity search, that today mean, you don’t need a team of 30 algorithm developers in-house if you’re looking to have some of these solutions.

Edo Liberty

100%. And we see this across the industry as AI and machine learning is maturing, all the MLOps model training themselves, the deep learning networks and all that stuff. They’re all great tools and great shortcuts. And Pinecone is one of those, it’s an arsenal of tools that make what used to be a 30 person, multi-year project to be like a three-person quarterly project.

David Yakobovitch

And talking about technologies, there’s always more tools coming out. There’s open-source projects, there’s functions like you described in ML ops, and DevOps. And for all these new AI and ML platforms, where do you see Pinecone fitting in? You mentioned it’s a database, you mentioned, it’s working with vectors similarity search, how is it a good solution that companies should be considering today?

Edo Liberty

Well, if they are experiencing the pain, if they see that they have search problems, or recommendation problems that have large amounts of data, then they will know. It’s very easy to know that you have a problem and you’re in pain, right? And so yes, if they’re trying to search or recommend over a very large collection of objects, and they’re trying to use machine learning for it, either for semantic search, or recommendation, or shopping, or any of those use cases, almost always Pinecone ends up being a lot easier, a lot faster, and a lot more production-ready than what they would build in-house a lot more functional.

We’ve spent two and a half years now, baking a lot of really great features into Pinecone. And, we’ve just launched version 2.0 that contains all sorts of filtering capabilities and cost reduction measures, and you name it, those, we’ve been hard at it for a long time. And we’ve managed to build something really compelling that very few companies, if any, could really pull off.

David Yakobovitch

It’s very exciting because when you think whether you’re a startup or you’re a Fortune 500 company, chances are you do not want to build a team of 30 algorithm developers, AI specialists, ML engineers and data scientists, and data engineers all together, you need millions of dollars to build this whole team, develop your models, choose your models, choose your algorithms, build and maintain infrastructure. It’s a lot of moving parts, and it’s not always practical.

Edo Liberty

We see more even when we work with companies with many billions of dollars worth of market cap and huge companies. They could easily mobilize 30 headcounts, but they just don’t know that it’s not their core value. And today with the hiring market being what it is, they’re like, we can’t even hire 30 engineers for this right now. I would rather mobilize the engineers we have on our core values and our core product and delegate everything that we don’t have to build it to somebody else who knows how to build it. And so you’re right. Some companies can afford it. But amazingly, a lot of companies that can’t afford it choose not to because they have better ways to use their capital.

David Yakobovitch

And there’s different ways to solve these problems. Some of them, of course, use Pinecone. But beyond that, when we think about each one of these right choose or develop models, choose algorithms to build and maintain infrastructure, we can unpack all three of them. So this first one chooses or develops models, what can a company do today, if they want to choose or develop models?

Edo Liberty

So, look, I’m a data scientist by training. And so for me, when companies just like, pick up something random from the web, and hope for it to perform, I get the heebie-jeebies

David Yakobovitch

That code in that medium article, better work.

Edo Liberty

Exactly, I found some notebook by some guys somewhere, and for some reason, I have high hopes for it. And so, it’s good for a prototype, it’s good for getting up to speed on your, how do I do this in-house. But one thing that I tell to my customers, oftentimes, it’s like, why not use some, open-source or some pre-baked thing. I’m like because your data is yours. And you know your customers and you know your application and your business logic 1000 times better than anybody else. And there’s absolutely no reason for you to believe that they will do a better job than you could.

And so, I’m a great believer in knowing your own data and knowing your own customers, and training your own models, it doesn’t mean that you have to train them from scratch, it doesn’t mean you don’t have to use the right tools for it. You don’t have to reinvent the wheel. But I’m not a big believer in completely pre-trained, plucked off of a random place in the internet models, I do want to say that there are great models for just feature engineering for objects that don’t change so much.

So we have language models, like Bert, and so on that transform text and create great embeddings. And they’re a good starting point. I don’t think you should stop there. But they’re a great starting point, you should take advantage of that. Same thing for image and audio and so on. So yes, by all means, use the right tools and use what’s available, but don’t buy into the hype, you’re still gonna have to think about it and train some stuff, in my opinion.

David Yakobovitch

Absolutely. You can even get the papers with code, you can go to GitHub, you can grab that code, but definitely check also when it was last updated, last updated five years ago.

Edo Liberty

Be careful. Yes.

David Yakobovitch

Be careful, always. Okay. So, that was number one to solve for right choosing or developing models. The second one that you mentioned about challenges is about choosing algorithms. There’s dozens of options. They all have different parameters and hyperparameters and different data sets and use cases like you said Edo, in your company you know the business logic best, especially if you’ve worked in that industry for 10, 20, 30 years. So if you’re trying to solve for choosing algorithms, what should we look into? Should it be experimentation, benchmarking, or whether you find it important there?

Edo Liberty

So what do you mean by choosing algorithms, whether you choose the Random Forest or deep learning or stuff like that, or choosing the algorithms?

David Yakobovitch

The big thing that a business stakeholder will tell a data scientist is, why did you choose Random Forest? Why didn’t we go with Gradient Boost?

Edo Liberty

I always love those discussions. Look, in the end, as a developer, or as a business driver, you have to really care about the results. And so I don’t, I really think you should experiment with whatever you can. I do want to say that, you should keep it simple, machine learning has come a long way. And some fairly simple things work pretty well. And so yes, you can look at the latest and greatest, whatever, transfer reinforcement learning library from Alpha Centauri.

But, odds are that a better decision tree would do great, or some linear classifier would probably get you most of the way there. And so if I did say, one thing is experiment with what you can choose the best thing, but make sure that you have some pretty basic solutions in there and you 9 out of 10 times, you’ll be surprised at how well they do.

David Yakobovitch

It’s great to think really about that. Number two, choosing algorithms that you’re right, you can have your swiss army knife of algorithms, and you’ll get pretty far. But even if you choose or develop the models, you choose the algorithms. Number three is the sticky one that a lot of companies struggle with. And we saw it in the pandemic, which is building and maintaining infrastructure.

Having a DevOps, a CloudOps, a cloud infrastructure team, there’s so much going on today from distributed computing, high availability of resources, consistency, systems going down. We were talking about, earlier before the show, about a major service that went down that impacts the web. We hear about these stories almost every few months, and be surprised how one service can take so much down. So there’s a lot to solve for building and maintaining infrastructure. What’s your take on that area?

Edo Liberty

This is the reason for something like Pinecone to exist, this is, it’s hard, it’s really hard to maintain infrastructure, it’s hard to optimize how we work, it’s hard to keep it up. And it’s never-ending work, it’s not something that you build and are done with, a live service needs update, and updates, it needs maintenance and stuff happens, and so, I encourage companies that find good infrastructure to use it. Building in-house should be taken very seriously. I would argue against really building anything that’s outside of your core value proposition.

If this is not who you are as a company, then don’t invest in it, because you’ll find yourself, three years later, and like 20, headcount in like what, why are we doing all of this? And we have, I speak with customers who are in that spot, who said, hey, I invested a ton of energy into this thing. And it kind of roughly works, but I just don’t want to maintain it anymore. I don’t want to keep building it. And I have a feature list, from here to like, I have a feature request that will keep me busy for five years, and I just don’t want to do that. And so you definitely should look at the right infrastructure for whatever it is you’re trying to do.

David Yakobovitch

That’s right. And it’s always evolving, but it comes down to the old adage, do I really want to fork this branch and maintain my own version? Right?

Edo Liberty

It’s hard, and if operations is something that kind of people know that it’s hard, but they don’t know how hard it is, until they try it. It’s kind of like parenting, you hear it’s hard. But then you have a child, you’re like, oh, shit, this is way harder than people told me. So yes, it’s one of those things.

David Yakobovitch

That makes a lot of sense. And so thinking through all of this, this is a lot of the reasons why you built and have scaled Pinecone, you’ve come out with version 2.0, can you share more? Tell us about the updated product, the new features? Why are you excited for the new version?

Edo Liberty

We’ve been very busy improving Pinecone significantly, it’s improved in 1000 different ways from a new console to a client operations and go and rust and, curl APIs and so on. But the two biggest things that I’m most excited about A is filtering. So now it’s not enough to find similarity. But if you’re a retailer, you have to find similarities between the items you want to recommend, but it’s really important for you that you recommend them only if they’re in stock, or if the price is above or below some threshold.

If you’re doing similarity search, you want to make sure that you’re searching the right corpus of data, or maybe just a category, and so on. So being able to do both, to kind of fuse together traditional, no SQL type of behavior in search capabilities. And similarity search and vector search into the same index is something that took us a very long time to get right. And we’re very excited to be able to launch it now. And it’s already available.

The second thing is cost. People told us, hey, this is really performant. But hey, we don’t really need this to come back in 20 milliseconds, we’re happy for this to be like, whatever, like three 400 milliseconds or even a second, I don’t really care if the search is that long, I just want it to be a lot cheaper.

And so we actually spent a very significant amount of energy building a hybrid index that is mostly on disk that cuts costs by 10x. And so now we’re actually coming out with a new pricing model people don’t see right now. But by the time this airs, then it will already be public. Our costs are going to be 10 times lower than what they used to be. And so you will be able to achieve 10x as much for the same amount of money with Pinecone. They used to up until today. Just very exciting to us, obviously exciting for our customers.

David Yakobovitch

Minimizing the total cost of ownership is critical. And I can echo your statements for hybrid systems. Where I’m at SingleStore, we also work hybrid, and it’s so important. We’ve heard that from customers over the years. If you do everything in memory, the cost gets up pretty quickly. And if you can build that architecture, which is phenomenal to hear that your team’s built that on-disk I mean that is Incredible there are so few systems out there today that are hybrids. So hats off to you and the engineering team to get that functional.

Edo Liberty

Thank you. Appreciate it.

David Yakobovitch

And thinking more also about the company I love to hear. Tell us about Pinecone. Where are you based? Are you in the state? So you are a global company?

Edo Liberty

We are an American company based mostly out of the Bay Area and New York. We do also have a Tel Aviv office and that’s where we are.

David Yakobovitch

Startup nation, startup alley and Silicon valley. That’s great.

Edo Liberty

All the three places that I call home.

David Yakobovitch

I love them. And, thinking moving forward, whether you see a little bit trends or predictions, either for the continued, product evolution of Pinecone in the next, maybe few quarters, if you could tease out for the audience or a little bit more on the industry around, vector similarity search.

Edo Liberty

So you’ll see two things. First of all, with Pinecone specifically, we’re focused on really only two things; making it easy to use and get value out of Pinecone and making it cheaper. That’s it? Those are the only two things we care about. Like if you can get a ton of value out of it and it doesn’t cost you too much, that’s it, you’re a happy customer and we’re happy to get you there. So that pretty much sums up all of our focus.

In terms of the industry. You’ll see a lot more happening on vector representation of data. And people working with unstructured data and transacting with it, people have had unstructured data for forever, like images and texts and so on. They are mostly just kept in files and stored. They’re not really searched through or transacted against, or, there’s no SQL over images.

Well, now you do have it, it’s called, it’s spoken in the language of vectors and so you will see a rise in awareness, in tools around it in ways to convert data into that format and in, the vector database becoming de facto a standard up piece of infrastructure that every company has, whether it’s at the core of what they do.

And, they are a search engine and that’s how they work, or maybe their shop, the retail site. And that’s just how their recommendation carousel works, whether it’s, in their spam classification or in their text semantic search, he’ll be somewhere for almost every company, very soon.

David Yakobovitch

Edo Liberty, the founder, and CEO of Pinecone. Thanks so much for joining us today on HumAin.

Edo Liberty

Thank you so much. It’s a pleasure.

David Yakobovitch

Thank you for listening to this episode of the HumAin podcast. Did the episode measure up to your thoughts and ML and AI data science, developer tools, and technical education? Share your thoughts with me at humainpodcast.com/contact. Remember to share this episode with a friend, subscribe and leave a review, and listen for more episodes of HumAIn.

Solid Data AI Thought Leadership

Actually being done in AI

Thought-provoking

Putting things into perspective

Digging into AI