Welcome to our newest season of HumAIn podcast in 2021. HumAIn is your first look at the startups and industry titans that are leading and disrupting ML and AI, data science, developer tools, and technical education. I am your host, David Yakobovitch, and this is HumAIn. If you like this episode, remember to subscribe and leave a review. Now, on to our show.
Welcome back, listeners, to the HumAIn podcast. Today, I bring to you Dave Bechberger, who is a Senior Graph Architect at Amazon Web Services, known as AWS. He is also a co-author for Graph Databases In Action, by Manning Publications.
Dave, thanks so much for joining us on the show.
Thanks for having me.
I love graph databases, because the work I do every single day is around understanding knowledge and the relationships of that, whether from traditional data or more unstructured data. And I’m so looking forward to our conversation, because we’re going to get into a lot of these topics about graphs, graph databases and what connected data means for our listeners. But let’s start out from a high level, Dave. Can you tell us a little bit about what you do in the industry?
I’ve actually been in the software development industry for a little over 20 plus years now. I started out my career doing full stack software development, but about four or five years ago, I got really into working with graph data and graph databases, specifically around how to use them within Enterprise or just corporate environments to be able to help solve certain types of problems that just traditional relational databases or other data technologies are not very good at solving.
And using that to build out high-performance data platforms on top of, usually, a mix of technologies, but really focused around solving them with graphic, graph database technologies.
It’s one of those things it’s definitely become a passion of mine. I wrote a book about it, so I would definitely, obviously, enjoy the topic. But I really fell into it, really. And it was one of those technologies when I started using it that clicked with me, and it’s really started to make sense, to be able to use this to solve certain types of problems, really very efficiently.
Now, when we talk about this term graph, I think back to my days of Algebra II in high school, where I’m drawing X, Y, and connecting dots and vectors together. Can you break down for our audience what is a graph? What is connected data?
When we’re talking about graphs, we’re not talking about bar charts, line charts, things of that nature. What we’re really talking about is the mathematical construct of a graph, but don’t let that scare you. Because really, when you break it down, it’s really about networks, connected data of different people connected to other people or trucks connected to other trucks or things of that nature. It’s really about building out those networks.
The canonical example of a graph would be something like a social network. If you think about something like Facebook or Twitter and how people are connected to other people that are connected to other people, those sorts of things when we’re talking about graphs, we’re talking about that sort of web of interconnected entities and items. And using those connections to be able to answer specific types of questions and draw certain types of insight and information out of that data that isn’t necessarily easily available from other technologies.
And when we think about graphs, how are graphs connected to data and how are they being used today?
The common use case that comes to my mind, of course, is the pandemic and tracking COVID-19. Where are the different infections and the movement of this virus, but there’s so much more than that. Dave, you just shared about the social networks. Our time spent on Facebook and TikTok and Twitter and messaging and integrating with each other. But where are you seeing additional use cases of graphs and connected data today?
In addition to social networks, obviously contact tracing is actually a really good example of how you can use those connected data to be able to go back and find a super spreader event and that sort of thing. But from a corporate environment, I’ve seen people using it in enterprises today, it really comes down to a couple of different areas. But really, a lot of them are focused on being able to work with what’s known as a knowledge graph inside most enterprises.
So, especially over the past two years or so, the concept of being able to build enterprise knowledge graphs has really bubbled up. And what an enterprise knowledge graph is, depends on who you’re talking to, but really, you can think about it as ways to connect disparate data silos inside of an organization together, in a way that basically allows you to see a holistic view of information, and where information comes from inside of a company.
So, a great example like this, let’s say, you’re working with something that’s maybe an e-commerce site. If you have an e-commerce site, maybe you have one system that’s basically running your web page and tracking your clicks and how people are moving through your site. And you have another system that’s basically tracking orders, and another system that’s in charge of handling distribution inside your warehouse. And another one that’s in charge of storing data for customers that are complaining about things or questions coming in on forms, things of that nature about your thing.
With all these different disparate data silos inside your organization, how are you ever, if a customer calls in, how are, for example, customer service agents supposed to be able to actually see the entirety of their path through the different pew parts and pieces of the organization? What purchases they’ve bought? What is the current state of the shipment of that is? What issues they’ve had with it in the past?
An enterprise knowledge graph works for being able to connect those different pieces together in such a way that you can easily get a holistic view of that sort of entity, and that entity could be a person. It could be a truck because you could meet. Maybe you want to track different things about how maintenance was performed. What routes did it go on? Things of that nature. So enterprise is definitely one area. There’s a couple other areas we definitely see quite frequently. One of them being fraud. Fraud is kind of another canonical use grab, use case, because it is all about figuring out connections and patterns within data, to be able to discern whether this activity is fraudulent or not.
Another one we’ve actually seen quite frequently lately is this concept of an identity graph. Being able to take something like data from click stream analysis, inside different parts of your site and be able to identify down, into a certain subset, that this person is likely also this person. And that’s usually used to power something like a recommendation engine, because now that I know what you’ve been looking at and that this mobile device and this web browser here, and this web browser here are all likely the same person, I can personalize the recommendations specifically to that customer the next time we see them come in.
I’m a big fan of identity graphs. This year I was getting my renewal for the global entry trusted traveler program in the US and I remember after I started searching to make sure I have the renewal with the TSA suddenly I started getting these recommendation searches.
Would you like to buy this additional suitcase? Would you like these additional security features to help you with your trust and your privilege? So it’s amazing how everything goes hand in hand. And thinking about identity and fraud and knowledge, graphs are all about security and trust, but why are we spending so much effort on graph technology? Why aren’t we using other technology for these relationships? Why do you think graphs is where it’s at?
It comes down to the fact that other technologies don’t do a great job of being able to link together entities in such a way that those links and those connections are also treated as first-class citizens inside that data. Because if you think about something like a social network, a very simple thing, the fact that you’re connected to someone else, and that person is connected to someone else, it doesn’t necessarily matter as much.
The entities that are connected together are important, but the connections between them are really what starts to drive a lot of the insight on how you can use that data to be able to do things like recommend another friend. Essentially recommend who you might want to be friends with, based on the fact that you’re connected to different people who are connected to different people that aren’t necessarily connected to you, that sort of thing. Graphs bring those connections in your data up to being first-class citizens. That’s the term I like to use.
Those connections are as important or more important than the entities they connect. So, with a graph, those connections are brought up and given first class status in the languages and queries that you run. Graphs definitely have a different query language and then something like a relational database, but those query languages basically treat those connections in such a way that it’s very easy to move across them. In graph terms, it’s called traversing them, to be able to move across them, to be able to drive insight from how those connections are made and how those connections basically connect this network of data together.
Now, being someone in the database industry with analytics and data science, a lot of our listeners are always wondering graphs..? I’ve heard of something called GraphQL. Dave, is GraphQL graphs, or can you demystify this for us as well?
I come from a full stack development and I love GraphQL for what it’s meant for me. But it is really not about graphs, specifically about graphs itself. I like to think of GraphQL as here’s a query language for your APS. That’s actually even the tagline that GraphQL uses.
It isn’t so much about being able to traverse those connections between your data as much as it is about being able to take, essentially, what in traditional terms would be a restful response, but being able to specify how you want to get the return data back from it. So, it’s not so much traversing graphs as it is being able to shape the results of a restful response coming back to you.
But that being said, there’s definitely some overlapping concepts between them. Not only is the name graph in GraphQL, but also, GraphQL does allow you to specify, “I want to get a parent-child relationship” and things of that nature. So it’s one of those areas where there’s a little bit of overlap between graphs and GraphQL.
And there are certain databases that actually use GraphQL, or well, extensions to GraphQL as their query language. As a listener, I don’t think you should conflate the two together and put the two together in such a way that you think graphs and GraphQL are the same thing. They’re complimentary technologies meant to solve different problems.
This makes sense that, today, a lot of real time analytics platforms, including the work that we’re doing at SingleStore is about accelerating data to be faster, accelerating full-stack systems to be near real time or in real time. And to add to that, there’s so many different data techniques and data structures that are critical for every workload.
And so it seems that we’re discussing today about graphs and how they’re in use and how they’re powerful for large enterprises across every industry being sector agnostic. Thinking about connected data, thinking about these enterprises. What are you seeing as some of the emerging trends or the newer use cases for graphs and connected data?
I wouldn’t even say these are necessarily new and emerging, but the ones I’ve seen bubble to the top recently is, some of those ones we’ve talked about, especially around enterprise knowledge, graphs, identity graphs, things of that nature, customer 360 type solutions.
Those have been around for a while, but what I’ve seen bubble up is being able to not only process those in a real-time transactional mode, but being able to use those along with something like graph type analytics, and then use that in conjunction with AI and ML technologies to basically be able to augment data back into your graph in order to provide a better real-time user experience using this trifecta of different techniques out there, to be able to really provide, real-time, highly personalized recommendation engines or user experiences that really modern-day applications are being basically required to have by customers.
And now, if we pivot this conversation from the enterprise perspective, back to consumers like you and I, who are on our smartphones, our mobile devices and our personal machines, connecting the data does impact us every day. We’ve talked about the onset to his conversation about contact tracing, about COVID and the vaccines and the variances. We’re building these new healthcare systems that are going to be immunity passport-first or confirming your last COVID testing date, it seems that graphs are going to be at the heart of this here, as well. Do you think that perhaps COVID and the pandemic has been a reemergence as graphs, to say, this is a great data structure that we should be using for more use cases?
Much like remote work from home. I don’t know that it’s necessarily because of COVID, but that realization has been accelerated due to the nature of that problem. And, as we mentioned earlier, something like contact tracing is very much a network type problem. You’re trying to find out how two people are connected inside this web of different contacts that they have. And people have started to see that and started to think about, okay, how can I use this to solve other problems inside? Not only my enterprise, but also within my consumer application, my website, things like this.
How can this technology be used to help drive a better customer experience? Because at the end of the day, everything in it, any enterprise build or any consumer service build is really about creating a better, faster and easier to use experience for your customers, because whoever those customers are, they are either internal or external customers, because those are really the driving forces behind any kind of business initiative.
So, I definitely have seen graphs as one of those technologies that’s going to be key to being able to do things like I was mentioning earlier with the enterprise knowledge graph, being able to link those disparate data sources together in such a way that you can get a holistic view of a customer, and use that holistic view to be able to better recommend products to that end user, things of that nature.
Now from these classic cases that we’re diving into the more emerging trends. Of course, the 2020s is the decade where AI goes mainstream. A lot of organizations and policies and countries are talking about vision 2030, we’re at the onset of 2020, only 15% of organizations and countries are AI-first, but by 2030, almost 75% of companies and organizations will be building products and instructors augmented by AI and graphs do have a relationship with AI. What is that relationship that you’ve seen, Dave?
So there’s a couple of very specific relationships that I’ve seen work with with graphs and AI. Some of the things that were mentioned briefly earlier is there’s definitely certain types of analytics that can be run on top of graphs that are very helpful to be used as inputs into machine learning algorithms of different types.
So examples of some of those, especially if you’re working in a fraud area or something like that. You really want to find groups of people that are similar to each other. So you may want to run a clustering type algorithm like a movie now, and we are connecting components type algorithms on top of your graph of connected data.
In order to find out that all of these people, that person A, B, Z, and Q, or all, basically, very tightly connected together. And that may be a very predictive feature fraud-based machine learning model. Now, another common use case that we see is things along the lines of embeddings or being able to create a very simple structure that helps define a network of people around a certain person in the graph.
So, if you want to think about something like a fraud use case, being able to see how closely somebody is connected to other known fraudsters, that’s probably going to be something that’s very helpful inside of your machine learning algorithm, because if somebody is directly connected to the same address as somebody who’s a known fraudster, you’re probably going to risk that person very differently than you would somebody that isn’t connected to any known fraudulent activity, being able to use that connected data structure in order to be able to use that as inputs into your machine learning models.
And then the last one, which is one of the high, the very hot topics in just machine learning in general and graphs-based machine learning specifically, is this concept of a graph neural network, which is basically a neural network that instead of taking only vector features as input, it actually takes in a graph itself. So, graph as an input to be able to create predictive models on the output. The irony is, I always thought about neural networks in general is under the coverage of what is a neural network, but it’s building a graph of different connected objects inside the algorithm itself as it’s training and learning.
But they weren’t able to actually take in graphs as features. So they ended up having to embed the features in your graph and to a vector or something like that. But with graph neural networks, they actually allow you to basically use graph inputs in order to use those for predictive modeling. And they’re very useful in certain scenarios where the connected nature of the data that you’re working with is a very predictive major. For example, I was just giving a minute ago for fraud, where you being connected within one or two connections to a known fraudster, it was definitely most likely a predictive measure that you might want to risk you a little bit different on a fraud score than somebody who doesn’t have any known fraudulent activity within five hops of them or something like that.
When we’re thinking about the growing financial domain with neobanks and underbanked consumers, there’s a lot of opportunities to build KYC and AML, correct from the start. And that’s looking at these use cases that you’re talking about, Dave. Are you signing all the right boxes and ensuring that everything is compliant?
It sounds like compliance in the financial services industry is a core use case for graphs and why that’s been growing so much. In fact, I know in your book that you published with Manning Publications actually, in chapters two and seven, you dive very deep into graphs and their relationships, especially with AI. What motivated you to build and launch this book to dive deep into graphs and graphs neural networks?
What motivated me was, when I started working with graphs about four or five years ago, there were a lot of hello world type tutorials out there. And there were a lot of very complex academic types of papers on how graphs can be used for these sorts of things. But what wasn’t really there was the middle of the road of, okay, I understand a little bit about graphs and I can write a hello world, but I need to go now to actually build a real application against this information.
There wasn’t a lot in that sort of domain. So my background is, I was a full stack software developer and I’d used a few NoSQL databases before jumping into graphs. But basically, most of my background was all in building on top of relational databases via Postgre or Microsoft SQL server, Oracle, something like building with that very table and column type mindset.
What I wanted this book to be was really, this was the book that I wish I had when I was starting. It was kind of, write-a-book-to-solve-your-own-problems scenario, so that’s why the entirety of the book, the thought behind the book was really, how do you take a relation developer and get him to be able to start thinking in a graph, about problems as if they are a graph and what they really need to know in order to get started?
Because if you start really diving deep into it, you get very quickly into all of this very complicated math that myself, my schooling, I was an electrical engineer. I’d never even taken a discrete systems class. I hadn’t, I didn’t have that formalized math background.
So that was very intimidating. And to really be able to build graph-based stack applications or applications on top of graph databases, you don’t necessarily need to have all of that very academic understanding. There’s a lot of ways, because graphs are so much around us in our real life every day, we were already using a lot of these concepts and thinking about problems in these sorts of ways. And being able to condense that down into a system that helps people start to think about problems that way was really what I was hoping to achieve with this book. And hopefully I did achieve that.
That’s something I’ve definitely seen through the previews and through the chapters, I’ve looked into myself as someone who started out as an actuarial scientist and then got involved actually in computer science and discrete math. I can tell you, learning the material takes a lot of effort, a lot of hard work to pick it up. And then you see these amazing packages in programming languages like Python, Java, and others where, Hey, here’s the graph package. And guess what?
If you know the math, that’s fantastic. You can research, you can build up the system, design something proprietary from the ground up. Or alternatively, you can use some of the best practices from open source and closed source and scale from there. So, we think often in software engineering, as well as in data science, as a service about build versus buy. Sometimes buying the technology is the best option to scale your product quicker.
That comes down to what is the end business driver for what you’re trying to solve? There’s definitely a certain group of people that are really trying to solve problems. Okay. How do I optimize a shortest path algorithm using A* Dijkstra? However, for most people, that’s not really the problem they’re trying to solve. How do I deliver this package to my customer faster? How do I route this better? Those sorts of questions are really the business drivers. So, we can use those at the work that’s done by those people that are focused on the academic aspects. No, they’re not all academic, the academic and the practical aspects of building those algorithms to be able to really build, leverage those to build more complex systems.
A lot of these things always bring me back to Legos. I’m a big Lego fan as a lot of people are. People have spent a lot of time building these little bricks, and when I go to build an application, the more I can use the bricks that other people have built, the faster it accelerates me. Actually building the entire room, the model that I’m trying to build or the application I’m trying to build in software terms. So it’s interesting sometimes to be able to guide and dive deep into some of the shortest path algorithm works or connected components, algorithm work, and there’s value in that.
Being able to also understand the outputs of those things. But coming from the background that I had, that was not my primary focus was on. How do I solve my customer’s problem better, faster, and make it easier for them, and especially in a more maintainable way from a software development perspective? So I’ve definitely looked for that buy versus build decision. Okay, what’s going to give, what’s going to drive the business value of this project I’m working on it faster?
And speaking of faster, the world continues to accelerate. As we move into a hybrid digital-first world, we’re always thinking about the greatest technology and what are those use cases. So looking at graphs, moving forward, thinking of vision 2030, where do you see the use of graph technology headed?
One of the biggest ways that I’ve seen graphs being adopted today, and I expect more and more of it to be adopted, I see them being adopted to use in conjunction with other technologies, be those relational databases or document databases or key value stores or whatever other technologies that are out there.
I always look at any of these applications. They all have a hundred problems that they need to solve differently, a hundred different questions. They need to be able to answer in relational databases, might do 80 of them really well. And document DVS might do five of them really well, and key value stores solve the other couple of them.
That leaves a certain percentage where you’re dealing with these connected data problems and being able to leverage the right technology and use graphs to graph databases at the right time, and you will solve those types of problems. What I see moving forward, what I’ve started to see in the adoption for it is a realization from developers and companies that your relational database isn’t going to necessarily solve all these problems.
And when you get to a point where you’re trying to do that, you get to a realization that your relational database isn’t the right choice for this problem, that graphs are starting to become more and more the go-to choice for solving those sorts of connected data issues and being able to drive these insights. So that’s definitely one area where I see graph technology headed that more tight integration between different data sources to solve different problems.
And then, the other way I definitely see them being used is as you mentioned, the graphs say both inputs and outputs of machine learning models and being able to use them to basically, machine learning or graphs as inputs to machine learning and being able to take the inputs or the outputs from those machine learning models and augment those back in your graph to build an iterative cycle of basically graph augmentation, to be able to solve some of these types of problems.
All these problems sound exciting. There’s so many possibilities for building technology and for solving problems for businesses at the onset. Beyond them, we always like to think about technology on the show about human-first, but there’s sometime dangers and bad actors there. Do you see any dangers in the way the graphs can or could be used?
Absolutely. It’s one of those technologies, it’s definitely a double-edged sword because you’re able to drive insights and you’ll be able to see connections between things. People could use those connections in nefarious type ways. Let’s just take the example of an identity graph. I was actually talking to a friend here a little while ago. She was wondering why is Facebook all of a sudden showing her ads about something that her boyfriend had been looking at on Google.
Because of the identity graph world, they were able to put those people down into a household or whatever they were using, but they would basically be able to identify that these two people were in the same place, being able to use that sort of information in this case, do you use target to do targeted advertising?
But you could think of that in the same way where someone would be able to use that to start doing things like being able to do fraud, identity fraud, or identity theft against people. Being able to use that to basically could be able to blackmail people with that or any sorts of other bad actor type things, because you’re able to know.
Take this information and connect together these pieces of information, which independently wouldn’t necessarily give you a lot of detailed information about somebody, but putting them together and connecting them using a graph give starts to now give you a more holistic and fulfilled picture on somebody.
And so, tying everything altogether from both the positives of the technology and the opportunities for it to improve, a lot of this inspired you to write a book, Dave. Yourself, as well as your colleague, Josh Paramin, together wrote Graph Databases In Action by Manning Publications. What are some of the takeaways you’ve seen with the book being out in the wild on the shelves by now?
Some of the biggest takeaways I’ve seen are just the almost completely positive reactions from people. And really their appreciating fact that someone has gone through this hassle of filling in that gap in. I’ve gone from hello world to an academic paper. How do I actually use this to solve my practical everyday problems? I’ve gotten a lot of really positive feedback from people that really appreciate the fact that even if they don’t necessarily agree with everything that’s written in the book, at least some liked the fact that there’s been a conversation started about how do we actually take this technology and practically apply it to everyday types of problems in ways that are pragmatic from where the technology exists today?
Graph databases are relatively new technology. They’ve been around, depending on exactly which one you’re talking about, or which type you’re talking about, let’s say 10 to 20 years. But if you compare that to something like a relational database, that’s been around for 40 plus years, or maybe even longer than that at this point, they’re definitely not quite to the same stage of maturity.
So, being able to take the feedback I’ve generally had, people appreciate the fact that we’ve been able to take a real pragmatic look of when to use them, when not to use them, and what are some of the drawbacks and complications that not only the technology prevents, but learning that technology prevents to developers on a daily basis when they’re trying to solve their daily problems.
I’m so looking forward to the future of graph databases. There’s so much that we’ve covered today. And it’s only the beginning as they become a stronger use case, as data is moving with us, every place that we go. Dave Bechberger, Senior Graph Architect at AWS and co-author of Graph Databases In Action, by Manning Publications. Thanks so much for joining us on the show.
Thank you for having me. Thank you for listening to this episode of the HumAIn podcast. Did the episode measure up to your thoughts and ML and AI, data science, developer tools and technical education. Share your thoughts with me at humainpodcast.com/contact. Remember to share this episode with a friend, subscribe and leave a review. And listen for more episodes of HumAIn.