DUE TO SOME HEADACHES IN THE PAST, PLEASE NOTE LEGAL CONDITIONS:
David Yakobovitch owns the copyright in and to all content in and transcripts of The HumAIn Podcast, with all rights reserved, as well as his right of publicity.
WHAT YOU’RE WELCOME TO DO: You are welcome to share the below transcript (up to 500 words but not more) in media articles (e.g., The New York Times, LA Times, The Guardian), on your personal website, in a non-commercial article or blog post (e.g., Medium), and/or on a personal social media account for non-commercial purposes, provided that you include attribution to “The HumAIn Podcast” and link back to the humainpodcast.com URL. For the sake of clarity, media outlets with advertising models are permitted to use excerpts from the transcript per the above.
WHAT IS NOT ALLOWED: No one is authorized to copy any portion of the podcast content or use David Yakobovitch’s name, image or likeness for any commercial purpose or use, including without limitation inclusion in any books, e-books, book summaries or synopses, or on a commercial website or social media site (e.g., Facebook, Twitter, Instagram, etc.) that offers or promotes your or another’s products or services. For the sake of clarity, media outlets are permitted to use photos of David Yakobovitch from the media room on humainpodcast.com or (obviously) license photos of David Yakobovitch from Getty Images, etc.
Welcome to our newest season of HumAIn podcast in 2021. HumAIn is your first look at the startups and industry titans that are leading and disrupting ML and AI, data science, developer tools and technical education. I am your host David Yakobovitch and this is HumAIn. If you liked this episode, remember to subscribe and leave a review. Now onto our show.
Welcome back listeners to this episode of HumAIn podcast, where we’re featuring the CEO and founder of Nomad Data, Brad Schneider. Brad has an extensive history in the data industry in New York city, with quite a few startups, and has built great products. Today we’re talking about his latest venture, Nomad Data. Brad, thanks so much for joining us on the show.
Thanks for having me, David, couldn’t be more excited.
I’m really excited for many reasons that we’ll unpack during the show, but to start it for our listeners: Can you tell us a little about who you are in your career in your previous ventures?
Absolutely. I’ve spent my career split between technology and finance. I started out even in my early days as a data guy, a tech guy. I did my undergrad at MIT in computer science. I came from the world of bulletin boards, pre-internet, and very quickly started to work with data.
So, right out of MIT I did a startup in the analytics space, basically helping large companies, the Dells, the Barnes and Nobles of the world, look at their transactional datasets to learn more and more about their customers, about their products, about how to think about pricing. From there I moved to the world of investment, and even though it sounds like a big shift, I was investing in technology. So I had that area of comfort as I made that move. Basically my job was to try to figure out which technology companies were doing better. Which companies were doing worse. And that sort of brought me back into the field of data.
Because I had just come from this world where the data actually told me how the company was doing. It gave me more insight into how Dell was generating higher or lower gross margins, where they were succeeding, where they were maybe struggling. From there I spent a good portion of my career investing in tech companies at a couple of different hedge funds.
Then, ultimately, I saw that the world is becoming more and more interested in data. I built a lot of software over the years to help me, as the user of data, more easily interact with that data. So I created some tools that would combine data across a lot of different sources and very easily allow me to visualize it.
So decided to launch a venture back in 2005 to commercialize this software. So the goal was really a single user interface for data, and I ran that company for about five years. We sold it right before COVID, in February of last year, and then started Nomad Data. So Nomad came out of a problem that I had seen over and over again.
So, the world of data is exciting because there’s so many new data sets coming to market that tell you everything: from voter information to win loss rates for lawyers and different court cases. And, even this morning, we were looking at the volume of shipments coming in and out of the U S, and what type of shipments, as we think about sort of economic recovery.
But the problem was there was no good way to find these data sets, and we always had customers asking us, basically: This is our problem. How do we go from knowing what our problem is to knowing what data will help address it? And really there was nothing set up to do that. Google isn’t really built to index datasets in that way.
There’s a lot of other platforms that will give you descriptions of datasets, but none of them help you connect the use case to the dataset, and that seems to be an area where people are really stuck. So Nomad Data’s goal is at a high level to be the search engine for these datasets, making it a lot easier for people in the AI space, for researchers, for computer science, for marketers, for strategy professionals, consultants, investors, help them connect those everyday business problems that they have to real datasets.
And we’ve, basically, been in business about a year. We launched a product. We started selling. We raised some money, and, so far, the reception has been great. This is clearly a problem that a lot of people are having, and one of that we’re helping fill the gap and make things a little bit easier.
Back to my days of both General Assembly and Galvanize, where we worked with a lot of institutions and enterprise clients. And we did consulting and advisory on datasets and data science workflows, and you’d often find the clients were really wanting to do the machine learning and do the performance monitoring.
But if you didn’t have good datasets to start with the problems, you really struggled to get a life off the ground. And there were a lot of platforms out there, all over the place, hot messes. You got the Kaggle’s of the world, but there’s a lot of opportunity to improve, and so it comes to this age-old problem. Do you make your data? Do you buy your data? So, speaking to that, let’s talk about buying data. Since your platform is very focused on data sources, what types of data are available for purchase?
It really ranges the gamut. We currently have about 650 data providers on the platform, and that represents thousands of different data sets. Some of the more common ones in the industries that we serve, let’s say credit transaction data. So there are datasets that allow you to see every single dollar being spent by, literally, tens of millions of Americans on their credit cards and ,even, internationally. And people ,obviously, worry about things like privacy, but this data is heavily aggregated.
So, it really lets me see: Let’s say at the store level, how is The Home Depot progressing? As we look at what happened last year, how are the cruise companies going to rebound? It’s a great way to see high-level economic performance. Then you’ve got things like customs data, which allows you to see trade flows, exactly what good is a certain retailer bringing into the country. Where are they sourcing that from?
Which factories are producing more or less. There’s ship to ship communications data where you can literally see every ship in real time on the planet. You’ve got consumer credit data, where there is information on about 50 million Americans and every single loan that they have.
And, again, the data is anonymized, but it allows you to really understand the health of either an economy or a certain type of loan, or where people are borrowing to spend. So the problem is that these data sets are so powerful, but they’re also so broad. I can use that customs data set to understand a single company on the aspect of one company or region and economic competitive wins and losses for factories.
And because they’re so broad it’s very hard to describe on a webpage what this dataset can be used for. And because of that search engines can’t really do a good job indexing that data, and it becomes really hard to find. If I know that my problem requires credit card data or, let’s say, import export data, then that problem is a lot easier to find and easier to solve, but that isn’t what the problem looks like. Most people, even experts, don’t know that you can take data from one industry and apply it to a completely different industry.
So, it sounds like when we’re thinking about solving problems, whether it’s analytics, whether it’s data science, whether it’s building products. When it comes to the build versus buy for data, you are in the school of thought of let’s buy data to speed up processes. Is that right?
So I’ve been on both sides of this one. So it really depends on your timeline and the availability of the data you need. So, early in my career I was using data to make better investments. And, for example, I would scrape hiring sites. So I created a lot of my own hiring data.
We would scrape things like LinkedIn. We would scrape different job boards. We would scrape company job boards. An issue that we had is that we really wanted to correlate this data that we were collecting with something that we could observe. So the companies reported headcount, the companies reported growth. But these things are typically only reported once a quarter. So, even if we collected data for a year, we only had four data points to run that test against. So that made it somewhat problematic. Even if the data we collected was a hundred percent accurate, it became very challenging, because we didn’t have enough data points to even make a simple linear regression model.
So, in a lot of cases, it’s better to buy. If you need to be able to do that testing. If really what you need is point in time data, and that is scrapable. For example, if I need to know the employees that work at a certain company today. If I need to know what goods were shipped into the US from China in the last month, then these are things that I can create myself.
It’s really the time series aspect of the data that I can’t start scraping something in the past, but not a hundred percent true. There are a few ways where other people may have stored it, and you’re really scraping it from them, but you can’t go back in time. You can’t see what something looked like at a different point in time and get that data. So, there are companies out there that have been doing these scrapes for years, or they’re plugged into a certain system for a decade. And so they’re wonderful sources to sort of speed things up.
So, when you think about the challenges around using data to inform decision-making, you’ve spoken about finding the right data. Having 650 plus data providers and thousands of datasets today for institutions and startups to access. You’ve spoken about testing the data. Making sure it has the number of points and it’s available to get highly accurate and valid, reliable results from models. There’s also a third area that you’ve spoken about before, which is putting data into production. Why is that area very important for informing decision making?
You have to get the data into some sort of a store that you could interact with it, and that can be a challenge, especially when you’re talking about extremely large datasets. So, in the past, I’ll give one example in the consumer credit space.
So we were working with one of these credit files from the credit bureaus and they are literally terabytes in size. It would take the company who is producing it roughly six to eight weeks, just to physically give it to you, to schedule a job, to produce the files to a partition and FTP server, and for you to, actually, download them. Then you basically had to go through, you had to import them into a database.
So this whole process took months before you were writing your first query. So, that process, there’s a lot of great companies focused on improving it, but that’s been a real bottleneck historically is getting that data from where it started, whoever is creating it or whoever you’re purchasing it from, and getting it somewhere that you can write that first query.
You can make that first API call. You can run that first analysis. Unfortunately, that’s still a bottleneck. But I’d say there’s light at the end of the tunnel on that one with services like Snowflake that are creating these marketplaces where people are putting the data in a common database format.
Speaking of the marketplaces, like Snowflake. You are building a marketplace to grow the data market. What is Nomad doing today? You mentioned that you are building the search engine for data of tomorrow. What is Nomad data doing today to improve that discoverability among these data providers?
So one thing we learned about data it’s hard to fully automate the data search process today, and the main reason being the data you need, the metadata about the data, doesn’t really exist, and the term metadata is used very broadly.
So the sort of metadata that does exist, it was basically what format is the data in? What are the different columns? What do they represent? But none of that really encodes the knowledge that those datasets possess. So we have been building that metadata database.
So keeping track of use cases; keeping track of sectors of coverage; keeping track of entities and types of metrics. So we build that database and as new searches are conducted on the platform, we’re actually learning from each one. So those are being incorporated into our model. Then, we also use cutting edge NLP and machine learning to find similar concepts.
So, I was actually just running a test on our platform to see if typing in Bitcoin would pull up all the cryptocurrencies related datasets, and looking at different acronyms, and it works quite well. So that’s a really important piece of it is to have this sort of expansion of vocabulary from what the user said they wanted to actually finding something that covers that, but maybe it doesn’t use the same language to do that.
When you think about discoverability, we also think back to the past year and a half as the world has continued to emerge from the great pandemic of 2020. The thought process has been how can data be used successfully and how has the pandemic impacted data? Is it available and what trends have you seen around data as a result of the pandemic?
I’d say the biggest change that the pandemic caused was really the need for data. So if I think about pre pandemic, a lot of things about an economy, a lot of things about a company were fairly well known. So we knew what back to school looked like.
We knew what black Friday looked like. We knew what pre Christmas shopping looked like. We knew that certain businesses have one cycle and other businesses have a different cycle. We know that inventory moves smoothly through the supply chain and we can expect them to produce whatever they need.
Whereas in the post COVID world, basically, all that stuff gets thrown out the window. We don’t have business cycles anymore that we understand, we don’t have any history. So all of a sudden, we go from a world where there might be one or two things that we don’t know about a country or a company to basically everything becoming unknown.
We have no idea what back to school will look like this coming year. We have no idea what the impact on the lumber market will be, whether or not we’re in a short-term bubble in housing or a completely new trend. So because of this, you have found buyers looking at more and more datasets to fill in the holes in their understanding. And because of the increasing number of those holes in their knowledge, there’s been an increasing need for data. So, I’ve spoken with many different data sellers, and many of them have seen their businesses double and triple over the last year because of this, because there’s just unprecedented uncertainty in both the economy and with particular business.
One of the challenges that I’ve seen with the data science workflow is being able to access data. Being able to know what’s the right data, and when to use it for projects goes back, not only to build versus buy, but it goes back to just the data market overall. What have you seen, Brad, being at the forefront of the data market: what’s holding the data market back?
Searching the area that we’re focused on is one of the biggest problems. People know they want to see something, they want to be able to calculate some statistics, but they don’t really know the data that would provide the requirement to do that.
So you have a lot of people that would be interested in buying data, but they don’t know how to even begin to approach the market. Your average data, or would it be data buyer, that is not well served by looking at a list of a thousand data providers or 5,000 data providers that only makes the problem more and more challenging.
It’s similar to going to a diner and having 500 things on the menu, it’s really hard to know which direction you want to go in. And you can only eat a few items before your fault, the same is true of data. We can’t test all the data in the world out, so we need to be really pinpointed on what we focus on, and because people have a really difficult time finding the right data, finding the best data to address their use case. They’ll get stuck at other stages, testing the wrong data, finding out that it isn’t helpful going through, maybe, an NDA or a purchase process to find out that that wasn’t really what they wanted in the first place.
For most buyers it’s just too intimidating to even get started. So you really only have experts in the market. And because of that, the market is a lot smaller than I think it will be in the future. One of the analogies I love to use is that of the internet. If you think about the early days of the internet, you could only remember so many websites in your head. You couldn’t have millions of websites and expect any of them to be found. So, then you implement this layer called the search engine, and all of a sudden you unlock the long tail of the market that allows these smaller, newer businesses to rise to the forefront very quickly. So you stir things up. It’s not too dissimilar with what happens in evolution. You get a mutation that’s highly advantageous and it explodes if there’s the right selection event in the population.
So, I expect that with services like Nomad, we’re going to really help unlock this industry, which ultimately what that means is you bring more and more buyers into the market. The data providers start to generate more and more income. It attracts new data providers. Ultimately, it brings down pricing in a positive way where providers are spending less money selling.
They can make a lot more earnings at a lower price, and that price reduction is felt by consumers, you bring even more consumers into the market. So you, ultimately, will see a market that’s orders of magnitude larger than what we see today. Which, like we saw on the internet, like we saw with app stores.
Just think about what the size of the mobile app market was in 2000. We had cell phones, we had cell phone apps. Apple didn’t really come up with an app store until it was circa 2010. Looking back, it seems like we had phones and the next second we had apps, but it was a long time before someone came in and standardized all the steps in the process, whether it was search, whether it was development, whether it was testing, whether it was: how do I pay for this? And once that was in place, the market grew from a hundred million to, the last time I checked, it was over a hundred billion. So I expect the same thing to happen with the types of commercial datasets that we’re talking about.
It sounds like many of the companies today haven’t given much thought to data as they have for software. Where software has been IP, where software has been proprietary. It’s that today data is the new part of the toolkit, that data should be looked at as R and D as IP and that competitive edge so that companies can be more effective and efficient to win at the marketplace.
There’s no question about it. I’d say that the data revolution has already started. And I think the first step in that was companies looking at their internal data. How can we make better use of what we’re producing out of our CRM systems, or out of our logistics systems? And as we squeeze more and more value out of that, then people start to say: Well, what more can I do?
My data informs mostly on my business, but only from my point of view. What do I look like from my competitor’s point of view? What do I look like to my customers when they’re not in my stores, when they’re not on my website? That’s where a lot of these companies will lose visibility. So, the next frontier is whether you want to call it external data, whether you want to call it alternative data. It’s these data sets that are coming from outside your four walls, and in a lot of different businesses, it gives you a perspective that you don’t have. It gives you a perspective that isn’t biased by your own internal processes, but it still is biased by other things. And that’s really the learning curve behind using external data is really familiarizing yourself with what those biases are and how they impact the analysis that you’re doing.
Now, when we look at a platform like what you’re building, Brad, at Nomad Data. How can corporations get involved if a corporate company today? Can I have my data on Nomad Data? What does that look like?
We’re already seeing that. So a lot of corporations are fairly new to selling data, but there’s an interest. I’d say it really depends on the company. If you’re a company where your brand is extremely important, you’re consumer facing, you’re a Coca Cola, you’re Nike.
Those types of companies in general are more reticent to sell data because there’s potential brand risk associated with doing that. It’s the companies where there may be deeper in the supply chain, their brand isn’t really what they sell on. They’re an intermediary in a market, but have incredible visibility. We’re seeing more and more of those companies start to bring their data to market. The reason that Nomad is such a great fit for them, and, even, for those in the former category, is that we support anonymity on both sides of the market.
So a lot of corporates are very hesitant to list their dataset in a marketplace where there’s a giant list of what’s being sold. They want to be very selective on who they share that information with and that their data is being sold at all. So in Nomad, they can post their data. It’s completely anonymous.
If someone were to conduct a search, our NLP algorithm matches that corporation to that particular search, and then it’s up to them, whether or not they want to reveal their identity. They can actually engage in a conversation while still being anonymous until they get to the point that they feel comfortable. And we’ve seen that generate a lot of interest from potential data sellers.
Now Brad, you as well recently raised around funding for Nomad Data. Can you share with our audience more about that capital round and what that will accelerate for your business moving?
We just announced this last week. We raised 1.6 million and that was led by Bloomberg beta and then some other sort of higher profile VCs as well. Some great angels in the data space.
We really want the goal for this is to start to reach scale with this business. So we’ve got a very good scale on the data providers side. And now we’re focused on bringing new types of buyers into the market. So that’s going to be focused on hiring and we’ve already brought on someone, even within the last week, we’re going to start making announcements about that in the coming week or two.
We’re going to be putting a lot of that into marketing, generating awareness, not only of Nomad Data, but of alternative data as a sector, and just how vitally important it is to making the right decisions, especially in this environment.
That’s great to see that Bloomberg beta led your round with participation from Alumni Ventures, Great Oak Ventures, Correlation Ventures, and ourselves, DataFrame Ventures. We’re excited to participate as well, turning things over to the alternative data space and trends and where you see the market’s going. What are some of the predictions or trends that you’re going to see in the next few years?
As we get out three to five years, awareness of this space and interest in this space is going to explode. We’re going to go from a market where people are excited that there’s something out there, and I have no idea how to approach it and get involved in, to a market where most companies have begun some sort of effort to incorporate this data and have started to make decisions based on this data. As a result, I expect you to see, literally, orders of magnitude growth on both the number of people selling data and the number of people buying data.
As you’re a founder for your company, Nomad Data, you are based out of New York city, both of us live in New York. I’m a big fan and proponent of aggregation economies like New York city. I wanted to hear from your end why New York for your latest venture?
New York is just an exciting place to be. The energy here is unrivaled. It’s also an interesting city because you’ve got so many new faces coming to the city. You’ve got so much turnover at big companies that are looking to do new and exciting things. Especially after COVID, we’ve seen just a record number of people look to change careers.
So, if you’re a startup it’s a wonderful environment to be in. It’s also helping a lot, that housing is coming down. I think we’re attracting more and more people. People that don’t want to commute here don’t have to anymore. So, it’s going to be a Renaissance for the city.
Well, Brad Schneider, the CEO and founder of Nomad Data. Thanks so much for joining us on HumAIn.
Thanks, David. Thanks again.
Thank you for listening to this episode of the HumAIn Podcast. Did the episode measure up to your thoughts and ML and AI, data science, developer tools and technical education. Share your thoughts with me at humainpodcast.com/contact. Remember to share this episode with a friend, subscribe and leave a review. And listen for more episodes of HumAIn.