DUE TO SOME HEADACHES IN THE PAST, PLEASE NOTE LEGAL CONDITIONS:
David Yakobovitch owns the copyright in and to all content in and transcripts of The HumAIn Podcast, with all rights reserved, as well as his right of publicity.
WHAT YOU’RE WELCOME TO DO: You are welcome to share the below transcript (up to 500 words but not more) in media articles (e.g., The New York Times, LA Times, The Guardian), on your personal website, in a non-commercial article or blog post (e.g., Medium), and/or on a personal social media account for non-commercial purposes, provided that you include attribution to “The HumAIn Podcast” and link back to the humainpodcast.com URL. For the sake of clarity, media outlets with advertising models are permitted to use excerpts from the transcript per the above.
WHAT IS NOT ALLOWED: No one is authorized to copy any portion of the podcast content or use David Yakobovitch’s name, image or likeness for any commercial purpose or use, including without limitation inclusion in any books, e-books, book summaries or synopses, or on a commercial website or social media site (e.g., Facebook, Twitter, Instagram, etc.) that offers or promotes your or another’s products or services. For the sake of clarity, media outlets are permitted to use photos of David Yakobovitch from the media room on humainpodcast.com or (obviously) license photos of David Yakobovitch from Getty Images, etc.
Welcome to our newest season of HumAIn podcast in 2021. HumAIn as your first look at the startups and industry titans that are leading and disrupting ML and AI, data science, developer tools, and technical education. I am your host David Yakobovitch, and this is HumAIn. If you liked this episode, remember to subscribe and leave a review, now on to our show.
Welcome to our newest season of HumAIn Podcast in 2021. HumAIn is your first look at the startups and industry titans that are leading and disrupting ML and AI data science, developer tools, and technical education. I am your host, David Yakobovitch, and this is HumAIn. If you like this episode, remember to subscribe and leave a review. Now onto our show.
Welcome listeners back to the HumAIn Podcast where we talk all about human augmented intelligence, real-time insights and data-intensive apps in the new modern data stack. Today’s episode features Ariel Utnik who is the Chief Revenue Officer and General Manager of Verbit. Verbit is a very exciting startup that’s building the future of AI transcriptions and captions so that content can be accessible to all audiences anywhere, any time. Ariel, thanks so much for joining us on the show.
Thanks, David, for having me.
Well, I’m really excited to have you here today because I’m a big fan of practical AI-taking technology making it commercially viable, making it available for enterprises and audiences. So we’re going to unpack that we’ll dive into that in today’s episode. But before we learn more about your business and how that’s continued to scale with exciting growth, can you share with the listeners a little about yourself, your career, and what brought you to scaling Verbit?
Yes, sure. I’ve been part of high tech for the last 25 years. And I would say that my journey and this may be a bit different. So I started at QA testing, implemented automation and load testing started on the technical side. Then I moved to serve customers as part of implementing customer ATP, and internal Lam acceptance test by customers on cellular from Converse, which was a cellular provider for value-added services. And from there, I moved back to be a VP, r&d back to the technology part. A big organization was able to deliver software for project management on the cloud company called Clarizen.
And after three years, I was part of the customer success of evolution, and built the customer success team back to the business, understood the importance in URL and upsell and what drives us companies. And from there, I continued evolving the business, moved for a company called Feedvisor, was over there for more than three years as Chief Customer Officer, and really built everything around the post-activity after failing from implementing the customer upselling and URL expansion, and so on. And today I am with Verbit as a Chief Revenue Officer responsible for marketing, sales, customer success, partnerships, really trying to serve the business and support a little bit of growth in the last three and a half years.
It’s great to follow your entire journey Ariel going from supporting customers through QA and customer success, to now being on the frontlines to scaling a fast-growing startup, which is now in the scale-up mode. Recently Verbit raised its Series E of funding, and you continue to scale the product. I’m really passionate about this entire space about transcription and having this 99% accuracy that Verbit works towards is a fun fact that I’ve shared with our HumAIn listeners in the past. My first job in college, I was one of those transcriptionists, I was working for, we call University copywriting services out in Gainesville, Florida, part of the University of Florida ecosystem.
And this was going back more than a decade ago, where I was listening to the audio and manually clicking and using shortcuts and typing and making marks for when doctors and lawyers and different teams needed to have their transcribed Voice Memos in near real-time or the next day, so to speak. Well, of course, the industry has evolved a lot since then, because now there’s technology like your team scaling at Verbit. To start off in frame for our listeners, we can see how everything is moved from human only to human plus AI. So can you share with us a little bit about how AI effectively captures this audio in video explosion of content?
First, it’s great that you had the experience around that, I think it would be maybe more easier to explain the difference between the days that you were listening to audio trying to type as you hear. And you need to listen to every sentence several times until you get it right. And, if you’re transcribing two hours or one hour, I’m sure that you have no technology to support consistency in quality, and what does it mean by that. So for example, let’s say someone is transcribing this podcast, and he’s transcribing your name, David Yakobovitch. At the beginning, it’s not sure that the same spelling of the name will be done in a consistent way, let’s say after 20 minutes.
So one piece of the day, is really helping on that making sure that the transcript is consistent and down on the quality level and he turns several parts, from taking the audio what we call the machine transcription, really transcribe by the machine, as a first pass, which depends on the audio, the accent, the background noise, can get you somewhere from, I don’t know, 75 to 85, sometimes a bit more accuracy, but this is not good enough for medical meeting, we need 99% accurate transcript, and then we have the human going over it once.
You have already a draft, and he can just correct the right words, what very nice Verbit is the ability for the ASR also to learn as during the session, meaning if I corrected the way that David is written, one time, the ASR will be able to pick up the correction, let’s say 20 minutes after, and instead, I will get the full pass of the ASR and need to collect David 10 times I will do it once added to the glossary using our AI and then the AI will already be implemented through the transcription walk. This is one piece of the AI.
Another piece of the AI is to make sure how we control all those transcribers. So you said you were a transcriber in university, but let’s assume there are 200 people like you. How do we know they are effectively working and making the changes that we need? How can we assess using AI? Who out of those transcribers is a good transcriber? And who is missing some so let’s say we have one transcriber that is weaker on punctuation.
If the AI is able to identify it, then we can send this transcriber to a punctuation lesson, right, and help him improve. Can we compensate differently? Can I compensate? High-quality transcriber, incentivize to walk with me more, versus one that is producing lower quality transcription? So the AI is starting really from the voice to text. But it’s only the beginning. In Verbit, we build different types of AI components to support the efficiency and the quality of the transcription.
It’s fascinating to hear how technology has changed. Back to my days working as a transcriber, you mentioned ASR, this Automatic Speech Recognition, and it was Human Speech Recognition. And just as you’re describing, Ariel, I was in a team, let’s call this big office with cubicles. We have like 200 transcribers there. And you can imagine everyone with their headphones and their keyboards with all their shortcuts and making sure that they can get every phrase and if you heard this phrase over and over from the doctor, the way they said it, so you’ve had a shortcut to write that phrase. In essence, we were being our own smart AI, but it was manual from the human. So it was this Human Speech Recognition technology, which is so repetitive and so manual.
And it’s incredible to see other technologies evolved over time, it’s allowed for this consumption of content to be contextualized. As you’re describing how Verbit’s making that possible, with a 99% accuracy and above. And there’s been a lot of industries where we’ve seen that digital transformation change occur.
I recall that some of the biggest clients I had when I was doing these transcriptions were actually lawyers, you’d have lawyers from the road they call in and February 11, 2010. This is the recording for the client, Joe Smith. Docket reports, Joe Smith went to court, Joe Smith won, and then transcribed that so there was a lot of information and in that industry specifically information is needed real-time because there could be decisions that would be delayed or accelerated based on how quickly information is processed to the right third parties. And so my next question is thinking about that industry and the clients that you work with, what have you seen as well with the legal industry? Why does the legal industry need to embrace digital transformation today?
So, legal industries is a very interesting industry, we are talking about the market of only transcription four to $5 billion a year with a very high standard of accuracy, you cannot allow yourself to have a mistake in a deposition that is being submitted to the court, because we are talking about life of people right. And there is a need as you said, there is a need for getting it live during the deposition forward, back, and others, there is a need to get a first draft very quickly. So the lawyer’s only work back to his office is to read it and can start working on it.
And there is a need to get something that is a very high accuracy, we are talking about above 99%, it’s not been measured by percentage even that will be submitted to the court. And traditionally, the only way to serve it was using the stenographer that was learning for five years how to type very fast and to react to it. But what we see in the market in the last 10 years is that we see a shortage of stenographers, and less and less people are registering to learn to be a stenographer. So there is a high number of litigation that is only increasing year over year with a shortage of stenographers. And this is exactly the space that Verbit identifies and builds the technology around it.
And Verbit is the only company that basically is able to provide the same level of service, both on live rough draft, and the final transcript, which can be submitted to the court. But what’s interesting is that it’s not only that we are able to provide those types of services today, it’s an opportunity for additional services. So think about a deposition that is going on for eight hours. One of the things that the lawyer wanted from the deposition is to know how the witnesses behave. And in what mental position he is?
So using sentiment analysis, and looking using the camera, because today, most of the deposition is done on Zoom, to see his eyes and his reaction and to be able to provide some feedback to the lawyer about his mental health. And it’s not stopping over there, because if it’s an eight-hour deposition, and during this eight hour the witness is being asked on the same topic from different angles.
The lawyer is trying to see if he is consistent about his answers. And then we can use the AI, really to help the lawyer to identify if we provide consistent answers. So I would say that the revolution that Verbit is leading around legal transcription is only the beginning. So today, we are able to provide all the services comparable, or better to the stenographer and we are already now looking into what additional services we can provide the lawyer during the deposition.
And this is game-changing. This is absolutely game-changing for the legal industry. And I can see how this verticalized solution, of course, is not only for the legal industry but can expand. I recall the other clients I worked with when I was doing transcription included in the healthcare industry as well as anyone who was a manager working with 360 reviews in the HR system recording conversations and making these insights available for different team members. So it seems there’s a lot of opportunity for everyone to embrace digital transformation.
And in fact, it’s not only of course, for having things as an archival or docket as a record. But there’s even more importance. We know that today living in our digital-first world where we’re a Zoom nation and a Zoom society, it’s no longer only about audio. In fact, it’s audio and video content. And we’re always on, always looking at these Samsung, Apple, and Google displays that, in fact, I think we look at for almost two-thirds of a day now.
And that content isn’t only audio, a lot of its video, and the big challenge is making sure that video is accessible for everyone. How do you make it so that people with disabilities can access the content and can get the same benefits and have the same accessibility to everyone? Can you share more Ariel with our listeners about why AI-driven transcription as a captioning as a tool? Does Verbit provide this new accessibility for all audiences?
Accessibility is very important, but it does not stop there. Well, we are talking about being inclusive, different types of people would like to consume data in a different way, maybe one student, maybe the hard of hearing and he need the captions, another student is blind, and he need audio description about the video that is being played, he’ll hear very well, but he cannot see what’s going on with the screen. So there is a service called an audio description that helps him translate what’s going on, on the screen into voice. Other students prefer to get a summary. Others will say, I want the full transcription of a session.
So being inclusive really creates an environment where everyone can consume the information in a way that helps him and Verbit is very proud to serve this market and really provide equal opportunity to everyone. And the services are, you can get, we talked about the digital revolution, and you call it a Zoom nation, I like the phrase and so everything is digital now. So how can we provide those types of services in the way that you need and in the time that you need?
So we can be in Zoom, and you need the captions now. You want to have an equal opportunity to everyone else in the class if you’re out of feeling. So you need the captions live, or maybe you consume a pre-recorded session, and then you would like to get the captions repelled for you in advance. So this is called the offline captions or the post-production captions.
There are different types of industries that need it. So if you are watching a movie on Netflix, you would like to see captions in your own language. If you’re blind, you would like to get an audio description. But maybe you also prefer to get the captions in another language. So you would like to get the translation of the episode that you are watching. So the whole inclusive environment, providing you the ability to consume digital content in the way that you prefer, is really the basis of the new era that we are living in.
This is important with multiple languages. And when you’re on the go, perhaps you’re in an environment where you can always turn audio off. So you need to have that ability to provide more content with a text contextualized approach, I think back to the famous show that became a world-renowned phenomenon, Squid Games.
This is a show that of course, piqued our imaginations with society. And what made it such a big hit was not only the culture, the show was never recorded in English. In fact, the show was a Korean first language, a show that was dubbed over in English. And you could have pulled up Korean subtitles or English subtitles if you wanted to engage for the show.
And so when we talk about accessible and we talk about accessing content, in different engaging modalities, I think about shows like this that can benefit from using technology to quicker and more accurately get those results. Sometimes I think about even shows I watch on TV, I’ll occasionally watch an episode of Saturday Night Live. And I’ll watch it on Hulu and you’ll see the translation that’s been provided.
And sometimes I’m unsure where that translation is if that’s actually a Human Speech Recognition or an Automatic Speech Recognition, besides just the words you say this is clearly wrong. Or like even a Netflix movie like this transcription is not right. And so I wonder what’s going on here. It sounds like a lot of organizations are still going manual and have not followed the bandwagon using automatic technology like from Verbit.
Definitely, there are if we talk about the media industry, there is the FCC regulation that ensures that we get an accurate a trans captions for a show but it’s not regulating all the channels, specifically one of the top 25 that need to be regulated and the rest can do whatever they think and, it depends on the audience really to come to the channel and say this is not good enough.
I’m not going to use your services because you don’t invest in providing quality content it’s not all about as you said, it’s not only about being providing equal opportunity for people with data, hard of hearing, or something like this, it’s about being inclusive, talking about the languages talking about the way that, you prefer to watch your movie, maybe the kids are sleeping, and you prefer to mute the volume and only read.
And, you mentioned this social media, which is huge today. And we know that we know all of us, we are an environment, we open our Instagram or Facebook or Tick Tok, we cannot put the audio on all the time. So, we would like to understand what this clip is talking about. So, one way is to get it through captions. So, some of the companies and the big players though, investing a lot in building ASR, which is a machine capability to provide some level of captions. Now, in some cases, it’s maybe sufficient. In other cases, it’s just not accurate, as you said, and there are other challenges because you don’t know if this video that is going to be uploaded to Instagram, in what language is going to be played. So how do you know which AI to choose to caption it? Not talking about the ability really to translate it to other languages.
So, there is a lot of investment done by the big companies around technology over there. And the only way to improve the technology itself is really to train the model. The only way to train a model is to provide him with the truth and what is the truth, the truth is what companies like Verbit are producing. So we are working with a lot of technology companies helping them improve their model. Now the understanding and the assumption is that it can be improved up to a certain level.
I gave the range somewhere between 75, 85 maybe 90%. Maybe if you watch a video on Tik Tok, and you have no expectations, maybe it’s fine to make a mistake every 10 words. But we all agree that if you’re going to university and you would like to understand that the professor is not accurate enough, not talking about, watching a movie on Netflix, and definitely not on submitting something to court which every word has a meaning over though.
And from everything that we’ve discussed today about the future of Automatic Speech Recognition about AI-enabled or AI-powered transcription with Verbit and your platform, there’s been a tremendous amount of growth in the space and the industry as a whole. There’s a lot of trends and changes around consumption and generation of content.
At Verbit alone last year, you raised both a series D and Series E round of venture capital, most recently round the Series E $250M at that unicorn territory led by and participated with investing parties such as Sapphire, Third Point Venture Vertex Growth 40, North Samsung next and TCP. Love to hear from you about how this round of funding will continue to accelerate at Verbit. And if you can share with us more about what’s next for the company.
So, first and foremost, we invest in technology, building the AI and HI community to make sure that we are able to scale and provide high-level accuracy for additional verticals. And we keep expanding our verticals that we serve and geographical location in the languages that we serve.
In addition to that, as you said, the majority of the transcription market is still not using the AI to the level that Verbit is offering and we use part of this capital to acquire companies and better serve the customers using our technology. We did two acquisitions last year. And we plan to continue to consolidate the market and provide higher accuracy to the rest of the market.
And you mentioned the acquisitions and of course Verbit as a company, you’re running a suite of products and those products are complementary building out some great solutions for the entire AI power transcription market. Can you share more about the evolution of your product roadmap and some of that identity for the market?
So, the approach that Verbit took from day one was that transcription is not a general transcription for each vertical if we do transcription for Education, or we do it for media or we do it for finance or for legal. It’s not the same approach. There are companies out there that, for them, give them an audio video and they will bring you text, we understand that it’s different. And I can explain, when you do transcription for the student, he needs to have it in class or remote.
So we would probably need to support the tablet in a way that he can consume it in class, or the different remote capabilities, such as Zoom Microsoft Teams and others, and he may need some functionality such, the professor was talking about a specific term, he would like to mark this term and find what this term means on Wikipedia or maybe other areas and write himself some comments. This is very specific for education.
Now, if we go to media, there is no research, there is no comment over there, this functionality doesn’t exist, you expect to consume it right here to get it if it’s a live session, or if it’s post-production in very high accuracy, using the different capabilities, such as translation, dubbing, audio description, but the technology is playing a different part. So let’s say we are going to a sports event, and we would like to get live captions of the sports event, the technology is able to prepare and help the HID the human do a better job.
Because let’s say we know the names of the teams. So can we upload and train already and provide all those shortcuts that you just described at the beginning that you need to sit and create, the AI can create it for you. We know the names of the players, we can code it for you. We know the area. So, we can help humans, get ready and do better and more efficient work for those types of events.
Thinking about legal, legal is, behaving totally differently. The transcription is not captions. And there is specific formatting, a number of lines, Bill page number of characters per line, the consistency between names is super important and needs to be accurate, you cannot have a mistake in the witness name. And some of the data is available, because, at the beginning of the deposition, the witness is spelling his name. So if you’re able to capture it, and from that moment to make sure that it’s going to be consistent for the eight hours of deposition, that technology plays a big part over here.
Maybe there are specific rules. So, whenever you are mentioning someone’s name, you would like to capitalize the name. This is the deposition, we call it guidelines. So the technology is able to not only help you with that but also audit before we produce the final transcript, that it’s being done. And if it’s not being done, it will not enable me to submit this work. And we’ll ask the transcriber to fix it because he doesn’t meet the requirements. So, those types of capabilities.
So, Verbit’s approach to the market is by vertical, making sure that we understand what is the vertical need, and how we can use the technology to better serve this vertical. And this is why I said that we are using the money to expand to other verticals. So, if we look at new verticals, we are asking ourselves, what does this vertical, specifically need?
And how can we serve it in a better way, both in technology, but also with the people that will work with our customers? So, our customer success teams are split by vertical. So, if you have someone that is working with you he knows today what you are facing and he’s able to speak your jargon and maybe can also advise you because he’s working with 10 other customers like you. So, he can consult you around challenges that you probably face on the day to day.
Well, that makes a lot of sense. And I’m excited to see as the Verbit platforms and family of products powered by Verbit are excited to continue to scale in the market for all things AI-powered transcription and captioning. Ariel, what’s the next steps or action items you’d love to share with our listeners to learn more?
As I mentioned at the beginning the next evolution is really making the machine transcription more adaptive during the session. What does it mean? So traditionally, if you the machine transcription the ASR, you will get a full transcript and then you start fixing it what we are doing today and which is not efficient because let’s say you have the ASR misspelled David, you need to fix it 10 times what we are doing today is that the machine is playing, is doing its part by pieces, building on the fly model for this specific session by the collection that done by the human.
So the ASR is getting more and more accurate as the session evolves, which provides us two things, one is much higher accuracy and efficiency. And second, the ability to use the ASR which somewhere after 20, 30 minutes is very accurate already, because the human perfects it and creates an on-the-fly model to be used as almost a live transcript that can be consumed by the end-user. This is one revolution that is being done. And the second revolution is those apps that we are building for each one of the verticals to better serve them.
So if I’m in education, give me all the sessions that exist in the university and help me search for them and find all the sessions that are relevant, for example, for a specific class, maybe some summarization of the class. There are other apps that help you better understand the topic.
So sometimes the professor is less clear and redundant on the sentence and the technology that is making it easy to digest information. This is in education. And I gave you the example on legal for example, how can we better serve the lawyer to do a better job. So those types of apps are being on top of the transcription that we provide?
Excellent. Well, it’s such a pleasure to learn more about this space. I feel I’ve come full circle from my early career days. The scene where yourself Ariel, you, and your team continued to scale Verbit. And for our listeners, if you want to learn more about Verbit, you can check it out at Verbit.ai and they’re hiring in engineering and many functions.
So be sure to check out the careers page as well as we’re continuing to build for the AI-powered and the data-powered economy. Listeners. Thanks for joining us. This has been our episode with Ariel Utnik, the Chief Revenue Officer, and General Manager of Verbit. Ariel. Thanks so much for joining us on the show.
Thank you, David. It was a pleasure.
Thank you for listening to this episode of the HumAIn podcast. Did the episode measure up to your thoughts and ML and AI data science, developer tools, and technical education? Share your thoughts with me at humainpodcast.com/contact. Remember to share this episode with a friend, subscribe, and leave a review, and listen for more episodes of HumAIn.