Modern Natural Language Processing and AI during COVID-19 with Daniel Whitenack
Daniel Whitenack is a Ph.D. trained data scientist working with Pachyderm. Daniel develops innovative, distributed data pipelines which include predictive models, data visualizations, statistical analyses, and more. He has spoken at conferences around the world (ODSC, Spark Summit, PyCon, GopherCon, JuliaCon, and more), teaches data science/engineering with Purdue University and Ardan Labs , maintains the Go kernel for Jupyter, and is actively helping to organize contributions to various open source data science projects.
Daniel Whitenack’s LinkedIn: https://www.linkedin.com/in/danielwhitenack/
Daniel Whitenack’s Twitter: @dwhitena
Daniel Whitenack’s Website: https://datadan.io/
Podcast website: https://www.humainpodcast.com
YouTube Full Episodes: https://www.youtube.com/channel/UCxvclFvpPvFM9_RxcNg1rag
Support and Social Media:
– Check out the sponsors above, it’s the best way to support this podcast
– Support on Patreon: https://www.patreon.com/humain/creators
– Twitter: https://twitter.com/dyakobovitch
– Instagram: https://www.instagram.com/humainpodcast/
– LinkedIn: https://www.linkedin.com/in/davidyakobovitch/
– Facebook: https://www.facebook.com/HumainPodcast/
– HumAIn Website Articles: https://www.humainpodcast.com/blog/
Here’s the timestamps for the episode:
(00:00) – Introduction
(02:13) – Being online is pretty normal for myself and my team. I am fairly often on calls with people all across the U.S. but also in Singapore, and India, and Africa and all over mostly via zoom.
(02:55) – Our India teammates went fully remote from their office cause they’re all programmers and software engineers and that sort of thing so they’re all working from home.
(03:56) – What’s really boosted NLP in the last couple of years are these large scale language models, so oftentimes what you’ll have in an AI model and that’s processing text is you’ll have a series either one or a series of encoders for text classification. What’s really been interesting is these sort of large scale language models that have been trained like GPT-2 and BERT and ELMo, and there’s a bunch of other ones. They’re trained on a massive set of data, even sometimes for multiple languages, such that you really can apply that model to a wide range of tasks by just fine tuning to one of these tasks like translation or sentiment analysis, or text classification with a much smaller amount of data than was required before. That led to this explosion and application of AI and NLP
(06:12) – The size of the models has increased a lot and they’re processing a lot of data. These word embeddings or these representations of texts that are learned in the model encode a lot about language in general so it’s been shown in a couple of studies that you can backtrack out of these embeddings, the actual traditional syntax structure of texts that linguists are familiar with like grammars and such and so in these embeddings is encoded a lot of information.
(08:07) – Transfer learning depends a lot on that sort of parent model that you transfer from and there are sort of very multilingual models out there some including up to a hundred and 104 hundred nine languages maybe. There’s actually 7,117 languages currently being spoken in the world. if we think about a multilingual model that has like 104 languages in it and it’s Embeddings that it’s language model supports, that’s a drop in the bucket and some tasks like speech to text, or text to speech especially in NLP platforms only support maybe 10 to 20 languages and so there’s a long way to go in terms of NLP for the world’s languages.
(11:29) – I’m really hoping that what we start to see in 2020 is a an acceleration of this technology through the long tail of languages because with 7,000 languages if we tackle like one language every six months or 12 months or something like that it’s going to take us a long time to support things like translation or speech to text in 7,000 languages, so I’m hoping that we see some sort of rapid adaptation technology come about in 2020 that will let us tackle, 40, 50, a hundred languages more at a time.
(13:46) – Teams that are starting to leverage that those existing resources, which really haven’t been tapped into I don’t think because they’re archived in weird ways they’re not in the sort of formats that like AI people typically are used to working in, so we’re just at the tipping point where we can really jump in and utilize a lot of that data in creative ways.
(15:17) – There are certain languages that maybe aren’t being used in the same way that they were before. There’s other languages that would be used digitally, they’re just not supported yet and there’s economic concerns and literacy concerns and all of these things all wrapped up and so we have a lot of data around all of those things.
(18:09) – For chatbots in general, I would say that there’s less support for those than there is for a general technology like Google Translate or machine translation. So it’s fewer languages than that, but you can do, again, some creative things to bridge the gap, like doing some of this transfer, learning and other things to build custom components under the hood to support new languages. whoever does crack the nut of rapidly.
(22:38) – Imagine going into a new language community with a virtual assistant, imagine if that virtual assistant had the ability to query a natural language, that could enable there’s still other pieces of that puzzle, like document search and that sort of thing but this is a big step in the right direction.
(26:40) – There’s a lot of disruption and that’s definitely true and there’s a lot of people experiencing real suffering out there but at the same time there also some new opportunities that are arising.
(36:15) – Our show is really focused on as you might have guessed the practicalities of being an AI developer these days and not only for those that are currently AI developers, but those that would like to be AI developers so we dig into a bunch of the different technology
(38:03) – Reinforcement learning and generative adversarial networks scans both of those technologies get a lot of hype because of some of the things that they power like deep fakes and other things we haven’t really entered into a season where reinforcement learning and GANs are really powering a lot of enterprise applications the way that deep learning models have actually penetrated.