Machine Learning with R, the tidyverse, and mlr with Hefin Rhys
Hefin Rhys is a Senior Scientist (flow cytometry) at UCB. He completed his PhD at the William Harvey Research Institute in Queen Mary University of London in 2017, and graduated from my MPharmacol degree from the University of Bath in 2013. His main academic interests are conventional, imaging and small particle flow cytometry, data science and machine learning.
Hefin Rhys’ LinkedIn: https://www.linkedin.com/in/hefin-rhys/
Hefin Rhys’ Twitter: @HRJ21
Hefin Rhys’ Website: https://www.manning.com/books/machine-learning-with-r-the-tidyverse-and-mlr
Podcast website: https://www.humainpodcast.com/
YouTube Full Episodes: https://www.youtube.com/channel/UCxvclFvpPvFM9_RxcNg1rag
Support and Social Media:
– Check out the sponsors above, it’s the best way to support this podcast
– Support on Patreon: https://www.patreon.com/humain/creators
– Twitter: https://twitter.com/dyakobovitch
– Instagram: https://www.instagram.com/humainpodcast/
– LinkedIn: https://www.linkedin.com/in/davidyakobovitch/
– Facebook: https://www.facebook.com/HumainPodcast/
– HumAIn Website Articles: https://www.humainpodcast.com/blog/
Here’s the timestamps for the episode:
(00:00) – Introduction
(01:44) – My view is not that of someone who is an expert on this virus, but it’s clearly something that’s very serious and that we need to take seriously and treat with respect. So as much as the virulence of the virus itself is concerning, I particularly consider how viral misinformation and misinformed practices have gone along with it.
(08:24) – As a pharmacologist, my PhD was in immunology. The traditional analysis methods that we had been using and that other people in biological fields were using started to not quite suit our needs, not quite answer our questions. In biological life sciences the level of maths left them. I started to teach statistics, R and machine learning during my PhD. Manning wanted a book that was not for computer scientists necessarily, but more for people who were an expert in their own area but who could use and benefit from machine learning, who could benefit from understanding and learning machine learning to make predictions and extract meaningful insights from the data that they have.
(14:57) – The answer to the question of whether somebody should learn R or Python is yes, people should use either or both. Python would probably have been a more convenient choice for a lot of people for machine learning. Carat or MLR in R, which were kind of an answer to scikit-learn and create this common interface so that you learn how to use that package and then substituting in a variety of different machine learning techniques and algorithms is extremely simple. Tidyverse is a collection of data science packages, a set of packages that are designed to make common data science tasks extremely easy, clean and reproducible.
(22:21) – There’s basically no reason for Python and R to compete, we can incorporate code from both languages.
(24:11) – R has a phenomenal community of people. You need only to tweet a question or ask for opinions, and hashtag our stats and you get a ton of really nice supportive answers back and a huge amount of support on github or stackoverflow.
(25:41) – Submitting a package to CRAN, the Comprehensive R Archive Network, is not a difficult process at all, if you write your package well. But writing a package for it to be submitted on to CRAN has to meet certain criteria. The documentation has to be of a certain quality in data in a certain way. The script files have to be laid out and documented in a certain way. So the whole CRAN submission process selects for good quality packages.
(27:30) – People that are asking the really important questions, whether to do with business or science or health or whatever, the people that know how to ask and are asking those important questions are the ones that should be able to harness and implement statistics, data science, and machine learning to get those answers. I don’t think that machine learning should be the purview only of mathematicians and computer scientists.
(28:13) – As long as you teach people how to do things properly, that they have enough of an understanding of how the techniques work and what they do and what they don’t do, then, absolutely, we can democratize machine learning. We can absolutely teach people to be able to use these techniques, to extract the answers or make the predictions that they’re looking for in their field of expertise.
(29:18) – The MLR package, which stands for machine learning in R. It provides a unified interface to a huge number of, not only actual machine learning algorithms, but also processes and functions like missing value, imputation, hyperparameter tuning, validation techniques. Where MLR particularly shines is, It makes it extremely simple to validate your models, MLR works very nicely with parallelization. MLR helps achieve that because you can do some extremely complicated validation pre-processing with very small amounts of code.
(34:49) – Caret has functions that you can use to split your data into train test validation sets. And it has the ability for you to perform data pre-processing steps like missing value, imputation and things like that. MLR has become more popular recently. Caret has been the mainstay.
(38:15) – Tidy Models are a set of packages that come from the Tidyverse. And in a similar way in which MLR is trying to create a uniform interface to machine learning, Tidy models are packages that are trying to create a unified approach to modeling in general. So that includes, and it’s probably more widely used, as linear modeling.
(41:53) – I really do think that Machine Learning with R, the tidyverse, and mlr is an excellent book. And it sounds very braggy of me and I don’t mean to be, because although I wrote the content, a huge number of people other than me have made the book very good. So I do think that people will learn a lot and get a lot from it.